[Comp Linguistics] Distributional Hypothesis

Authored by Tony Feng

Created on Oct 4th, 2022

Last Modified on Oct 4th, 2022

Intro

This sereis of posts contains a summary of materials and readings from the course CSCI 1460 Computational Linguistics that I’ve taken @ Brown University. The class aims to explore techniques regarding recent advances in NLP with deep learning. I posted these “Notes” (what I’ve learnt) for study and review only.


Theories of Word Meaning

The Meaning of a word

Naive BOW model has no notion of word meaning (e.g. “cat” and “kitten” shoud have similar meanings, but it treats them as different words). Here comes to a problem: How to set boundaries to define words?

  • Words refer to sets
  • Words refer to things
  • Words refer to concepts
  • Words are defined by the context

The Distributional Hypothesis

The meaning of a word is defined by its context, and this leads to another question: What is context?

  • Perceptional context / Linguistic context
  • Symbolic features / Real-valued “impressions”
  • First order associations / Higher-order abstraction

The distributional hypothesis is that words found in the same contexts usually have similar meanings. It is clear how words are “learned” according to the theory and the model correlates well with lots of data on humans. However, it is “holostic”, because the meaning is always changing based on different contextss.


Vector Space Models

Definition

Words are represented as vectors, and those have similar meanings are nearby in the space.

Term-document Matrix

  • word meaning = set of documents in which it occurs
  • It can be binary indicators, real value counts, tf-idf values, etc.
  • It captures broad topical-similarity and co-occurrence, rather than the “same meaning”.
  • It is good for document classification tasks, retrieval.

Word-Context Matrix

  • It finds all sentences containing that word.
  • It can be binary indicators, real value counts, tf-idf values, etc.
  • The similar words don’t necessarily co-occur, but they occur in similar contexts.
  • It captures more grammatical similarity and lexical similarity.

Computing Similarity

Extract Equivalence

  • w1 == w2 iff their vector representations are identical
  • However, the language is too varaible and this would never work

Jaccard Similarity

  • $S= \frac{intersection(v_1, v_2)}{union(v_1, v_2)} $
  • It works well for binary vectos, but needs adjustments for real-valued dimensions.

Euclidean Distance

  • $E = \sqrt{\sum_{i=0}^{n}\left(v_{1}^{i}-v_{2}^{i}\right)^{2}} $
  • It assumes similar words will be of similar magnitude (i.e., occur with similar frequency).

Cosine Similarity

  • $C = \frac{\overrightarrow{v 1} \cdot \overrightarrow{v 2}}{|\overrightarrow{v 1}||\overrightarrow{v 2}|} = \frac{\sum_{i=0}^{n} v_{1}^{i} v_{2}^{i}}{\sqrt{\sum_{i=0}^{n} (v_{1}^{i})^{2}} \sqrt{\sum_{i=0}^{n} (v_{2}^{i})^{2}}}$
  • Dot product (scalar product / inner product) of two words vectors
  • It can be explained as projection of v1 onto v2
  • v1·v2 == 0 if vectos are orthogonal
  • v1·v2 == -1 if they have opposite directions
  • v1·v2 == 1 if they are parallel

Word Embeddings

Word Vectors vs. Embeddings

Word Vectors: sparse, very high-dimensional
Word Embeddings: dense, low dimensional, dimensions are not directly interpretable

Why Word Embeddings?

  • Lower dimensional = less computationally intensive
  • Lower dimensional forces abstraction
  • Lower dimensional removes noise:
  • Dimensionality reduction can capture “second order” effects (E.g., w1 occurs with c1, w2 occurs with c2, c1 and c2 are similar. Thus, w1 and w2 are similar.)

We can use Dimensionality Reduction to get them: 1) Matrix Factorization, 2) Neural Networks.

Dimensionality Reduction

It represents the data points in a new feature space by transforming the feature matrix. The new feature space is more informative for ML but is less interpretable to humans.

Principle Component Analysis (PCA) Low Rank Assumptioin: we typically assume that our features contain a large amount of redundant information.
M = U * D * V, where U contains word embeddings.

Singular Value Decomposition

Truncated Singular Value Decomposition

For more info, please read this article: Principal Component Analysis in Machine Learning


MIT License
Last updated on Oct 07, 2022 00:29 EDT
Built with Hugo
Theme Stack designed by Jimmy