Authored by Tony Feng
Created on Oct 13th, 2022
Last Modified on Oct 16th, 2022
Intro
This sereis of posts contains a summary of materials and readings from the course CSCI 1460 Computational Linguistics that I’ve taken @ Brown University. The class aims to explore techniques regarding recent advances in NLP with deep learning. I posted these “Notes” (what I’ve learnt) for study and review only.
Langauge Modeling
Definition
- It assigns a probability to a sequence of words
- Given a sequence of words, predict the most likely next word
- Generate likely sequences of words
Applications
- Unconstrained text generation
- Conditional text generation,
- Machine Translation
- Speech Recognition
- Summarization
- Representation Learning
N-gram Langauge Models
Directly Computing Corpus Stats
It just computes the probability directly, but langauge is dynamic and we cannot ensure same sentence will exist in the corpus for multiple times.
$$P(w_0, …, w_1) = \frac{n}{N(l)} $$
, where $n$ is num of occurances of $w_0, …, w_1$ and $N$ is number of sequences with length $l$.
Unigram Language Model
$$ P\left(w_{0} \ldots w_{n}\right)=P\left(w_{0}\right) P\left(w_{1} \mid w_{0}\right) P\left(w_{2} \mid w_{0}, w_{1}\right) \ldots P\left(w_{n} \mid w_{0} \ldots w_{n-1}\right)$$
According to Naive Asumption, we have $ P\left(w_{i} \mid w_{0} \ldots w_{i-1}\right) \approx P\left(w_{i}\right) $.
$$ P\left(w_{0} \ldots w_{n}\right) \approx P\left(w_{0}\right) P\left(w_{1}\right) P\left(w_{2}\right) \ldots P\left(w_{n}\right) $$
However, $P($“give me an apple”$)$ equals $P($“give an apple me”$)$.
Bigram Language Model
According to Markov Assumption, we have $P\left(w_{i} \mid w_{0} \ldots w_{i-1}\right) \approx P\left(w_{i} \mid w_{i-1}\right)$.
$$P\left(w_{0} \ldots w_{n}\right) \approx P\left(w_{0}\right) P\left(w_{1} \mid w_{0}\right) P\left(w_{2} \mid w_{1}\right) \ldots P\left(w_{n} \mid w_{n-1}\right)$$
N-gram Language Model
$$P\left(w_{0} \ldots w_{n}\right) \approx \prod_{i=0}^{n} P\left(w_{i} \mid w_{i-(n-1)} \ldots w_{i-1}\right)$$
Smoothing
Laplace Smoothing
The basic idea is to “add one” to everything, so there won’t be zero counts.
- It needs to renormalize to keep it probability distribution
$$P\left(w_{n} \mid w_{n-1}\right)=\frac{N\left(w_{n-1} w_{n}\right)+1}{\sum_{w}\left(N\left(w_{n-1} w\right)+1\right)}$$ - It is often intepreted as discounting, beacuse we borrow probability mass from high-frequency words in order to make room for unseen words.
Backoff
We can estimate the probability of a longer sequence from the probabilities of its subsequences. If an ngram of length $n$ is not observed, use the corresponding length $n-1$ ngram instead.
$$P(a, b, c) \cong P(a, b) \cong P(a)$$
Interpolation
All counts are estimated using a weighted combination of smaller ngrams. It requires renormalization.
$$P(a, b, c) = \lambda_1 P(a, b, c) \times \lambda_2 P(a, b) \times \lambda_3 P(a)$$
Kneser-Ney Smoothing
It’s a state-of-the-art smoothing algorithm that combines several ideas:
- Absolute discounting (estimated from data) $$ P_{\text {AbsoluteDiscounting }}\left(w_{i} \mid w_{i-1}\right)=\frac{C\left(w_{i-1} w_{i}\right)-d}{\sum_{\nu} C\left(w_{i-1} v\right)}+\lambda\left(w_{i-1}\right) P\left(w_{i}\right) $$
- Replacing ngram probabilities with continuation probabilities. $$P\left(w_{i} \mid w_{i-1}\right)=\frac{\max \left(C\left(w_{i-1} w_{i}\right)-d, 0\right)}{C\left(w_{i-1}\right)}+\lambda\left(w_{i-1}\right) P_{\mathrm{CONTINUATION}}\left(w_{i}\right)$$
Perplexity
A good language model should assign high probability to sentences that actually appear. Instead of using probability directly, we use a metric called “perplexity”.
$$PPL(W)=\sqrt[n]{\prod_{i=1}^{n} \frac{1}{P\left(w_{1} \ldots w_{n}\right)}}$$
Intuition
- “Weighted average branching factor” -> how many next words can follow any given word?
- A model with lower PPL is less “surprised” by new data and has more certainty about true sequences.
- It considers branching factors to be lower, because it has a good sense of what should come next.
- In natural language, distributions are highly non-uniform, so branching factors are (relatively) low.
- PPL will never be zero! Natural language has inherent uncertainty.
- PPL is not comparable across different datasets.
- Higher-order n-grams lead to lower ppl in general, but
- it is more likely to overfit to training data,
- requires more memory,
- results in more zero counts.