[Comp Linguistics] Language Modeling

Authored by Tony Feng

Created on Oct 13th, 2022

Last Modified on Oct 16th, 2022

Intro

This sereis of posts contains a summary of materials and readings from the course CSCI 1460 Computational Linguistics that I’ve taken @ Brown University. The class aims to explore techniques regarding recent advances in NLP with deep learning. I posted these “Notes” (what I’ve learnt) for study and review only.

Langauge Modeling

Definition

It assigns a probability to a sequence of words
Given a sequence of words, predict the most likely next word
Generate likely sequences of words

Applications

Unconstrained text generation
Conditional text generation,
- Machine Translation
- Speech Recognition
- Summarization
Representation Learning

N-gram Langauge Models

Directly Computing Corpus Stats

It just computes the probability directly, but langauge is dynamic and we cannot ensure same sentence will exist in the corpus for multiple times.

$$P(w_0, …, w_1) = \frac{n}{N(l)} $$

, where $n$ is num of occurances of $w_0, …, w_1$ and $N$ is number of sequences with length $l$.

Unigram Language Model

$$ P\left(w_{0} \ldots w_{n}\right)=P\left(w_{0}\right) P\left(w_{1} \mid w_{0}\right) P\left(w_{2} \mid w_{0}, w_{1}\right) \ldots P\left(w_{n} \mid w_{0} \ldots w_{n-1}\right)$$

According to Naive Asumption, we have $ P\left(w_{i} \mid w_{0} \ldots w_{i-1}\right) \approx P\left(w_{i}\right) $.

$$ P\left(w_{0} \ldots w_{n}\right) \approx P\left(w_{0}\right) P\left(w_{1}\right) P\left(w_{2}\right) \ldots P\left(w_{n}\right) $$

However, $P($“give me an apple”$)$ equals $P($“give an apple me”$)$.

Bigram Language Model

According to Markov Assumption, we have $P\left(w_{i} \mid w_{0} \ldots w_{i-1}\right) \approx P\left(w_{i} \mid w_{i-1}\right)$.

$$P\left(w_{0} \ldots w_{n}\right) \approx P\left(w_{0}\right) P\left(w_{1} \mid w_{0}\right) P\left(w_{2} \mid w_{1}\right) \ldots P\left(w_{n} \mid w_{n-1}\right)$$

N-gram Language Model

$$P\left(w_{0} \ldots w_{n}\right) \approx \prod_{i=0}^{n} P\left(w_{i} \mid w_{i-(n-1)} \ldots w_{i-1}\right)$$

Smoothing

Laplace Smoothing

The basic idea is to “add one” to everything, so there won’t be zero counts.

It needs to renormalize to keep it probability distribution
$$P\left(w_{n} \mid w_{n-1}\right)=\frac{N\left(w_{n-1} w_{n}\right)+1}{\sum_{w}\left(N\left(w_{n-1} w\right)+1\right)}$$
It is often intepreted as discounting, beacuse we borrow probability mass from high-frequency words in order to make room for unseen words.

Backoff

We can estimate the probability of a longer sequence from the probabilities of its subsequences. If an ngram of length $n$ is not observed, use the corresponding length $n-1$ ngram instead.

$$P(a, b, c) \cong P(a, b) \cong P(a)$$

Interpolation

All counts are estimated using a weighted combination of smaller ngrams. It requires renormalization.

$$P(a, b, c) = \lambda_1 P(a, b, c) \times \lambda_2 P(a, b) \times \lambda_3 P(a)$$

Kneser-Ney Smoothing

It’s a state-of-the-art smoothing algorithm that combines several ideas:

Absolute discounting (estimated from data) $$ P_{\text {AbsoluteDiscounting }}\left(w_{i} \mid w_{i-1}\right)=\frac{C\left(w_{i-1} w_{i}\right)-d}{\sum_{\nu} C\left(w_{i-1} v\right)}+\lambda\left(w_{i-1}\right) P\left(w_{i}\right) $$
Replacing ngram probabilities with continuation probabilities. $$P\left(w_{i} \mid w_{i-1}\right)=\frac{\max \left(C\left(w_{i-1} w_{i}\right)-d, 0\right)}{C\left(w_{i-1}\right)}+\lambda\left(w_{i-1}\right) P_{\mathrm{CONTINUATION}}\left(w_{i}\right)$$

Perplexity

A good language model should assign high probability to sentences that actually appear. Instead of using probability directly, we use a metric called “perplexity”.

$$PPL(W)=\sqrt[n]{\prod_{i=1}^{n} \frac{1}{P\left(w_{1} \ldots w_{n}\right)}}$$

Intuition

“Weighted average branching factor” -> how many next words can follow any given word?
A model with lower PPL is less “surprised” by new data and has more certainty about true sequences.
It considers branching factors to be lower, because it has a good sense of what should come next.
In natural language, distributions are highly non-uniform, so branching factors are (relatively) low.
PPL will never be zero! Natural language has inherent uncertainty.
PPL is not comparable across different datasets.
Higher-order n-grams lead to lower ppl in general, but
- it is more likely to overfit to training data,
- requires more memory,
- results in more zero counts.