[Comp Linguistics] Language Modeling - Neural Architecture

Authored by Tony Feng

Created on Oct 18th, 2022

Last Modified on Oct 20th, 2022

Intro

This sereis of posts contains a summary of materials and readings from the course CSCI 1460 Computational Linguistics that I’ve taken @ Brown University. The class aims to explore techniques regarding recent advances in NLP with deep learning. I posted these “Notes” (what I’ve learnt) for study and review only.


Multi-layer Perceptron

  • MLP doesn’t readily support long, sequential inputs
  • MLP doesn’t consider encoding word order, essentially a BOW model.
  • Inputs either become muddy (adding everything together, i.e., “bag-of-vectors”) or too large (concatenating everything)
  • “bag-of-vectors” classifiers are common and often work well for basic applications

Recurrent Neural Network (RNN)

Architecture

The basic idea is generation of word $i+1$ depends on word $i$ plus “memory” of words generated up to $i$.

“Unrolled”“Recurrent / Recursive”“A single step of recurrent / recursive”

$$h_{t}=g\left(U h_{t-1}+W x_{t}\right)$$ $$y_{t}=f\left(V h_{t}\right)$$

Inference

1
2
3
4
5
6
function ForwardRNN(x, rnn) -> sequence y:
    h_0 <- 0
    for i <- 1 to length(x) do:
        h_i <- g(U * h_(i-1) + W * x_i)
        y_i <- f(V * h_i)
    return y

Training Considerations

  • In practice, using unrolled and padded to a fixed length is better for batching.
  • When producing word $i$, predict based on the real $i-1$, not the predicted $i-1$ (which is likely wrong).

Long-Short Term Memory (LSTM)

Motivation

  • RNNs struggle with long range dependencies.
  • Vanishing gradients makes it hard to update early hidden states for long sequences.

Architecture

Overview

  • Introducing a “gating” mechanisms to manage the hidden state/ memory
  • Passing through “gates”
    • “Forget” gate removes information no longer needed
    • “Add” gate adds new information likely to be useful in the future
    • “Output” gate
  • Adding explicit previous “context” in addition to prior hidden state

LSTM Computation Graph

Gate

  • Learn some mask (i.e., vector) via backpropagation
  • Apply the mask(i.e., elementwise multiplication) to some hidden state

Computation

  • Compute “current” state, “add” gate, “forget” gate, and “output” gate from previous hidden state and current input.
    • g = current state
    • f = forget gate
    • i = add gate
    • o = output gate (controls what to output and retains the relavant info right now.)
    • k = intermediate output (context after “forgetting”)
    • j = intermediate output (info to be added to the context)
    • c = updated context
    • h = updated hidden state
  • Combine those things using Hadamard product.
  • Update context and hidden state for next iteration.

$$g=\tanh \left(U_{g} h_{t-1}+W_{g} x_{t}\right) $$ $$f=\operatorname{sigmoid}\left(U_{f} h_{t-1}+W_{f} x_{t}\right) $$ $$i_{t}=\operatorname{sigmoid}\left(U_{i} h_{t-1}+W_{i} x_{t}\right) $$ $$o_{t}=\operatorname{sigmoid}\left(U_{o} h_{t-1}+W_{o} x_{t}\right) $$ $$k_{t}=f_{t} \odot c_{t-1} $$ $$j_{t}=i_{t} \odot g_{t} $$ $$c_{t}=j_{t}+k_{t} $$ $$h_{t}=o_{t} \odot \tanh \left(c_{t}\right) $$


Transformer

Architecture

Representation of the word depends on (slightly less contextualized) representation of other words.

Transformer

Self Attention

The idea is to learn a distribution/weighted combination of hidden states that inform this hidden state.

Each word has three roles at each timestep and we learn three weight matrices $(Q,K,V)$ to cast each word into each role. The dot product of key and Query produces a weight and the next layer receives the weighted combination of values.

  • Query: The word as the current focus
  • Key: The word as a context word
  • Value: The word as part of the output

Self-attention

Multi-threaded self-attention is repeating the attention process multiple times. Each KQV set can focus on a slightly different aspect of the representation.

Blocks

A Transformer Block

Residual Connection

It adds input to output to help with training/vanishing gradients.

Layer Normalize

It has the same idea of Z-score normalization.

Feedforward Layer

It is a simple perceptron-style layer to combine everything togethrt.

Positional Encoding

Transformers aren’t actually aware of the order in which words occur, becasue they are a bag of words essentially.

A Simple Positional Encoding

Positional encodings add an embedding representation of the absolute position to the input word embedding. However,

  • Not the same as relative/order information in language
  • Less supervision for later positions
  • Hard to deal with recursive human language

why is it a big deal?

  • Attention
    • has minimal inductive bias
    • can learn arbitrary graph structure
  • Multiheadedness
    • Each “head” focuses on a different subspace of the input
  • Scalability
    • At layer N, no dependency between timesteps, so it can be trained in parallel
    • Faster training = bigger models + more data
    • Allows for massive pretraining

MIT License
Last updated on Oct 20, 2022 16:21 EDT
Built with Hugo
Theme Stack designed by Jimmy