[Comp Linguistics] Language Modeling - Pretraining

Authored by Tony Feng

Created on Oct 20th, 2022

Last Modified on Oct 25th, 2022

Intro

This sereis of posts contains a summary of materials and readings from the course CSCI 1460 Computational Linguistics that I’ve taken @ Brown University. The class aims to explore techniques regarding recent advances in NLP with deep learning. I posted these “Notes” (what I’ve learnt) for study and review only.


What is pretraining?

  • Train on some “general” task that encourages good representations
  • Transfer these representations with little updating to other tasks
  • No need to train embeddings anew for every task

Contextualized Word Representations

Basic Idea

Traditional word embedding methods (e.g., word2vec) produce typelevel representations, while contextualized word embeddings (e.g., BERT, ELMo) produce token-level representations.

Static VS. Contextualized

Advantages

  • It can capture word sense
  • It can capture syntactic and semantic context

Disadvantages

  • Variable-length sentence representations
  • No clear “lexicon”, which (traditionally) linguistics likes to have
  • More training data needed

ELMo

ELMo

  • Architecture: Two-layer BiLSTM
  • Layer 0 = pretrained static embeddings (Glove)
  • Trained on vanilla language modeling task
  • Finetuned by learning a simple linear combination of the learned embeddings

BERT

BERT

  • Architecture: Deep Transformer (small = 12, large = 24)
  • Layer 0 = wordpiece embeddings
  • Trained on masked language modeling + next-sentence prediction
  • Typically finetuned by updating all parameters (though there are other strategies)

GPT

GPT

  • Architecture: Deep Transformer
  • Layer 0 = BPE
  • Trained on vanilla language modeling
  • Notable because the recent versions (GPT-3) are HUGE, and very impressive
  • Typically finetuned by updating all parameters (for the small models) or (for the large models) prompting

Different Transfer Methods

Frozen

Just pool representations and train a new classifier on top。

Finetuning

Treat pretraining as a good initialization and continue to update all parameters on new tasks. Models perform better and require less data to learn the target task.

“Prompt” Tuning

Update a small number of parameters at the bottom of the network.

Adapters

Update a small number of parameters throughout the network.

Zero-Shot or “In-Context” Learning

Cast all tasks as an instance of the task that the model was trained on (e.g., language modeling).

  • Results are high variance, depending on exact wording of the prompts.
  • Unclear what was seen during training, so hard to know how much they are truly learning and generalizing vs. parroting.

References


MIT License
Last updated on Oct 25, 2022 15:49 EDT
Built with Hugo
Theme Stack designed by Jimmy