Authored by Tony Feng
Created on Sept 14th, 2022
Last Modified on Sept 14th, 2022
Intro
This sereis of posts contains a summary of materials and readings from the course CSCI 1460 Computational Linguistics that I’ve taken @ Brown University. The class aims to explore techniques regarding recent advances in NLP with deep learning. I posted these “Notes” (what I’ve learnt) for study and review only.
Task-Driven Culture
NLP research has tended to be organized around “tasks” with shared ideas. Instead of building a whole system themselves, they will justify the problem in relation to some commonly-agreed upon important “tasks”.
Extrinsic & Intrinsic Tasks
Here, we take ACL Submission Topics as examples to make a classification.
Extrinsic Tasks: An “end user” might want to use
- Dialog & Interactive Sys
- Info Retrieval & Text Mining
- Language Grounding to Vision, Robotics & Beyond
- Machine Translation & Multilinguality
- NLP Applications
- Question Answering
- Sementiment Analysis, Stylistic Analysis & Argument Mining
- Speech & Multimodality
- Summarization
Intrinsic Tasks: Part of a bigger system
- Generation
- ML for NLP
- Info Extraction
- Resources & Evaluation
- Discourses & Pragmatics
- Phonology, Morphology & Word Segmentation
- Syntax: Tagging, Chunking & Parsing
- Semantics: Lexical
- Sementics: Sentence-level Sementics, Textual Inference & Other Areas
Bakeoffs & Leaderboards
Statistical Revolution uses the same inputs, and compares against the same ground truth (standardized test sets) outputs, using the same metric. The filed places a lot of stock in empirical comparison of ideas and applicability of tech.
- DARPA
- SemEval
- SQuAD
- GLUE
- SuperGlue
- GEM
- DynaBench
- BigBench
Major “Downstream” Tasks and Evals
Classification
Input & Output
Input: a piece of text
Output: a label
Examples
Sentiment prediction (e.g., for stock trading)
Language detection (e.g., before machine translation)
Relevance prediction (e.g., for retrieval or ad targeting)
Intent classification (e.g., for goal-oriented dialog)
Standard Metrics
Accuracy: How often is the label correct?
Precision: Probability that it is spam, given that the model says it is.
Recall: Probability that the model says its spam, given that it actually is spam.
F-1 Score: Harmonic Mean of Precision and Recall
$$
F_{1}=2 \times \frac{\text { precision } \times \text { recall }}{\text { precision }+\text { recall }}
$$
AUC: Area Under the True-Positive vs. False Positive Curve
Info Retrieval
Input & Output
Input: Query or Doc Collection
Output: Ranked List of Doc
Examples
They have increasingly complex goals, which are interwoven with other NLP tasks (summarization, question answering).
Standard Metrics
Precision@K: How many of the top K documents are relevant?
Recall@K: How many of the relevant documents occur in the top K?
Mean Reciprocal Rank: At what rank does the relevant document appear?
Discounted Cumulative Gain (DCG): How much benefit with one more rank down the list?
$$
DCG_{p}=\sum_{i=1}^{p} \frac{rel_{i}}{\log _{2}(i+1)}
$$
Area Under the Curve (AUC): Summary metrics for precision-recall curves
Averaged Precision (AP)
$$ AP=\sum_{n}\left(R_{n}-R_{n-1}\right) P_{n} $$
Ground Truth
Query Relevances (“qrels”)
- Humans manually annotate documents for relevance, based on description of query
- Too expensive to estimate recall (requires labeling every document for every query)
- Not “ecologically valid”
User Behavior/AB Testing
- Does the user click on links?
- Do they stay on the page after clicking?
- Do they go back and reword the query?
Questions Answering
Variants of QA
- Open Book
- Cloased Book
- QA over Databases
Open Book QA
Input: Question + Doc
Output: Answer / Highlighted Span
Closed Book QA
Input: Question
Output: Answer
Note: Hard to verify answer source
Database Book QA
Input: Question + Database
Output: Answer
Note: 1) Good for short, factoid questions 2) Requires formal query languages
Standard Metrics
Accuracy: What % of questions were correct?
Exact Match Accuracy: % of times model’s answer was string identical to ground truth
Precision: How many of the predicted tokens were correct?
Recall: How many of the correct tokens were predicted?
F-1: Harmonic mean of precision and recall
Important Intrinsic Tasks and Evals
- Tokenization
- Sentence Splitting
- Part-of-speech Tagging
- Morphological Analysis
- Named Entity Recognition
- Syntactic Parsing
- Coreference Resolution
- Other Annotations
Current Debates
- Extrinsic vs. Intrinsic
- Scientific Validity
- Leaderboardism