[Comp Linguistics] Tasks & Benchmarks

Authored by Tony Feng

Created on Sept 14th, 2022

Last Modified on Sept 14th, 2022

Intro

This sereis of posts contains a summary of materials and readings from the course CSCI 1460 Computational Linguistics that I’ve taken @ Brown University. The class aims to explore techniques regarding recent advances in NLP with deep learning. I posted these “Notes” (what I’ve learnt) for study and review only.

Task-Driven Culture

NLP research has tended to be organized around “tasks” with shared ideas. Instead of building a whole system themselves, they will justify the problem in relation to some commonly-agreed upon important “tasks”.

Extrinsic & Intrinsic Tasks

Here, we take ACL Submission Topics as examples to make a classification.

Extrinsic Tasks: An “end user” might want to use

Dialog & Interactive Sys
Info Retrieval & Text Mining
Language Grounding to Vision, Robotics & Beyond
Machine Translation & Multilinguality
NLP Applications
Question Answering
Sementiment Analysis, Stylistic Analysis & Argument Mining
Speech & Multimodality
Summarization

Intrinsic Tasks: Part of a bigger system

Generation
ML for NLP
Info Extraction
Resources & Evaluation
Discourses & Pragmatics
Phonology, Morphology & Word Segmentation
Syntax: Tagging, Chunking & Parsing
Semantics: Lexical
Sementics: Sentence-level Sementics, Textual Inference & Other Areas

Bakeoffs & Leaderboards

Statistical Revolution uses the same inputs, and compares against the same ground truth (standardized test sets) outputs, using the same metric. The filed places a lot of stock in empirical comparison of ideas and applicability of tech.

DARPA
SemEval
SQuAD
GLUE
SuperGlue
GEM
DynaBench
BigBench

Major “Downstream” Tasks and Evals

Classification

Input & Output

Input: a piece of text
Output: a label

Examples

Sentiment prediction (e.g., for stock trading)
Language detection (e.g., before machine translation)
Relevance prediction (e.g., for retrieval or ad targeting)
Intent classification (e.g., for goal-oriented dialog)

Standard Metrics

Accuracy: How often is the label correct?
Precision: Probability that it is spam, given that the model says it is.
Recall: Probability that the model says its spam, given that it actually is spam.
F-1 Score: Harmonic Mean of Precision and Recall
$$ F_{1}=2 \times \frac{\text { precision } \times \text { recall }}{\text { precision }+\text { recall }} $$ AUC: Area Under the True-Positive vs. False Positive Curve

Info Retrieval

Input & Output

Input: Query or Doc Collection
Output: Ranked List of Doc

Examples

They have increasingly complex goals, which are interwoven with other NLP tasks (summarization, question answering).

Standard Metrics

Precision@K: How many of the top K documents are relevant?
Recall@K: How many of the relevant documents occur in the top K?
Mean Reciprocal Rank: At what rank does the relevant document appear?
Discounted Cumulative Gain (DCG): How much benefit with one more rank down the list? $$ DCG_{p}=\sum_{i=1}^{p} \frac{rel_{i}}{\log _{2}(i+1)} $$
Area Under the Curve (AUC): Summary metrics for precision-recall curves Averaged Precision (AP)

$$ AP=\sum_{n}\left(R_{n}-R_{n-1}\right) P_{n} $$

Ground Truth

Query Relevances (“qrels”)

Humans manually annotate documents for relevance, based on description of query
Too expensive to estimate recall (requires labeling every document for every query)
Not “ecologically valid”

User Behavior/AB Testing

Does the user click on links?
Do they stay on the page after clicking?
Do they go back and reword the query?

Questions Answering

Variants of QA

Open Book
Cloased Book
QA over Databases

Open Book QA

Input: Question + Doc
Output: Answer / Highlighted Span

Closed Book QA

Input: Question Output: Answer
Note: Hard to verify answer source

Database Book QA

Input: Question + Database
Output: Answer
Note: 1) Good for short, factoid questions 2) Requires formal query languages

Standard Metrics

Accuracy: What % of questions were correct?
Exact Match Accuracy: % of times model’s answer was string identical to ground truth Precision: How many of the predicted tokens were correct?
Recall: How many of the correct tokens were predicted?
F-1: Harmonic mean of precision and recall

Important Intrinsic Tasks and Evals

Tokenization
Sentence Splitting
Part-of-speech Tagging
Morphological Analysis
Named Entity Recognition
Syntactic Parsing
Coreference Resolution
Other Annotations

Current Debates

Extrinsic vs. Intrinsic
Scientific Validity
Leaderboardism