[Comp Linguistics] Tasks & Benchmarks

Authored by Tony Feng

Created on Sept 14th, 2022

Last Modified on Sept 14th, 2022

Intro

This sereis of posts contains a summary of materials and readings from the course CSCI 1460 Computational Linguistics that I’ve taken @ Brown University. The class aims to explore techniques regarding recent advances in NLP with deep learning. I posted these “Notes” (what I’ve learnt) for study and review only.


Task-Driven Culture

NLP research has tended to be organized around “tasks” with shared ideas. Instead of building a whole system themselves, they will justify the problem in relation to some commonly-agreed upon important “tasks”.

Extrinsic & Intrinsic Tasks

Here, we take ACL Submission Topics as examples to make a classification.

Extrinsic Tasks: An “end user” might want to use

  • Dialog & Interactive Sys
  • Info Retrieval & Text Mining
  • Language Grounding to Vision, Robotics & Beyond
  • Machine Translation & Multilinguality
  • NLP Applications
  • Question Answering
  • Sementiment Analysis, Stylistic Analysis & Argument Mining
  • Speech & Multimodality
  • Summarization

Intrinsic Tasks: Part of a bigger system

  • Generation
  • ML for NLP
  • Info Extraction
  • Resources & Evaluation
  • Discourses & Pragmatics
  • Phonology, Morphology & Word Segmentation
  • Syntax: Tagging, Chunking & Parsing
  • Semantics: Lexical
  • Sementics: Sentence-level Sementics, Textual Inference & Other Areas

Bakeoffs & Leaderboards

Statistical Revolution uses the same inputs, and compares against the same ground truth (standardized test sets) outputs, using the same metric. The filed places a lot of stock in empirical comparison of ideas and applicability of tech.

  • DARPA
  • SemEval
  • SQuAD
  • GLUE
  • SuperGlue
  • GEM
  • DynaBench
  • BigBench

Major “Downstream” Tasks and Evals

Classification

Input & Output

Input: a piece of text
Output: a label

Examples

Sentiment prediction (e.g., for stock trading)
Language detection (e.g., before machine translation)
Relevance prediction (e.g., for retrieval or ad targeting)
Intent classification (e.g., for goal-oriented dialog)

Standard Metrics

Accuracy: How often is the label correct?
Precision: Probability that it is spam, given that the model says it is.
Recall: Probability that the model says its spam, given that it actually is spam.
F-1 Score: Harmonic Mean of Precision and Recall
$$ F_{1}=2 \times \frac{\text { precision } \times \text { recall }}{\text { precision }+\text { recall }} $$ AUC: Area Under the True-Positive vs. False Positive Curve

Info Retrieval

Input & Output

Input: Query or Doc Collection
Output: Ranked List of Doc

Examples

They have increasingly complex goals, which are interwoven with other NLP tasks (summarization, question answering).

Standard Metrics

Precision@K: How many of the top K documents are relevant?
Recall@K: How many of the relevant documents occur in the top K?
Mean Reciprocal Rank: At what rank does the relevant document appear?
Discounted Cumulative Gain (DCG): How much benefit with one more rank down the list? $$ DCG_{p}=\sum_{i=1}^{p} \frac{rel_{i}}{\log _{2}(i+1)} $$
Area Under the Curve (AUC): Summary metrics for precision-recall curves Averaged Precision (AP)

$$ AP=\sum_{n}\left(R_{n}-R_{n-1}\right) P_{n} $$

Ground Truth

Query Relevances (“qrels”)

  • Humans manually annotate documents for relevance, based on description of query
  • Too expensive to estimate recall (requires labeling every document for every query)
  • Not “ecologically valid”

User Behavior/AB Testing

  • Does the user click on links?
  • Do they stay on the page after clicking?
  • Do they go back and reword the query?

Questions Answering

Variants of QA

  • Open Book
  • Cloased Book
  • QA over Databases

Open Book QA

Input: Question + Doc
Output: Answer / Highlighted Span

Closed Book QA

Input: Question Output: Answer
Note: Hard to verify answer source

Database Book QA

Input: Question + Database
Output: Answer
Note: 1) Good for short, factoid questions 2) Requires formal query languages

Standard Metrics

Accuracy: What % of questions were correct?
Exact Match Accuracy: % of times model’s answer was string identical to ground truth Precision: How many of the predicted tokens were correct?
Recall: How many of the correct tokens were predicted?
F-1: Harmonic mean of precision and recall


Important Intrinsic Tasks and Evals

  • Tokenization
  • Sentence Splitting
  • Part-of-speech Tagging
  • Morphological Analysis
  • Named Entity Recognition
  • Syntactic Parsing
  • Coreference Resolution
  • Other Annotations

Current Debates

  • Extrinsic vs. Intrinsic
  • Scientific Validity
  • Leaderboardism

MIT License
Last updated on Sep 14, 2022 16:11 EDT
Built with Hugo
Theme Stack designed by Jimmy