[Comp Linguistics] Text Classifier: Features

Authored by Tony Feng

Created on Sept 20th, 2022

Last Modified on Sept 20th, 2022

Intro

This sereis of posts contains a summary of materials and readings from the course CSCI 1460 Computational Linguistics that I’ve taken @ Brown University. The class aims to explore techniques regarding recent advances in NLP with deep learning. I posted these “Notes” (what I’ve learnt) for study and review only.


Data Pre-processing

Preprocessing VS. Features

Preprocessing: Defining the vocabulary
Features: Capturing the aspects of the language that is independent

Differences in steps

  • Load the data
  • Preprocess the data
  • Extract the features
  • Split the data
  • Train and test the model

Preprocessing Steps

  • Strip boilerplate/html/etc
  • Tokenization
  • Stemming/Lemmatization
  • Lowercasing
  • Removing punctuation, stop words, numbers, etc.
  • Setting a Vocab Size/Defining out-of-vocabulary (OOV)
  • Word Sense Disambiguation (one word with multiple meanings)
  • Combining Synonyms/Paraphrases

Feature Engineering

Weighting Strategies

Basic BOW: 0 and 1 for appearance
Count: Frequency in a document
TF-IDF: Weighting schemes for important words

TF-IDF is about assigning higher weights to words that differentiate this document from other documents. More info could be find here.

TF(word, doc) = # of occurrences of the word in the doc / # of words in the doc
IDF(word) = log(# of all docs / # of docs that contain the word)
TF-IDF(word, doc) = TF(word, doc) * IDF(word)

You could try TF-IDF Demo here.

N-Grams

N-gram is a sequence of words of lenght N, such as unigram, bigram, trigram, 4-gram, etc. The algorithm uses sliding window with N length to extract features.

However, it may lead to large feature space. It’s a tradeoff between expressivity and generalization.

An example of bigram

More Advanced Features

Syntactic Features

People often use dependency parsers to find meaningful links between non-adjacent words


MIT License
Last updated on Sep 20, 2022 20:12 EDT
Built with Hugo
Theme Stack designed by Jimmy