[Comp Linguistics] Text Classifier: Features

Authored by Tony Feng

Created on Sept 20th, 2022

Last Modified on Sept 20th, 2022

Intro

This sereis of posts contains a summary of materials and readings from the course CSCI 1460 Computational Linguistics that I’ve taken @ Brown University. The class aims to explore techniques regarding recent advances in NLP with deep learning. I posted these “Notes” (what I’ve learnt) for study and review only.

Data Pre-processing

Preprocessing VS. Features

Preprocessing: Defining the vocabulary
Features: Capturing the aspects of the language that is independent

Differences in steps

Load the data
Preprocess the data
Extract the features
Split the data
Train and test the model

Preprocessing Steps

Strip boilerplate/html/etc
Tokenization
Stemming/Lemmatization
Lowercasing
Removing punctuation, stop words, numbers, etc.
Setting a Vocab Size/Defining out-of-vocabulary (OOV)
Word Sense Disambiguation (one word with multiple meanings)
Combining Synonyms/Paraphrases

Feature Engineering

Weighting Strategies

Basic BOW: 0 and 1 for appearance
Count: Frequency in a document
TF-IDF: Weighting schemes for important words

TF-IDF is about assigning higher weights to words that differentiate this document from other documents. More info could be find here.

TF(word, doc) = # of occurrences of the word in the doc / # of words in the doc
IDF(word) = log(# of all docs / # of docs that contain the word)
TF-IDF(word, doc) = TF(word, doc) * IDF(word)

You could try TF-IDF Demo here.

N-Grams

N-gram is a sequence of words of lenght N, such as unigram, bigram, trigram, 4-gram, etc. The algorithm uses sliding window with N length to extract features.

However, it may lead to large feature space. It’s a tradeoff between expressivity and generalization.

An example of bigram

More Advanced Features

Syntactic Features

People often use dependency parsers to find meaningful links between non-adjacent words