[Comp Linguistics] Tokenization

Authored by Tony Feng

Created on Sept 22nd, 2022

Last Modified on Sept 22nd, 2022

Intro

This sereis of posts contains a summary of materials and readings from the course CSCI 1460 Computational Linguistics that I’ve taken @ Brown University. The class aims to explore techniques regarding recent advances in NLP with deep learning. I posted these “Notes” (what I’ve learnt) for study and review only.

Morphology

Why can’t we just split on white space?

Compounding
Many language systems don’t use white space
Clitics
Words formation

Finite State Machines

This is the main rule-based computational tool for morphological parsing in NLP.

Finite State Automata (FSA)

It is used to check whether to ‘accept’ a string.

Formal Definition

A finite set of states: $ \mathrm{Q}={q_{0}, q_{1}, \ldots q_{N-1} } $
A finite set of symbols (alphabet): $ \Sigma={i_{0}, i_{1}, \ldots i_{{M}-1}} $
A start state & Final States
A transition function that returns the new state, given a state and a symbol: $ \delta(q, i) $

Psuedocode for Implementation

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15


def recognize(tape, machine) -> Boolean:
    id <- tape[0]
    curState <- state[0]
    loop:
        if reach tape[-1]:
            if curState == finalState:
                return True
            else:
                return False
        elif not transition(curState, tape(id)):
            return False
        else:
            curState <- transition(curState, tape(id))
            id += 1
    end of loop

Non-Determinism

Multiple links out of the same node
Epsilon transitions

An example of FSA

Finite State Transducer (FST)

It is used to ’translate’ one string into another, which produces a parse as output.

An example of FST

Subword Tokenization

Subword Tokenization is a way to algorithmically identify common sub-sequences of characters and merge them into atomic units.

Problems

Tokenizers result in many unknown words (OOVs), and morphological analysis can help, but:

Running morphological analyzers is expensive and language specific
Rare words might be “guessable” from cues

Byte Pair Encoding

Steps

Break words into characters
Define a number of iterations to run
Count the frequency of all pairs of symbols
Merge the most frequent pair of characters, treating them as a new token
Repeat
Tp parse new inputs, just apply merge in order