Tokenization - RamNayakTech/knowledge-hub GitHub Wiki

Tokenization - word Vs characters

Corpora - Huge dataset collected from various sources used for training OOV - Out Of Vocabulary

1. Byte-Pair Encoding

A simple form of data compression algorithm in which the most common pair of consecutive bytes of data is replaced with a byte that does not occur in that data.

Suppose we have data aaabdaaabac which needs to be encoded (compressed). The byte pair aa occurs most often, so we will replace it with Z as Z does not occur in our data. So we now have ZabdZabac where Z = aa. The next common byte pair is ab so let’s replace it with Y. We now have ZYdZYac where Z = aa and Y = ab. The only byte pair left is ac which appears as just one so we will not encode it. We can use recursive byte pair encoding to encode ZY as X. Our data has now transformed into XdXac where X = ZY, Y = ab, and Z = aa. It cannot be further compressed as there are no byte pairs appearing more than once. We decompress the data by performing replacements in reverse order.

BPE is used in language models like GPT-2, RoBERTa, XLM, FlauBERT, etc.

A variant of this is used in NLP. Let us understand the NLP version of it together. 🤗

BPE ensures that the most common words are represented in the vocabulary as a single token while the rare words are broken down into two or more subword tokens and this is in agreement with what a subword-based tokenization algorithm does.

Reference link

Tokenization - RamNayakTech/knowledge-hub GitHub Wiki

1. Byte-Pair Encoding

2. WordPiece

3. Unigram