Tokenization - runtimerevolution/labs GitHub Wiki

Tokenization is the step where the input text is broken down into smaller units called tokens. Tokens can be as small as individual characters or as large as whole words.

As an example, let's look at the sentence The child’s book.

We could split the text whenever we find white space characters. The output would be:

["The", "child's", "book."]

As you can see, the punctuation is still attached to the words child’s and book.

Otherwise, we could split the text according to white spaces and punctuation. The output would be:

["The", "child", "'", "s", "book", "."]

Importantly, tokenization is model-specific, meaning different models require different tokenization processes, which can complicate pre-processing and multi-modal modeling.