Processing Text for LLM:

Setup

--splitting data into smaller parts. I.e. dividing a sentence into the separate words.

Output 1: Count Characters

Output 2: Split by Whitespace

Output 3: Converting Tokens into token IDs

Output 4: Encode a sentence from the dataset. Tokens -> token ID

-prints the Token IDs

Output 5: Decode the encoded text

Output 6:

I get a KeyError because the word 'delete' is not a key in the vocabulary.

Output 7: add meta-tokens

-<|endoftext|> is a token that marks the start of a new text
-<|unk|> marks unknown words

Output 8: encoding a sentence with known and unknown tokens