Griffin AI 5 8 2024 - TheEvergreenStateCollege/upper-division-cs-23-24 GitHub Wiki

Processing Text for LLM:

Setup

Class notes:
https://github.com/TheEvergreenStateCollege/upper-division-cs/wiki/AI%E2%80%9024sp%E2%80%902024%E2%80%9005%E2%80%9002%E2%80%90Afternoon

Link to code:
https://github.com/rasbt/LLMs-from-scratch/blob/main/ch02/01_main-chapter-code/ch02.ipynb

image

Tokenizing Text

--splitting data into smaller parts. I.e. dividing a sentence into the separate words.

Output 1: Count Characters
image

Output 2: Split by Whitespace
image

Output 3: Converting Tokens into token IDs
image

Output 4: Encode a sentence from the dataset. Tokens -> token ID
image
-prints the Token IDs

Output 5: Decode the encoded text
image

Output 6:
image

image

I get a KeyError because the word 'delete' is not a key in the vocabulary.

Output 7: add meta-tokens
image
-<|endoftext|> is a token that marks the start of a new text
-<|unk|> marks unknown words

Output 8: encoding a sentence with known and unknown tokens
image

image

⚠️ **GitHub.com Fallback** ⚠️