AI‐24sp‐2024‐05‐01‐Morning - TheEvergreenStateCollege/upper-division-cs-23-24 GitHub Wiki
AI Self-Hosting, Spring 2024
Week 05
2024-05-01
- Review
- Sequence of MNIST classification, to text-to-speech synthesis, to LLMs
- LLM Architecture (Chapter 1 of Raschka book)
- Processing Datasets for LLMs (Chapter 2 of Raschka book)
Working through https://github.com/rasbt/LLMs-from-scratch.git
We want to choose the following parameters to exercise GitHub Codespaces / Gitpod
- Text size (in words or tokens)
- Vocabulary size
- Your choice of a personally meaningful
- public dataset, such as written works of literature, laws
- private data that you'd like to incorporate into a self-hosted LLM chat
- you may wish to run this code on your own laptop or machine for privacy
- by default, we'll choose Wikipedia circa 2008 because it is licensed under Creative Commons
- you can choose which prefix letter to download
Come to lab on Thursday afternoon with your private dataset prepared to work through the text processing step for your chosen dataset.
You'll write a dev diary entry with code blocks documenting the command-line output of each step.
Steps in Processing Data
-
Opening File(s) and Counting Words
-
Splitting on Punctuation and Whitespace
-
Assigning Unique Token IDs (a Vocabulary)
-
Handling Meta-Tokens
Questions
-
Do some datasets need to capture whitespace?
-
Why do we save different capitalizations of words?
-
Why might GPTs / LLMs be bad at math, based on the vocabulary you've seen?