AI‐24sp‐2024‐05‐01‐Morning - TheEvergreenStateCollege/upper-division-cs-23-24 GitHub Wiki

AI Self-Hosting, Spring 2024

Week 05

2024-05-01

Review
- Sequence of MNIST classification, to text-to-speech synthesis, to LLMs
LLM Architecture (Chapter 1 of Raschka book)
Processing Datasets for LLMs (Chapter 2 of Raschka book)

Working through https://github.com/rasbt/LLMs-from-scratch.git

We want to choose the following parameters to exercise GitHub Codespaces / Gitpod

Text size (in words or tokens)
Vocabulary size
Your choice of a personally meaningful
- public dataset, such as written works of literature, laws
- private data that you'd like to incorporate into a self-hosted LLM chat
  - you may wish to run this code on your own laptop or machine for privacy
- by default, we'll choose Wikipedia circa 2008 because it is licensed under Creative Commons
  - you can choose which prefix letter to download

Come to lab on Thursday afternoon with your private dataset prepared to work through the text processing step for your chosen dataset.

You'll write a dev diary entry with code blocks documenting the command-line output of each step.

Steps in Processing Data

Opening File(s) and Counting Words
Splitting on Punctuation and Whitespace
Assigning Unique Token IDs (a Vocabulary)
Handling Meta-Tokens

Questions

Do some datasets need to capture whitespace?
Why do we save different capitalizations of words?
Why might GPTs / LLMs be bad at math, based on the vocabulary you've seen?