AI‐24sp‐2024‐05‐01‐Morning - TheEvergreenStateCollege/upper-division-cs-23-24 GitHub Wiki

AI Self-Hosting, Spring 2024

Week 05

2024-05-01

  • Review
    • Sequence of MNIST classification, to text-to-speech synthesis, to LLMs
  • LLM Architecture (Chapter 1 of Raschka book)
  • Processing Datasets for LLMs (Chapter 2 of Raschka book)

Working through https://github.com/rasbt/LLMs-from-scratch.git

We want to choose the following parameters to exercise GitHub Codespaces / Gitpod

  • Text size (in words or tokens)
  • Vocabulary size
  • Your choice of a personally meaningful
    • public dataset, such as written works of literature, laws
    • private data that you'd like to incorporate into a self-hosted LLM chat
      • you may wish to run this code on your own laptop or machine for privacy
    • by default, we'll choose Wikipedia circa 2008 because it is licensed under Creative Commons

Come to lab on Thursday afternoon with your private dataset prepared to work through the text processing step for your chosen dataset.

You'll write a dev diary entry with code blocks documenting the command-line output of each step.

Steps in Processing Data

  1. Opening File(s) and Counting Words

  2. Splitting on Punctuation and Whitespace

  3. Assigning Unique Token IDs (a Vocabulary)

  4. Handling Meta-Tokens

Questions

  1. Do some datasets need to capture whitespace?

  2. Why do we save different capitalizations of words?

  3. Why might GPTs / LLMs be bad at math, based on the vocabulary you've seen?