AI‐24sp‐2024‐05‐09‐Afternoon - TheEvergreenStateCollege/upper-division-cs-23-24 GitHub Wiki

AI Self-Hosting, Spring 2024

Week 06 - Afternoon Lab

Continue with your dataset from last week's lab.

Second Half of Chapter 2 Goals:

  • Continue getting byte-pair-encoded tokens from tiktoken
  • Create training (x,y) pairs of word chunks, and chunks that predict the next word.
  • Add positional encodings.

First Half of Chapter 3 Goals:

  • Create embeddings layer as a neural network
  • Start training it using your dataset loader.

If you haven't finished Week 5's lab and written a dev diary entry, start there first and complete it.

Step 1. Create a Week6 Directory and Do Today's Work There

Change back to your assignments directory, one directory up from the data directory in the previous step.

cd <repo_dir>/ai-24sp/assignments/<your_username>

Download your dataset / book text using your data.sh script from last time, if you don't have your text in your current work environment.

(For example, a new GitPod workspace).

./data.sh

Create a new directory for week6 and do your work by adding and running the process.py in this directory.

mkdir -p week6
cd week6
touch process.py
touch dataloader.py

The only lines you need from your Week 5 process.py are the ones that load your dataset. They should look like the following:

raw_text = "" 
with open("../data/mark-twain-autobio.txt", "r", encoding="utf-8") as f:
    raw_text = f.read()

print(f"Total number of characters: {len(raw_text)}")
print(raw_text[:100])

You'll add more lines after this in process.py

Add a requirements.txt file with the following packages:

tiktoken
torch
|-- data
|   |-- data.sh
|   |-- mark-twain-autobio.html
|   |-- mark-twain-autobio.txt
`-- week-05
    |-- process.py
    `-- tokenizer.py
`-- week-06
    |-- process.py
    |-- dataloader.py
    `-- requirements.txt

Install these packages with this command

pip3 install -r requirements.txt

This may take a few minutes on GitPod, so start the process and read ahead to the next step.

Step 3. Write and Run Python

Read Chapter 2 of the Raschka book on LLMs, starting from Section 2.5 through to 2.9.

Examine the code for these sections in these iPython notebooks.

Your goal is to re-type these code sections into your own directory, either in process.py or other .py files that you import into process.py, and adapt them to your own dataset instead of the Edith Wharton The Verdict short story in the examples.

Step 4. Git Add, Commit, Push

In your assignments directory, run the following command to prevent yourself from accidentally committing any large files to the class monorepo.

Now add, commit, and push your changes on a branch and pull request using our Git Workflow.

⚠️ **GitHub.com Fallback** ⚠️