AI‐24sp‐2024‐05‐16‐Afternoon - TheEvergreenStateCollege/upper-division-cs-23-24 GitHub Wiki

AI Self-Hosting, Spring 2024

Week 07 - Afternoon Lab

Continue with your dataset from last week's lab.

First Half of Chapter 3 Goals:

  • Code a simple, hard-coded attention layer of weights
    • Normalize it to sum to 1.0
  • Code a trainable attention layer

Don't worry if you haven't finished the previous lab. Start fresh with today's lab.

Step 1. Create a Week6 Directory and Do Today's Work There

Change back to your assignments directory, one directory up from the data directory in the previous step.

cd <repo_dir>/ai-24sp/assignments/<your_username>

Download your dataset / book text using your data.sh script from last time, if you don't have your text in your current work environment.

(For example, a new GitPod workspace).

./data.sh

Create a new directory for week6 and do your work by adding and running the process.py in this directory.

mkdir -p week7
cd week7
touch dataloader.py

The only lines you need from your Week 6 process.py are the ones that load your dataset. They should look like the following:

raw_text = "" 
with open("../data/mark-twain-autobio.txt", "r", encoding="utf-8") as f:
    raw_text = f.read()

print(f"Total number of characters: {len(raw_text)}")
print(raw_text[:100])

You'll also copy over your requirements.txt file from week6 and add the numpy package. This is needed for some numerical calculations and torch depends on it.

tiktoken
torch
numpy

Your directory structure in your personal / team assignments directory will look something like this, but with your own dataset in the data directory.

|-- data.sh
|-- data
|   |-- mark-twain-autobio.html
|   |-- mark-twain-autobio.txt
`-- week5
    |-- process.py
    `-- tokenizer.py
`-- week6
    |-- process.py
    |-- dataloader.py
    `-- requirements.txt
`--week7
    ├── 2_6_sampling.py
    ├── 2_7_embeddings.py
    ├── 2_8_positional.py
    |---3_3_1_untrainable.py
    |---3_3_2_trainable.py
    └── dataloader.py
    `-- requirements.txt

Install these packages with this command

pip3 install -r requirements.txt

This may take a few minutes on GitPod, so start the process and read ahead to the next step.

Step 3. Write and Run Python

Read Chapter 3 of the Raschka book on LLMs, starting from the beginning and up to and including Section 3.4.1.

Your goal is to re-type these code sections into your own directory, in the file with the appropriate filename.

For example:

  • Section 2.6 in 2_6_sampling.py
  • Section 2.7 in 2_7_embeddings.py
  • Section 2.8 in 2_8_positional.py
  • Section 3.3.1 in 3_3_1_untrainable.py
  • Section 3.3.2 in 3_3_2_trainable.py

and adapt them to your own dataset.

Run these files to produce the same output from the book, and copy and paste these into a dev diary entry for today.

Step 4. Git Add, Commit, Push

In your assignments directory, run the following command to prevent yourself from accidentally committing any large files to the class monorepo.

Now add, commit, and push your changes on a branch and pull request using our Git Workflow.

⚠️ **GitHub.com Fallback** ⚠️