AI‐24sp‐2024‐05‐16‐Afternoon - TheEvergreenStateCollege/upper-division-cs-23-24 GitHub Wiki
Week 07 - Afternoon Lab
Continue with your dataset from last week's lab.
First Half of Chapter 3 Goals:
- Code a simple, hard-coded attention layer of weights
- Normalize it to sum to 1.0
- Code a trainable attention layer
Don't worry if you haven't finished the previous lab. Start fresh with today's lab.
Change back to your assignments directory, one directory up from the data
directory in the previous step.
cd <repo_dir>/ai-24sp/assignments/<your_username>
Download your dataset / book text using your data.sh
script from last time,
if you don't have your text in your current work environment.
(For example, a new GitPod workspace).
./data.sh
Create a new directory for week6
and do your work by adding and running the process.py
in this directory.
mkdir -p week7
cd week7
touch dataloader.py
The only lines you need from your Week 6 process.py
are the ones
that load your dataset. They should look like the following:
raw_text = ""
with open("../data/mark-twain-autobio.txt", "r", encoding="utf-8") as f:
raw_text = f.read()
print(f"Total number of characters: {len(raw_text)}")
print(raw_text[:100])
You'll also copy over your requirements.txt
file from week6
and add the numpy
package.
This is needed for some numerical calculations and torch
depends on it.
tiktoken
torch
numpy
Your directory structure in your personal / team assignments directory will look something like this,
but with your own dataset in the data
directory.
|-- data.sh
|-- data
| |-- mark-twain-autobio.html
| |-- mark-twain-autobio.txt
`-- week5
|-- process.py
`-- tokenizer.py
`-- week6
|-- process.py
|-- dataloader.py
`-- requirements.txt
`--week7
├── 2_6_sampling.py
├── 2_7_embeddings.py
├── 2_8_positional.py
|---3_3_1_untrainable.py
|---3_3_2_trainable.py
└── dataloader.py
`-- requirements.txt
Install these packages with this command
pip3 install -r requirements.txt
This may take a few minutes on GitPod, so start the process and read ahead to the next step.
Read Chapter 3 of the Raschka book on LLMs, starting from the beginning and up to and including Section 3.4.1.
Your goal is to re-type these code sections into your own directory, in the file with the appropriate filename.
For example:
- Section 2.6 in
2_6_sampling.py
- Section 2.7 in
2_7_embeddings.py
- Section 2.8 in
2_8_positional.py
- Section 3.3.1 in
3_3_1_untrainable.py
- Section 3.3.2 in
3_3_2_trainable.py
and adapt them to your own dataset.
Run these files to produce the same output from the book, and copy and paste these into a dev diary entry for today.
In your assignments directory, run the following command to prevent yourself from accidentally committing any large files to the class monorepo.
Now add, commit, and push your changes on a branch and pull request using our Git Workflow.