AI‐24sp‐2024‐05‐09‐Afternoon - TheEvergreenStateCollege/upper-division-cs-23-24 GitHub Wiki
Week 06 - Afternoon Lab
Continue with your dataset from last week's lab.
Second Half of Chapter 2 Goals:
- Continue getting byte-pair-encoded tokens from
tiktoken
- Create training (x,y) pairs of word chunks, and chunks that predict the next word.
- Add positional encodings.
First Half of Chapter 3 Goals:
- Create embeddings layer as a neural network
- Start training it using your dataset loader.
If you haven't finished Week 5's lab and written a dev diary entry, start there first and complete it.
Change back to your assignments directory, one directory up from the data
directory in the previous step.
cd <repo_dir>/ai-24sp/assignments/<your_username>
Download your dataset / book text using your data.sh
script from last time,
if you don't have your text in your current work environment.
(For example, a new GitPod workspace).
./data.sh
Create a new directory for week6
and do your work by adding and running the process.py
in this directory.
mkdir -p week6
cd week6
touch process.py
touch dataloader.py
The only lines you need from your Week 5 process.py
are the ones
that load your dataset. They should look like the following:
raw_text = ""
with open("../data/mark-twain-autobio.txt", "r", encoding="utf-8") as f:
raw_text = f.read()
print(f"Total number of characters: {len(raw_text)}")
print(raw_text[:100])
You'll add more lines after this in process.py
Add a requirements.txt
file with the following packages:
tiktoken
torch
|-- data
| |-- data.sh
| |-- mark-twain-autobio.html
| |-- mark-twain-autobio.txt
`-- week-05
|-- process.py
`-- tokenizer.py
`-- week-06
|-- process.py
|-- dataloader.py
`-- requirements.txt
Install these packages with this command
pip3 install -r requirements.txt
This may take a few minutes on GitPod, so start the process and read ahead to the next step.
Read Chapter 2 of the Raschka book on LLMs, starting from Section 2.5 through to 2.9.
Examine the code for these sections in these iPython notebooks.
Your goal is to re-type these code sections into your own directory, either in
process.py
or other .py
files that you import into process.py
,
and adapt them to your own dataset instead of the Edith Wharton The Verdict short story in the examples.
In your assignments directory, run the following command to prevent yourself from accidentally committing any large files to the class monorepo.
Now add, commit, and push your changes on a branch and pull request using our Git Workflow.