AI‐24sp‐2024‐05‐02‐Afternoon - TheEvergreenStateCollege/upper-division-cs-23-24 GitHub Wiki
Week 05 - Afternoon Lab
Complete as much of this pre-lab before coming to lab on Thursday afternoon. If you're able to choose and prepare your dataset beforehand, you can spend more time on the main lab, using byte-pair encoding to tokenize and embed.
Choose a dataset for training your GPT for the final project.
(You may change this later as you discover more about the training purpose and your use case).
It should be:
- Convertible to text (ASCII or UTF-8 encodings are the easiest to work with)
- Accessible to download from a URL using
curl
orwget
- Be useful or personally meaningful to you
- Be written in, and representative of, any natural human language that you are comfortable conversing in
- For most of us, this is modern English
- Have between 600,000 and 1,000,000 space-separated words
- If your preferred dataset has fewer words than this, it will be better as a finetuning dataset rather than pre-training
- Choose another dataset, or use the Wikipedia articles by 2-letter prefix which are available under a Creative Commons license
- We will perform finetuning in Weeks 9 and 10, so save your smaller datasets until then.
- Fit comfortably on your laptop's storage device, or the space you have available in GitPod or GitHub Codespaces
You can use the Python package html2text
to convert an HTML or XML page to plaintext.
pip3 install html2text
html2text yourfile.html > yourfile.txt
If you have a PDF file, you can convert it to plaintext using the pdftotext
tool that is part of the xpdf
source code.
https://dl.xpdfreader.com/xpdf-4.05.tar.gz
We will work on getting this added to our GitPod image, but in the meantime, you can download and compile it for your platform.
Start with a clean working directory of the class monorepo, on main
branch.
Create an assignment directory for yourself and change there
cd <repo_dir>
mkdir -p ai-24sp/assignments/<your_username>/data
cd ai-24sp/assignments/<your_username>/data
and change there.
Write a shell script called data.sh
to download your files and extract them to this directory.
For example:
#!/bin/sh
wget https://www.gutenberg.org/files/19987/19987-h/19987-h.htm
html2text 19987-h.htm > mark-twain-autobio.txt
Add these to a .gitignore
file in that directory.
*.htm
*.txt
Change back to your assignments directory, one directory up from the data
directory in the previous step.
cd <repo_dir>/ai-24sp/assignments/<your_username>
Create a new directory for week5
and do your work in the next step in this directory.
mkdir -p week5
cd week5
touch process.py
touch tokenizer.py
Be sure to load your file one directory up, for example
with open("../data/name-of-your-file.txt", "r") as f:
raw_text = f.read()
Your directory structure will now look something like:
|-- data
| |-- data.sh
| |-- mark-twain-autobio.html
| |-- mark-twain-autobio.txt
`-- week-05
|-- process.py
`-- tokenizer.py
Read Chapter 2 of the Raschka book on LLMs, all the way through to Section 2.4, on token IDs.
You may find the code for Chapter 2 useful.
For each Python code listing, add them to a Python file called process.py
and adapt them to run on the dataset file
that you downloaded in the previous step.
You may choose to put your simple tokenizers into tokenizer.py
and import them into process.py
if it helps
you to modularize your code. In your process.py
you can import code in tokenizer.py
with a statement like this:
from tokenizer import SimpleTokenizerV1
At each step, copy and paste the output of your program to your dev diary entry and describe what each one is.
You should have 8 outputs total that look like the following:
Total number of character: 20479
I HAD always thought Jack Gisburn rather a cheap genius--though a good fellow enough--so it was no
['I', 'HAD', 'always', 'thought', 'Jack', 'Gisburn', 'rather', 'a', 'cheap', 'genius', '--', 'though', 'a', 'good', 'fellow', 'enough', '--', 'so', 'it', 'was', 'no', 'great', 'surprise', 'to', 'me', 'to', 'hear', 'that', ',', 'in']
Include as many vocabulary words to show what you think is interesting about this dataset.
('!', 0)
('"', 1)
("'", 2)
...
('Has', 49)
('He', 50)
[1, 58, 2, 872, 1013, 615, 541, 763, 5, 1155, 608, 5, 1, 69, 7, 39, 873, 1136, 773, 812, 7]
'" It\' s the last he painted, you know," Mrs. Gisburn said with pardonable pride.'
This one won't be program output, but include the error stack trace from not being able to find a token key in your vocab dictionary.
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
Cell In[16], line 5
1 tokenizer = SimpleTokenizerV1(vocab)
3 text = "Hello, do you like tea. Is this-- a test?"
----> 5 tokenizer.encode(text)
Cell In[12], line 9, in SimpleTokenizerV1.encode(self, text)
7 preprocessed = re.split(r'([,.?_!"()\']|--|\s)', text)
8 preprocessed = [item.strip() for item in preprocessed if item.strip()]
----> 9 ids = [self.str_to_int[s] for s in preprocessed]
10 return ids
Cell In[12], line 9, in <listcomp>(.0)
7 preprocessed = re.split(r'([,.?_!"()\']|--|\s)', text)
8 preprocessed = [item.strip() for item in preprocessed if item.strip()]
----> 9 ids = [self.str_to_int[s] for s in preprocessed]
10 return ids
KeyError: 'Hello'
('younger', 1156)
('your', 1157)
('yourself', 1158)
('<|endoftext|>', 1159)
('<|unk|>', 1160)
'<|unk|>, do you like tea? <|endoftext|> In the sunlit terraces of the <|unk|>.'
In your assignments directory, run the following command to prevent yourself from accidentally committing any large files to the class monorepo.
cd <repo_dir>/ai-24sp/assignments/<your_username>
find . -size +100k | cat >> .gitignore
git add .gitignore
Now add, commit, and push your changes on a branch and pull request using our Git Workflow.