AI‐24sp‐2024‐05‐02‐Afternoon - TheEvergreenStateCollege/upper-division-cs-23-24 GitHub Wiki

AI Self-Hosting, Spring 2024

Week 05 - Afternoon Lab

Pre-Lab

Complete as much of this pre-lab before coming to lab on Thursday afternoon. If you're able to choose and prepare your dataset beforehand, you can spend more time on the main lab, using byte-pair encoding to tokenize and embed.

Step 1. Choose a Dataset

Choose a dataset for training your GPT for the final project.

(You may change this later as you discover more about the training purpose and your use case).

It should be:

  • Convertible to text (ASCII or UTF-8 encodings are the easiest to work with)
  • Accessible to download from a URL using curl or wget
  • Be useful or personally meaningful to you
  • Be written in, and representative of, any natural human language that you are comfortable conversing in
    • For most of us, this is modern English
  • Have between 600,000 and 1,000,000 space-separated words
    • If your preferred dataset has fewer words than this, it will be better as a finetuning dataset rather than pre-training
    • Choose another dataset, or use the Wikipedia articles by 2-letter prefix which are available under a Creative Commons license
    • We will perform finetuning in Weeks 9 and 10, so save your smaller datasets until then.
  • Fit comfortably on your laptop's storage device, or the space you have available in GitPod or GitHub Codespaces

Converting from XML or HTML

You can use the Python package html2text to convert an HTML or XML page to plaintext.

pip3 install html2text
html2text yourfile.html > yourfile.txt

Converting from PDF

If you have a PDF file, you can convert it to plaintext using the pdftotext tool that is part of the xpdf source code.

https://dl.xpdfreader.com/xpdf-4.05.tar.gz

We will work on getting this added to our GitPod image, but in the meantime, you can download and compile it for your platform.

Step 2. Download Data to Your Assignments Directory

Start with a clean working directory of the class monorepo, on main branch.

Create an assignment directory for yourself and change there

cd <repo_dir>
mkdir -p ai-24sp/assignments/<your_username>/data
cd ai-24sp/assignments/<your_username>/data

and change there.

Write a shell script called data.sh to download your files and extract them to this directory.

For example:

#!/bin/sh

wget https://www.gutenberg.org/files/19987/19987-h/19987-h.htm
html2text 19987-h.htm > mark-twain-autobio.txt

Add these to a .gitignore file in that directory.

*.htm
*.txt

Step 3. Create a Week5 Directory and Do Today's Work There

Change back to your assignments directory, one directory up from the data directory in the previous step.

cd <repo_dir>/ai-24sp/assignments/<your_username>

Create a new directory for week5 and do your work in the next step in this directory.

mkdir -p week5
cd week5
touch process.py
touch tokenizer.py

Be sure to load your file one directory up, for example

with open("../data/name-of-your-file.txt", "r") as f:
  raw_text = f.read()

Your directory structure will now look something like:

|-- data
|   |-- data.sh
|   |-- mark-twain-autobio.html
|   |-- mark-twain-autobio.txt
`-- week-05
    |-- process.py
    `-- tokenizer.py

Step 3. Write and Run Python

Read Chapter 2 of the Raschka book on LLMs, all the way through to Section 2.4, on token IDs.

You may find the code for Chapter 2 useful.

For each Python code listing, add them to a Python file called process.py and adapt them to run on the dataset file that you downloaded in the previous step.

You may choose to put your simple tokenizers into tokenizer.py and import them into process.py if it helps you to modularize your code. In your process.py you can import code in tokenizer.py with a statement like this:

from tokenizer import SimpleTokenizerV1

At each step, copy and paste the output of your program to your dev diary entry and describe what each one is.

You should have 8 outputs total that look like the following:

Output 1: Counting total words and printing the first hundred words

Total number of character: 20479
I HAD always thought Jack Gisburn rather a cheap genius--though a good fellow enough--so it was no 

Output 2: Splitting on whitespace and punctuation marks

['I', 'HAD', 'always', 'thought', 'Jack', 'Gisburn', 'rather', 'a', 'cheap', 'genius', '--', 'though', 'a', 'good', 'fellow', 'enough', '--', 'so', 'it', 'was', 'no', 'great', 'surprise', 'to', 'me', 'to', 'hear', 'that', ',', 'in']

Output 3: Assigning vocabulary words to token IDs

Include as many vocabulary words to show what you think is interesting about this dataset.

('!', 0)
('"', 1)
("'", 2)
...
('Has', 49)
('He', 50)

Output 4: Encoding a sentence from the dataset

[1, 58, 2, 872, 1013, 615, 541, 763, 5, 1155, 608, 5, 1, 69, 7, 39, 873, 1136, 773, 812, 7]

Output 5: Round-trip decoding back to a sentence from token IDs

'" It\' s the last he painted, you know," Mrs. Gisburn said with pardonable pride.'

Output 6: Encoding a new test sentence not from the dataset

This one won't be program output, but include the error stack trace from not being able to find a token key in your vocab dictionary.


---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Cell In[16], line 5
      1 tokenizer = SimpleTokenizerV1(vocab)
      3 text = "Hello, do you like tea. Is this-- a test?"
----> 5 tokenizer.encode(text)

Cell In[12], line 9, in SimpleTokenizerV1.encode(self, text)
      7 preprocessed = re.split(r'([,.?_!"()\']|--|\s)', text)
      8 preprocessed = [item.strip() for item in preprocessed if item.strip()]
----> 9 ids = [self.str_to_int[s] for s in preprocessed]
     10 return ids

Cell In[12], line 9, in <listcomp>(.0)
      7 preprocessed = re.split(r'([,.?_!"()\']|--|\s)', text)
      8 preprocessed = [item.strip() for item in preprocessed if item.strip()]
----> 9 ids = [self.str_to_int[s] for s in preprocessed]
     10 return ids

KeyError: 'Hello'

Output 7: Adding meta-tokens to end of vocabulary

('younger', 1156)
('your', 1157)
('yourself', 1158)
('<|endoftext|>', 1159)
('<|unk|>', 1160)

Output 8: Encoding a sentence using both known and unknown tokens

'<|unk|>, do you like tea? <|endoftext|> In the sunlit terraces of the <|unk|>.'

Step 4. Git Add, Commit, Push

In your assignments directory, run the following command to prevent yourself from accidentally committing any large files to the class monorepo.

cd <repo_dir>/ai-24sp/assignments/<your_username>
find . -size +100k | cat >> .gitignore
git add .gitignore

Now add, commit, and push your changes on a branch and pull request using our Git Workflow.

Main Lab

⚠️ **GitHub.com Fallback** ⚠️