AI‐Homework‐06 - TheEvergreenStateCollege/upper-division-cs-23-24 GitHub Wiki
Read the second half of Chapter 2 in the Raschka LLM book, from Section 2.6 until the end.
Read Appendix A in the same book, up to and including Section A.5, on PyTorch and tensors.
Attempt a response to the following questions in a dev diary entry.
- Match each of the red or the blue diagram below with one of the following processes:
- Encoding text from words to token IDs (integers).
- Decoding from token IDs (integers) to words.


- What are the two special tokens added to the end of the vocabulary shown in the tokenizer below? What are their token IDs?

- The following diagram shows multiple documents being combined (concatenated) into a single large document to provide a high-quality training dataset for a GPT.
What is the special token that separates the original documents in the final combined dataset? How can you tell how many original documents were combined into this dataset?

- Using the byte-pair-encoding tokenizer, unknown words are broken down into single and double-letter pairs.

What are some of the advantages of this approach? Let's use the example of a long, but unusual word, "Pneumonoultramicroscopicsilicovolcanoconiosis" which might be broken down into smaller word parts like "Pn-eu-mono-ul-tra-micro-scopic-silico-volcano-conio-sis".
For each choice below, explain why it is an advantage of disadvantage of this approach.
- It lets the GPT learn connections between the long word and shorter words based on common parts, like "pneumonia", "microscope", or "volcano".
- This approach can work in any language, not just English.
- All words are broken down in the same way, so the process is deterministic and results from repeated runs will be similar.
- The system will handle any word it encounters in chat or inference, even if it has never seen the word during training.
- It is easier than looking up the word in a hashtable, not finding it, and using a single
<|unk|>
unknown special token.
- A BPE tokenized encoding is shown below with successively longer lists of token IDs.
The code that produced it is
text = "In the sunlit terraces of someunknownPlace."
split_text = split(text)
integers = tokenizer.encode(text, allowed_special={"<|endoftext|>"})
for i in range(10):
decoded = tokenizer.decode(integers[:i+1])
print(f"{integers[:i+1]} {decoded}")
This is the output
[818] In
[818, 262] In the
[818, 262, 4252] In the sun
[818, 262, 4252, 18250] In the sunlit
[818, 262, 4252, 18250, 8812] In the sunlit terr
[818, 262, 4252, 18250, 8812, 2114] In the sunlit terraces
[818, 262, 4252, 18250, 8812, 2114, 286] In the sunlit terraces of
[818, 262, 4252, 18250, 8812, 2114, 286, 617] In the sunlit terraces of some
[818, 262, 4252, 18250, 8812, 2114, 286, 617, 34680] In the sunlit terraces of someunknown
[818, 262, 4252, 18250, 8812, 2114, 286, 617, 34680, 27271] In the sunlit terraces of someunknownPlace
What kind of words (tokens) do you think tend to have smaller integers for token IDs, and what kind of words (tokens) tend to have larger integers?
What tokens smaller than words receive their own IDs? Are they always
separated with a space from the token before them? If you were the
Python programmer writing the tokenizer.decode
method that produced
the English string below from a list of integers, how would you
implement the feature of letting these "sub-word" tokens join the
sentence with or without a space?
Read the following case description of Hachette v. Internet Archive
In the same dev diary entry as above, write a response of at least 500 words addressing the following questions:
- How does this case relate to selecting training datasets for GPTs, such as our final project in this class?
- Describe the main point of view of Hachette and other publishers on the plaintiff side of this case (the party who claim they have been wronged).
- Describe the main point of view of Internet Archive, the founder Brewster Kahle, the Electronic Frontier Foundation, or other parties on the defendant side of this case (the party defending against claims that they have done anything wrong).
- What other news event or legal case is similar to this case that is interesting to you?
- Compare and contrast the two cases. How are they similar and how are they different?
- Which of the above arguments are convincing to you? If you were the judge in this case, how would you form your opinion?
- Suppose you wanted to train a GPT based on the code and written work of you and your classmates.
- What ethical questions and considerations would come up for you in this process?
If you use an AI chat to help develop your response, include a link to your chat and attempt to make it a 50%-50% chat: write prompts, questions, or your own summaries that are at least as long as the responses the AI gives you, or ask the AI to deliberately shorten its answers.
Note: Come up with your own thesis statement after reading the article before talking to anyone else, either a classmate or an AI. Do not ask another entity to develop ideas from scratch or come to a conclusion for you. You may wish to use other sources only to check your own understanding, knowing that you should independently verify and do additional work outside of this conversation to make sure the contributions are usable, true, and support your main point.