AI_Homework6_Response - TheEvergreenStateCollege/upper-division-cs-23-24 GitHub Wiki

AI Homework Week 6

Link to original assignment


Questions

1. Match each of the red or the blue diagram below with one of the following processes:
* **Encoding text from words to token IDs (integers). -->** image
  • Decoding from token IDs (integers) to words. -->
image
2. What are the two special tokens added to the end of the vocabulary shown in the tokenizer below? What are their token IDs?
image

<|unk|> : Token ID 783. Indicates a word that's not in the tokenized dataset's input vocabulary.

<|endoftext|> : Token ID 784. Indicates end of text.

3. The following diagram shows multiple documents being combined (concatenated) into a single large document to provide a high-quality training dataset for a GPT.

What is the special token that separates the original documents in the final combined dataset? How can you tell how many original documents were combined into this dataset?


image

The <|endoftext|> token seperates original documents, so counting the number of times this token occurs would demonstrate the number of original documents in a combined dataset.

4. Using the byte-pair-encoding tokenizer, unknown words are broken down into single and double-letter pairs. image

What are some of the advantages of this approach? Let's use the example of a long, but unusual word, "Pneumonoultramicroscopicsilicovolcanoconiosis" which might be broken down into smaller word parts like "Pn-eu-mono-ul-tra-micro-scopic-silico-volcano-conio-sis".

For each choice below, explain why it is an advantage of disadvantage of this approach.

It lets the GPT learn connections between the long word and shorter words based on common parts, like "pneumonia", "microscope", or "volcano". This is an advantage, because it allows GPT to understand the relationship between words with shared prefixes, suffixes, and roots.
This approach can work in any language, not just English. This is an advantage, because it's more inclusive, allowing the GPT access to a broader set of data.
All words are broken down in the same way, so the process is deterministic and results from repeated runs will be similar. The advantage of this is stability and a feeling of general safety and wellbeing, just as all deterministic outputs provide us.
The system will handle any word it encounters in chat or inference, even if it has never seen the word during training. This can be an advantage, because the system won't crash and will instead have the opportunity to accomplish a process similar to humans: use "context clues" to guess at the word meaning. However, the disadvantage could misinterpret a word or its use.
It is easier than looking up the word in a hashtable, not finding it, and using a single `<|unk|>` unknown special token. This is an advantage, because it allows for less words to end up in the forbidden land of unk tokens. On the flip side, the result is that words that might not benefit from being broken down could be misinterpreted; i.e. "flower" broken into "flow" and "er" might have a meeting unrelated to the petaled fuana.
5. A BPE tokenized encoding is shown below with successively longer lists of token IDs.

The code that produced it is

text = "In the sunlit terraces of someunknownPlace."
split_text = split(text)
integers = tokenizer.encode(text, allowed_special={"<|endoftext|>"})

for i in range(10):
    decoded = tokenizer.decode(integers[:i+1])
    print(f"{integers[:i+1]} {decoded}")

This is the output

[818] In
[818, 262] In the
[818, 262, 4252] In the sun
[818, 262, 4252, 18250] In the sunlit
[818, 262, 4252, 18250, 8812] In the sunlit terr
[818, 262, 4252, 18250, 8812, 2114] In the sunlit terraces
[818, 262, 4252, 18250, 8812, 2114, 286] In the sunlit terraces of
[818, 262, 4252, 18250, 8812, 2114, 286, 617] In the sunlit terraces of some
[818, 262, 4252, 18250, 8812, 2114, 286, 617, 34680] In the sunlit terraces of someunknown
[818, 262, 4252, 18250, 8812, 2114, 286, 617, 34680, 27271] In the sunlit terraces of someunknownPlace

What kind of words (tokens) do you think tend to have smaller integers for token IDs, and what kind of words (tokens) tend to have larger integers?

What tokens smaller than words receive their own IDs? Are they always separated with a space from the token before them? If you were the Python programmer writing the tokenizer.decode method that produced the English string below from a list of integers, how would you implement the feature of letting these "sub-word" tokens join the sentence with or without a space?

Words that are more popular - in our example "the", "In", and "of" - have the smallest token id value.

It's hard to determine a pattern for the words that contain multiple tokens. The word "sunlit", contains a token describing "sun" and "lit", which would suggest that all compound words are composed of multiple tokens.

The tokens "terraces" and "unknown" are more perplexing - "terraces" is two tokens, "terr" and "aces". Each word does not contain meaning on its own, and the meaning of "aces" doesn't contribute much to the meaning of "terraces".

Following the pattern of "terraces", one would expect "unknown" to be broken down into two tokens. Strangely enough it isn't....

Human Writing

⚠️ **GitHub.com Fallback** ⚠️