AI_Homework6_Response - TheEvergreenStateCollege/upper-division-cs-23-24 GitHub Wiki
Link to original assignment
1. Match each of the red or the blue diagram below with one of the following processes:
* **Encoding text from words to token IDs (integers). -->**

- Decoding from token IDs (integers) to words. -->

2. What are the two special tokens added to the end of the vocabulary shown in the tokenizer below? What are their token IDs?

<|unk|> : Token ID 783. Indicates a word that's not in the tokenized dataset's input vocabulary.
<|endoftext|> : Token ID 784. Indicates end of text.
3. The following diagram shows multiple documents being
combined (concatenated) into a single large document to
provide a high-quality training dataset for a GPT.
What is the special token that separates the original
documents in the final combined dataset? How can you tell
how many original documents were combined into this dataset?

The <|endoftext|> token seperates original documents, so counting the number of times this token occurs would demonstrate the number of original documents in a combined dataset.
4. Using the byte-pair-encoding tokenizer, unknown words
are broken down into single and double-letter pairs.
What are some of the advantages of this approach?
Let's use the example of a long, but unusual word,
"Pneumonoultramicroscopicsilicovolcanoconiosis" which
might be broken down into smaller word parts
like "Pn-eu-mono-ul-tra-micro-scopic-silico-volcano-conio-sis".
For each choice below, explain why it is an advantage of disadvantage of this approach.
It lets the GPT learn connections between the long word and shorter words based on common parts, like "pneumonia", "microscope", or "volcano". | This is an advantage, because it allows GPT to understand the relationship between words with shared prefixes, suffixes, and roots. |
This approach can work in any language, not just English. | This is an advantage, because it's more inclusive, allowing the GPT access to a broader set of data. |
All words are broken down in the same way, so the process is deterministic and results from repeated runs will be similar. | The advantage of this is stability and a feeling of general safety and wellbeing, just as all deterministic outputs provide us. |
The system will handle any word it encounters in chat or inference, even if it has never seen the word during training. | This can be an advantage, because the system won't crash and will instead have the opportunity to accomplish a process similar to humans: use "context clues" to guess at the word meaning. However, the disadvantage could misinterpret a word or its use. |
It is easier than looking up the word in a hashtable, not finding it, and using a single `<|unk|>` unknown special token. | This is an advantage, because it allows for less words to end up in the forbidden land of unk tokens. On the flip side, the result is that words that might not benefit from being broken down could be misinterpreted; i.e. "flower" broken into "flow" and "er" might have a meeting unrelated to the petaled fuana. |
5. A BPE tokenized encoding is shown below with successively longer lists of token IDs.
The code that produced it is
text = "In the sunlit terraces of someunknownPlace."
split_text = split(text)
integers = tokenizer.encode(text, allowed_special={"<|endoftext|>"})
for i in range(10):
decoded = tokenizer.decode(integers[:i+1])
print(f"{integers[:i+1]} {decoded}")
This is the output
[818] In
[818, 262] In the
[818, 262, 4252] In the sun
[818, 262, 4252, 18250] In the sunlit
[818, 262, 4252, 18250, 8812] In the sunlit terr
[818, 262, 4252, 18250, 8812, 2114] In the sunlit terraces
[818, 262, 4252, 18250, 8812, 2114, 286] In the sunlit terraces of
[818, 262, 4252, 18250, 8812, 2114, 286, 617] In the sunlit terraces of some
[818, 262, 4252, 18250, 8812, 2114, 286, 617, 34680] In the sunlit terraces of someunknown
[818, 262, 4252, 18250, 8812, 2114, 286, 617, 34680, 27271] In the sunlit terraces of someunknownPlace
What kind of words (tokens) do you think tend to have smaller integers for
token IDs, and what kind of words (tokens) tend to have larger integers?
What tokens smaller than words receive their own IDs? Are they always
separated with a space from the token before them? If you were the
Python programmer writing the tokenizer.decode
method that produced
the English string below from a list of integers, how would you
implement the feature of letting these "sub-word" tokens join the
sentence with or without a space?
Words that are more popular - in our example "the", "In", and "of" - have the smallest token id value.
text = "In the sunlit terraces of someunknownPlace."
split_text = split(text)
integers = tokenizer.encode(text, allowed_special={"<|endoftext|>"})
for i in range(10):
decoded = tokenizer.decode(integers[:i+1])
print(f"{integers[:i+1]} {decoded}")
[818] In
[818, 262] In the
[818, 262, 4252] In the sun
[818, 262, 4252, 18250] In the sunlit
[818, 262, 4252, 18250, 8812] In the sunlit terr
[818, 262, 4252, 18250, 8812, 2114] In the sunlit terraces
[818, 262, 4252, 18250, 8812, 2114, 286] In the sunlit terraces of
[818, 262, 4252, 18250, 8812, 2114, 286, 617] In the sunlit terraces of some
[818, 262, 4252, 18250, 8812, 2114, 286, 617, 34680] In the sunlit terraces of someunknown
[818, 262, 4252, 18250, 8812, 2114, 286, 617, 34680, 27271] In the sunlit terraces of someunknownPlace
tokenizer.decode
method that produced
the English string below from a list of integers, how would you
implement the feature of letting these "sub-word" tokens join the
sentence with or without a space?It's hard to determine a pattern for the words that contain multiple tokens. The word "sunlit", contains a token describing "sun" and "lit", which would suggest that all compound words are composed of multiple tokens.
The tokens "terraces" and "unknown" are more perplexing - "terraces" is two tokens, "terr" and "aces". Each word does not contain meaning on its own, and the meaning of "aces" doesn't contribute much to the meaning of "terraces".
Following the pattern of "terraces", one would expect "unknown" to be broken down into two tokens. Strangely enough it isn't....