wshine ai hw6 - TheEvergreenStateCollege/upper-division-cs-23-24 GitHub Wiki
-
the blue diagram is showing encoding text into a list of token IDs representing each word. the red diagram is showing decoding tokens back into words
-
<|unk|> (id: 783)and <|endoftext|> (id: 784),
-
the special token is <|endoftext|>, you can tell the count of documents by counting the <|endoftext|> tokens
-
-
It lets the GPT learn connections between the long word and shorter words based on common parts, like "pneumonia", "microscope", or "volcano".
this is an advantage because unkown words can be related to known words and still gain value out of them.
-
This approach can work in any language, not just English.
I think having an approach that works on all languages would be a big advantage. I'm not actually sure if this approach does work the same for all languages since the characters used be harder to break down into small pieces.
-
All words are broken down in the same way, so the process is deterministic and results from repeated runs will be similar.
This is an advantage, if words were not always processed the same way and a word might be broken up into different chunks each time I think there would be little meaning to this approach in handling unkown words.
-
The system will handle any word it encounters in chat or inference, even if it has never seen the word during training.
-
It is easier than looking up the word in a hashtable, not finding it, and using a single <|unk|> unknown special token. I'm not sure either of these last two are true or how to relate them to process of tokenizing unkown words during training.
-
-
unkown word parts broken into chunks seem to be given higher token IDs while known words are significantly lower. They are not always separated by a space. I would probabl check by token id. Decide what token id is the threshold for tokens that are word parts and only add a space between tokens if the id is less than that threshold.
This text is about sharing/distributing digital copies of books. It relates to selecting training datasets for GTPs since we are using any text material (articles, blog posts, or potential books) to train a LLM. The publishers in this case are arguing that the CDL is violating their copyright on the books and has cost them millions of dollars. Internet Archive's stance is that they are ensuring the public can continue to make use of the books that libraries have bought and paid for, and that libraries are investing in digitizing texts as a means to preserve them. They also claim to be functioning the same way libraries do, so only lending out a single 'copy' at a time. I think the Internet Archive's argument is convincing, the books have already been bought and paid for, are still only being lend to one individual at a time and is allowing more books/knowledge to be shared with others.
Training a GPT based on the code and written work of my peers would bring up the question of whether or not my peers allowed it? The idea of sharing my work to be used as part of training some AI is slightly uncomfortable. What will it be used for? How can I be sure nothing that personally identifies me is retained?