AI HW 06 Griffin - TheEvergreenStateCollege/upper-division-cs-23-24 GitHub Wiki
Tensors
Computational graph: 
the chain rule is a way to compute gradients of a loss function
Questions:
-
blue: encoding text from words to token IDs red: decoding from token IDs to words
-
unknown and end of text. 783 and 784
-
end of text
- It lets the GPT learn connections between the long word and shorter words based on common parts, like "pneumonia", "microscope", or "volcano" this is an advantage because our vocabulary might contain subwords, so we even if we don't know all of the word we can start to learn the pattern
- This approach can work in any language, not just English. This is an advantage
- sub words have higher IDs. If I was implementing this, I'd save tokens with spaces and without spaces. I'm also wondering/guessing that capitalized tokens are stored as separate IDs than their lowercase counterparts.
Human Writing: https://www.eff.org/cases/hachette-v-internet-archive
I'm not sure how this relates to datasets because unlike print books, datasets aren't limited. I feel like the library case is pretty silly because humans should be striving to distribute information, and any barriers that limit access to books are bad in my mind. The awesome thing about digital stuff is that the form allows us to share many copies of a dataset.
They feel like the library has been giving out more copies than they paid for.
The Internet Archive has paid for the rights to distribute these books like this, and also they limit how many copies are distributed.
This reminds me of how many video games are only licenses to play. I.e. if a company takes a game off of the server, then nobody has access to it anymore. People have lost the same sense of physical ownership. In contrast, the library still has a physical copy, and limits the number of copies checked out. In this way the digital copies are represented by a physical object.
The archives argument is convincing because they aren't distributing more copies than they have access to.
Since it is other people's data, do they deserve part of the profit? what are the security concerns? Also the dataset is biased by the people it comes from.