AI Homework - TheEvergreenStateCollege/upper-division-cs-23-24 GitHub Wiki

AI Homework 1

The High Court is set to determine an unjust enrichment case concerning the resurrection of the actor Peter Cushing in the film Rogue One. The case has been brought by Tyburn Film Productions, which argues that it has the right to block or restrict others from resurrecting Cushing according to an agreement made in 1993. Cushing appeared in Star Wars (1977) as the Grand Moff Tarkin and was resurrected as the same character in Rogue One (2016). The defendants claim they had the right to resurrect him under a 1976 agreement with his production company and/or that they acquired that right under an agreement in 2016.

To me, the cases were more of telltale case about concerning the rights of who owns the character and the jurisdiction along the filming right. On one side, a company called Tyburn Film Productions, says that they should have control over the rights of the actor, while the others claim that they were only using him per an agreement back in 1976. It steps into a grey area related to the law of performer's rights, which the judge commented on 2 specific areas where she said the law is “not entirely settled”, competing analysis of Regulation 31 of the Copyright and Related Rights Regulations 1996 and arguments at the edges of the scope of unjust enrichment in multi-party or indirect situations. To me, I don't know much about the law but I know that the use of CGI to revive dead actors was not very uncommon as it has been done in the past. 3 years before this event (counting the release date of Rogue One), Paul Walker got in an accident and died inside his own Porsche as a passenger, shocking the industries and fans of Fast and Furious around the world. At the time of his death, the filming of Fast and Furious 7 hasn't finished yet. So to fill out the missing role and make a tribute to him, the director has to rewrite some parts of the stories and recruit his 2 other brothers as stand-ins, which could also involve using CGI for the matter. To me, the movie did justice and made a great tribute to a long-living legend that influenced many people in car culture.

AI Homework 3

I work with Torsten for this one

  1. What is connectionism and the distributed representation approach? How does it relate to the MNIST classification of learning the idea of a circle, regardless of whether it is the top of the digit 9 or the top/bottom of the digit 8?

Connectionism is the study of how the neurons in a neural network connect, i.e. how they work together. Distributed representation is a way of encoding information in a system where each input is represented by many features, and each feature is involved in the representation of many possible inputs. I think that the way it relates to the classification is mainly about the distributed representation, where the circle in the pixel screen can be represented by multiple edges, that on their own represent multiple features in one input and can be spread on many possible inputs

  1. What are some factors that have led to recent progress in the ability of deep learning to mimic intelligent human tasks?

A factor that has led to recent progress in the ability of deep learning to mimic intelligent human tasks is the growth of neural network learning and the availability of higher-end, strong computational hardware. Compared to around 10 years ago, the chip that we regard as old or less powerful became the most powerful back then with the Nvidia Geforce GTX era GPU and the release of Intel Core I5 chips (Correct me if I'm wrong). Compared with today's systems, the training of AI using reinforcement learning and neural networks become a much less complicated task, with chips that can count epochs of training in a matter of seconds, which leads to larger networks that can now achieve higher accuracy on more complex tasks. It makes sense and correlates somewhat to Moore's law regarding the number of transistors doubling every 2 years

  1. How many neurons are in the average human brain, versus the number of simulated neurons in the biggest AI supercomputer described in the book chapter? Now in the year 2024, how many neurons can the biggest supercomputer simulate? (You may use a search engine or an AI chat itself to speculate).

The average human brain contains around 86 billion neurons. Current supercomputers have the capability of doing a much larger amount of computation with some of the largest AI supercomputers capable of simulating tens of billions of neurons

Let's say you are training your neural network on pairs of (x,y), where x is a training datapoint (an image in MNIST) and y is the correct label for x.

  1. Why does the neural network, before you've trained it on the first input x, output "trash", or something that is very far from the corresponding y?

It's mostly because our model is not trained, it was just using random weights and biases assigned. It hasn't been told how to separate

  1. If you have a Numpy array that looks like the following, give its shape as a tuple of maximum dimensions along each axis. For example (p,q,r) is a tensor of the third rank, with p along the first dimension (which is "2D layer"), q "rows" in each "2D layer", and r "columns" in each "row". [[[1,2,3,4], [5,6,7,8], [9,10,11,12]] ]

(2, 3, 4)

Assume your neural network in network.py is created with the following layers

net = Network([7,13,2])

  1. What is the list of shapes of the self.weights and self.biases members in the constructor of the Network? Because these are lists of Numpy matrices, the different elements can have different shapes, so your answer will have a slightly different form. Your answer should look like a list of tuples, such as [ (1,2), (3,4), (5,6) ] which means a list of three shapes, the first has 1 row of 2 columns each, the second has 3 rows of 4 columns each, and the last has 5 rows of 6 columns each.

[(13,7),(2,13)]

  1. From the notes, answer this question by ordering the given outputs in terms of the cost function they give out.

D > C > A > B

  1. What is the effect of changing the learning rate (the Greek letter "eta") in training with SGD?

Increasing the learning rate will create bigger steps and quicker to find the minimum. But with a high risk of missing a better minimum in the gradient descent. Lowering the learning rate mitigates this but it takes a longer time

  1. Why is the word "stochastic" in the name "stochastic gradient descent", and how is it different than normal gradient descent?

Stochastic means randomly determined, something akin to a pseudo-random. Stochastic gradient descent means the gradient is computed using a random example or position. It's better because, with randomized findings, it can converge to gradient descent quickly.

Human Writing

In my opinion, the essays handled by 2 different ChatGPT versions are both coming out very well. I can say that although it has some improvements, it could still make mistakes or handle things a bit less poetically comparing the 2 different versions altogether, the 3.5rd version has a more fluent tone in expressing the essay albeit lacks textual information in some form. In the 4th version, the essay feels more factual but less poetic when talking about expressing the essay. Personally, for me, poetic expression is one of the important points when evaluating the essay as it creates a more humane tone and helps the reader visualize the emotions you depict while doing it. The 4th version will come out a lot more like "using ChatGPT" than the 3.5 version because of this fact. But I would say that in general, both can be used as a template or study guide about how to write, but not applicable if you submit it as it is because of the previous fact.

Journalism framing is an attempt to highlight or move focus away from something else through the use of messages. I think that "Standard story" means a template that is used in journalism usually as a guide for journalists to write about certain subjects.

Because chatGPt is also learning on its own using data that OpenAI feeds in or through user input, including previous-known works and citations, the "standard story" could be used as a template for ChatGPT when it is reviewed through multiple articles to find key points and summarize the idea, applying back to machine learning and deep learning, it gains a sense of understanding by its own and represented into the user as an answer

I think that not just using ChatGPT to create a template of what you want to write, could also help you research the article and find key points to learn technical topics or lend assistance to the user. It would help to some extent but not fully since we still need to review the information before processed through.

AI Homework 4

image

  1. The cost of this "trash" output is 3.3585
  2. Increase the bias associated with the digit-2 neuron and decrease the biases associated with all the other neurons.
  3. Increasing the weight, in this case, will help past identify nodes that shoot correctly to be stronger, hence enhancing the identical connection that it previously had
  4. Changes to the weights of all the neurons "requested" by each training data. Additionally, changes to the biases of all the neurons "requested" by each training data can also be one of the cases when backpropagation used in the neural network
  5. a drunk stumbling quickly down a hill. Additionally, "asking a random-sized group of people to make a small chance to any law in the country, repeated times, allowing a person the possibility to be chosen multiple times" is a possible analogy to this method
  6. 12 x 4

Backpropagation calculus

  1. (784 * 100) + (100 * 10) Adding 3 each?

image

  1. These are the 3 choices:

  2. The bottom half of the second diagram is the same as the first diagram

  3. The second diagram extends backward into the neural network, showing a previous layer L-2 whose outputs the layer L-1 depends on

  4. Drawing a path between two quantities in either diagram will show you which partial derivatives to "chain together" in calculating gradient c

Human Writing

I remember that I recorded this in my dev diary somewhere but to be precise I'll put it here

  • What thoughts or feelings do you have right now, before hearing your synthesized voice? I would feel a bit ecstatic but feels weird since I recognize my voice as something very cringy.

  • Should you have the right to synthesize your own voice? What are possible advantages or misuses of it? As the owner of the voice, you should have the right to do so, but the only advantage so far is to fake yourself as a person and replace me in some tasks like setting an appointment. But it also has a disadvantage to using a deepfake, which is an AI video fake to be so real that other person can be easily misled.

  • Photographs, video, and audio recordings are technologies that changed the way we remember loved ones who have died.

    • Should family members have the right to create AI-simulated personalities of the deceased? It steps in a grey area in my opinion. But to be safe, unless there are pre-conditions, where that person previously strongly had the will to live, or has been written in the will to the descendant, it's best to base on those conditions to create an AI simulation
    • If generative AI allows us another way to interact with the personality of a person in the past, how does this compare with historical re-enactors, or movies depicting fictional events with real people? I think that although we have generative AI, sometimes it doesn't depict as accurately as historical re-enactors. About the movies, there's a specific word I think that fits the scenario, namely "canon"

AI Homework 5

Chapter 1

  1. What is the difference between a GPT and an LLM? Are the terms synonymous? A GPT and an LLM have different architectures and applications. GPTs are trained using unsupervised learning on large amounts of text data, including user input. While LLM learns to predict the next word in a sequence given the previous words. To me, they're not completely synonymous, having some resemblance as a Large Language Model but not the same applicably

  2. Labeled training pairs of questions and answers, in the model of "InstructGPT" are most similar to which of the following?

A. Posts from Stackoverflow which have responses to them.

B. Posts on Twitter/X and their replies

C. Posts on Reddit and their replies

D. Posts on Quora and their replies

E. Images of handwritten digits and a label of 0 through 9 in the MNIST classification task

For each one, are there multiple labels for a training datapoint, and if so, is there a way to rank the quality or verify (closeness to the truth) of the labels with respect to their training datapoint?

They could have multiple labels for a training data point, and the only way to rank the quality is through intense deep learning methods, which is a harder way to do or to be verified by members of the social media, similar to the community notes system on Twitter/X or Reddit's upvote/downvote

  1. The GPT architecture in the paper "Attention is All You Need" was originally designed for which task

A. Passing the Turing test

B. Making up song lyrics

C. Machine translation from one human language to another

D. Writing a novel

  1. How many layers of neural networks are considered "deep learning" in the Rashka text?

3 or more layers in a neural network

  1. Is our MNIST classifier a deep learning neural network by this definition?

Yes, it is. It has 3-4 layers depending on the configuration

  1. For each statement about how pre-training is related to fine-tuning for GPTs:
  • If the statement is true, write "True" and give an example.
  • If the statement is false, write "False" and give a counter-example.

A. Pre-training is usually much more expensive and time-consuming than fine-tuning.

True, Pre-training LLMs require access to significant resources and are very expensive in both time and computation. While fine-tuning requires less than that since it already has data to train, we just continuously train more while giving new information every time. An example is our MNIST data

B. Pre-training is usually done with meticulously labeled data while finetuning is usually done on large amounts of unlabeled or self-labeling data.

True. An example is our MNIST

C. A model can be fine-tuned by different people than the ones who originally pre-trained a model.

True, A model can be fine-tuned by different people than the ones who originally pre-trained a model. IRL there are a lot of LLM models, with one of them being GPT-2 as being open-sourced for different people to try. An example is our MNIST when we swap people to different prototype

D. Pre-training is to produce a more general-purpose model, and fine-tuning specializes it for certain tasks.

True. Our GPT were trained to do so

E. Fine-tuning usually uses less data than pre-training.

True, because it's already labeled back in pre-training, this is more of a testing

F. Pre-training can produce a model from scratch, but fine-tuning can only change an existing model.

True, There's no way to fine-tune without an already-existing model

  1. GPTs work by predicting the next word in a sequence, given which of the following as inputs or context?

A. The existing words in sentences it has already produced in the past.

B. Prompts from the user

C. A system prompt that frames the conversation or instructs the GPT to behave in a certain role or manner

D. New labeled pairs that represent up-to-date information that was not present at the time of training

E. The trained model which includes the encoder, decoder, and attention mechanism weights and biases

  1. The reading distinguishes between these three kinds of tasks that you might ask an AI to do:
  • Predicting the next word in a sequence (for a natural language conversation)
  • classifying items, such as a piece of mail as spam, or a passage of text as an example of Romantic vs. realist literature
  • Answering questions on a subject after being trained with question-answer examples Open your favorite AI chat (these are probably all GPTs currently) such as OpenAI ChatGPT, Google's Gemini, Anthropic's Claude, etc.

Have a conversation where you try to understand how these three tasks are the same or different. In particular, is one of these tasks general-purpose enough to implement the other two tasks?

All 3 of them are mutually inclusive? of each other, Predicting can only be generalized for Classifying, Classifying can only be generalized for Predicting, and Answering can only be generalized for Predicting

Copy and paste a link to your chat into your dev diary entry. image

  1. Which of the following components of the GPT architecture might be neural networks, similar to the MNIST classifier we have been studying? Explain your answer.

A. Encoder, that translates words into a higher-dimensional vector space of features

B. Tokenizer, that breaks up the incoming text into different textual parts

C. Decoder, the translates from a higher-dimensional vector space of features back to words

  1. What is an example of zero-shot learning that we have encountered in this class already? Choose all that apply and explain.

A. Using an MNIST classifier trained on numeric digits to classify alphabetic letters instead.

B. Using the YourTTS model for text-to-speech to clone a voice the model has never heard before

C. Using ChatGPT or a similar AI chat to answer a question it has never seen before with no examples

D. Using spam filters in Outlook by marking a message as spam to improve Microsoft's model of your email reading habits

  1. What is zero-shot learning, and how does it differ from few-shot or many-shot learning?

Zero-shot learning is the ability of the model to generalize to completely unseen tasks without any prior specific examples. Unlike both, they were only provided with test data

  1. What is the number of model parameters quoted for any of the OpenAI models: GPT-3, GPT-3.5, or GPT-4?

GPT 3 has 175 billion parameters, for 3.5 and 4, OpenAI didn't specifically disclose this information but it could be more than that particular number

Chapter 2

  1. Why can't LLMs operate on words directly? (Hint: think of how the MNIST neural network works, with nodes that have weighted inputs, that are converted with a sigmoid, and fire to the next layer. These are represented as matrices of numbers, which we practiced multiplying with Numpy arrays. The text-to-speech system TTS similarly does not operate directly on sound data.)

Because without splitting it into separate tokens, the information feed in the LLM is just gibberish sentences to the LLM that don't make any sense

  1. What is an embedding? What does the dimension of an embedding mean?

Embedding is the process of turning data such as raw text into a vector format and storing it in tokens. In this case, it's just a case of a dimension of vectors I think

  1. What is the dimension of the embedding used in our example of Edith Wharton's short story "The Verdict"? What is it for GPT-3?

The short story has 2500 words, so that's what the dimension of the embedding? In GPT-3, it uses an embedding size of 12,288 dimensions.

  1. Put the following steps in order for processing and preparing our dataset for LLM training

A. Adding position embeddings to the token word embeddings B. Giving unique token IDs (numbers) to each token. C. Breaking up natural human text into tokens, which could include punctuation, whitespace, and special "meta" tokens like "end-of-text" and "unknown" D. Converting token IDs to their embeddings, for example, using Word2Vec

A-> C -> B -> D

Human Writing

As for my research, I don't have a specific reason for gathering and sourcing data for my AI training currently. However I believe that there should be some consideration regarding ethics like how to train while keeping data privacy or data quality, and also keeping transparency and accountability. It's just based on my experience with AI technology in the past. Back then I was met with a lot of mischief relating to the use of AI, there was one time when deepfake was spreading through social media. Not only that but also I got interested in AI technology when I first played some open-world games, it's not a big task, usually revolves around pathfinding and how they move and act in a range. Later on, the most prominent case for me was how the bots move in the Counter-Strike series. David Deutsch, a physicist and philosopher, emphasizes the potential of AI to augment human creativity and problem-solving capabilities rather than viewing it as a threat. To me, his perspective about the potential of AI was somehow correct since currently, and probably in a few more years to the future, the AI wasn't sentient enough to cause troubles like how the Terminator movies or Detroit: Become Human video game depicts. It still doesn't have the capabilities of emotions. However, it's still better to keep an eye on it. David Deutsch views creativity as fundamentally human, involving the ability to create new knowledge and explanations that expand our understanding of the universe. He contrasts creativity with routine problem-solving, emphasizing its role in advancing scientific knowledge and cultural evolution. One compelling use of AI is in creative fields such as art and music generation. Generative AI algorithms can autonomously produce artworks, compositions, or literature that mimic human creativity. There has been a big debate about whether or not using generative AI to create artworks has been deemed creative, some agree that it's a novel expression while others object to it and the reason is very much apparent, there have been a lot of cases where people using generative AI to plagiarise the work of a less famous individual and claim it's their own. It has been sparking outrage throughout the media to the point that a person has to close their account and justice is served to the original creator. To me, this is the case that we should keep in mind if it happens to be great to have AI accompanying us.

AI Homework 6

  1. Match each of the red or the blue diagram below with one of the following processes:

image Encoding text from words to token IDs (integers)

image Decoding from token IDs (integers) to words.

  1. What are the two special tokens added to the end of the vocabulary shown in the tokenizer below? What are their token IDs?

image

The |unk| is a special token that identifies a token of the word that the LLM doesn't know or hasn't trained for. While the |endoftext| token represents the end of a sentence, paragraph, or the whole work in general. They usually take the last few spaces of the token array, in this case, it's 783 and 784

  1. The following diagram shows multiple documents being combined (concatenated) into a single large document to provide a high-quality training dataset for a GPT. What is the special token that separates the original documents in the final combined dataset? How can you tell how many original documents were combined into this dataset?

image

the end of the text token, by the numbers of those tokens, we can tell how many original documents it has, in this case, there are 3 tokens, which means 4 original documents have been used.

  1. Using the byte-pair-encoding tokenizer, unknown words are broken down into single and double-letter pairs. image

What are some of the advantages of this approach? Let's use the example of a long, but unusual word, "Pneumonoultramicroscopicsilicovolcanoconiosis" which might be broken down into smaller word parts like "Pn-eu-mono-ul-tra-micro-scopic-silico-volcano-conio-sis".

For each choice below, explain why it is an advantage or disadvantage of this approach.

  • It lets the GPT learn connections between the long word and shorter words based on common parts, like "pneumonia", "microscope", or "volcano".
  • This approach can work in any language, not just English.
  • All words are broken down in the same way, so the process is deterministic and results from repeated runs will be similar.
  • The system will handle any word it encounters in chat or inference, even if it has never seen the word during training.
  • It is easier than looking up the word in a hashtable, not finding it, and using a single <|unk|> unknown special token.

It helps to learn the connections between long words and shorter words based on common parts, each token can also be append with other nouns or adjectives that have the same end or beginning, therefore it helps the model to learn to predict words quicker with accuracy based on how much it's training. I think that this approach might work in other languages based on their grammatical context and how the word is spelled. If all word is broken down in the same way, it's not a good idea as it'll create harder ways to generalize, the system can handle any word it encounters as a fine-tuning method, and making a hash table is not easy to implement, therefore it's a disadvantage

  1. BPE tokenized encoding is shown below with successively longer lists of token IDs. What kind of words (tokens) do you think tend to have smaller integers for token IDs, and what kind of words (tokens) tend to have larger integers?

The most common words used in the training usually have smaller integers for token IDs, and larger integers are reserved for uncommon words

  • What tokens smaller than words receive their own IDs? Are they always separated with a space from the token before them? If you were the Python programmer writing the tokenizer.decode method that produced the English string below from a list of integers, how would you implement the feature of letting these "sub-word" tokens join the sentence with or without a space?

"Terr" in "Terrace" has its IDs, in this case, they're just a missing piece of a word so they're not usually separated with a space for the token.

Human Writing

This case is about sharing/distributing digital copies of books. It makes sense that whether when we train and use the data for training or you acquired it for your usage, It's best to check out the copyright laws or legal issues related to it. In this case, because we extract most of our data from a free source that provides books, if any we can adhere to it and be fine on our own. In this case, the publishers are arguing that the Internet Archive is violating their copyright on the books and has cost them millions of dollars. On the other hand, the side of Internet Archive said that they are ensuring the public can continue to make use of the books that libraries have bought and paid for and that current libraries are investing in digitizing texts as a means to preserve them. They also claim to be functioning the same way libraries do, only lending out a single 'copy' at a time. In my opinion, I agree with this decision on the Internet Archive side, books may have different legal terms on how the publishers and authors work together to release it and the Internet Archive functions the way that we can preserve the work of the author in cases when the original has been lost, the author's passed away or the publishers decide to not releasing it anymore. This could also be alluding to video games, especially racing games where a car manufacturer's license is heard of these racing games. When the license ended, the publisher pulled the plug on these games. It's a very common thing that happens for the duration of these and the Internet Archive, as well as many other preserved efforts from different people, also work to preserve these kinds of games and work, so that we can have a long heritage of it. This has been proven more important when recently, The Crew, a Ubisoft racing game, plugged the game out of life support and provided no means way to play after it came to the end of service state, even though people actively paid for it in the past. This sparks outrage and people have said that it violates the consumer's right to own an item. Suppose that I wanted to train a GPT based on the code and written work of you and your classmates, for ethical questions, I would ask for their permission, and only gather what they have or want and not change them.

AI Homework 7

Question 0. Which of the following vectors is a one-hot encoding? How would you describe a one-hot encoding in English?

The vector on the right is the one-hot encoding on the left. Based on my understanding, the one-hot encoding is a method to hard-coded a matrix with a vector indexing where the position is added, I think this is to reduce spaces and make the transition between matrixes and vectors easier

Question 1. What is an (x,y) training example (in English)? Hint: In MNIST, x is a 784-pixel image, and y is a single-character label from 0 to 9.

An input and a label?, or an input and an output, either way, it's related to that

image

Question 3. Because embedding is stored as a matrix, and we studied how neural network weights can also be stored in a matrix, we can view the operation of transforming an input vector into an embedding as a two-layer neural network. What if the embeddings matrix took you from a vocabulary size of 7 to an output dimension of 128? What is the shape of that matrix?

7x128 or [7, 128]

Question 4. In the above problem, we can treat the input to this [4,3] neural network as a single token ID (as a one-hot encoding) that we wish to convert to its embedding (in a higher-dimensional feature space).

To embed a batch of 8 chunks, we form a matrix from the column vectors of each chunk and multiply that by the embeddings matrix.

If the embeddings matrix goes from a vocabulary of size 6 to an output dimension of 12, what is the shape of the output matrix when we embed a batch of 8 chunks?

8x6x128, something like this

Question 5. Suppose your embedding matrix (created in Section 2.7 of the book) looks like the example below: image

If you select and print out the embedding for a particular token ID, you get

tensor(1.8960, -0.1750, 1.3689, -1.6033, grad_fn=) (Ignored the requires_grad and grad_fn parameters for now).

A) Which token ID did you get an embedding for? (Remember it is 0-based indexing) 3 I think

B) Which of the following is true? **i) Your vocabulary has 4 token IDs in it, and the embedding output dimension is 7 ** ii) Your vocabulary has 7 token IDs in it, and the embedding output dimension is 4 iii) Both iv) Neither

AI Homework 8

In Section 3.5.3, there is a code listing for a compact self-attention class. image On line 2 are the parameters of the constructor.

When we call this class later in the section, what exact numbers do we give for each parameter value? d_in: 3 d_out: 2 context_length: 6 dropout: 0

image

Give the values of the variables on line 16: b: 2 num_tokens: 6 d_in: 3

Give the shape of the variables on lines 18-20: keys values queries

They both have the same shape of 2x6x2

Give the shape of the variable on line 22 attn_scores and line 25 attn_weights

They both have the same shape of 2x6x6

Use Python3's print function in your 3_5_3_causal_class.py to verify your answers, and copy and paste the output into your dev diary.

2 6 3
keys:  torch.Size([2, 6, 2])
queries:  torch.Size([2, 6, 2])
values:  torch.Size([2, 6, 2])
attn:  torch.Size([2, 6, 6])
attn:  torch.Size([2, 6, 6])
context_vecs.shape: torch.Size([2, 6, 2])
⚠️ **GitHub.com Fallback** ⚠️