Timothy's Dev Diary

Spring 2024 | Student-Originated Software

Directory 📁

Software Construction	AI Self-Hosting	Operating Systems
Software Construction SC Homework #1 SC Homework #2 SC Homework #6 SC Homework #7 SC Lab #6 SC Homework #8 SC Homework #9 Place Holder Place Holder Place Holder	AI Self-Hosting AI Homework #1 AI Homework #3 AI Homework #4 AI Homework #5 AI Homework #6 AI Homework #7 AI Pre Lab #5 AI Lab #7 AI Homework #8 AI Lab #9	Operating Systems Threads Schedules Memory Concurrency Condition Variables Deadlock Semaphores Place Holder Place Holder Place Holder

Software Construction

SC Homework #1

Completed Git-Tac-Toe with Morgan

Rust Book Chapters 1, 2, 3, and 4.

Complete Rustlings 00 to 06.

SC Homework #2

Rust Book Chapters 5, 6, and 7.

Complete Rustlings 07 to 13, quizes 1 to 2.

Rustlings 13

SC Homework #6

Rust Book Chapters 8 and 9.

Complete Rustlings 14

SC Homework #7

Rust Book Chapters 10 and 11.

Complete Rustlings 15 and Quiz 3

WASM end result

WASM Reference Image

SC Lab #6

2024-05-06 SC Lab 06

Complete City Builder

Step 1

In this generated city

Hello, city!
Added Road 8
Added Road 14
Added Road 17
Added Road 25
Added Road 33
Added Road 41
Added Road 46
Added Road 5
Added Road 8
Added Road 14
Added Road 17
Added Road 22
Added Road 27
Added Road 34
Added Road 40
Added Road 47
The number of generated addresses is 1086
.......o#o...o#oo#o.....o#o.....o#o.....o#o..o#o..
.......o#o...o#oo#o.....o#o.....o#o.....o#o..o#o..
.......o#o...o#oo#o.....o#o.....o#o.....o#o..o#o..
.......o#o...o#oo#o.....o#o.....o#o.....o#o..o#o..
oooooooo#ooooo#oo#ooooooo#ooooooo#ooooooo#oooo#ooo
##################################################
oooooooo#ooooo#oo#ooooooo#ooooooo#ooooooo#oooo#ooo
oooooooo#ooooo#oo#ooooooo#ooooooo#ooooooo#oooo#ooo
##################################################
oooooooo#ooooo#oo#ooooooo#ooooooo#ooooooo#oooo#ooo
.......o#o...o#oo#o.....o#o.....o#o.....o#o..o#o..
.......o#o...o#oo#o.....o#o.....o#o.....o#o..o#o..
.......o#o...o#oo#o.....o#o.....o#o.....o#o..o#o..
oooooooo#ooooo#oo#ooooooo#ooooooo#ooooooo#oooo#ooo
##################################################
oooooooo#ooooo#oo#ooooooo#ooooooo#ooooooo#oooo#ooo
oooooooo#ooooo#oo#ooooooo#ooooooo#ooooooo#oooo#ooo
##################################################
oooooooo#ooooo#oo#ooooooo#ooooooo#ooooooo#oooo#ooo
.......o#o...o#oo#o.....o#o.....o#o.....o#o..o#o..
.......o#o...o#oo#o.....o#o.....o#o.....o#o..o#o..
oooooooo#ooooo#oo#ooooooo#ooooooo#ooooooo#oooo#ooo
##################################################
oooooooo#ooooo#oo#ooooooo#ooooooo#ooooooo#oooo#ooo
.......o#o...o#oo#o.....o#o.....o#o.....o#o..o#o..
##################################################
oooooooo#ooooo#oo#ooooooo#ooooooo#ooooooo#oooo#ooo
##################################################
oooooooo#ooooo#oo#ooooooo#ooooooo#ooooooo#oooo#ooo
.......o#o...o#oo#o.....o#o.....o#o.....o#o..o#o..
.......o#o...o#oo#o.....o#o.....o#o.....o#o..o#o..
.......o#o...o#oo#o.....o#o.....o#o.....o#o..o#o..
.......o#o...o#oo#o.....o#o.....o#o.....o#o..o#o..
##################################################
##################################################
oooooooo#ooooo#oo#ooooooo#ooooooo#ooooooo#oooo#ooo
.......o#o...o#oo#o.....o#o.....o#o.....o#o..o#o..
.......o#o...o#oo#o.....o#o.....o#o.....o#o..o#o..
.......o#o...o#oo#o.....o#o.....o#o.....o#o..o#o..
oooooooo#ooooo#oo#ooooooo#ooooooo#ooooooo#oooo#ooo
##################################################
##################################################
.......o#o...o#oo#o.....o#o.....o#o.....o#o..o#o..
.......o#o...o#oo#o.....o#o.....o#o.....o#o..o#o..
.......o#o...o#oo#o.....o#o.....o#o.....o#o..o#o..
.......o#o...o#oo#o.....o#o.....o#o.....o#o..o#o..
##################################################
##################################################
oooooooo#ooooo#oo#ooooooo#ooooooo#ooooooo#oooo#ooo
.......o#o...o#oo#o.....o#o.....o#o.....o#o..o#o..

1,086 addresses for generated and populated into the hash map.

As an overestimate:

City of size 50
For 7 avenues, with locations both east and west, 7 * (50 + 50) = 700
- This is an overestimate because some avenues may be leftmost or rightmost
For 9 streets, with locations both north and south, 9 * (50 + 50) = 900
- Some streets may northmost or southmost.
Overestimate total is 1,600
Three double-streets above (two streets directly adjacent)
- "two-lane" or "double-wide" means subtract 3 * 2 * 50
1,600 - 300 = 1,300
Every street and avenue intersect and overlap in 4 addresses.
7 * 9 = 63 intersections, so subtract 4 * 63 addresses.
1,300 - 252 = 1,048

Three generated addresses (from a separate random city) showing that the code from this morning works.

The address at coordinates (47, 9) is 667 Avenue 48 
The address at coordinates (4, 3) is 6 Avenue 5 
The address at coordinates (14, 16) is 129 Avenue 13

Morgan found the bug for addresses around East-west streets.

After moving the code where we initialize address_counter to reset after every road

we print out all the addresses at 16 to show that it's being re-used.

Reused address number 16 Avenue 6
Reused address number 16 Avenue 13
Reused address number 16 Avenue 17
Reused address number 16 Avenue 24
Reused address number 16 Avenue 33
Reused address number 16 Avenue 38
Reused address number 16 Avenue 44
Reused address number 16 Avenue 49
Reused address number 16 Street 3
Reused address number 16 Street 6
Reused address number 16 Street 12
Reused address number 16 Street 15
Reused address number 16 Street 20
Reused address number 16 Street 29
Reused address number 16 Street 36
Reused address number 16 Street 45
Reused address number 16 Street 49

SC Homework #8

Rust Book Chapter 13.

Complete Rustlings 16, 17, 18

Rustlings 16

Rustlings 17

Rustlings 18

Warm Cells

Paused Cells at end of 4.7. You can press a button made in HTML displayed on the canvas to play or pause the cells. By pressing on a cell you can switch its state to alive or dead depending on its previous state.

Code Uploaded here: Pull Request[End of 4.7]

SC Homework #9

Rust Book Chapters 16 and 17

Complete Rustlings 19 and 20.

Rustlings 19

Rustlings 20

WASM Time Profiling

AI Self-Hosting

AI Homework #1

Human Writing

Read the following article.

Unjust Enrichment Lawsuit: Actors' Strike 2023 and Grand Moff Tarkin

Write a 100 to 200 word response, describing the points of view on both sides of this legal case as you understand it. Drawing comparisons and contrasts to at least one other, similar news event.

In the legal case regarding Peter Cushing's performers' rights, two very different perspectives emerge. One-half of the argument says that any use of someone's identity should be held accountable for monetary compensation. No matter how much time and effort is put into a project the actors should get paid for each job especially if their voice is being used to create the content. On the other side of things, this copyrighted material will continue to hinder creative freedom for others in the future. By extending copyright terms, all companies who own animated characters(or other forms of art) will always own the rights to those characters, even if the business is no longer around.

This complicated things because after a certain period, society as a whole benefits from unrestricted access to creative works, which can foster innovation and creativity in new generations of artists and creators. This brings light to another similar story where the writer's guild has gone on strike. This has been happening for a while, but with the increased use of AI, the strikes have moved in front of Netflix's headquarters. From writers to software developers and even business managers, everyone is wondering how AI will affect their futures. Does the benefit of having AI help humanity outweigh the potential misuse or will AI reduce opportunities?

Source

AI Homework #3

What is connectionism and the distributed representation approach? How does it relate to the MNIST classification of learning the idea of a circle, regardless of whether it is the top of the digit 9 or the top/bottom of the digit 8?

My answer is: Connectionism is another term for interconnected nodes or a neural network. It represents the relationship between many nodes that are connected and work together. The distributed representation approach is where information is distributed between multiple elements.

What are some factors that have led to recent progress in the ability of deep learning to mimic intelligent human tasks?

My answer is: With recent advancements to computational power, the overall CPU power has been significantly increased allowing for larger models with even more parameters to be trained. As deep learning has been around for a while now, there has been an increase in the amount of data available for testing. At the start, there wasn't much data to test which means we weren't able to learn much about neural networks right away. Now with better algorithms and training models, the ability for deep learning to mimic intelligent human tasks is many steps closer.

How many neurons are in the average human brain, versus the number of simulated neurons in the biggest AI supercomputer described in the book chapter? Now in the year 2024, how many neurons can the biggest supercomputer simulate? (You may use a search engine or an AI chat itself to speculate).

My answer is: 86 Billion neurons are in the average human brain, but this depends on age, genetics, and other factors. The is a very big scale difference between the human brain and the simulated neurons in this chapter. The largest supercomputers are capable of simulating neural networks with billions of neurons and trillions of synapses, but they are still far from being able to match the scale of the human brain in terms of neuron count.

Read Chapter 2: Gradient Descent from 3Blue1Brown and respond to the questions below in your dev diary entry.

Let's say you are training your neural network on pairs of $(x,y)$, where $x$ is a training datapoint (an image in MNIST) and $y$ is the correct label for $x$.

Why does the neural network, before you've trained it on the first input $x$, output "trash", or something that is very far from the corresponding $y$?

My answer is: When training data for a neural network, the first trained output is usually trash because there have been random weights and biases applied to the training data. Because of this, the first output is not very useful information and is seen as trash. As soon as the next data training output is computed, the first helpful piece of data is created to train a successful model.

Review this shared Google Colab notebook.

If you have a Numpy array that looks like the following, give its shape as a tuple of maximum dimensions along each axis.

For example (p,q,r) is a tensor of third rank, with p along the first dimension (which "2D layer"), q "rows" in each "2D layer", and r "columns" in each "row".

[[[1,2,3,4],
  [5,6,7,8],
  [9,10,11,12]]
]

Assume your neural network in network.py is created with the following layers

net = Network([7,13,2])

My answer is: Given the shape of the array, (1, 3, 4) is the tensor shape.

What is the list of shapes of the self.weights and self.biases members in the constructor of Network? Because these are lists of Numpy matrices, the different elements can have different shapes, so your answer will have a slightly different form.

The shapes would be: weights = [(13, 7) (2, 13)] and biases = [(1,13), (1, 2)]

From the notes, answer this question by ordering the given outputs in terms of the cost function they give out.

My answer is D: D > C > A > B

What is the effect of changing the learning rate (the Greek letter $\eta$ "eta") in training with SGD?

Read the hiker analogy by Sebastian Raschka, related to our outdoor hill-climbing activity

My answer is: By changing the learning rate, you are changing the amount of data learned per training run. By increasing this variable you have a chance of reaching the maximum too quickly and overshooting the desired result. The learning rate needs to be a good average number so that you can gradually find the maximum data over time.

Why is the word "stochastic" in the name "stochastic gradient descent", and how is it different than normal gradient descent?

My answer is: Stochastic means a random event or probability. This means that stochastic gradient descent is when a random point is selected and tested. This can be better for deep learning to compute large datasets efficiently.

Programming MNIST Classifier in Python

Human Writing: Paul McMillin Guest Speaker, ChatGPT and Hallucination

Pre-Class Reading Slides

Human Prompt:

AI chat models are primarily trained from web crawls (automated mass retrievals of data) from websites that may include Wikipedia, social media platforms that don't require logins like X (formerly known as Twitter), and paywall-protected sites like the New York Times and academic journals. Correlations that appear between words, including emotional tone as interpreted by humans, from these training sources are more likely to be included in the chat model and to appear in the generated text. This chat model is similar to the weights and biases (the parameters) that we are training in our MNIST handwritten digit classifier. GPTs like ChatGPT include multiple neural networks in their architecture that work very similarly in principle.

How do the essays hang together? Is 4’s essay, which responds to exactly the same prompt, an improvement on that of 3.5?

ChatGPT 4's essay was not much better than ChatGPT 3's essay. They both had good points and used outside sources. 4's essay wasn't too much better than 3's and they both claimed their sources were correct and double-checked.

What do you think of ChatGPT as a student writer?

I think ChatGPT can be a useful tool for brainstorming or finding new ideas to write about, but I do not agree with using its output as your own work. There is a time to use it to help learn and grow, and then there are other times when you abuse it which should be avoided.

Would you want to use ChatGPT (or other AI) for an assignment like this?

I would not use ChatGPT for a school assignment as my ethical decisions decide the type of person I choose to be. If another form of AI was better at teaching the user about topics and guiding them towards answers they seek, with the improvements of being able to cite its sources, then I might use it to learn.

If you did, how would you use it? For a first draft? To help edit a first draft you wrote yourself? Would you just submit ChatGPT’s version as is, maybe ‘making it your own’ a bit by changing a few words or adding a few things of your own? If you would use ChatGPT in any way, would you do that merely for convenience, or do you think it would contribute to your development as a thinker and academic writer?

I would use self-hosted LLM to learn and advance my studies. Being able to use something I created from my own design would be very useful in the future. Having different models focused on specialized tasks could help people finish general tasks quickly.

What is framing and the "standard story" in terms of journalism?

Framing is when information is given to the reader showing a specific point of view. The stand story is a term used to identify the most prevalent story and can consist of biases or popular opinions in society.

How does journalistic framing relate to system prompts?

Both journalistic framing and system prompts have a specific way of presenting information to the user. By giving the user specific details you can position the perspective however you want. Biases from outside sources can generate biased data which is important to be aware of.

How does the "standard story" relate to the kinds of information that are likely to be expressed by AI chat systems like ChatGPT?

The more prominent the story is, the more often the model will be trained more strongly towards that bias. The standard story can also reflect people's opinions in different geographical locations. This can help represent people and their thoughts or beliefs. It is important to either focus on the standard story or use it to get a better output.

How can research assistance for a critical essay be connected to using AI chat to learn technical topics or for assistance with programming languages?

Using ChatGPT to assist in research for a critical essay is a very practical use. ChatGPT will help retrieve information, validate it to ensure it is correct, and give feedback on the output received. Giving all of this a personalized touch as you work with the model adds to the user experience as well as the quality of responses.

Epoch Testing

Prototype-06

AI Homework #4

Question 0. In this equation, match the symbols with their meaning:

Symbols

A. $\sigma$

B. $w_i$

C. $a_i$

D. $b$
Meanings

i. Activations from the previous layer

ii. Bias of this neuron, or threshold for firing

iii. Sigmoid, or squishing function, to smooth outputs to the 0.0 to 1.0 range?

iv. Weights from the previous layer to this neuron

My answer is: [A-3][B-4][C-1][D-2]

Question 1. Calculate the cost of this "trash" output of a neural network and our desired label of the digit "3"

My answer is: The cost of this output is the sum of the squares of differences between the actual output and the desired output which is 3.3585

Question 2. Suppose we are feeding this image forward through our neural network and want to increase the classification of it as the digit "2"

Answer this question about decreasing the cost of the "output-2" node firing

My answer is: Increase the bias associated with the digit-2 neuron and decrease the biases associated with all the other neurons.

Question 3. What does the phrase "neurons that fire together wire together" mean in the context of increasing weights in layers in proportion to how they are activated in the previous layer?

My answer is: The phrase “Neurons that fire together wire together” comes from the way neurons react and are strengthened when firing together.

Question 4. The following image shows which of the following:

changes to the weights of all the neurons "requested" by each training data
changes to the biases of all the neurons "requested" by each training data
changes to the activations of the previous layer
changes to the activation of the output layer

In addition to the answer you chose above, what other choices are changes that backpropagation can actually make to the neural network?

My answer is: The image shows changes to the weights of all the neurons "requested" by each training data. Other choices that are changes to the neural network would include changes to the bias and activations as well as the amount of layers.

Question 5. In the reading, calculating the cost function delta by mini-batches to find the direction of the steepest descent is compared to a

a cautious person calculating how to get down a hill
a drunk stumbling quickly down a hill
a cat leaping gracefully down a hill
a bunch of rocks tumbling down a hill

What is the closest analogy to calculating the best update changes $\nabla C$ by mini-batches?

passing laws by electing a new president and waiting for an entire election's paper ballots to be completely counted
asking a single pundit on a television show what laws should be changed
asking a random-sized group of people to make a small chance to any law in the country, repeated $n$ times, allowing a person the possibility to be chosen multiple times
making a small change to one law at a time chosen by random groups of $n$ people, until everyone in the country has been asked at least once

My answer is: The closest analogy is the fourth option makes the most sense. This would mean incremental updates for each mini-batch. This will be making a small change to one law at a time chosen by random groups of people until everyone in the country has been asked at least once.

Question 6. If each row in this image is a mini-batch, what is the mini-batch size?

Remember in our MNIST train.py in last week's lab, the mini-batch size was 10.

My answer is: The mini-batch size is based on the last column of data or the complete training set. For this image, there are 12 numbers in each row and the last row also has 12 numbers. This means that there are 12 mini batches.

Backpropagation Calculus

3Blue1Brown Chapter 4: Backpropagation Calculus

Question #1. For our neural network with layers of [784, 100, 10], what is the size (number of elements) of the (cost function changes) matrix below:

Answer the question again for this smaller neural network

My answer is: 100 is the size or number of elements.

Question #2

Symbols

A. $a^{(L-1)}$

B. $\sigma$

C. $b^{L}$

D. $w^{L}$

E. $a^{(L)}$
Meanings

i. Activations from the previous layer

ii. Bias of the current layer

iii. Activations of the current layer

iv. Sigmoid, or squishing function, to smooth outputs to the 0.0 to 1.0 range

v. Weights from the previous layer to this neuron

My answer is: [A-1][B-4][C-2][D-5][E3]

Question 3. In this tree diagram, we see how calculating the final cost function at the first training image 0 $C_0$ is dependent on the activation of the output layer $a^(L)$. In turn, $a^{(L)}$ is dependent on the weighted output (before the sigmoid function) $z^(L)$, which itself depends on the incoming weights $w^{(L)}$ and activations $a^{(L-1)}$ from the previous layer and the bias of the current layer $b^{L}$

What is the relationship of this second, extended diagram to the first one?

Choices (choose all that apply)

There is no relationship
The bottom half of the second diagram is the same as the first diagram
The second diagram extends backward into the neural network, showing a previous layer $L-2$ whose outputs the layer $L-1$ depends on.
The second diagram can be extended further back to layer $L-3$, all the way to the first layer $L$
The first diagram is an improved version of the second diagram with fewer dependencies
Drawing a path between two quantities in either diagram will show you which partial derivatives to "chain together" in calculating $\nabla C$

My answer is: 2-The bottom half of the second diagram is the same as the first diagram. This is because z(L) extends to z(L-1) and a(L-1) extends to a(L-2) to continue the pattern. 3- The second diagram extends backward into the neural network, showing a previous layer $L-2$ whose outputs the layer $L-1$ depends on. This is also true because the diagram continues to extend layers and will show outputs depending on what layer it is currently on.

Human Writing

Use the questions below to choose a topic about AI ethics.

Consider the Warren Buffett voice cloning demonstration.
- How does this compare to the unjust enrichment lawsuit against the estate of the actor Peter Cushing in the Week 01 reading?
- $25million lost in AI-generated video call fraud
- What safeguards if any should govern the right to voice clone actors or public figures from freely available recordings on the internet?
What thoughts or feelings do you have right now, before hearing your synthesized voice?
- Should you have the right to synthesize your own voice? What are possible advantages or misuses of it?
- Photographs, video and audio recordings are technologies that changed the way we remember loved ones who have died.
  - Should family members have the right to create AI simulated personalities of the deceased?
  - If generative AI allows us another ay to interact with a personality of a person in the past, how does this compare with historical re-enactors, or movies depicting fictional events with real people?

Response

There should be some sort of background check to allow people to use such sophisticated software. By creating a professional license only obtainable by working for companies that utilize these tools, you can stop people who might be harmful from using the tool. It isn't useful for an ordinary person and can be used to inflict harm in the wrong hands.

My main thought is that if my voice were synthesized how would people know the real me from the AI voice? It is a scary thought that your identity could be stolen and used against others. I think you should be able to synthesize your own voice, but only you have the right to your own voice. This means that you can create a second version of yourself to repeat things you forget. It can also help you analyze things in a new light, helping you be a better version of yourself. If you wanted to create a personalized diary of all your memories, you could have your own voice read the audience's entries. This could be personalized for anyone who decides to use this technology. It is a tool that can be useful in certain situations and should only be used to help the user solve problems, achieve academic accomplishments, or continue building on future goals.

If a family member has died, then another family member has the right to grieve however they need to. If artificial intelligence allows the user to hear the voice of someone, they hold dear to them, it is important to embrace that miracle we have achieved. With future technology, we could use these voices with automated scripts to talk during the day. That way it will never feel like they are truly gone from your life.

When using generative AI and historical reenactments together you can truly create a museum of art. The depictions of real people are completely true stories while artificial intelligence isn't known for being always correct with its outputs. This might improve over time, but I don't think using artificial intelligence to take on the persona of a character for the act of telling a story from their perspective is morally right. It is important to remember these stories and pass them on to others. To honor these stories, we can respect other cultures and only tell the stories as they are. Artificial Intelligence might change the story from different iterations over time. There is also the alternative that AI could create new scenarios that would teach us important information we may never have found before. It is important to understand how to use AI the right way for learning purposes.

AI Homework #5

Reading

Read Chapters 1 and 2 of Build a Large Language Model (From Scratch), by Sebastian Raschka.

Chapter 1 Questions

with open("file_name", "r", encoding="utf-8") as f:
raw_text = f.read()

print ("Total number of characters", len(raw_text))
print(raw_text[:100])

import re
result = re.split(r'([,.]|\s)', text)
print(result)

What is the difference between a GPT and an LLM? Are the terms synonymous?

The difference between a GPT and an LLM is that a GPT is more focused on text generation based on the type of transformer. LLMs consist of a variety of designs and are made for various purposes. GPT is a type of LLM.

Labeled training pairs of questions and answers, in the model of "InstructGPT" are most similar to which of the following?

The most similar to the labeled training pairs are E: The images of handwritten digits and a label of 0 through 9 in the MNIST classification task.

A. Posts from Stackoverflow that have responses to them. B. Posts on Twitter/X and their replies C. Posts on Reddit and their replies D. Posts on Quora and their replies E. Images of handwritten digits and a label of 0 through 9 in the MNIST classification task

For each one, are there multiple labels for a training data point, and if so, is there a way to rank the quality or verity (closeness to the truth) of the labels with respect to their training data point?

Yes, the posts from Stackoverflow, Twitter/X, Reddit, and Quora can be ranked by quality depending on the number of upvotes received. The last example of MNIST images and labels has one label per image and has no way to be ranked from the start by the model.

The GPT architecture in the paper "Attention is All You Need" was originally designed for which task

A. Passing the Turing test B. Making up song lyrics C. Machine translation from one human language to another D. Writing a novel

The original transformer was originally designed for C: machine learning to translate English texts to German and French.

How many layers of neural networks is considered "deep learning" in the Rashka text?

The number of layers to make a neural network considered "deep learning" would be three or more layers to model complex patterns.

Is our MNIST classifier a deep-learning neural network by this definition?

Yes, based on the rule that three or more layers are considered "deep learning", our MNIST loader is clarified as "deep learning".

For each statement about how pre-training is related to fine-tuning for GPTs:

If the statement is true, write "True" and give an example.
If the statement is false, write "False" and give a counter-example.

A. Pre-training is usually much more expensive and time-consuming than fine-tuning.

T

B. Pre-training is usually done with meticulously labeled data while finetuning is usually done on large amounts of unlabeled or self-labeling data.

F. Finetuning is useful when you have large labeled data.

C. A model can be fine-tuned by different people than the ones who originally pre-trained a model.

T

D. Pre-training is to produce a more general-purpose model, and fine-tuning specializes it for certain tasks.

T

E. Fine-tuning usually uses less data than pre-training.

T

F. Pre-training can produce a model from scratch, but fine-tuning can only change an existing model.

T

GPTs work by predicting the next word in a sequence, given which of the following as inputs or context?

A. The existing words in sentences it has already produced in the past. B. Prompts from the user C. A system prompt that frames the conversation or instructs the GPT to behave in a certain role or manner D. New labeled pairs that represent up-to-date information that was not present at the time of training E. The trained model while includes the encoder, decoder, and attention mechanism weights and biases

A. Input B. Context C. Context D. Input E. Context

The reading distinguishes between these three kinds of tasks that you might ask an AI to do:

Predicting the next word in a sequence (for a natural language conversation)
classifying items, such as a piece of mail as spam, or a passage of text as an example of Romantic vs. realist literature
Answering questions on a subject after being trained with question-answer examples

how these three tasks are the same or different. -is one of these tasks general-purpose enough to implement the other two tasks?

These three tasks are similar because each task is a list of objects that require an input to be analyzed. You could also say that each gives an output as the first example that predicts words would return the full string of text. When you classify mail as a specific type or specify answers to questions then you are grouping by type. This can be said about the string, it can predict the next word to create a list of strings which are all of the same type. By classifying items and giving them types, like strings and question-answer pairs, we can group the variables and get an expected output.

Which of the following components of the GPT architecture might be neural networks, similar to the MNIST classifier we have been studying? Explain your answer.

A. Encoder, which translates words into a higher-dimensional vector space of features B. Tokenizer, which breaks up the incoming text into different textual parts C. Decoder, the translates from a higher-dimensional vector space of features back to words

B is not a neural network as it only breaks up text while C is also not a neural network as one is not needed for a decoder. The encoder translates words into a higher-dimensional vector space which makes it a neural network.

What is an example of zero-shot learning that we have encountered in this class already? Choose all that apply and explain.

A. Using an MNIST classifier trained on numeric digits to classify alphabetic letters instead. B. Using the YourTTS model for text-to-speech to clone a voice the model has never heard before C. Using ChatGPT or a similar AI chat to answer a question it has never seen before with no examples D. Using spam filters in Outlook by marking a message as spam to improve Microsoft's model of your email reading habits

B. Using the first time running the model without any fine-tuning does not go very well. The longer the recording of the voice input is the better the speech output is. This zero-shot model definitely needs fine-tuning in a professional setting but can be trained very quickly.

What is zero-shot learning, and how does it differ from few-shot or many-shot learning?

Zero-shot learning is the inference from attributes, text descriptions, and other data to recognize general categories. This is different from few-shot and many-shot as no labeled examples are provided during training.

What is the number of model parameters quoted for GPT-3, a predecessor of the model used to power the first ChatGPT product?

GPT-3 has 175 billion parameters

Chapter 2 Questions (Part 1)

Why can't LLMs operate on words directly? (Hint: think of how the MNIST neural network works, with nodes that have weighted inputs, that are converted with a sigmoid, and fire to the next layer. These are represented as matrices of numbers, which we practiced multiplying with Numpy arrays. The text-to-speech system TTS similarly does not operate directly on sound data.)

String have many issues like being variable sizes and computers usually need numbers to calculate. This means the model must convert the words to numbers adjusting the way the computer interprets the text.

What is an embedding? What does the dimension of an embedding mean?

An embedding is a tokenization of a word held in a higher dimension of space. The dimensions usually refer to the elements or features in the vector representation.

What is the dimension of the embedding used in our example of Edith Wharton's short story "The Verdict"? What is it for GPT-3?

Chat GPT-3 has 768 elements.

Put the following steps in order to process and prepare our dataset for LLM training

A. Adding position embeddings to the token word embeddings B. Giving unique token IDs (numbers) to each token. C. Breaking up natural human text into tokens, which could include punctuation, whitespace, and special "meta" tokens like "end-of-text" and "unknown" D. Converting token IDs to their embeddings, for example, using Word2Vec

The order of steps is C, B, A, and D.

Human Writing

Read/listen Part I of an interview of David Deutsch by Naval Ravikant and Brett Hall

Write a response to this piece of 500 words or more. Use any of the questions below to guide your thinking or react against them.

Form and describe a potential philosophy of when you will source a dataset for AI training. When and in what contexts do you hope to gather data for training an AI model?
- Are there any principles or past experiences that have led you to choose this philosophy?
- What is David Deutsch's apparent philosophy of AI use? How have his thoughts affected your policy if at all?
What is a use of AI that you are interested in, either that you've seen publicly or that you've considered as a thought experiment?
What is creativity?
- What is David Deutsch's apparent definition of creativity? Or, what does he think creativity is not? (anti-definitions)
- Is it necessary for AIs to be creative in order to be considered intelligent, or vice versa?
- Is creativity important? Why or why not?
  - Expand your answer above to include both your own personal preferences and circumstances, as well as within human society as a whole.
- Are there any negative effects of creativity or creative freedom? Why or why not?

Response

I would gather data to train a model based on certain datasets that would help me gain informative information. By training a model that helps the user learn you would need a lot of training data, which is not easily done for anyone. I think it is important to use software and other tools to gain information and learn. Doing experiments and finding out what data is useful compared to data that is not can lead to a successful outcome.

David Deutsch’s philosophy of AI makes sense as it is not a creative algorithm. It takes in the user’s prompt and gives its best output. There could be many different outputs so how does ChatGPT decide what output it gives to the user? How can it expect what the user wants? It cannot and it also cannot be creative. This is something that will revolutionize AI as people would not have to create anymore.

I think using AI for generating gifs or small images is interesting. It can be difficult to animate small images even if you are a creative person. Having software do all of the heavy lifting for you means there is no human touch. Without this, every piece of art would turn into generated art, and no one could tell the difference. David Deutsch continues by saying that creativity is unbound, meaning a computer could not compute something not identifiable. If there are no boundaries to creativity, then how can we make a computer know if it is creative or what it means to be creative? To be intelligent you must know the facts and lay them out so anyone can understand them. To do this without seeming like a robot each time, you must get creative and make it interesting.

Creativity is one of the most important things in life as it multiplies and grows from the creation of each person's imagination. Each input is a new chance at infinite possibilities. You can use creativity as a tool to solve problems or to express yourself. It is important to have fun with it and let it flow as the outcome will be best that way. You can enrich and inspire other lives by adapting parts of your life or culture. Every day is a new day, and it creates even more circumstances to be unique as we navigate life's many mysteries. With creativity comes new innovations, but it also can come with failure. It might take many tries before you can visually see what you pictured in your head. When views change there can be differences of opinion which can upset someone who cares about their opinion. When forcing creativity, it can lead to burnout or other forms of stress. You can also get overwhelmed when there are too many items to deal with or the scope is too large.

AI-generated content is not creative or unique as it follows guidelines and rules to get an output strictly based on the input prompt.

AI Homework #6

AI Reading

Read the second half of Chapter 2 in the Raschka LLM book, from Section 2.6 until the end.

Read Appendix A in the same book, up to and including Section A.5, on PyTorch and tensors.

Questions

Match each of the red or the blue diagram below with one of the following processes:

Encoding text from words to token IDs (integers).
Decoding from token IDs (integers) to words.

Encoding text from words to token IDs (integers) matches with the blue diagram.

Decoding from token IDs (integers) to words matches with the red diagram.

What are the two special tokens added to the end of the vocabulary shown in the tokenizer below? What are their token IDs?

The <|unk|> token is a list of unknown words.

The <|endoftest|> token identifies the end of the text and stops scanning for more text.

The following diagram shows multiple documents being combined (concatenated) into a single large document to provide a high-quality training dataset for a GPT.

What is the special token that separates the original documents in the final combined dataset? How can you tell how many original documents were combined into this dataset?

The <|endoftest|> token is a special token that tells the file where the end of the text is located so that it stops at the end. You can tell how many original documents were combined by checking how many special <|endoftext|> tokens there are. If there is a token after some text, that would indicate the document ended and a new one is being shown.

Using the byte-pair-encoding tokenizer, unknown words are broken down into single and double-letter pairs.

What are some of the advantages of this approach? Let's use the example of a long, but unusual word, "Pneumonoultramicroscopicsilicovolcanoconiosis" which might be broken down into smaller word parts like "Pn-eu-mono-ul-tra-micro-scopic-silico-volcano-conio-sis".

The advantage of breaking down large unknown words has many advantages as it can identify known words within the large unknown word and break it down to something more known. If there was a large word that was unknown and that was all we knew about it, then there would be no way to find out what that word is. By locating known string sequences, then we can associate known words with large unknown words to get hints on what the word actually is.

For each choice below, explain why it is an advantage or disadvantage of this approach.

It lets the GPT learn connections between the long word and shorter words based on common parts, like "pneumonia", "microscope", or "volcano".
This approach can work in any language, not just English.
All words are broken down in the same way, so the process is deterministic and results from repeated runs will be similar.
The system will handle any word it encounters in chat or inference, even if it has never seen the word during training.
It is easier than looking up the word in a hashtable, not finding it, and using a single <|unk|> unknown special token.

A BPE tokenized encoding is shown below with successively longer lists of token IDs.

The code that produced it is

text = "In the sunlit terraces of someunknownPlace."
split_text = split(text)
integers = tokenizer.encode(text, allowed_special={"<|endoftext|>"})

for i in range(10):
    decoded = tokenizer.decode(integers[:i+1])
    print(f"{integers[:i+1]} {decoded}")

This is the output

[818] In
[818, 262] In the
[818, 262, 4252] In the sun
[818, 262, 4252, 18250] In the sunlit
[818, 262, 4252, 18250, 8812] In the sunlit terr
[818, 262, 4252, 18250, 8812, 2114] In the sunlit terraces
[818, 262, 4252, 18250, 8812, 2114, 286] In the sunlit terraces of
[818, 262, 4252, 18250, 8812, 2114, 286, 617] In the sunlit terraces of some
[818, 262, 4252, 18250, 8812, 2114, 286, 617, 34680] In the sunlit terraces of someunknown
[818, 262, 4252, 18250, 8812, 2114, 286, 617, 34680, 27271] In the sunlit terraces of someunknownPlace

What kind of words (tokens) do you think tend to have smaller integers for token IDs, and what kind of words (tokens) tend to have larger integers?

More frequent words will be smaller integers compared to rare words which would be larger numbers.

What tokens smaller than words receive their own IDs? Are they always separated with a space from the token before them? If you were the Python programmer writing the tokenizer.decode method that produced the English string below from a list of integers, how would you implement the feature of letting these "sub-word" tokens join the sentence with or without a space?

Tokens smaller than words that receive their own ID would be:

Human Writing

Read the following case description of Hachette v. Internet Archive

In the same dev diary entry as above, write a response of at least 500 words addressing the following questions:

How does this case relate to selecting training datasets for GPTs, such as our final project in this class?
Describe the main point of view of Hachette and other publishers on the plaintiff side of this case (the party who claim they have been wronged).
Describe the main point of view of Internet Archive, the founder Brewster Kahle, the Electronic Frontier Foundation, or other parties on the defendant side of this case (the party defending against claims that they have done anything wrong).
What other news event or legal case is similar to this case that is interesting to you?
- Compare and contrast the two cases. How are they similar and how are they different?
Which of the above arguments are convincing to you? If you were the judge in this case, how would you form your opinion?
Suppose you wanted to train a GPT based on the code and written work of you and your classmates.
- What ethical questions and considerations would come up for you in this process?

If you use an AI chat to help develop your response, include a link to your chat and attempt to make it a 50%-50% chat: write prompts, questions, or your own summaries that are at least as long as the responses the AI gives you, or ask the AI to deliberately shorten its answers.

Note: Come up with your own thesis statement after reading the article before talking to anyone else, either a classmate or an AI. Do not ask another entity to develop ideas from scratch or come to a conclusion for you. You may wish to use other sources only to check your own understanding, knowing that you should independently verify and do additional work outside of this conversation to make sure the contributions are usable, true, and support your main point.

Response

This case is related to selecting training datasets for GPTs because the Internet archive holds all the public information and allows people to check it out. If we were to grab all the information in the internet archive and train a model based on it, we would have all the information from every book and pdf in the archive. The main point of Hachette’s and the other publishers is that the Internet Archive's actions impose on their exclusive rights as copyright holders and can change the laws on how copyrights. This mass digitization and distribution of copyrighted works without proper authorization or compensation deprives authors and publishers of their rightful income. This can be seen as unauthorized distribution of copyrighted content online.

However, the Internet Archive serves as a temporary and emergency measure to address any closures of physical libraries and schools due to the pandemic, ensuring that students, educators, and the public can continue to access essential educational resources remotely. They argued that the limited use of copyrighted works without permission for purposes such as education, research, and scholarship, particularly in times of crisis was allowed in a time of need. This intrusion of data is almost like the current issues the United States is dealing with to ban TikTok. Everyone's data is stored on the platform and is then shared with service providers or business partners. This data includes usage information, account information, as well as location and device information. The United States wants to ban this app because they want to keep their data private as well as mitigate the negative impacts of the constant use of technology on the youth’s health. These cases are similar because both describe rights and who owns what data. However, some people do not care who owns the data and will use it to improve the app or for financial gain. These articles are different because they show that one side is after the rights to their book for financial gain while the other is after the safety of a whole country's data. I think as a judge I would have to do the most ethical thing and choose public knowledge as a solution, not a hindrance. He should be thinking of the Internet Archive for promoting the information allowing others to be informed by it. If someone does not access the internet archive or cannot check the material out because it is already checked out, then they must pay to read it. I think that is fair and that while public knowledge is important for education, TikTok should be banned so that no personal information is leaked, or information stolen.

If I trained a GPT model based on code and written work from me and my classmates, then I would think that it would be something only we would use. As it is our data, we would not be too keen to share it with anyone. No one would be too interested in it anyway, but they would not have any use with a GPT based on someone else’s work. It would be beneficial to help explain past work written and could also format and comment on the code to make readability easier.

Source

AI Homework #7

AI Homework 07

Back to AI Self-Hosting

AI Reading

Read Chapter 3 in the Raschka LLM book, typing and adapting the code samples to your own dataset (which we'll do as part of Lab 07)

Read Appendix A in the same book, from Section A.6, on PyTorch and tensors.

Attempt a response to the following questions in a dev diary entry.

Questions

Question 0. Which of the following vectors is a one-hot encoding?

How would you describe a one-hot encoding in English?

The 1 bit in the outlined vector is the one-hot encoding. When you multiply each value by 0s and 1s you get a specific output from the row of values. This will pick out specific elements of the matrix or rows in the matrix by categorizing them.

Question 1. What is an (x,y) training example (in English)?

Hint: In MNIST, x is a 784-pixel image, and y is a single character label from 0 to 9.

An (x, y) training example is a pair of data designed to train a model. When x is the image and y is the label for MNIST you get a pair of information that can help train the model to understand what it is looking for. One piece of the example is the data it is sorting through, or input, while the other piece of data is the answer to what it is sorting, the expected output.

Question 2. We call large texts for training a GPT "self-labeling" because we can sample from the text in a sliding window (or batches of words).

Match the following terms (A,B,C) with its definition below (1,2,3):

A. max_length in

B. stride

C. batch size

i. the number of token IDs to "slide" forward from one (x,y) training example to the next (x,y) training example

ii. chunk size, or number of token IDs to group together into one x or y of a training example (x,y)

iii. the number of (x,y) training examples returned in each call to next of our dataloader's iterator.

[A-2][B-1][C-3]

Question 3. Because embedding is stored as a matrix, and we studied how neural network weights can also be stored in a matrix, we can view the operation of transforming an input vector into an embedding as a two-layer neural network.

For example, this neural network has layers of [4,3] meaning 4 nodes in the first layer and 3 nodes in the 2nd layer.

We ignore biases for now, or assume they are biases of all zeros.

The weights for the above neural network are a matrix that takes column vectors of size 4 to column vectors of size 3. Using the rules of matrix multiplication, what is the size of this weights matrix?

$$ \left[ \begin{matrix} a\ b\ c\ d \end{matrix} \right] $$

$$\Rightarrow$$

$$ \left[ \begin{matrix} e\ f\ g \end{matrix} \right] $$

If the embedding matrix took you from a vocabulary size of 7 to an output dimension of 128, then the shape of that matrix would be 128x7.

Question 4. In the above problem, we can treat the input to this [4,3] neural network as a single token ID (as a one-hot encoding) that we wish to convert to its embedding (in a higher-dimensional feature space).

Not in this hw

Question 5. Suppose your embedding matrix (created in Section 2.7 of the book) looks like the example below:

If you select and print out the embedding for a particular token ID, you get

tensor([[ 1.8960, -0.1750,  1.3689, -1.6033]], grad_fn=<EmbeddingBackward0>)

(Ignored the requires_grad and grad_fn parameters for now).

A) Which token ID did you get an embedding for? (Remember it is 0-based indexing) B) Which of the following is true? i) Your vocabulary has 4 token IDs in it, and the embedding output dimension is 7 ii) Your vocabulary has 7 token IDs in it, and the embedding output dimension is 4 iii) Both iv) Neither

Not in this hw

Human Writing

Read the following two articles about energy usage in computing

Thoughts on permacomputing

Generative AI’s environmental costs are soaring — and mostly secret, by

In the same dev diary entry as above, write a response of at least 500 words addressing the following questions:

How would you summarize the main point or thesis of each article?
- Try to summarize them each in one sentence.
How would you divide and expand the thesis of each article above into two or three main parts? That is:
- for the permacomputing article, how would you summarize its main sections?
- for the AI energy article, how would you summarize its main sections?
What are two pieces of evidence or arguments that each article provides to support their thesis?
- Provide two pieces of evidence or arguments for the permacomputing article.
- Provide another two pieces of evidence or arguments for the AI energy article.
What is a related piece of evidence or arguments that you've found independently (e.g. through reading / watching the news, search engines)?
- Find one piece of evidence or argument that supports or refutes the permacomputing article.
- Find one piece of evidence or argument that supports or refutes the AI energy article.
How are the two readings similar or different?
- How would you describe the overall attitude of each article?
- How would you describe the approach each article takes?
To what extent do you agree or disagree with the thesis of each article, as you've stated it above?
- Do you find the pieces of evidence or arguments that you provide convincing? Why or why not?

If you use an AI chat to help develop your response, include a link to your chat and attempt to make it a 50%-50% chat: write prompts, questions, or your own summaries that are at least as long (in number of words) as the responses the AI gives you, or ask the AI to deliberately shorten its answers.

You may wish to use other sources only to check your own understanding, knowing that you should independently verify and do additional work outside of this conversation to make sure the contributions are usable, true, and support your main point.

Response

It is important to break down large articles like this. Here is one sentence summarizing each point: Permaculture emphasizes resource-sensitivity which means that computers need to manage their energy usage especially while computing large data sets. Observation is crucial to computers for making difficult-to-see changes visible, as information becomes a resource when acted upon. Linear progress is misleading as progress in technology isn't always led by hardware, but also by the advancements we make with the hardware, allowing us to create new things and further our creativity. Programming is essential to communities as they should be able to access and create software to address solutions and by not relying on larger systems we can utilize smaller building blocks to assemble something with only the necessary features. Programs can function as tools, problem-solvers, engine components, or entirely unique entities and should be easy to use, automate generic tasks, as well as perform maintenance which enhances code to make it smaller and faster. The AI industry is facing an energy crisis as the next generation of generative AI systems will consume vastly more power than current energy systems can handle.

The first piece of evidence from the first article is: "Current consumer-oriented computing systems often go to ridiculous lengths to actually prevent the user from knowing what is going on." this means that users will continue to use devices without thinking about the actual concern for nature or our societies electricity use and overall wellbeing. This proves that users focus more on their selves and use devices to escape reality. This means they will be no help in saving our energy crisis and do something before the users use up all of their power. Another piece of evidence is: "The fossil-industrial story of linear progress has made many people believe that the main driver for computer innovation would be the constant increase of computing capacity". This proves that the "industry" does not really know facts and just assumes what the users do and think. We should focus on getting evidence and tracking what data is actually true to help us fix any issues that we can.

The first piece of evidence from the second article is: "It remains very hard to get accurate and complete data on environmental impacts." This means that even as scientists we can not 100% confirm the exact impact of these specific actions. It is hard to identify but there are ways to narrow down the data. We should be focusing our time and efforts on preventing these impacts as best as we can.

The second piece of evidence from the second article is: "Researchers could optimize neural network architectures for sustainability and collaborate with social and environmental scientists to guide technical designs towards greater ecological sustainability." This could be problematic as neural networks need lots of data to train. This would take a lot of effort from software engineers to work with technical designers towards improving the overall structure of neural networks.

A related piece of evidence from both articles is in the first section 2.1 Energy: it states that computers must adapt to energy conditions. This means that if we do not have enough energy but it must adapt to twice as much input as before then it will wipe out our electric grids. This is evident in the second article because it also tells us that "Within years, large AI systems are likely to need as much energy as entire nations." We will not have enough power to do this and so we must find new ways to reinvent electricity or how we store electricity.

These two readings are very similar as they touch upon the topic of energy and computing large amounts of data with many resources. The first article is more informative as it tries to teach the reader about each topic and how it affects the larger picture. The second article is more argumentative and focuses on getting its point across that we need to do something about this energy crisis and the impacts it will produce.

I agree with these articles that energy constraints are a very huge problem. I do not believe this will be an immediate problem as we do not need to use more power to create larger models as the ones we have known are as advanced as they get. Once we get to that roadblock then we will hopefully have a solution on how to get more energy or to optimize the neural network architectures.

AI Pre Lab #5

2024-05-02 AI Pre Lab #5

1

Total number of characters: 842147
DRACULA

CHAPTER I

JONATHAN HARKER’S JOURNAL

(_Kept in shorthand._)

_3 May. Bistritz._--Left Mun

2

['sight', '.', 'I', 'shall', 'be', 'glad', 'as', 'long', 'as', 'I', 'live', 'that', 'even', 'in', 'that', 'moment', 'of', 'final', 'dissolution', ',', 'there', 'was', 'in', 'the', 'face', 'a', 'look', 'of', 'peace', ',', 'such', 'as', 'I', 'never', 'could', 'have', 'imagined', 'might', 'have', 'rested', 'there', '.', 'The', 'Castle', 'of', 'Dracula', 'now', 'stood', 'out', 'against', 'the', 'red', 'sky', ',', 'and', 'every', 'stone', 'of', 'its', 'broken', 'battlements', 'was', 'articulated', 'against', 'the', 'light', 'of', 'the', 'setting', 'sun', '.', 'The', 'gypsies', ',', 'taking', 'us', 'as', 'in', 'some', 'way', 'the', 'cause', 'of', 'the', 'extraordinary', 'disappearance', 'of', 'the', 'dead', 'man', ',', 'turned', ',', 'without', 'a', 'word', ',', 'and', 'rode', 'away', 'as', 'if', 'for', 'their', 'lives', '.', 'Those', 'who', 'were', 'unmounted', 'jumped', 'upon', 'the', 'leiter-wagon', 'and', 'shouted', 'to', 'the', 'horsemen', 'not', 'to', 'desert', 'them', '.', 'The', 'wolves', ',', 'which', 'had', 'withdrawn', 'to', 'a', 'safe', 'distance', ',', 'followed', 'in', 'their', 'wake', ',', 'leaving', 'us', 'alone', '.', 'Mr', '.', 'Morris', ',', 'who', 'had', 'sunk', 'to', 'the', 'ground', ',', 'leaned', 'on', 'his', 'elbow', ',', 'holding', 'his', 'hand', 'pressed', 'to', 'his', 'side', ';', 'the', 'blood', 'still', 'gushed', 'through', 'his', 'fingers', '.', 'I', 'flew', 'to', 'him', ',', 'for', 'the', 'Holy', 'circle', 'did', 'not', 'now', 'keep', 'me', 'back', ';', 'so', 'did', 'the', 'two', 'doctors', '.', 'Jonathan', 'knelt', 'behind', 'him', 'and', 'the', 'wounded', 'man', 'laid', 'back', 'his', 'head', 'on', 'his', 'shoulder', '.', 'With', 'a', 'sigh', 'he', 'took', ',', 'with', 'a', 'feeble', 'effort', ',', 'my', 'hand', 'in', 'that', 'of', 'his', 'own', 'which', 'was', 'unstained', '.', 'He', 'must', 'have', 'seen', 'the', 'anguish', 'of', 'my', 'heart', 'in', 'my', 'face', ',', 'for', 'he', 'smiled', 'at', 'me', 'and', 'said', ':', '--', '“I', 'am', 'only', 'too', 'happy', 'to', 'have', 'been', 'of', 'any', 'service', '!', 'Oh', ',', 'God', '!', '”', 'he', 'cried', 'suddenly', ',', 'struggling', 'up', 'to', 'a', 'sitting', 'posture', 'and', 'pointing', 'to', 'me', ',', '“It', 'was', 'worth', 'for', 'this', 'to', 'die', '!', 'Look', '!', 'look', '!', '”', 'The', 'sun', 'was', 'now', 'right', 'down', 'upon', 'the', 'mountain', 'top', ',', 'and', 'the', 'red', 'gleams', 'fell', 'upon', 'my', 'face', ',', 'so', 'that', 'it', 'was', 'bathed', 'in', 'rosy', 'light', '.', 'With', 'one', 'impulse', 'the', 'men', 'sank', 'on', 'their', 'knees', 'and', 'a', 'deep', 'and', 'earnest', '“Amen”', 'broke', 'from', 'all', 'as', 'their', 'eyes', 'followed', 'the', 'pointing', 'of', 'his', 'finger', '.', 'The', 'dying', 'man', 'spoke', ':', '--', '“Now', 'God', 'be', 'thanked', 'that', 'all', 'has', 'not', 'been', 'in', 'vain', '!', 'See', '!', 'the', 'snow', 'is', 'not', 'more', 'stainless', 'than', 'her', 'forehead', '!', 'The', 'curse', 'has', 'passed', 'away', '!', '”', 'And', ',', 'to', 'our', 'bitter', 'grief', ',', 'with', 'a', 'smile', 'and', 'in', 'silence', ',', 'he', 'died', ',', 'a', 'gallant', 'gentleman', '.', 'NOTE', 'Seven', 'years', 'ago', 'we', 'all', 'went', 'through', 'the', 'flames', ';', 'and', 'the', 'happiness', 'of', 'some', 'of', 'us', 'since', 'then', 'is', ',', 'we', 'think', ',', 'well', 'worth', 'the', 'pain', 'we', 'endured', '.', 'It', 'is', 'an', 'added', 'joy', 'to', 'Mina', 'and', 'to', 'me', 'that', 'our', 'boy’s', 'birthday', 'is', 'the', 'same', 'day', 'as', 'that', 'on', 'which', 'Quincey', 'Morris', 'died', '.', 'His', 'mother', 'holds', ',', 'I', 'know', ',', 'the', 'secret', 'belief', 'that', 'some', 'of', 'our', 'brave', 'friend’s', 'spirit', 'has', 'passed', 'into', 'him', '.', 'His', 'bundle', 'of', 'names', 'links', 'all', 'our', 'little', 'band', 'of', 'men', 'together', ';', 'but', 'we', 'call', 'him', 'Quincey', '.', 'In', 'the', 'summer', 'of', 'this', 'year', 'we', 'made', 'a', 'journey', 'to', 'Transylvania', ',', 'and', 'went', 'over', 'the', 'old', 'ground', 'which', 'was', ',', 'and', 'is', ',', 'to', 'us', 'so', 'full', 'of', 'vivid', 'and', 'terrible', 'memories', '.', 'It', 'was', 'almost', 'impossible', 'to', 'believe', 'that', 'the', 'things', 'which', 'we', 'had', 'seen', 'with', 'our', 'own', 'eyes', 'and', 'heard', 'with', 'our', 'own', 'ears', 'were', 'living', 'truths', '.', 'Every', 'trace', 'of', 'all', 'that', 'had', 'been', 'was', 'blotted', 'out', '.', 'The', 'castle', 'stood', 'as', 'before', ',', 'reared', 'high', 'above', 'a', 'waste', 'of', 'desolation', '.', 'When', 'we', 'got', 'home', 'we', 'were', 'talking', 'of', 'the', 'old', 'time', '--', 'which', 'we', 'could', 'all', 'look', 'back', 'on', 'without', 'despair', ',', 'for', 'Godalming', 'and', 'Seward', 'are', 'both', 'happily', 'married', '.', 'I', 'took', 'the', 'papers', 'from', 'the', 'safe', 'where', 'they', 'had', 'been', 'ever', 'since', 'our', 'return', 'so', 'long', 'ago', '.', 'We', 'were', 'struck', 'with', 'the', 'fact', ',', 'that', 'in', 'all', 'the', 'mass', 'of', 'material', 'of', 'which', 'the', 'record', 'is', 'composed', ',', 'there', 'is', 'hardly', 'one', 'authentic', 'document', ';', 'nothing', 'but', 'a', 'mass', 'of', 'typewriting', ',', 'except', 'the', 'later', 'note-books', 'of', 'Mina', 'and', 'Seward', 'and', 'myself', ',', 'and', 'Van', 'Helsing’s', 'memorandum', '.', 'We', 'could', 'hardly', 'ask', 'any', 'one', ',', 'even', 'did', 'we', 'wish', 'to', ',', 'to', 'accept', 'these', 'as', 'proofs', 'of', 'so', 'wild', 'a', 'story', '.', 'Van', 'Helsing', 'summed', 'it', 'all', 'up', 'as', 'he', 'said', ',', 'with', 'our', 'boy', 'on', 'his', 'knee', ':', '--', '“We', 'want', 'no', 'proofs', ';', 'we', 'ask', 'none', 'to', 'believe', 'us', '!', 'This', 'boy', 'will', 'some', 'day', 'know', 'what', 'a', 'brave', 'and', 'gallant', 'woman', 'his', 'mother', 'is', '.', 'Already', 'he', 'knows', 'her', 'sweetness', 'and', 'loving', 'care', ';', 'later', 'on', 'he', 'will', 'understand', 'how', 'some', 'men', 'so', 'loved', 'her', ',', 'that', 'they', 'did', 'dare', 'much', 'for', 'her', 'sake', '.', '”', 'JONATHAN', 'HARKER', '.', 'THE', 'END']

3

['DRACULA', 'CHAPTER', 'I', 'JONATHAN', 'HARKER’S', 'JOURNAL', '(', '_', 'Kept', 'in', 'shorthand', '.', '_', ')', '_', '3', 'May', '.', 'Bistritz', '.', '_', '--', 'Left', 'Munich', 'at', '8:35', 'P', '.', 'M', '.']
186381
11927
('!', 0)
('&', 1)
('(', 2)
(')', 3)
('*', 4)
(',', 5)
('--', 6)
('.', 7)
('000', 8)
('1', 9)
('10', 10)
('10:18', 11)
('10:30', 12)
('11', 13)
('11:40', 14)
('12', 15)
('12:30', 16)
('12:45', 17)
('12th', 18)
('13', 19)
('14', 20)
('15', 21)
('16', 22)
('17', 23)
('1777;', 24)
('17s', 25)
('17th', 26)
('18', 27)
('1854', 28)
('1873', 29)
('19', 30)
('197', 31)
('1st', 32)
('2', 33)
('20', 34)
('21', 35)
('22', 36)
('23', 37)
('24', 38)
('24th', 39)
('25', 40)
('26', 41)
('27', 42)
('28', 43)
('29', 44)
('2:', 45)
('2:35', 46)
('3', 47)
('3-4', 48)
('30', 49)
('31', 50)

4

[1, 58, 2, 872, 1013, 615, 541, 763, 5, 1155, 608, 5, 1, 69, 7, 39, 873, 1136, 773, 812, 7]

5

'" It\' s the last he painted, you know," Mrs. Gisburn said with pardonable pride.'

6

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Cell In[16], line 5
      1 tokenizer = SimpleTokenizerV1(vocab)
      3 text = "Hello, do you like tea. Is this-- a test?"
----> 5 tokenizer.encode(text)

Cell In[12], line 9, in SimpleTokenizerV1.encode(self, text)
      7 preprocessed = re.split(r'([,.?_!"()\']|--|\s)', text)
      8 preprocessed = [item.strip() for item in preprocessed if item.strip()]
----> 9 ids = [self.str_to_int[s] for s in preprocessed]
     10 return ids

Cell In[12], line 9, in <listcomp>(.0)
      7 preprocessed = re.split(r'([,.?_!"()\']|--|\s)', text)
      8 preprocessed = [item.strip() for item in preprocessed if item.strip()]
----> 9 ids = [self.str_to_int[s] for s in preprocessed]
     10 return ids

KeyError: 'Hello'

7

('”', 11924)
('”;', 11925)
('”’', 11926)
('<|endoftext|>', 11927)
('<|unk|>', 11928)

8

'<|unk|>, do you like tea? <|endoftext|> In the sunlit terraces of the <|unk|>.'

AI Lab 7

2024-05-16 AI Lab #7

2.6

230107
x: [350, 13, 337, 1539]
y:      [13, 337, 1539, 319]
[350] ----> 13
[350, 13] ----> 337
[350, 13, 337] ----> 1539
[350, 13, 337, 1539] ----> 319
 P ----> .
 P. ---->  M
 P. M ----> .,
 P. M., ---->  on
[tensor([[7707, 2246, 6239,   32]]), tensor([[2246, 6239,   32,  198]])]
[tensor([[2246, 6239,   32,  198]]), tensor([[6239,   32,  198,  198]])]
Inputs:
 tensor([[ 7707,  2246,  6239,    32],
        [  198,   198, 41481,   314],
        [  198,   198,    41,  1340],
        [12599,  1565,   367, 14175],
        [ 1137,   447,   247,    50],
        [  449, 11698,    45,  1847],
        [  198,   198, 28264,  8896],
        [  457,   287, 45883, 13557]])

Targets:
 tensor([[ 2246,  6239,    32,   198],
        [  198, 41481,   314,   198],
        [  198,    41,  1340, 12599],
        [ 1565,   367, 14175,  1137],
        [  447,   247,    50,   449],
        [11698,    45,  1847,   198],
        [  198, 28264,  8896,   457],
        [  287, 45883, 13557,     8]])

2.7

Parameter containing:
tensor([[ 0.3374, -0.1778, -0.1690],
        [ 0.9178,  1.5810,  1.3010],
        [ 1.2753, -0.2010, -0.1606],
        [-0.4015,  0.9666, -1.1481],
        [-1.1589,  0.3255, -0.6315],
        [-2.8400, -0.7849, -1.4096]], requires_grad=True)
tensor([[-0.4015,  0.9666, -1.1481]], grad_fn=<EmbeddingBackward0>)
tensor([[ 1.2753, -0.2010, -0.1606],
        [-0.4015,  0.9666, -1.1481],
        [-2.8400, -0.7849, -1.4096],
        [ 0.9178,  1.5810,  1.3010]], grad_fn=<EmbeddingBackward0>)

2.8

Token IDs:
 tensor([[ 7707,  2246,  6239,    32],
        [  198,   198, 41481,   314],
        [  198,   198,    41,  1340],
        [12599,  1565,   367, 14175],
        [ 1137,   447,   247,    50],
        [  449, 11698,    45,  1847],
        [  198,   198, 28264,  8896],
        [  457,   287, 45883, 13557]])

Inputs shape:
 torch.Size([8, 4])
torch.Size([8, 4, 256])
torch.Size([4, 256])
torch.Size([8, 4, 256])

3.3.1

tensor([0.9544, 1.4950, 1.4754, 0.8434, 0.7070, 1.0865])
tensor(0.9544)
tensor(0.9544)
Attention weights: tensor([0.1455, 0.2278, 0.2249, 0.1285, 0.1077, 0.1656])
Sum: tensor(1.0000)
Attention weights: tensor([0.1385, 0.2379, 0.2333, 0.1240, 0.1082, 0.1581])
Sum: tensor(1.)
Attention weights: tensor([0.1385, 0.2379, 0.2333, 0.1240, 0.1082, 0.1581])
Sum: tensor(1.)
tensor([0.4419, 0.6515, 0.5683])

3.3.2

tensor([[0.9995, 0.9544, 0.9422, 0.4753, 0.4576, 0.6310],
        [0.9544, 1.4950, 1.4754, 0.8434, 0.7070, 1.0865],
        [0.9422, 1.4754, 1.4570, 0.8296, 0.7154, 1.0605],
        [0.4753, 0.8434, 0.8296, 0.4937, 0.3474, 0.6565],
        [0.4576, 0.7070, 0.7154, 0.3474, 0.6654, 0.2935],
        [0.6310, 1.0865, 1.0605, 0.6565, 0.2935, 0.9450]])
tensor([[0.9995, 0.9544, 0.9422, 0.4753, 0.4576, 0.6310],
        [0.9544, 1.4950, 1.4754, 0.8434, 0.7070, 1.0865],
        [0.9422, 1.4754, 1.4570, 0.8296, 0.7154, 1.0605],
        [0.4753, 0.8434, 0.8296, 0.4937, 0.3474, 0.6565],
        [0.4576, 0.7070, 0.7154, 0.3474, 0.6654, 0.2935],
        [0.6310, 1.0865, 1.0605, 0.6565, 0.2935, 0.9450]])
tensor([[0.2098, 0.2006, 0.1981, 0.1242, 0.1220, 0.1452],
        [0.1385, 0.2379, 0.2333, 0.1240, 0.1082, 0.1581],
        [0.1390, 0.2369, 0.2326, 0.1242, 0.1108, 0.1565],
        [0.1435, 0.2074, 0.2046, 0.1462, 0.1263, 0.1720],
        [0.1526, 0.1958, 0.1975, 0.1367, 0.1879, 0.1295],
        [0.1385, 0.2184, 0.2128, 0.1420, 0.0988, 0.1896]])
Row 2 sum: 1.0
All row sums: tensor([1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000])
tensor([[0.4421, 0.5931, 0.5790],
        [0.4419, 0.6515, 0.5683],
        [0.4431, 0.6496, 0.5671],
        [0.4304, 0.6298, 0.5510],
        [0.4671, 0.5910, 0.5266],
        [0.4177, 0.6503, 0.5645]])
Previous 2nd context vector: tensor([0.4419, 0.6515, 0.5683])

AI Homework 8

2024-05-23 AI Lab #8

3.4.1

tensor([1.4274, 1.1425])
keys.shape: torch.Size([6, 2])
values.shape: torch.Size([6, 2])
tensor(2.8430)
tensor([2.4696, 2.8430, 2.8140, 1.4756, 1.5018, 1.8638])
tensor([0.1912, 0.2490, 0.2440, 0.0947, 0.0965, 0.1246])
tensor([0.7687, 0.3144])

3.4.2

tensor([[0.7475, 0.3051],
        [0.7687, 0.3144],
        [0.7678, 0.3139],
        [0.7419, 0.3035],
        [0.7337, 0.2963],
        [0.7533, 0.3097]], grad_fn=<MmBackward0>)
tensor([[-0.0739,  0.0713],
        [-0.0748,  0.0703],
        [-0.0749,  0.0702],
        [-0.0760,  0.0685],
        [-0.0763,  0.0679],
        [-0.0754,  0.0693]], grad_fn=<MmBackward0>)

3.5.1

tensor([[0.1921, 0.1646, 0.1652, 0.1550, 0.1721, 0.1510],
        [0.2041, 0.1659, 0.1662, 0.1496, 0.1665, 0.1477],
        [0.2036, 0.1659, 0.1662, 0.1498, 0.1664, 0.1480],
        [0.1869, 0.1667, 0.1668, 0.1571, 0.1661, 0.1564],
        [0.1830, 0.1669, 0.1670, 0.1588, 0.1658, 0.1585],
        [0.1935, 0.1663, 0.1666, 0.1542, 0.1666, 0.1529]],
       grad_fn=<SoftmaxBackward0>)
tensor([[1., 0., 0., 0., 0., 0.],
        [1., 1., 0., 0., 0., 0.],
        [1., 1., 1., 0., 0., 0.],
        [1., 1., 1., 1., 0., 0.],
        [1., 1., 1., 1., 1., 0.],
        [1., 1., 1., 1., 1., 1.]])
tensor([[0.1921, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.2041, 0.1659, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.2036, 0.1659, 0.1662, 0.0000, 0.0000, 0.0000],
        [0.1869, 0.1667, 0.1668, 0.1571, 0.0000, 0.0000],
        [0.1830, 0.1669, 0.1670, 0.1588, 0.1658, 0.0000],
        [0.1935, 0.1663, 0.1666, 0.1542, 0.1666, 0.1529]],
       grad_fn=<MulBackward0>)
tensor([[1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.5517, 0.4483, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.3800, 0.3097, 0.3103, 0.0000, 0.0000, 0.0000],
        [0.2758, 0.2460, 0.2462, 0.2319, 0.0000, 0.0000],
        [0.2175, 0.1983, 0.1984, 0.1888, 0.1971, 0.0000],
        [0.1935, 0.1663, 0.1666, 0.1542, 0.1666, 0.1529]],
       grad_fn=<DivBackward0>)
tensor([[0.2899,   -inf,   -inf,   -inf,   -inf,   -inf],
        [0.4656, 0.1723,   -inf,   -inf,   -inf,   -inf],
        [0.4594, 0.1703, 0.1731,   -inf,   -inf,   -inf],
        [0.2642, 0.1024, 0.1036, 0.0186,   -inf,   -inf],
        [0.2183, 0.0874, 0.0882, 0.0177, 0.0786,   -inf],
        [0.3408, 0.1270, 0.1290, 0.0198, 0.1290, 0.0078]],
       grad_fn=<MaskedFillBackward0>)
tensor([[1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.5517, 0.4483, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.3800, 0.3097, 0.3103, 0.0000, 0.0000, 0.0000],
        [0.2758, 0.2460, 0.2462, 0.2319, 0.0000, 0.0000],
        [0.2175, 0.1983, 0.1984, 0.1888, 0.1971, 0.0000],
        [0.1935, 0.1663, 0.1666, 0.1542, 0.1666, 0.1529]],
       grad_fn=<SoftmaxBackward0>)

3.5.2

tensor([[2., 2., 2., 0., 2., 2.],
        [0., 0., 2., 2., 2., 2.],
        [2., 0., 0., 2., 0., 2.],
        [2., 0., 0., 2., 0., 2.],
        [0., 2., 2., 2., 2., 0.],
        [0., 2., 0., 2., 0., 0.]])
tensor([[2.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.7599, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.5517, 0.0000, 0.0000, 0.4638, 0.0000, 0.0000],
        [0.0000, 0.3966, 0.3968, 0.3775, 0.3941, 0.0000],
        [0.0000, 0.3327, 0.0000, 0.3084, 0.0000, 0.0000]],
       grad_fn=<MulBackward0>)

3.5.3

torch.Size([2, 6, 3])
tensor([[[-0.1263, -0.4881],
         [-0.2705, -0.5097],
         [-0.3103, -0.5140],
         [-0.3055, -0.4562],
         [-0.2315, -0.4149],
         [-0.2854, -0.4101]],

        [[-0.1263, -0.4881],
         [-0.2705, -0.5097],
         [-0.3103, -0.5140],
         [-0.3055, -0.4562],
         [-0.2315, -0.4149],
         [-0.2854, -0.4101]]], grad_fn=<UnsafeViewBackward0>)
context_vecs.shape: torch.Size([2, 6, 2])

Causal Attention Class Parameter Values

In Section 3.5.3, there is a code listing for a compact self-attention class.

On line 2 are the parameters to the constructor.
- When we call this class later in the section, what exact numbers do we give for each parameter value?
  - d_in
  - d_out
  - context_length
  - dropout

My answer is:

d_in: 3
d_out: 2
context_length:

tensor([[[-0.1263, -0.4881],
         [-0.2705, -0.5097],
         [-0.3103, -0.5140],
         [-0.3055, -0.4562],
         [-0.2315, -0.4149],
         [-0.2854, -0.4101]],

        [[-0.1263, -0.4881],
         [-0.2705, -0.5097],
         [-0.3103, -0.5140],
         [-0.3055, -0.4562],
         [-0.2315, -0.4149],
         [-0.2854, -0.4101]]], grad_fn=<UnsafeViewBackward0>)

dropout: Dropout(p=0.0, inplace=False)

Give the values of the variables on line 16:

b
num_tokens
d_in

b: 2
num_tokens: 6
d_in: 3

Give the shape of the variables on lines 18-20:

keys
values
queries

keys:

tensor([[[-0.1955,  0.4681],
         [-0.4527,  0.5057],
         [-0.4222,  0.4954],
         [-0.3403,  0.2727],
         [ 0.2456,  0.1702],
         [-0.6476,  0.3915]],

        [[-0.1955,  0.4681],
         [-0.4527,  0.5057],
         [-0.4222,  0.4954],
         [-0.3403,  0.2727],
         [ 0.2456,  0.1702],
         [-0.6476,  0.3915]]], grad_fn=<UnsafeViewBackward0>)

values:

tensor([[[-0.1263, -0.4881],
         [-0.4081, -0.5302],
         [-0.3888, -0.5227],
         [-0.2916, -0.2766],
         [ 0.0693, -0.2418],
         [-0.4905, -0.3688]],

        [[-0.1263, -0.4881],
         [-0.4081, -0.5302],
         [-0.3888, -0.5227],
         [-0.2916, -0.2766],
         [ 0.0693, -0.2418],
         [-0.4905, -0.3688]]], grad_fn=<UnsafeViewBackward0>)
tensor([[[-0.1263, -0.4881],
         [-0.2705, -0.5097],
         [-0.3103, -0.5140],
         [-0.3055, -0.4562],
         [-0.2315, -0.4149],
         [-0.2854, -0.4101]],

        [[-0.1263, -0.4881],
         [-0.2705, -0.5097],
         [-0.3103, -0.5140],
         [-0.3055, -0.4562],
         [-0.2315, -0.4149],
         [-0.2854, -0.4101]]], grad_fn=<UnsafeViewBackward0>)

queries:

tensor([[[ 0.2561,  0.3511],
         [-0.1963,  0.4377],
         [-0.1863,  0.4377],
         [-0.1927,  0.2190],
         [ 0.0465,  0.3153],
         [-0.2921,  0.2356]],

        [[ 0.2561,  0.3511],
         [-0.1963,  0.4377],
         [-0.1863,  0.4377],
         [-0.1927,  0.2190],
         [ 0.0465,  0.3153],
         [-0.2921,  0.2356]]], grad_fn=<UnsafeViewBackward0>)

Give the shape of the variable on line 22

attn_scores

attn_scores:

tensor([[[ 0.1143,  0.0616,  0.0658,  0.0086,  0.1226, -0.0284],
         [ 0.2433,  0.3102,  0.2997,  0.1862,  0.0263,  0.2985],
         [ 0.2413,  0.3057,  0.2955,  0.1828,  0.0287,  0.2920],
         [ 0.1402,  0.1980,  0.1899,  0.1253, -0.0101,  0.2106],
         [ 0.1385,  0.1384,  0.1366,  0.0701,  0.0651,  0.0933],
         [ 0.1674,  0.2514,  0.2401,  0.1637, -0.0316,  0.2814]],

        [[ 0.1143,  0.0616,  0.0658,  0.0086,  0.1226, -0.0284],
         [ 0.2433,  0.3102,  0.2997,  0.1862,  0.0263,  0.2985],
         [ 0.2413,  0.3057,  0.2955,  0.1828,  0.0287,  0.2920],
         [ 0.1402,  0.1980,  0.1899,  0.1253, -0.0101,  0.2106],
         [ 0.1385,  0.1384,  0.1366,  0.0701,  0.0651,  0.0933],
         [ 0.1674,  0.2514,  0.2401,  0.1637, -0.0316,  0.2814]]],
       grad_fn=<UnsafeViewBackward0>)

and line 25

attn_weights

attn_weights:

tensor([[[ 0.1143,    -inf,    -inf,    -inf,    -inf,    -inf],
         [ 0.2433,  0.3102,    -inf,    -inf,    -inf,    -inf],
         [ 0.2413,  0.3057,  0.2955,    -inf,    -inf,    -inf],
         [ 0.1402,  0.1980,  0.1899,  0.1253,    -inf,    -inf],
         [ 0.1385,  0.1384,  0.1366,  0.0701,  0.0651,    -inf],
         [ 0.1674,  0.2514,  0.2401,  0.1637, -0.0316,  0.2814]],

        [[ 0.1143,    -inf,    -inf,    -inf,    -inf,    -inf],
         [ 0.2433,  0.3102,    -inf,    -inf,    -inf,    -inf],
         [ 0.2413,  0.3057,  0.2955,    -inf,    -inf,    -inf],
         [ 0.1402,  0.1980,  0.1899,  0.1253,    -inf,    -inf],
         [ 0.1385,  0.1384,  0.1366,  0.0701,  0.0651,    -inf],
         [ 0.1674,  0.2514,  0.2401,  0.1637, -0.0316,  0.2814]]],
       grad_fn=<MaskedFillBackward0>)

Use Python3's print function in your 3_5_3_causal_class.py to verify your answers, and copy and paste the output into your dev diary.

Exercise 3.2 Returning 2-dimensional embedding vectors

In Section 3.5.4, when you are working on multi-headed attention:

Change the input arguments for the MultiHeadAttentionWrapper(..., num_heads=2) call such that the output context vectors are 2-dimensional instead of 4-dimensional while keeping the setting num_heads=2. Hint: You don't have to modify the class implementation; you just have to change one of the other input arguments.

Run and test your work in 3_4_3_multi_head.py.

3.4.3 (6.1)

tensor([[[-0.4519,  0.2216,  0.4772,  0.1063],
         [-0.5874,  0.0058,  0.5891,  0.3257],
         [-0.6300, -0.0632,  0.6202,  0.3860],
         [-0.5675, -0.0843,  0.5478,  0.3589],
         [-0.5526, -0.0981,  0.5321,  0.3428],
         [-0.5299, -0.1081,  0.5077,  0.3493]],

        [[-0.4519,  0.2216,  0.4772,  0.1063],
         [-0.5874,  0.0058,  0.5891,  0.3257],
         [-0.6300, -0.0632,  0.6202,  0.3860],
         [-0.5675, -0.0843,  0.5478,  0.3589],
         [-0.5526, -0.0981,  0.5321,  0.3428],
         [-0.5299, -0.1081,  0.5077,  0.3493]]], grad_fn=<CatBackward0>)
context_vecs.shape: torch.Size([2, 6, 4])

Human Writing

Read pages 1-7 and the last paragraph on page 8, of Ethics and Second-Order Cybernetics.

In your dev diary, write an essay responding to one or more of the following prompts:

What is cybernetics?
What is the relationship of cybernetics to artificial intelligence?
What is ethics...
- ...in your personal understanding before reading this essay?
- ...as described in this essay?
- As far as your personal view is similar to or different from the meaning of ethics implied in the essay, say more about how your view developed
What is the cybernetics of cybernetics?
What kinds of non-verbal experiences is it possible to have with a large language model that is focused mostly on verbal interactions?
- For non-verbal experiences which are currently impossible to have with an LLM, do you think it will ever be possible, and how will you know?
How can cybernetics affect artificial intelligence, in particular language interactions and recent LLM progress?
How can artificial intelligence, in particular LLM progress, affect cybernetics?

Response

Cybernetics is the study of systems in machines and living organisms. When we study how things work, we combine different principles from aspects of engineering, computer science, and biology. Using this knowledge and the things we know about cybernetics we can learn how things work and the decisions that are made based on feedback in each given situation. It explains in the reading that it is required to have a brain write a theory of a brain, and for it to feel complete, it must know it wrote the story. By entering your own space or “domain” as they called it, the cybernetician must account for their own ability, giving them the parameters or descriptions initially hidden. That is when cybernetics becomes second-order cybernetics. The observer is being observed in a larger microscope making the universe seem extremely large, even infinite. It is important to focus on our role in the universe as we must see the bigger picture to find our way. By observing and describing a system we directly influence the system itself.

In the case of whether the chicken or the egg came first: if there is a chicken is the possibility there is an egg 100%? Does the chicken know it came from the egg? The egg is the only viable option as either it came first, or the chicken would know it came from that egg. Otherwise, the chicken would not know what an egg is as it did not come from one... or it did not see what it came out of.

The idea is that we have feedback loops that take in an input and send it to the next input to create a feeling of activeness or aliveness in a system when working from one part to the next. It is crucial to manage these systems with one control mechanism that uses your inputs to alter the system's behavior. These inputs can cause positive or negative effects on the system depending on the desired output. By signaling and learning the system can communicate globally to identify different changes, altering its behavior based on those inputs.

Using cybernetics in cybernetics we can use principles from the field itself to identify higher-level concepts. This helps us improve our research in cybernetics as we can broaden the terms of what it means to study systems and organisms to bridge the gap between nature and technology.

It is very important to focus on ethics when it comes to anything related to computer science as the future of technology is held in each person’s hand. By learning and creating we can work together to build the greater future we envision with improved systems. When designing new systems you should always keep in mind to think of others personal privacy as well as the impact it might have on them for using it. Keeping user's data secure and staying transparent about the system will increase the usability and lifetime of the product.

AI Lab 9

2024-05-30 AI Lab #9

Read Chapter 5 of the Raschka book on LLMs, starting from the beginning and up to and including Section 5.

Your goal is to adapt the files gpt_train.py and previous_chapters.py to measure

Validation and training loss on your chosen dataset, reading Section 5.1
- Compare your validation and training loss to that shown in class yesterday.
- Does your validation loss remain higher than training loss? How can we close this gap?
Read the section "5.4 Loading and saving model weights in PyTorch" to make sure you don't have to train each time
- you may wish to have a command-line argument, or copy and paste a new python script that loads a model for inference instead of training it
Adapt the Python code so you can pass it a command-line argument seed phrase to start a conversation
- so you don't have to use the hard-coded phrase "Every effort moves you"
Read Section "5.5 Loading pretrained weights from OpenAI" and try chatting with both the OpenAI model and your own model.
- How are they different or the same?

The validation loss and training loss are different from class yesterday because the training set is different. The training loss should decrease and the validation loss will even out at an average amount. They both start at about 12 which is close to where they started in class. After two more steps, the validation loss is slightly larger than the training loss. Both numbers are slowly decreasing except when they slightly get further apart and become almost the same number again. There is not a gap and if there was I am not sure on how to close that gap.

I did not get to chat with my model.

Operating Systems

Threads

ostep Repo

ostep homework Repo

Ran p1.c, p2.c, p3.c, p4.c - Introduction to fork() and exec()

Introduction to pipe()

Schedules

Ran scheduler.py - Introduction to response time and turnaround time

Ran mlfq.py - Introduction to MLFQ scheduler

Memory

Ran free and pmap commands - Introduction to displaying memory information and memory usage

Concurrency

Locks VS mutex VS semifor?

Test and Set Checks register: If 0 then unlock and set it to 1 If the current is 1 then it is already locked Thread knows it's locked so it proceeds.
Compare and Swap Compares value to 0 if it's 0 it swaps with its value which is 1. Load link store conditional Can be used for locking or other things lock(mem) It does a load load lock to r1 Set it to 1 or if not then wait If 0 store 1, else don't do anything it's locked just sleep If 0 store conditional 1 Load is linked to stored mem address, if something else tries to write there it fails

Locks can be expensive (time) Big data structure and lots of threads are reading or updating from it You could have a lot of locks

Fetch and add: Get value from memory and add value to it like 1 store result back
Ticket lock: avoid starvation If you have a lot of threads trying to modify or enter this lock it is a random hit or miss, any thread will be given access to do the task, The ticket system is round-robin, and when you release the lock the next thread in line gets the lock
Yield:
give up CPU or wait? CPU will yield if there is a lock on the memory
Transactional memory: The memory manager is doing the work – make something a transition and it completes or undoes what it has done if fails.

Sleep vs spin Waiting vs retrying over and over in a loop

Condition Variables

Condition or Barrier – parents wait on children for threads to complete processes or become true

Local counter var vs global counter When the local counter gets to a certain point copy the value to the global counter

Deadlock

Classwork and Lab

Ran main-deadlock.c, main-race.c, and main-signal.c

Notes

3 most common types of bugs:

-violation of atomicity assumption

deadlock or assume it's not in the right order
Violation of order assumption

All create the possibility of deadlock

Conditions for deadlock:

Mutual exclusion

Hold-and-wait – it has what you're waiting for prevent by partial ordering. Can't access the lock so unlock 1 and lock 2.

No preemption – releases the locks you have

Circular wait – waiting on resources when the lock has already acquired resource

Deadlock vs livelock

Livelock: Livelock occurs when multiple threads can execute, without blocking indefinitely (the deadlock case), but the system as a whole is unable to proceed, because of a repeated pattern of resource contention.

Prevent any of these to prevent deadlock

Semaphores

Added semaphores and ran fork-join.c, rendezvous.c, and barrier.c

Extras

Commands

* Terminal *

    - cargo build
    - cargo run

    - sudo chown -R gitpod:gitpod /opt/.cargo/

* Git *

    git status
    git add .(or * for all files)
    git commit -m "<description of your changes>"

    # Checkout and create a new branch for your work
    git checkout -b <name>-<date>
    git status
    git add <changed_or_new_files>
    git commit -m "<describe changes>"
    git push -u origin <name-<date>

⬆️ Back to top

Timothy's Dev Diary - TheEvergreenStateCollege/upper-division-cs-23-24 GitHub Wiki

Timothy's Dev Diary

Spring 2024 | Student-Originated Software

Software Construction

Completed Git-Tac-Toe with Morgan

Rust Book Chapters 1, 2, 3, and 4.

Complete Rustlings 00 to 06.

Rust Book Chapters 5, 6, and 7.

Complete Rustlings 07 to 13, quizes 1 to 2.

Rustlings 13

Rust Book Chapters 8 and 9.

Complete Rustlings 14

Rust Book Chapters 10 and 11.

Complete Rustlings 15 and Quiz 3

WASM end result

WASM Reference Image

2024-05-06 SC Lab 06

Complete City Builder

Step 1

Rust Book Chapter 13.

Complete Rustlings 16, 17, 18

Rustlings 16

Rustlings 17