AI_Homework3_ResponsePROG - TheEvergreenStateCollege/upper-division-cs-23-24 GitHub Wiki

DEEP LEARNING HOMEWORK 3 RESPONSES

1. What is connectionism and the distributed representation approach? How does it relate to the MNIST classification of learning the idea of a circle, regardless of whether it is the top of the digit 9 or the top / bottom of the digit 8? “Connectionism” is a term that was used for deep learning in the 80s and 90s. The term connecionism was popularized when it was used to describe a type of deep learning based around the distributed representation approach. According to this approach, the target becomes sets of features which indicate the output, instead of the output themselves.

Although in the MNIST dataset the 10 target nodes do represent the 10 outputs, the intermediate layers may have more to do with the features they’re extracting – like the loops found in 6, 8, 9 or 0 – or the horizontal lines in 2, 4, or 7.

2. What are some factors that has led to recent progress in the ability of deep learning to mimic intelligent human tasks?

Deep learning has been around for a long time, and has appeared to steadily develop throughout its history. These advancements have assisted in aiding deep learning to reach its current level of popularity, however I’d speculate the recent AI boom has the most to do with the decreased cost and increased accessibility to more powerful CPUs and GPUs allowing for larger datasets, models and for shorter training periods.

3. How many neurons are in the average human brain, versus the number of simulated neurons in the biggest AI supercomputer described in the book chapter? Now in the year 2024, how many neurons can the biggest supercomputer simulate? (You may use a search engine or an AI chat itself to speculate). According to figure 1.11, the biggest AI supercomputer had between 10⁶ and 10⁷ neurons. The human brain was listen of have 10^10. Super computers today have 1.15 billion, or a little more than 10^9 billion neurons.

CHAPTER 2 GRADIENT DESCENT RESPONSES

4. Let's say you are training your neural network on pairs of where is a training datapoint (an image in MNIST) and _ is the correct label for _. Why does the neural network, before you've trained it on the first input, output "trash", or something that is very far from the corresponding _? An untrained neural network on the first run is initialized with randomized weight values. As more epochs or batches run through the network and the weights are updated according to their similarity to the output (label) the results of the input will more closely match the label.

GOOGLE COLAB NOTEBOOK RESPONSES

5. If you have a Numpy array that looks like the following, give its shape as a tuple of maximum dimensions along each axis. For example (p,q,r) is a tensor of third rank, with p along the first dimension (which "2D layer"), q "rows" in each "2D layer", and r "columns" in each "row". ([[1,2,3,4], [5,6,7,8], [9,10,11,12]] ) Answer: (1, 3, 4)
6. Assume your neural network in network.py is created with the following layers net = Network([7,13,2]) What is the shape of the self.weights and self.biases members in the constructor of Network? Use the same shape notation as in the previous question. I’m glad you asked this question because I always get confused by this... self.weights.shape : layer 1(between a0 and a1): (7, 13), layer 2(between a1 and a2): (13, 2) self.biases.shape : a1: (13, 1), a2: (2, 1)
7. From the notes, answer this question ordering the given outputs in terms of the cost function they give out. Answer: B > A > C > D B – No error-cost, because all outputs are marked correctly. A – Some error-cost because 4 in not marked. The rest of the outputs are marked correctly. C – More error-cost, because 0-3 and 5-9 are marked (but only partially). This error-cost is reduced by 4 also being shaded. D – Lots of error cost. Dead wrong for every output marker.
8. What is the effect of changing the learning rate (the Greek letter "eta") in training with SGD?

The eta is the step-size of the function – in our hiking example, this would have translated literally: to the size of each step we took. With larger steps, we might reach the top of the hill more quickly, but we also have the potential to wander off in the wrong direction. How would we be affected by a large eta and took bigger steps? Imagine the mounds were much smaller - we might mistakenly step over our highest point. What about if the eta was smaller, and we took half-steps or quarter-steps? Now we imagine the mounds were bigger - it would take us a long time to reach the top of a single mound, and if we did we might not have time to explore another (possibly-taller!) mound before the epoch countdown ran out.

RASCHKA'S HIKER ANALOGY

9. Why is the word "stochastic" in the name "stochastic gradient descent", and how is it different than normal gradient descent?

In the hiking activity on Thursday, students in groups of 3 represented training batches of that size. After the three students selected a direction, using the angles of their feet to sense a local maxima, another group of students would switch and calculate the next step. Gradient descent describes the practice of finding minima to reduce error costs. The example above represents a particular type of gradient descent – stochastic gradient descent. Stochastic or iterative gradient descent is achieved by iterating through training batches. This method produces a less smooth path to determining local minima, because each group is different – they have different size feet, or different cooperation patterns producing different results. In the hiking example, to produce the smooth results of standard gradient descent, the entire class would stand on the hill and try to work together to determine the direction of the next step. You can imagine the time it would take for us to coordinate would be longer, and this is similar to the additional computational resources a batch of this size would require. The average of our many conflicting viewpoints would produce a smooth line because the average of our viewpoints would result in a conservative view. This could be beneficial, but it also could inhibit exploration that may lead to producing an accurate answer more quickly, before the epoch limit was reached.

⚠️ **GitHub.com Fallback** ⚠️