AI‐24sp‐2024‐04‐17‐Morning - TheEvergreenStateCollege/upper-division-cs-23-24 GitHub Wiki

Back to AI Self-Hosting Spring 2024

Gradient Descent

3Blue1Brown Notes Gradient Descent Notes

We'll all create Dev Diary entries describing our training runs as it explores the landscape via gradient descent

https://github.com/TheEvergreenStateCollege/upper-division-cs/wiki/Paul%E2%80%90Pham%E2%80%90Dev%E2%80%90Diary%E2%80%902024%E2%80%9004%E2%80%9016

Review

We downloaded MNIST handwritten digit data
We loaded and formatted it in Numpy arrays
We coded a simple feedforward neural network

Models as Probabilistic Compiled Output of Training

Each model is just a collection of parameters that can be trained, that can be run as part of inference
- For neural networks, these parameters weights and biases
- For other machine learning algorithms, they may be different
  - e.g. Support Vector Machines (SVMs or Nearest Neighbor Clustering, which can be used to classify spam
  - Generative Pre-trained Transformers (GPTs)

How many parameters for the MNIST neural network from 3blue1brown?
- How many parameters for the Mike Nielsen code that we are typing in lab?
- Are they different?
If each weight or bias is a 64-bit floating point number, how big should a model file be?
How many parameters are on some popular HuggingFace models?

Cost Function

We can calculate the cost (or "error") for a single training example (datapoint) by
- Presenting the input data to the input layer (it's a 784-pixel image for MNIST)
- Calculating each next layer based on the weights from the previous layer, and comparing to biases
- At the output layer, comparing it to our label
But there are many many (60,000) training examples.
- What does the value of the cost function for each one mean?

Hyperparameters

Learning rate $\eta$
- how quickly we move around the "cost function landscape", or explore
Mini batch size
- how many datapoints we average at each step of SGD
Number of epochs
- how many times we run through all the datapoints, in chunks of mini-batches

Data and Labels

For supervised learning: training data looks like tuples of (data, label)
What is the difference between training data and test data?

In-Class Activity

https://github.com/TheEvergreenStateCollege/upper-division-cs/wiki/AI%E2%80%90Homework%E2%80%9003

Our Goal in Lab on Thursday

In teams or solo:

Do a complete training run on MNIST data
Save our model parameters as a file (Python pickle)
Load it again, so we don't have to train each time, and classify some MNIST digits from the validation set
Exchange it with another team and compare our files.
What to pass along about the model's training

Grassy Knoll Exercise

Questions

What are some cost functions that would compare two vectors of size 10, and produce 1 number?
An epoch is going through, minimizing the cost function through all the datapoints once. Why isn’t one epoch enough?

Post Grassy Knoll Exercise 3) What if you choose multiple starting points? (e.g. one in each corner and in the middle) MNIST images are square. Cost function Landscape is not the same input image size, Exists in > 784-dimensional space . What is a corner? What if we choose random sets of random locations? Does this help?

What if different mini-batches start at different starting points?

Save the best location (weights/biases) so far, and continue exploring.

92%