AI‐24sp‐2024‐04‐17‐Morning - TheEvergreenStateCollege/upper-division-cs-23-24 GitHub Wiki
Back to AI Self-Hosting Spring 2024
Gradient Descent
3Blue1Brown Notes Gradient Descent Notes
We'll all create Dev Diary entries describing our training runs as it explores the landscape via gradient descent
Review
- We downloaded MNIST handwritten digit data
- We loaded and formatted it in Numpy arrays
- We coded a simple feedforward neural network
Models as Probabilistic Compiled Output of Training
- Each model is just a collection of parameters that can be trained, that can be run as part of inference
- For neural networks, these parameters weights and biases
- For other machine learning algorithms, they may be different
- e.g. Support Vector Machines (SVMs or Nearest Neighbor Clustering, which can be used to classify spam
- Generative Pre-trained Transformers (GPTs)
- How many parameters for the MNIST neural network from 3blue1brown?
- How many parameters for the Mike Nielsen code that we are typing in lab?
- Are they different?
- If each weight or bias is a 64-bit floating point number, how big should a model file be?
- How many parameters are on some popular HuggingFace models?
Cost Function
- We can calculate the cost (or "error") for a single training example (datapoint) by
- Presenting the input data to the input layer (it's a 784-pixel image for MNIST)
- Calculating each next layer based on the weights from the previous layer, and comparing to biases
- At the output layer, comparing it to our label
- But there are many many (60,000) training examples.
- What does the value of the cost function for each one mean?
Hyperparameters
- Learning rate $\eta$
- how quickly we move around the "cost function landscape", or explore
- Mini batch size
- how many datapoints we average at each step of SGD
- Number of epochs
- how many times we run through all the datapoints, in chunks of mini-batches
Data and Labels
- For supervised learning: training data looks like tuples of
(data, label)
- What is the difference between training data and test data?
In-Class Activity
https://github.com/TheEvergreenStateCollege/upper-division-cs/wiki/AI%E2%80%90Homework%E2%80%9003
Our Goal in Lab on Thursday
In teams or solo:
- Do a complete training run on MNIST data
- Save our model parameters as a file (Python pickle)
- Load it again, so we don't have to train each time, and classify some MNIST digits from the validation set
- Exchange it with another team and compare our files.
- What to pass along about the model's training
Grassy Knoll Exercise
Questions
- What are some cost functions that would compare two vectors of size 10, and produce 1 number?
- An epoch is going through, minimizing the cost function through all the datapoints once. Why isn’t one epoch enough?
Post Grassy Knoll Exercise 3) What if you choose multiple starting points? (e.g. one in each corner and in the middle) MNIST images are square. Cost function Landscape is not the same input image size, Exists in > 784-dimensional space . What is a corner? What if we choose random sets of random locations? Does this help?
What if different mini-batches start at different starting points?
Save the best location (weights/biases) so far, and continue exploring.
92%