Griffin LLM - TheEvergreenStateCollege/upper-division-cs-23-24 GitHub Wiki

add context tokens to tokenizer (unknown word and end of text):

pip install tiktoken to use for byte pair encoding (2.5) byte pair encoding handles unknown words by breaking them down into sub words which might be in the vocabulary? Byte pair encoding builds a vocabulary by iteratively merging frequent sub words into words

Dataloader: used to fetch the input-target pairs we create the above target pairs using a sliding window approach, in other words the input is x and the target is x + 1 where x is the first token in a sample.

context size is the size of our sample, i.e. the number of tokens

the data loader iterates over the input dataset, and returns the inputs and targets as pytorch tensors, which can be thought of as multi dimensional arrays: -note there is one tensor for inputs, and one tensor for targets

Okay, so embeddings are a vector--like a real long vector with a ton of dimensions (elements). These vectors are input into the model as input. I'm still not clear on what each element in the vector represents, and how to create the final embedding.

Training

I'm running gpt_train.py using python gpt_train.py train, but it only runs part of the first epoch. The furthest I got was step 000005 in the first epoch before it terminates--usually it only makes it to step 0000000.

The number of epochs is good, i.e. > 1

I've tried using different datasets

I'm wondering if it's a memory issue and is just running out of memory so I used the command htop to monitor Gitpod's resources while running the program:

the CPU maxes out, but the memory sticks around 50% I don't think the memory would be a problem then, but the CPU could. I don't have access to a better computer at the moment, but I'd like to rerun this on a different computer to see if the problem is the CPU.