How to train a LLM - amosproj/amos2025ss04-ai-driven-testing GitHub Wiki

Theoretical approach

Training an LLM is a science. This entry gives a brief overview, to familiarize you with the core concepts and vocabulary. It is not possible to go into detail for every section. If you want to only read a part, I would advise you to only read the practical approach and refer to other sections when needed.

1. Data Curation

Data sources

In general, a great resource as an overview on data sources to use, check out Zhao 3.2.

Scrape the internet

This provides a large variety
There are copy right problems, since we do not have explicit permissions. (This might interfere with our Open Source approach.)

Public datasets

These can be open-source and even per-curated.

Common Crawl - scraping of internet curated
Colossal Clean Crawled Corpus (C4) filtered versions of Common Crawl
The Pile - 800GM Dataset of diverse text
Hugging Face Datasets - Finding datasets
ect.

Use an LLM to generate a dataset

This falls under a legal grey-area, since it is technically only open-source, if the LLM was not exclusively trained on open-source Data. An example is AlpacaDataCleaned, which has been used to train Alpaca with text generated by GPT-3

Data Diversity

Mixing multiple datasets is the most common for LLMs. Mixing generalized data and specialized data is common. There has not jet been an exact proportion that works best specific tasks. Zhao 4.1

Data Preprocessing

This step is to remove noisy, redundant, irrelevant, and potentially toxic data [Zhao 4.1]

Language based filtering (only keep data on languages that will be used)
Metric based filtering (focus on data you are interested in, using quantification and mathematical operations)
Statistic based filtering (tweak statistical distribution, on punctuation, symbol-to-word-ration, sentence-length)
Keyword based filtering (identify and remove distracting elements)

De-duplication

at different granularity, to reduce repetitive patterns and reduce chance of over-fitting

Privacy Reduction

remove personally identifiable information from data, with for instance keyword spotting

Tokenization

Decide on a minimum segment for tokens, the building blocks of a language. [Zhao 4.1]

word-based tokenization (dominant, but can be different for the same input in some languages and generates a large vocabulary of low-frequency words, which makes training difficult)
character tokenization
subword tokenizers (widely used for transformer based LMs, segmentation into logically coherent patterns in words)

It is usually better to use a tokenizer trained on the pre-training corpus

Subword tokenizers

Byte-Pair Encoding (BPE) tokenization Gage, Sennrich

Start with basic symbols & iteratively combine pairs of two that have the highest frequency of appearing together. The origional tokens persist. This is used by GPT-2, BART, and LLaMA.

Unigram tokenization Kudo

Start with large set of possible sub-strings and subsequently remove until expected vocabulary size is reached. Uses Viterbi algorithm optimally splits words into tokens, measure how well the vocab can explain the training data, remove about 10% of tokens for each step. This is used by T5 & mBART

SentencePiece Kudo,Richardson

An open source combination of BPE and Unigram tokenization. This has an existing python library. sentencepiece

2. Model Architecture

Transformers

= a network-architecture based on self-attention mechanisms. This is the component that actually decides what token in a given set has the highest correspondence with the query-token. Transformers translate tokens into semantic representations. Transformer explanation

Vaswani

Attention

Attention function:

spotlighting the most important parts of an input
receives a Query (elements to process, an individual token), Keys (elements to compare against, the other tokens in context) and Values (information about the tokens) and outputs the Weights (scoring of keys)

Example for the sentence: "The animal didn't cross the street because it was too tired" Query: The word "it" Keys: All words in the sentence ("The", "animal", "didn't", etc.) Values: The actual information about each word Weights: Higher for "animal" (because that's what "it" refers to), lower for other words

The weight-calculation is usually done using SoftMax

Self-attention: Is a specific way of calculating the attention, by comparing the query to each token in the context. The use of self-attention layers allow for instead of recurrences or convolutions allows for parallelization and improved [Vaswani]

Types of Transformers

Encoder-only

with reference to self-attention from the tokens

task: text classification

Decoder-only

encoder without self-attention to future elements
generate one element at a time
auto-regressive = consuming previous generated symbol as additional input

task: text generation

Encoder-decoder

combine both & allow cross-attention

task: translation

[Talebi], [Vaswani]

Residual Connections

= the act of passing intermediate training data through tho deeper layers and sipping some layers. The benefit is that this allows for gradual development of the layers weights, thereby reducing over-fitting or overspecialized paths. The disadvantage of this approach is that an echo of the less trained data is present in the higher levels, witch introduces a bias to the weights. There are methods to work around this, but those exceed the scope of this wiki-entry. The most common approach is to use residual connections. Zhang

Layer Normalization

= resale values between layers.

Where to normalize: before or after the layer?
How to normalize: Using the Layer Norm or the Root Means Square Norm?

Layer Norm: $$y = \frac{x-\overline{x}}{\sqrt(Var(x) + \epsilon)} \times \gamma + \beta$$ RMS Norm: $$y = \frac{x}{RMS(x)} \times \gamma + \beta$$ with $\overline{x}$ = mean, $\gamma$ = gain factor, $\beta$ = bias term

The most common approach is pre-layer normalization and using the Layer Norm [Talebi]

Activation functions

= control in what manner a Layer-path should respond, based on a given weight. There exist many variants, with Gaussian Error Linear Unit (GeLU) being the most widely used Zhao

Examples: GeLU, ReLU, Swish, SwiGLU, GeGLU

Positional Embeddings

= placement of tokens in vector-space by embedding functions. This is necessary for better token correspondence and better understanding of the context. These can be set at the beginning [Vaswani] or be dynamically adjustable, known as relative position representations Shaw

Size

The size of the model is determined by how large the token amount of the training data is and how long the model is trained. If the model is too big or trained too long, it can easily over-fit but if it is too small or not trained long enough it can under-perform. A rough estimate for the sweet-spot for Large Models is $20$ tokens per model parameter. This estimation can help prevent under-fitting models, resulting in better smaller models. Even if the current push is towards bigger models, for the same data-set, a correctly trained model with a fraction of the parameters outperforms a model that has not been trained long enough. Hoffmann While training, it is advisable, that the training-loss and value-loss stay almost the same. once they start to diverge, the model is being over-fitted. Dandekar

3. Training at Scale

Optimization Setting

Batch Training

not dealing with the entire data set at once, but splitting it into batches and computing Gradients for these

Learning Rate

= How fast/much the weights are adjusted/ how much influence an individual round of training has

the initial phase should have a low Learning rate, to not overshoot, the middle should have a peak learning rate and the end should reduce the learning rate again, to be more stable.

Optimizer

scale the way the gradients are adjusted

Stability

Check-pointing - save snapshots of artifacts to be able to resume at intermediate points
Weight Decay - penalize large parameter values
Gradient Clipping - rescale large values with gradients

Scalable Training Techniques

Mixed Precision Training

Using 16-bit floating point data types whenever possible and 32-bit only when necessary

3D Parallelisim

Using multiple techniques to efficiently parallelize Pipeline Parallelism

distribute transformer layers across multiple GPUs
Distribute training data, such that each GPU can do forward and backward propagation on its data. The resulting gradients are then aggregated

this is very scalable and already implemented by popular deep-learning libraries such as TensorFlow and PyTorch

Model Parallelism

distribute different layers into one GPU.
there are ways to do this by padding multiple batches of data and asynchronous gradient updates

Tensor parallelism

Decompose Tensors (parameter matrecies) into multiple GPUs.
This allows for faster matrix-matrix multiplication.

An example of an open-source implementation is Megatron-LM

Redundancy decreasing

optimize states, gradients, parameter partitioning to allow for better separation of data into different GPUs and reduce cross-GPU communication.
ZeRO, FSDP and activation recomputation techniques can be used here.

Implementations can be found in DeepSpeed, PyTourch and Megatron-LM

Zhao 4.3

Practical approach

Best approach for our project

In our case, we do not need a general Large-language-model. We need a language-model that is trained on english, python-code, mark-down-code and maybe some other data-sets. We do not need to train an LLM for our use-case.

Train a small language Model (SLM) with specific smaller datasets

An example is given in the step by step walk through on how to train a SLM with the tiny-stories Dataset. Dandekar This also includes a Codelab. DandekarCode We could take this code, and add a few more data-sets for code generation in python, markdown, ... and train on that.

LLM Structure

Step 1: Import the Dataset

Here the TinyStories dataset is loaded. It is a dataset, which consists of stories for 3 to 4 year olds. Dataset is remarkable, since facilitates learning coherent English, whilst being very small. We should add additional data bases, for our specific use-case.

Step 2: Tokenize the Dataset

The dataset is transformed from normal language into encodings. This works in a similar way as the embedding model in RAG. The tokenized and encoded data is then separated into batches/ shards, to have modular handling. Here the data is also split into train, validation and test sections.

Step 3: Create Input-Output batches for the dataset

A memory-map is created, to prevent memory leeks (from the wrong category) when accessing files. Input and output sequences are randomly created from the batches. The output sequence is always by one to the input data, as seen in the image on Transformers (Section 2 Image purple)

Step 4: Define the LM Model Architecture

multiple calsses and functions are combined into a Transformer Block() with the order LayerNorm → Self-Attention → LayerNorm → MLP

The classes are defined as follows:

First is the Layer-norm (Section 2).
The CausalSelfAttention() is the implementation of self-assertion (Section 2 Image Orange). Here Multi-Head Attention is used, to capture different aspects of the input sequence simultaneously and capture multiple types of relationships.
The Multi-Layer Perceptron (MLP) expands the dimensions, applies the activation function and then contracts the size back to the original (Section 2 Image blue)

After defining the individual Transformer Blocks, the Generative Pre-trained Transformer (GPT) configuration is set and the model structure is initialized. The GPT class defines methods for initialization, generating new tokens based on input

Step 5: Define the loss function

This measures how far the models predictions are away from the current state

Step 6 & 7: Define LM Training Configuration

set core training Parameters
optimize execution for hardware
set learning rate components with a scheduler

Step 8: (Pre-)train the LM

execute the training loop for the given data
save the model state after each iteration of the loop
make sure that train_loss and validation_loss stay similar, to avoid over-fitting

Other Sources

a good overview Talebi
an immensly useful paper Zhao
an example of going through the code of a GPT Karpathy