Configuration - golololologol/LLM-Distillery GitHub Wiki

Configuration is done through a JSON file and command-line arguments. The configuration file config.json should be in the root directory, and command-line args can be used to supersede the config's args without rewriting the config itself. Here's a breakdown of the main parameters:

(config arg | command-line arg | shortened command-line arg)

Paths

cache_folder | --cache_folder | -c: Directory for cache storage. Ideally should be an empty folder.
dataset_path | --dataset_path | -d: Path to the training dataset.
validation_dataset_path | --validation_dataset_path | -vd: Path to the validation dataset.
teacher_models_folder | --teacher_models_folder | -tm: Directory containing teacher models, or a path to just one teacher directly.
student_path | --student_path | -s: Path to the student model.

Cache Settings

max_cache_size_gb | --max_cache_size_gb | -maxgb: Maximum cache size in GB. Used to keep the main h5 dataset under this limit, and use chunked collection+training when the calculated size of the collected h5 dataset is over this limit. Only tracks the main h5 dataset's size, any misc. files/states are not counted, so be mindful of that.

Pipeline Settings

ignore_model_type | --ignore_model_type: If True, will let completion teachers collect instruct data, and instruct teachers completion data. Use at your own discretion.
rebase_dataset | --rebase_dataset: Rebase the dataset without safety checks. Overwrites all metadata in the h5 dataset, so be very careful to use the Exact same text dataset as the one used for its collection. Intended use is to update the dataset to a newer version of the pipeline, after some breaking changes were implemented.
use_teachers | --use_teachers: Whether to use teachers for distillation. Useful when you downloaded the dataset from the internet and want to distill using it, but don't have the teachers it was collected with on disk.

Model Settings

context_len | --context_len | -ctx: Context length to collect and train on.
save_sys_range | --save_sys_range,
save_user_range | --save_user_range,
save_assistant_range | --save_assistant_range:
Boolean flags to save specific token ranges within conversations. Its used only with instruct data to collect and train only on the content of the conversation, avoiding the prompt formatting tokens. See Content Ranges for a more in-depth explanation.
crop_distr_to_size | --crop_distr_to_size: Crop distribution size for token filtering. A very rudimentary way to avoid training on tokens that aren't in the student's vocab. Its applied as: distributions[:, :crop_distr_to_size], where the first dimension is for tokens, and the second for logits. MAKE SURE TO SET IT CORRECTLY, IT IS A VERY IMPORTANT PARAMETER, its supposed to be set to the base-model's vocabulary size, which you can check using this.
enable_topK | --enable_topK: Enable top-K sampling for collecting and training.
save_topK | --save_topK | -topk: Configure top-K sampling for collecting and training. Slashes storage requirements enormously, without it, 1000 2048-token long samples take up ~100gb of storage at vocabulary size of 32000, but with top-K at 200, this goes down to only ~1.4gb.
device | --device: Main device for any single-device tensor shenanigans (cuda:0...)

Collection Settings

num_inference_workers | --num_inference_workers | -niw: Number of inference workers for parallel collection. Use if you have enough VRAM. Note: it can't handle more than 3 inference workers due to multiprocessing being a rapscallion, else it just hangs indefinitely
reserve_vram | --reserve_vram: Amount of VRAM to reserve per gpu during collection, will try to keep that much memory in gb free for each gpu. ([4, 0.5] for 4gb reserved on first gpu, and 0.5gb on the second)

Training Settings

num_epochs | --num_epochs | -ne: Number of training epochs.
num_warmup_steps | --num_warmup_steps | -nws: Number of warmup steps for lr.
batch_size | --batch_size | -bs: Training batch size.
grad_accum_batches | --grad_accum_batches | -g: Number of gradient accumulations done before calling optimizer.step()
grad_checkpointing | --grad_checkpointing: Enable gradient checkpointing for memory savings, but slower training.
temperature | --temperature | -t: Temperature for distillation.
lr | --lr | -lr: Learning rate.
decay_start | --decay_start: Start decaying lr to 0 at this percentage of total training steps, only used by the wsd lr scheduler.
alpha | --alpha: Weighting factor for weighted losses. I suppose the code behind what this parameter controls will change quite a bit in the future, but currently its used like this: weights = ((kl_div_per_token / kl_div_per_token.max()) + 1).pow(alpha)
lr_scheduler | --lr_scheduler: Learning rate scheduler name, the pipeline offers a custom implementation of Warmup Stable Decay learning rate scheduler from MiniCPM under wsd name, and any other lr scheduler from Transformer's get_scheduler() function, like: cosine, linear, constant, ...
optimizer | --optimizer: Optimizer name, currently available ones are: adam, adamw, adamw8bit, adamw32bit, paged_adamw, paged_adamw8bit, paged_adamw32bit, sgd, rmsprop, rmsprop8bit, rmsprop32bit, adagrad
data_order | --data_order: Order of samples during training. shuffle - randomizes order of samples every epoch. sorted - sort the data by length of samples, can be useful to diagnose OOM errors. native - same order of samples as in the text dataset.
training_precision | --training_precision: Training precision: fp32, fp16, bf16, 8bit and 4bit - it has nf4 quantization, double quant is true
validate_every_n_epochs | --validate_every_n_epochs: Validation frequency measured in epochs to scale with your dataset, accepts floating point numbers like 0.1, this will do 10 validation steps every epoch, etc..
save_student_every_n_epochs | --save_student_every_n_epochs: Frequency of saving student model in epochs. Same jam as the above.
num_gpu0_layers | --num_gpu0_layers: Number of layers for GPU 0. Used only with device_map = "custom".
device_map | --device_map: Device mapping strategy. Currently supports any device map provided by HF accelerate: balanced, balanced_low_0,..., and custom for our custom splitting strategy, also giving you the ability to set how many layers you want on the first GPU with num_gpu0_layers.
max_memory | --max_memory: Maximum memory allocation for each device. Must be formatted as a json object, an example is provided within the config.json in the repo.
multi_gpu | --multi_gpu: Enable multi-GPU training.
save_final_state | --save_final_state: Save the final model and optimizer state after training. Can consume enormous amounts of storage and RAM.
use_flash_attn_2| --use_flash_attn_2 | -fa2: Whether to use Flash Attention 2, or default to sdpa
wandb_comment | --wandb_comment | -wdb: A comment for Weights and Biases logging. ({comment} ModelName lr(lr) (Date/Time))

Student Settings

freeze_layers | --freeze_layers | -fl: Layers to freeze during training. Uses a list of strings for layer names to freeze: [".block_sparse_moe.gate", ...]
add_bos | --add_bos: Add beginning-of-sequence token to every sample.
prompt_format | --prompt_format: Prompt format to use for instruct samples.