Configuration - golololologol/LLM-Distillery GitHub Wiki
Configuration is done through a JSON file and command-line arguments. The configuration file config.json
should be in the root directory, and command-line args can be used to supersede the config's args without rewriting the config itself. Here's a breakdown of the main parameters:
(config arg | command-line arg | shortened command-line arg
)
Paths
-
cache_folder | --cache_folder | -c
: Directory for cache storage. Ideally should be an empty folder. -
dataset_path | --dataset_path | -d
: Path to the training dataset. -
validation_dataset_path | --validation_dataset_path | -vd
: Path to the validation dataset. -
teacher_models_folder | --teacher_models_folder | -tm
: Directory containing teacher models, or a path to just one teacher directly. -
student_path | --student_path | -s
: Path to the student model.
Cache Settings
max_cache_size_gb | --max_cache_size_gb | -maxgb
: Maximum cache size in GB. Used to keep the main h5 dataset under this limit, and use chunked collection+training when the calculated size of the collected h5 dataset is over this limit. Only tracks the main h5 dataset's size, any misc. files/states are not counted, so be mindful of that.
Pipeline Settings
-
ignore_model_type | --ignore_model_type
: If True, will let completion teachers collect instruct data, and instruct teachers completion data. Use at your own discretion. -
rebase_dataset | --rebase_dataset
: Rebase the dataset without safety checks. Overwrites all metadata in the h5 dataset, so be very careful to use the Exact same text dataset as the one used for its collection. Intended use is to update the dataset to a newer version of the pipeline, after some breaking changes were implemented. -
use_teachers | --use_teachers
: Whether to use teachers for distillation. Useful when you downloaded the dataset from the internet and want to distill using it, but don't have the teachers it was collected with on disk.
Model Settings
-
context_len | --context_len | -ctx
: Context length to collect and train on. -
save_sys_range | --save_sys_range
,
save_user_range | --save_user_range
,
save_assistant_range | --save_assistant_range
:
Boolean flags to save specific token ranges within conversations. Its used only with instruct data to collect and train only on the content of the conversation, avoiding the prompt formatting tokens. See Content Ranges for a more in-depth explanation. -
crop_distr_to_size | --crop_distr_to_size
: Crop distribution size for token filtering. A very rudimentary way to avoid training on tokens that aren't in the student's vocab. Its applied as:distributions[:, :crop_distr_to_size]
, where the first dimension is for tokens, and the second for logits. MAKE SURE TO SET IT CORRECTLY, IT IS A VERY IMPORTANT PARAMETER, its supposed to be set to the base-model's vocabulary size, which you can check using this. -
enable_topK | --enable_topK
: Enable top-K sampling for collecting and training. -
save_topK | --save_topK | -topk
: Configure top-K sampling for collecting and training. Slashes storage requirements enormously, without it, 1000 2048-token long samples take up ~100gb of storage at vocabulary size of 32000, but with top-K at 200, this goes down to only ~1.4gb. -
device | --device
: Main device for any single-device tensor shenanigans (cuda:0
...)
Collection Settings
-
num_inference_workers | --num_inference_workers | -niw
: Number of inference workers for parallel collection. Use if you have enough VRAM. Note: it can't handle more than 3 inference workers due to multiprocessing being a rapscallion, else it just hangs indefinitely -
reserve_vram | --reserve_vram
: Amount of VRAM to reserve per gpu during collection, will try to keep that much memory in gb free for each gpu. ([4, 0.5]
for 4gb reserved on first gpu, and 0.5gb on the second)
Training Settings
-
num_epochs | --num_epochs | -ne
: Number of training epochs. -
num_warmup_steps | --num_warmup_steps | -nws
: Number of warmup steps for lr. -
batch_size | --batch_size | -bs
: Training batch size. -
grad_accum_batches | --grad_accum_batches | -g
: Number of gradient accumulations done before callingoptimizer.step()
-
grad_checkpointing | --grad_checkpointing
: Enable gradient checkpointing for memory savings, but slower training. -
temperature | --temperature | -t
: Temperature for distillation. -
lr | --lr | -lr
: Learning rate. -
decay_start | --decay_start
: Start decaying lr to 0 at this percentage of total training steps, only used by thewsd
lr scheduler. -
alpha | --alpha
: Weighting factor for weighted losses. I suppose the code behind what this parameter controls will change quite a bit in the future, but currently its used like this:weights = ((kl_div_per_token / kl_div_per_token.max()) + 1).pow(alpha)
-
lr_scheduler | --lr_scheduler
: Learning rate scheduler name, the pipeline offers a custom implementation of Warmup Stable Decay learning rate scheduler from MiniCPM underwsd
name, and any other lr scheduler from Transformer'sget_scheduler()
function, like:cosine, linear, constant, ...
-
optimizer | --optimizer
: Optimizer name, currently available ones are:adam, adamw, adamw8bit, adamw32bit, paged_adamw, paged_adamw8bit, paged_adamw32bit, sgd, rmsprop, rmsprop8bit, rmsprop32bit, adagrad
-
data_order | --data_order
: Order of samples during training.shuffle
- randomizes order of samples every epoch.sorted
- sort the data by length of samples, can be useful to diagnose OOM errors.native
- same order of samples as in the text dataset. -
training_precision | --training_precision
: Training precision:fp32
,fp16
,bf16
,8bit
and4bit
- it has nf4 quantization, double quant is true -
validate_every_n_epochs | --validate_every_n_epochs
: Validation frequency measured in epochs to scale with your dataset, accepts floating point numbers like0.1
, this will do 10 validation steps every epoch, etc.. -
save_student_every_n_epochs | --save_student_every_n_epochs
: Frequency of saving student model in epochs. Same jam as the above. -
num_gpu0_layers | --num_gpu0_layers
: Number of layers for GPU 0. Used only withdevice_map = "custom"
. -
device_map | --device_map
: Device mapping strategy. Currently supports any device map provided by HF accelerate:balanced, balanced_low_0,...
, andcustom
for our custom splitting strategy, also giving you the ability to set how many layers you want on the first GPU withnum_gpu0_layers
. -
max_memory | --max_memory
: Maximum memory allocation for each device. Must be formatted as a json object, an example is provided within theconfig.json
in the repo. -
multi_gpu | --multi_gpu
: Enable multi-GPU training. -
save_final_state | --save_final_state
: Save the final model and optimizer state after training. Can consume enormous amounts of storage and RAM. -
use_flash_attn_2| --use_flash_attn_2 | -fa2
: Whether to use Flash Attention 2, or default to sdpa -
wandb_comment | --wandb_comment | -wdb
: A comment for Weights and Biases logging. ({comment} ModelName lr(lr) (Date/Time)
)
Student Settings
-
freeze_layers | --freeze_layers | -fl
: Layers to freeze during training. Uses a list of strings for layer names to freeze:[".block_sparse_moe.gate", ...]
-
add_bos | --add_bos
: Add beginning-of-sequence token to every sample. -
prompt_format | --prompt_format
: Prompt format to use for instruct samples.