Training Techniques - shisa-ai/shisa-v2 GitHub Wiki

RL

KTO

SimPO

ExPO

https://huggingface.co/papers/2405.19107 DRO, or Direct Reward Optimisation, as a framework and associated algorithms that do not require pairwise preferences.

Value-Incentivized Preference Optimization: A Unified Approach to Online and Offline RLHF https://arxiv.org/abs/2405.19320

Xwin-LM: Strong and Scalable Alignment Practice for LLMs https://huggingface.co/papers/2405.20335

Self-Exploring Language Models: Active Preference Elicitation for Online Alignment https://huggingface.co/papers/2405.19332

PEFT

We aim to have enough access to compute (2 x H100/A100-80) nodes to be able to due full fine tunes, but it'd be interesting to see how performance compares with some of the PEFT tuning techniques.

MoRA

LISA

Real World Testing

Batch Size

For efficiency, a larger batch size is better, but larger batch sizes can lead to poorer training results: https://medium.com/mini-distill/effect-of-batch-size-on-training-dynamics-21c14f7a716e

In practice, as we are typically memory limited for our tunes, we don't have to worry about our batch sizes getting too big. Aiming for 64 global batch size is probably fine. (see also gradient accumulation)

Benchmarks

Testing speed and loss for FP16, BF16, TF32 and for Gradient Accumulation steps:

https://github.com/huggingface/transformers/issues/15026

Methods and tools for efficient training on a single GPU

https://huggingface.co/docs/transformers/main/en/perf_train_gpu_one

Multiple GPUs and parallelism

https://huggingface.co/docs/transformers/main/en/perf_train_gpu_many

DeepSpeed:

https://huggingface.co/docs/transformers/main/en/deepspeed