Training Techniques - shisa-ai/shisa-v2 GitHub Wiki
RL
KTO
SimPO
ExPO
https://huggingface.co/papers/2405.19107 DRO, or Direct Reward Optimisation, as a framework and associated algorithms that do not require pairwise preferences.
Value-Incentivized Preference Optimization: A Unified Approach to Online and Offline RLHF https://arxiv.org/abs/2405.19320
Xwin-LM: Strong and Scalable Alignment Practice for LLMs https://huggingface.co/papers/2405.20335
Self-Exploring Language Models: Active Preference Elicitation for Online Alignment https://huggingface.co/papers/2405.19332
PEFT
We aim to have enough access to compute (2 x H100/A100-80) nodes to be able to due full fine tunes, but it'd be interesting to see how performance compares with some of the PEFT tuning techniques.
MoRA
LISA
Real World Testing
Batch Size
For efficiency, a larger batch size is better, but larger batch sizes can lead to poorer training results: https://medium.com/mini-distill/effect-of-batch-size-on-training-dynamics-21c14f7a716e
In practice, as we are typically memory limited for our tunes, we don't have to worry about our batch sizes getting too big. Aiming for 64 global batch size is probably fine. (see also gradient accumulation)
Benchmarks
Testing speed and loss for FP16, BF16, TF32 and for Gradient Accumulation steps:
Methods and tools for efficient training on a single GPU
Multiple GPUs and parallelism
DeepSpeed: