Preparations for the first start - golololologol/LLM-Distillery GitHub Wiki

Here's key steps to make sure everything launches smoothly:

Set ALL arguments in the config correctly, the most important ones are:
Paths to everything.
max_cache_size_gb - set it according to how much storage you can shell out for the main h5 dataset.
crop_distr_to_size - set it to the vocabulary size of the base model used in teachers.
If anything isn't done as it should, it'll crash. But overall they all are pretty important, else they wouldn't be provided. Take a look at Configuration for more info.
Make sure your main dataset and validation dataset are structured with the correct format accepted by the pipeline: Data Preparation
Prepare all the models according to this page: Preparing Models
And, this should be it, you're ready to start distillation by launching collect_and_finetune.py using your prepared venv.

What to expect during the first start?

At the very first start of the pipeline, it'll do multiple things:

Create the necessary folder structure in the cache folder that you've specified, its pretty advisable to specify an empty folder, else some unexpected behaviors may arise.
As the collection of data is done with ExllamaV2, all the models need to be in .safetensors file format, and not just .pt/.bin, so it will try to convert all the models to the correct format automatically by creating a new folder with the model's name + _safetensors, then will convert all model's files, and after that's done, will delete the original model.
Next after all models are under the correct file format, it'll prompt you for a couple things: the model's pipeline config, and the prompt format for this model. Both will be saved to the model's folder to not have to input them on every restart.

Config: mainly used for collection, these are the default values:

config = {
        'batch_size': 1,
        'add_bos': True,
        'seq_chunk_len': 256,
        'completion': False
        }

batch_size - the batch size to use during collection.
add_bos - whether to add or not to add bos to every sample.
seq_chunk_len - chunking length during the forward pass, from testing, 256 seems to perform the best, even better than larger values, for some unknown reason.
completion - whether to treat this model as completion or instruct, used when ignore_model_type arg is False, in that case, instruct data and completion data will be collected separately, instruct by instruct teachers, and completion by completion teachers, the student will be trained on both instruct and completion samples in any case. Else, if ignore_model_type is True, the pipeline will not be distinguishing between model types, and will let all teachers collect all data.
This data is saved under pipeline_config.json into the model's folder.

Prompt format: used for collection of instruct samples, these are the default values:

prompt_format = {
        'SYS_START': "### System:\n",
        'USER_START': "### User:\n",
        'ASSISTANT_START': "### Assistant:\n",
        'SYS_END': "\n",
        'USER_END': "\n",
        'ASSISTANT_END': "\n"
        }

Note: CHANGE THESE VALUES TO THE CORRECT PROMPT FORMAT OF THE TEACHER

The names of the keys should be fairly informative already, but anyway here's a little explanation: USER_START is what should be added before every user's turn.
USER_END is what should be added to the end of every user's turn.
Both can contain special tokens.
Same logic applies to all other keys.

With the default values, this is how a conversation will be formatted:
init: "sys message of sorts", conversations: [{"from": "human", "value": "User said hi"}, {"from": "gpt", "value": "AI said hi"}] ->

### System:
sys message of sorts
### User:
User said hi
### Assistant:
AI said hi

Next up, if you didn't fuck up any args, it should be smooth sailing, first will be collection of the data into the h5 dataset, and then training the student on it all . May god be merciful on your soul, and it won't crash or OOM during all of this.

Note: It'll ask you for wandb info needed for logging at the very beginning of training, some text may be cropped due to the nature of progress bars and all, so try to figure out what it wants, and good luck.