Hyper‐parameters Teacher Model - transducens/demint GitHub Wiki

Llama Factory

Important hyper-parameters

Model: Llama3-8B
Dataset: TheBigSix (TSCC v2, CIMA, Conversational Uptake, Multicultural classrom discourse dataset, MathDIAL, ClariQ)
Fine Tuning Method: Lora (Full, Freeze, Lora)
- Full: Full fine-tuning involves updating all model parameters during training.
- Freeze: Freezing certain layers (typically lower layers) and only training the top layers. This reduces computational cost and avoids overfitting to some extent.
- Lora: LoRA involves adding trainable low-rank matrices to the model’s weights, which allows efficient fine-tuning by updating a small number of parameters.
Adapter Path: None
- Adapters are small neural network modules inserted into each layer of a pre-trained model. During fine-tuning, instead of updating the entire model's parameters, only the parameters of these adapters are updated. This makes the fine-tuning process more efficient in terms of both memory and computational requirements.
Quantization bit: 8 (none, 8, 4)
- Quantization is a technique used to reduce the computational and memory requirements of neural networks by representing weights and activations with lower precision. Quantization typically involves converting 32-bit floating-point (FP32) numbers to lower-bit representations such as 16-bit, 8-bit, 4 or even 2.
Prompt template: Llama3 (llama3, mistral, olmo, openchat, ...)
- Prompt templates are a powerful tool for structuring input data for fine-tuning language models. As we are fine-tuning Llama3-8B, we are using the llama3 template.
RoPE scaling: dynamic (none, linear, dynamic)
- None: No additional scaling is applied to the rotary position embeddings. The model uses the embeddings as they are. Suitable for tasks where the default positional encoding is sufficient, and sequences are of typical lengths encountered during pre-training.
- Linear: Applies a linear scaling factor to the rotary position embeddings. This can help the model adjust its sensitivity to positional information linearly with respect to the position in the sequence. Useful for tasks where the input sequences are longer than those seen during pre-training or where a linear adjustment to positional sensitivity is beneficial.
- Dynamic: Adjusts the scaling factor dynamically based on the sequence length or other factors. This allows the model to adapt its positional sensitivity more flexibly, improving performance on tasks with varying sequence lengths. Ideal for tasks with highly variable sequence lengths or where the model needs to dynamically adjust its understanding of positional information.
Booster: unsloth (none, flashattn2, unsloth)
- None: No additional booster or optimization technique is applied. The model operates with its default configurations.
- Flashattn2: Flash Attention is a technique designed to make the attention mechanism in transformer models more efficient by reducing memory usage and computational overhead.
- Unsloth: Uses Flash Attention among other techniques focusing on improving the training speed or reducing the computational load without sacrificing model accuracy (allegedly even improving it "can speed up LLM training by up to 30 times while also reducing the amount of memory needed by 60%.").
Visual inputs: no (no, yes)
- Capability of the model to handle and integrate visual data as part of its input. This option is relevant for multimodal models that are designed to process not just text, but also images, videos, or other forms of visual data.
Stage: Supervised Fine-Tuning (Supervised Fine-Tuning, Reward Modeling, PPO, DPO, ORPO, Pre-training)
- Supervised Fine-Tuning: This involves fine-tuning a pre-trained model on a specific dataset with labeled examples. The model learns to map input sequences to target outputs based on supervised learning principles. Fine-tuning for specific tasks such as classification, question answering, summarization, etc., where labeled data is available.
- Reward Modeling: Involves training a model to predict rewards based on certain criteria, typically used as a precursor to reinforcement learning techniques. The reward model helps in evaluating the quality of generated outputs. Preparing for reinforcement learning steps like PPO by developing a reward signal that guides the policy learning.
- PPO (Proximal Policy Optimization): A reinforcement learning algorithm that optimizes the policy by interacting with an environment. PPO uses the reward signal to adjust the model’s behavior, making it suitable for fine-tuning models based on user feedback or other interaction-based criteria. Tasks requiring model behavior optimization based on interaction, such as chatbots or game-playing agents.
- DPO (Dynamic Proximal Optimization): A variant of PPO that includes dynamic adjustments to the policy optimization process, potentially allowing for more flexible and adaptive learning. Similar to PPO but might be preferred in scenarios requiring more adaptive learning strategies.
- ORPO (Optimized Reward Proximal Optimization): Another variant of PPO that focuses on optimizing the reward signal more efficiently. This can involve modifications to the reward calculation or policy update mechanisms. Scenarios where reward optimization is critical and standard PPO might not suffice.
- Pre-training: The initial phase of training a language model on a large, diverse corpus of text data. This stage aims to learn general language representations that can later be fine-tuned for specific tasks. Building a strong foundation for a language model that can be adapted to various downstream tasks through fine-tuning.
Learning Rate: 1e-4
- The learning rate is a crucial hyperparameter in training machine learning models, including fine-tuning large language models like Llama3-8B. It controls the step size at each iteration while moving toward a minimum of the loss function. In simpler terms, it determines how much the model's weights are updated with respect to the error each time the model weights are updated.
- When using QLoRA, it's crucial to select an appropriate learning rate to ensure stable and effective training. Given that QLoRA involves both quantization and low-rank adaptation, a slightly conservative learning rate is generally recommended. Typical Range: 1e-4 to 5e-4, though starting lower can help ensure stability initially.
- Recomendation: Begin with a lower learning rate to ensure stability and to observe how the model behaves during training. You can use learning rate schedulers or gradually increase the learning rate if the model seems to be converging too slowly.
Epochs: 10
- Each epoch refers to one complete pass through the entire training dataset. During an epoch, the model processes every example in the dataset exactly once, which involves updating the model's parameters based on the computed loss from each example or batch of examples.
Maximum Gradient Norm: 1.0
- It is a parameter used to clip gradients during training to prevent them from becoming excessively large, which can destabilize the training process. By setting a maximum gradient norm, you ensure that the gradient updates remain within a controlled range, leading to more stable and effective training. This technique is particularly useful when fine-tuning large models like Llama3-8B, especially when combined with other strategies like learning rate scheduling and gradient accumulation. A common value for the maximum gradient norm is around 1.0, but this can be adjusted based on the specific needs of your training process.
Max samples: 100000
- Given our dataset (TheBigSix), each sample is a "conversations" between teacher and student. Including the "system" value. In total we have 89.664 samples.
Compute type: fp32 (fp16, bf16, fp32, pure_bf16)
- The compute type determines the precision of floating-point operations used during the training of the model. This affects how calculations are performed on the model's parameters.
- FP16: (Half-Precision Floating Point): Uses 16-bit floating-point numbers to reduce memory usage and increase computation speed on compatible hardware.
- BF16: (Brain Floating Point): Uses 16-bit floating-point numbers but with a larger exponent range than FP16, offering better numerical stability.
- FP32: (Single-Precision Floating Point): Uses 32-bit floating-point numbers, providing high numerical precision and stability.
- Pure BF16: Similar to BF16 but ensures that all operations are performed using BF16 precision without fallback to FP32.
- Difference with Quantization:
  - Quantization: Applies to how model weights and activations are stored and used, primarily affecting the inference phase but also sometimes the training phase.
  - Compute Type: Refers to the precision used for arithmetic operations during training and inference.
Cutoff length: 3500
- It specifies the maximum number of tokens for each input sequence during training or inference. It helps manage computational efficiency and memory usage by ensuring consistent input sizes. In the "TheBigSix" the biggest conversation has 177588 words that is too much. Only 5 conversations are that huge, so removing those 5 the biggest conversation has 2394 words. The longest conversations have 67926, 102975, 105268, 160137, 177588 words.
- The decided cutoff is 3500 to assure that all the tokens will fit after the tokenization. There should be 6 tokens per sequence (value of each speaker) so the cutoff could be even 3000 as there is 97 sequences in the largest conversation (except the biggest 5 ones). But in order to give it some margin, it was decided to select 3500.
Batch size: 90
- Batch size is the number of training examples utilized in one iteration. It determines how many samples the model processes before updating the internal model parameters.
- Given our dataset (TheBigSix), each sample is a whole conversation between teacher and student, including the "system" value. In total we have 89.664 (aproximating to 90.000) samples. We are choosing a big batch size since we have good GPUs.
Gradient accumulation: 8
- It is a technique that allows for larger effective batch sizes by accumulating gradients over multiple mini-batches before updating the model parameters. This is particularly useful for training large models on hardware with limited memory.
- If your hardware cannot accommodate very large batch sizes due to memory constraints, gradient accumulation allows you to simulate larger batch sizes by accumulating gradients over multiple smaller batches.
Val size: 0.1 (0 - 1)
- Validation size parameter specifies the portion of the dataset that is set aside for validation during the training process. Validation data is used to evaluate the model's performance on unseen data after each epoch, which helps in monitoring the model's generalization ability and tuning hyperparameters.
LR scheduler: reduce_lr_on_plateau (linear, cosine, cosine_with_restars, polynomial, constant, constant_with_warmup, inverse_sqrt, reduce_lr_on_plateau)
- Learning Rate scheduler is a method or strategy to adjust the learning rate during the training process. It changes the learning rate according to a pre-defined schedule or based on certain conditions to improve the training process.
- linear: Decreases the learning rate linearly from an initial value to a final value over the course of training. Simple and effective for many standard training scenarios.
- cosine: Adjusts the learning rate following a cosine curve, which gradually decreases and then increases the learning rate periodically. Useful for cyclical training processes, often used in conjunction with warm restarts.
- cosine_with_restars: Similar to the cosine scheduler but includes periodic restarts where the learning rate is reset to a higher value. Helps the model escape local minima and explore new parts of the parameter space.
- polynomial: Decreases the learning rate according to a polynomial decay function. Provides more flexibility than linear decay, allowing for slower decreases in the learning rate.
- constant: Keeps the learning rate constant throughout the training process. Simple and effective when you do not want to change the learning rate.
- constant_with_warmup: Starts with a small learning rate and gradually increases it to a specified value (warmup phase), then keeps it constant. Helps stabilize training initially before settling into a constant learning rate.
- inverse_sqrt: Decreases the learning rate proportionally to the inverse square root of the step number. Often used in transformer models and sequence-to-sequence tasks.
- reduce_lr_on_plateau: Reduces the learning rate when a specified metric (e.g., validation loss) has stopped improving. Useful for fine-tuning and scenarios where you need to adjust the learning rate based on performance metrics.
Extras:
- Logging steps: 10
  - It refers to the intervals at which training metrics are recorded and reported during the training process. This logging helps you monitor the training progress, diagnose issues, and make adjustments to the training configuration if necessary.
  - As the dataset "TheBigSix" has aproximately 90.000 samples and the batch size is 90, then there will be 1000 steps in each epoch.
- Save steps: 100
  - It refers to the intervals at which the current state of the model is saved to disk during the training process. This feature is essential for various reasons, including the ability to resume training from a checkpoint, analyze the model's progress over time, and ensure that work is not lost in case of interruptions.
- Warmup steps: 10
  - It refers to the initial phase of training during which the learning rate is gradually increased from a small value to the initial learning rate. This helps to stabilize the training process and prevent large updates to the model parameters early on, which can cause instability.
- NEFTune Alpha: 0
  - It is a specific parameter used in the Llama Factory framework for fine-tuning large language models (LLMs). NEFTune refers to a technique for noise-enhanced fine-tuning, and the alpha parameter (neftune_noise_alpha) controls the intensity or amount of noise added during this fine-tuning process. This technique is designed to improve the robustness and generalization of the model by introducing controlled noise into the training process.
- Optimizer: adamw_torch (adamw_torch, adamw_8bit, adafactor)
  - It refers to the algorithm used to adjust the weights of the neural network during training.
  - adamw_torch: This is the AdamW optimizer implemented in PyTorch. AdamW is a variant of the Adam optimizer with weight decay, which is often used to prevent overfitting by adding a penalty on the size of the weights. It is widely used and generally provides good performance across various tasks.
  - adamw_8bit: This is a more memory-efficient version of the AdamW optimizer, utilizing 8-bit precision for certain calculations. This can significantly reduce the memory footprint and increase computational speed, making it suitable for training very large models where memory is a constraint.
  - adafactor: This is an alternative optimizer to Adam that is particularly efficient in terms of memory usage. Adafactor dynamically scales the learning rates and uses less memory, making it well-suited for training very large models. It has been shown to work effectively in various large-scale training scenarios.
- Resize token embeddings: yes (no/yes)
  - Adjusts the size of the token embeddings. This is often used when the vocabulary size changes or if there is a need to match the embeddings size to a specific layer size in the model. Usually used when fine-tuning a pre-trained model on a new dataset with a different vocabulary.
- Upcast LayerNorm: no (no/yes)
  - Upcast weights of layernorm in float32.
  - Uses higher precision (e.g., float32 instead of float16) for Layer Normalization operations. To improve numerical stability and precision during training, which can be particularly important for very deep networks. Usually used in scenarios where model training suffers from instability due to lower precision arithmetic.
- Enable S^2 Attention: yes (no/yes)
  - Use shift short attention proposed by LongLoRA.
  - Enables a specialized form of attention mechanism called S^2 (Sparse and Structured) Attention. To improve the efficiency and scalability of the attention mechanism by making it sparse and structured. Usually used when training very large models where standard attention mechanisms become computationally expensive.
- Pack sequences: yes (no/yes)
  - Pack sequences into samples of fixed length.
  - Packs sequences together to utilize computational resources more efficiently. The purpose is to increase training efficiency by reducing the amount of padding in batches, which can save memory and computation time. Usually used when dealing with variable-length sequences to improve throughput during training.
- Enable LLaMA Pro: no (no/yes)
  - Make the parameters in the expanded blocks trainable.
  - Activates advanced features and optimizations available in the Pro version of LLaMA. The purpose is to leverage enhanced capabilities for better performance, scalability, and potentially additional metrics or logging features. Usually used for users with access to the Pro version, aiming for the best possible training performance and additional tools.
- Enable external logger: no (no/yes)
  - Use TensorBoard or wandb to log experiment.
  - Enables logging of training metrics to an external logging service or system. The purpose is to provide better tracking, visualization, and analysis of the training process through external tools. Usually used when detailed monitoring and logging of the training process is needed, often for large-scale experiments or production environments.
LoRA configurations:
- Rank: 128
  - Specifies the rank of the low-rank approximation. A lower rank reduces the number of parameters, making the model more efficient while potentially sacrificing some fine-tuning accuracy. Common values might be 4, 8, 16, etc., depending on the trade-off between efficiency and accuracy.
- Alpha: 1.0
  - A scaling factor applied to the low-rank updates. Balances the influence of the low-rank adaptations on the model’s weights. Often set to values like 1.0 or adjusted based on empirical performance during validation.
- Dropout: 0
  - The dropout rate applied to the low-rank adaptation layers. The purpose is to prevent overfitting by randomly dropping out a fraction of the low-rank updates during training. Common values are 0.1, 0.2, etc., depending on the risk of overfitting.
- LoRA + LR ratio: 0
  - Sets the proportion between the learning rates for the LoRA-specific parameters and the main model parameters. For instance, if the main model's learning rate is 'n' and the LoRA+ LR ratio is set to 'k', then the learning rate for the LoRA parameters will be 'k×n'.
  - Ensures that the updates to the low-rank adaptation parameters are appropriately scaled relative to the updates of the main model parameters. Also helps in fine-tuning by adjusting how aggressively the LoRA parameters should be updated compared to the rest of the model, potentially leading to better convergence and performance.
- Create New Adapter: yes (no/yes)
  - Is a critical option for implementing Low-Rank Adaptation effectively. By enabling this parameter, new adapter layers are created to apply low-rank updates, preserving the integrity of the original model and allowing efficient and modular fine-tuning. This approach enhances the flexibility and efficiency of adapting large pre-trained models to new tasks.
- rslora: no (no/yes)
  - Random Sparse Low-Rank Adaptation, is a variant of LoRA that introduces sparsity into the low-rank adaptations. The primary goal is to further reduce the number of parameters and computational overhead by making the adaptation matrix sparse. This can help in situations where computational resources are highly constrained. Instead of a dense low-rank matrix, rslora uses a sparse matrix where many of the elements are zero. The non-zero elements are typically placed randomly.
- DoRA: no (no/yes)
  - Dynamic Low-Rank Adaptation, is an advanced variant of LoRA that dynamically adjusts the rank of the low-rank adaptations during training. The aim is to adaptively optimize the rank based on the learning dynamics, potentially improving the efficiency and effectiveness of the adaptation. During training, the rank of the low-rank matrices can be adjusted based on specific criteria, such as the model's performance on validation data or the gradients' behavior.
- Lora Modules: none
  - This parameter specifies the names of the modules within the model where Low-Rank Adaptation (LoRA) should be applied. Its goal is to target specific parts of the model for low-rank adaptation, allowing for focused and efficient fine-tuning. You list the names of the modules where LoRA should be applied. Multiple modules can be specified, separated by commas. Example: If you want to apply LoRA to the attention layers and the feed-forward layers, you might specify something like 'attention', 'feed_forward'.
- Additional Modules: none
  - This parameter specifies the names of the modules apart from the LoRA layers that should be set as trainable during the fine-tuning process. Its goal is to allow certain parts of the model, other than those specified for LoRA, to be trained. This provides additional flexibility in the fine-tuning process. You list the names of the modules to be set as trainable, separated by commas. Example: If you want to fine-tune the embedding layer and the final linear layer along with the LoRA adaptations, you might specify something like 'embedding', 'final_linear'.
Freeze configurations: default
- Freeze configurations allow you to specify which layers or components of the model should not be updated during training. This can be useful in scenarios where only a subset of the model's parameters needs to be fine-tuned.
RLHF configurations: default
- RLHF (Reinforcement Learning from Human Feedback) is a technique that uses human feedback to guide the training process of models. This feedback helps in refining the model's outputs to better align with human preferences.
GaLore configurations: default
- Stands for Gradient Low-Rank Projection, which is a technique to optimize the gradient updates during training by projecting them into a lower-dimensional space.
BAdam configurations: default
- BAdam is a variant of the Adam optimizer that includes block-wise adaptation, which means it adapts the learning rate for different blocks or layers of the model.

All hyper-parameters for training

- model_name_or_path MODEL_NAME_OR_PATH
                      Path to the model weight or identifier from
                      huggingface.co/models or modelscope.cn/models.
                      (default: None)
- adapter_name_or_path ADAPTER_NAME_OR_PATH
                      Path to the adapter weight or identifier from
                      huggingface.co/models. (default: None)
- cache_dir CACHE_DIR
                      Where to store the pre-trained models downloaded from
                      huggingface.co or modelscope.cn. (default: None)
- use_fast_tokenizer [USE_FAST_TOKENIZER]
                      Whether or not to use one of the fast tokenizer
                      (backed by the tokenizers library). (default: True)
- no_use_fast_tokenizer
                      Whether or not to use one of the fast tokenizer
                      (backed by the tokenizers library). (default: False)
- resize_vocab [RESIZE_VOCAB]
                      Whether or not to resize the tokenizer vocab and the
                      embedding layers. (default: False)
- split_special_tokens [SPLIT_SPECIAL_TOKENS]
                      Whether or not the special tokens should be split
                      during the tokenization process. (default: False)
- new_special_tokens NEW_SPECIAL_TOKENS
                      Special tokens to be added into the tokenizer.
                      (default: None)
- model_revision MODEL_REVISION
                      The specific model version to use (can be a branch
                      name, tag name or commit id). (default: main)
- low_cpu_mem_usage [LOW_CPU_MEM_USAGE]
                      Whether or not to use memory-efficient model loading.
                      (default: True)
- no_low_cpu_mem_usage
                      Whether or not to use memory-efficient model loading.
                      (default: False)
- quantization_bit QUANTIZATION_BIT
                      The number of bits to quantize the model using
                      bitsandbytes. (default: None)
- quantization_type {fp4,nf4}
                      Quantization data type to use in int4 training.
                      (default: nf4)
- double_quantization [DOUBLE_QUANTIZATION]
                      Whether or not to use double quantization in int4
                      training. (default: True)
- no_double_quantization
                      Whether or not to use double quantization in int4
                      training. (default: False)
- quantization_device_map {auto}
                      Device map used to infer the 4-bit quantized model,
                      needs bitsandbytes>=0.43.0. (default: None)
- rope_scaling {linear,dynamic}
                      Which scaling strategy should be adopted for the RoPE
                      embeddings. (default: None)
- flash_attn {off,sdpa,fa2,auto}
                      Enable FlashAttention for faster training and
                      inference. (default: auto)
- shift_attn [SHIFT_ATTN]
                      Enable shift short attention (S^2-Attn) proposed by
                      LongLoRA. (default: False)
- mixture_of_depths {convert,load}
                      Convert the model to mixture-of-depths (MoD) or load
                      the MoD model. (default: None)
- use_unsloth [USE_UNSLOTH]
                      Whether or not to use unsloth's optimization for the
                      LoRA training. (default: False)
- visual_inputs [VISUAL_INPUTS]
                      Whethor or not to use multimodal LLM that accepts
                      visual inputs. (default: False)
- moe_aux_loss_coef MOE_AUX_LOSS_COEF
                      Coefficient of the auxiliary router loss in mixture-
                      of-experts model. (default: None)
- disable_gradient_checkpointing [DISABLE_GRADIENT_CHECKPOINTING]
                      Whether or not to disable gradient checkpointing.
                      (default: False)
- upcast_layernorm [UPCAST_LAYERNORM]
                      Whether or not to upcast the layernorm weights in
                      fp32. (default: False)
- upcast_lmhead_output [UPCAST_LMHEAD_OUTPUT]
                      Whether or not to upcast the output of lm_head in
                      fp32. (default: False)
- infer_backend {huggingface,vllm}
                      Backend engine used at inference. (default:
                      huggingface)
- vllm_maxlen VLLM_MAXLEN
                      Maximum input length of the vLLM engine. (default:
                      2048)
- vllm_gpu_util VLLM_GPU_UTIL
                      The fraction of GPU memory in (0,1) to be used for the
                      vLLM engine. (default: 0.9)
- vllm_enforce_eager [VLLM_ENFORCE_EAGER]
                      Whether or not to disable CUDA graph in the vLLM
                      engine. (default: False)
- offload_folder OFFLOAD_FOLDER
                      Path to offload model weights. (default: offload)
- use_cache [USE_CACHE]
                      Whether or not to use KV cache in generation.
                      (default: True)
- no_use_cache        Whether or not to use KV cache in generation.
                      (default: False)
- hf_hub_token HF_HUB_TOKEN
                      Auth token to log in with Hugging Face Hub. (default:
                      None)
- ms_hub_token MS_HUB_TOKEN
                      Auth token to log in with ModelScope Hub. (default:
                      None)
- export_dir EXPORT_DIR
                      Path to the directory to save the exported model.
                      (default: None)
- export_size EXPORT_SIZE
                      The file shard size (in GB) of the exported model.
                      (default: 1)
- export_device EXPORT_DEVICE
                      The device used in model export, use cuda to avoid
                      addmm errors. (default: cpu)
- export_quantization_bit EXPORT_QUANTIZATION_BIT
                      The number of bits to quantize the exported model.
                      (default: None)
- export_quantization_dataset EXPORT_QUANTIZATION_DATASET
                      Path to the dataset or dataset name to use in
                      quantizing the exported model. (default: None)
- export_quantization_nsamples EXPORT_QUANTIZATION_NSAMPLES
                      The number of samples used for quantization. (default:
                      128)
- export_quantization_maxlen EXPORT_QUANTIZATION_MAXLEN
                      The maximum length of the model inputs used for
                      quantization. (default: 1024)
- export_legacy_format [EXPORT_LEGACY_FORMAT]
                      Whether or not to save the `.bin` files instead of
                      `.safetensors`. (default: False)
- export_hub_model_id EXPORT_HUB_MODEL_ID
                      The name of the repository if push the model to the
                      Hugging Face hub. (default: None)
- print_param_status [PRINT_PARAM_STATUS]
                      For debugging purposes, print the status of the
                      parameters in the model. (default: False)
- template TEMPLATE   Which template to use for constructing prompts in
                      training and inference. (default: None)
- dataset DATASET     The name of provided dataset(s) to use. Use commas to
                      separate multiple datasets. (default: None)
- dataset_dir DATASET_DIR
                      Path to the folder containing the datasets. (default:
                      data)
- split SPLIT         Which dataset split to use for training and
                      evaluation. (default: train)
- cutoff_len CUTOFF_LEN
                      The cutoff length of the tokenized inputs in the
                      dataset. (default: 1024)
- reserved_label_len RESERVED_LABEL_LEN
                      The minimum cutoff length reserved for the tokenized
                      labels in the dataset. (default: 1)
- train_on_prompt [TRAIN_ON_PROMPT]
                      Whether to disable the mask on the prompt or not.
                      (default: False)
- streaming [STREAMING]
                      Enable dataset streaming. (default: False)
- buffer_size BUFFER_SIZE
                      Size of the buffer to randomly sample examples from in
                      dataset streaming. (default: 16384)
- mix_strategy {concat,interleave_under,interleave_over}
                      Strategy to use in dataset mixing (concat/interleave)
                      (undersampling/oversampling). (default: concat)
- interleave_probs INTERLEAVE_PROBS
                      Probabilities to sample data from datasets. Use commas
                      to separate multiple datasets. (default: None)
- overwrite_cache [OVERWRITE_CACHE]
                      Overwrite the cached training and evaluation sets.
                      (default: False)
- preprocessing_num_workers PREPROCESSING_NUM_WORKERS
                      The number of processes to use for the pre-processing.
                      (default: None)
- max_samples MAX_SAMPLES
                      For debugging purposes, truncate the number of
                      examples for each dataset. (default: None)
- eval_num_beams EVAL_NUM_BEAMS
                      Number of beams to use for evaluation. This argument
                      will be passed to `model.generate` (default: None)
- ignore_pad_token_for_loss [IGNORE_PAD_TOKEN_FOR_LOSS]
                      Whether or not to ignore the tokens corresponding to
                      padded labels in the loss computation. (default: True)
- no_ignore_pad_token_for_loss
                      Whether or not to ignore the tokens corresponding to
                      padded labels in the loss computation. (default:
                      False)
- val_size VAL_SIZE   Size of the development set, should be an integer or a
                      float in range `[0,1)`. (default: 0.0)
- packing PACKING     Whether or not to pack the sequences in training. Will
                      automatically enable in pre-training. (default: None)
- tokenized_path TOKENIZED_PATH
                      Path to save or load the tokenized datasets. (default:
                      None)
- output_dir OUTPUT_DIR
                      The output directory where the model predictions and
                      checkpoints will be written. (default: None)
- overwrite_output_dir [OVERWRITE_OUTPUT_DIR]
                      Overwrite the content of the output directory. Use
                      this to continue training if output_dir points to a
                      checkpoint directory. (default: False)
- do_train [DO_TRAIN]
                      Whether to run training. (default: False)
- do_eval [DO_EVAL]   Whether to run eval on the dev set. (default: False)
- do_predict [DO_PREDICT]
                      Whether to run predictions on the test set. (default:
                      False)
- evaluation_strategy {no,steps,epoch}
                      The evaluation strategy to use. (default: no)
- prediction_loss_only [PREDICTION_LOSS_ONLY]
                      When performing evaluation and predictions, only
                      returns the loss. (default: False)
- per_device_train_batch_size PER_DEVICE_TRAIN_BATCH_SIZE
                      Batch size per GPU/TPU/MPS/NPU core/CPU for training.
                      (default: 8)
- per_device_eval_batch_size PER_DEVICE_EVAL_BATCH_SIZE
                      Batch size per GPU/TPU/MPS/NPU core/CPU for
                      evaluation. (default: 8)
- per_gpu_train_batch_size PER_GPU_TRAIN_BATCH_SIZE
                      Deprecated, the use of `--per_device_train_batch_size`
                      is preferred. Batch size per GPU/TPU core/CPU for
                      training. (default: None)
- per_gpu_eval_batch_size PER_GPU_EVAL_BATCH_SIZE
                      Deprecated, the use of `--per_device_eval_batch_size`
                      is preferred. Batch size per GPU/TPU core/CPU for
                      evaluation. (default: None)
- gradient_accumulation_steps GRADIENT_ACCUMULATION_STEPS
                      Number of updates steps to accumulate before
                      performing a backward/update pass. (default: 1)
- eval_accumulation_steps EVAL_ACCUMULATION_STEPS
                      Number of predictions steps to accumulate before
                      moving the tensors to the CPU. (default: None)
- eval_delay EVAL_DELAY
                      Number of epochs or steps to wait for before the first
                      evaluation can be performed, depending on the
                      evaluation_strategy. (default: 0)
- learning_rate LEARNING_RATE
                      The initial learning rate for AdamW. (default: 5e-05)
- weight_decay WEIGHT_DECAY
                      Weight decay for AdamW if we apply some. (default:
                      0.0)
- adam_beta1 ADAM_BETA1
                      Beta1 for AdamW optimizer (default: 0.9)
- adam_beta2 ADAM_BETA2
                      Beta2 for AdamW optimizer (default: 0.999)
- adam_epsilon ADAM_EPSILON
                      Epsilon for AdamW optimizer. (default: 1e-08)
- max_grad_norm MAX_GRAD_NORM
                      Max gradient norm. (default: 1.0)
- num_train_epochs NUM_TRAIN_EPOCHS
                      Total number of training epochs to perform. (default:
                      3.0)
- max_steps MAX_STEPS
                      If > 0: set total number of training steps to perform.
                      Override num_train_epochs. (default: -1)
- lr_scheduler_type {linear,cosine,cosine_with_restarts,polynomial,constant,constant_with_warmup,inverse_sqrt,reduce_lr_on_plateau}
                      The scheduler type to use. (default: linear)
- lr_scheduler_kwargs LR_SCHEDULER_KWARGS
                      Extra parameters for the lr_scheduler such as
                      {'num_cycles': 1} for the cosine with hard restarts
                      (default: {})
- warmup_ratio WARMUP_RATIO
                      Linear warmup over warmup_ratio fraction of total
                      steps. (default: 0.0)
- warmup_steps WARMUP_STEPS
                      Linear warmup over warmup_steps. (default: 0)
- log_level {detail,debug,info,warning,error,critical,passive}
                      Logger log level to use on the main node. Possible
                      choices are the log levels as strings: 'debug',
                      'info', 'warning', 'error' and 'critical', plus a
                      'passive' level which doesn't set anything and lets
                      the application set the level. Defaults to 'passive'.
                      (default: passive)
- log_level_replica {detail,debug,info,warning,error,critical,passive}
                      Logger log level to use on replica nodes. Same choices
                      and defaults as ``log_level`` (default: warning)
- log_on_each_node [LOG_ON_EACH_NODE]
                      When doing a multinode distributed training, whether
                      to log once per node or just once on the main node.
                      (default: True)
- no_log_on_each_node
                      When doing a multinode distributed training, whether
                      to log once per node or just once on the main node.
                      (default: False)
- logging_dir LOGGING_DIR
                      Tensorboard log dir. (default: None)
- logging_strategy {no,steps,epoch}
                      The logging strategy to use. (default: steps)
- logging_first_step [LOGGING_FIRST_STEP]
                      Log the first global_step (default: False)
- logging_steps LOGGING_STEPS
                      Log every X updates steps. Should be an integer or a
                      float in range `[0,1)`. If smaller than 1, will be
                      interpreted as ratio of total training steps.
                      (default: 500)
- logging_nan_inf_filter [LOGGING_NAN_INF_FILTER]
                      Filter nan and inf losses for logging. (default: True)
- no_logging_nan_inf_filter
                      Filter nan and inf losses for logging. (default:
                      False)
- save_strategy {no,steps,epoch}
                      The checkpoint save strategy to use. (default: steps)
- save_steps SAVE_STEPS
                      Save checkpoint every X updates steps. Should be an
                      integer or a float in range `[0,1)`. If smaller than
                      1, will be interpreted as ratio of total training
                      steps. (default: 500)
- save_total_limit SAVE_TOTAL_LIMIT
                      If a value is passed, will limit the total amount of
                      checkpoints. Deletes the older checkpoints in
                      `output_dir`. When `load_best_model_at_end` is
                      enabled, the 'best' checkpoint according to
                      `metric_for_best_model` will always be retained in
                      addition to the most recent ones. For example, for
                      `save_total_limit=5` and
                      `load_best_model_at_end=True`, the four last
                      checkpoints will always be retained alongside the best
                      model. When `save_total_limit=1` and
                      `load_best_model_at_end=True`, it is possible that two
                      checkpoints are saved: the last one and the best one
                      (if they are different). Default is unlimited
                      checkpoints (default: None)
- save_safetensors [SAVE_SAFETENSORS]
                      Use safetensors saving and loading for state dicts
                      instead of default torch.load and torch.save.
                      (default: True)
- no_save_safetensors
                      Use safetensors saving and loading for state dicts
                      instead of default torch.load and torch.save.
                      (default: False)
- save_on_each_node [SAVE_ON_EACH_NODE]
                      When doing multi-node distributed training, whether to
                      save models and checkpoints on each node, or only on
                      the main one (default: False)
- save_only_model [SAVE_ONLY_MODEL]
                      When checkpointing, whether to only save the model, or
                      also the optimizer, scheduler & rng state.Note that
                      when this is true, you won't be able to resume
                      training from checkpoint.This enables you to save
                      storage by not storing the optimizer, scheduler & rng
                      state.You can only load the model using
                      from_pretrained with this option set to True.
                      (default: False)
- no_cuda [NO_CUDA]   This argument is deprecated. It will be removed in
                      version 5.0 of 🤗 Transformers. (default: False)
- use_cpu [USE_CPU]   Whether or not to use cpu. If set to False, we will
                      use cuda/tpu/mps/npu device if available. (default:
                      False)
- use_mps_device [USE_MPS_DEVICE]
                      This argument is deprecated. `mps` device will be used
                      if available similar to `cuda` device. It will be
                      removed in version 5.0 of 🤗 Transformers (default:
                      False)
- seed SEED           Random seed that will be set at the beginning of
                      training. (default: 42)
- data_seed DATA_SEED
                      Random seed to be used with data samplers. (default:
                      None)
- jit_mode_eval [JIT_MODE_EVAL]
                      Whether or not to use PyTorch jit trace for inference
                      (default: False)
- use_ipex [USE_IPEX]
                      Use Intel extension for PyTorch when it is available,
                      installation: 'https://github.com/intel/intel-
                      extension-for-pytorch' (default: False)
- bf16 [BF16]         Whether to use bf16 (mixed) precision instead of
                      32-bit. Requires Ampere or higher NVIDIA architecture
                      or using CPU (use_cpu) or Ascend NPU. This is an
                      experimental API and it may change. (default: False)
- fp16 [FP16]         Whether to use fp16 (mixed) precision instead of
                      32-bit (default: False)
- fp16_opt_level FP16_OPT_LEVEL
                      For fp16: Apex AMP optimization level selected in
                      ['O0', 'O1', 'O2', and 'O3']. See details at
                      https://nvidia.github.io/apex/amp.html (default: O1)
- half_precision_backend {auto,apex,cpu_amp}
                      The backend to be used for half precision. (default:
                      auto)
- bf16_full_eval [BF16_FULL_EVAL]
                      Whether to use full bfloat16 evaluation instead of
                      32-bit. This is an experimental API and it may change.
                      (default: False)
- fp16_full_eval [FP16_FULL_EVAL]
                      Whether to use full float16 evaluation instead of
                      32-bit (default: False)
- tf32 TF32           Whether to enable tf32 mode, available in Ampere and
                      newer GPU architectures. This is an experimental API
                      and it may change. (default: None)
- local_rank LOCAL_RANK
                      For distributed training: local_rank (default: -1)
- ddp_backend {nccl,gloo,mpi,ccl,hccl}
                      The backend to be used for distributed training
                      (default: None)
- tpu_num_cores TPU_NUM_CORES
                      TPU: Number of TPU cores (automatically passed by
                      launcher script) (default: None)
- tpu_metrics_debug [TPU_METRICS_DEBUG]
                      Deprecated, the use of `--debug tpu_metrics_debug` is
                      preferred. TPU: Whether to print debug metrics
                      (default: False)
- debug DEBUG [DEBUG ...]
                      Whether or not to enable debug mode. Current options:
                      `underflow_overflow` (Detect underflow and overflow in
                      activations and weights), `tpu_metrics_debug` (print
                      debug metrics on TPU). (default: None)
- dataloader_drop_last [DATALOADER_DROP_LAST]
                      Drop the last incomplete batch if it is not divisible
                      by the batch size. (default: False)
- eval_steps EVAL_STEPS
                      Run an evaluation every X steps. Should be an integer
                      or a float in range `[0,1)`. If smaller than 1, will
                      be interpreted as ratio of total training steps.
                      (default: None)
- dataloader_num_workers DATALOADER_NUM_WORKERS
                      Number of subprocesses to use for data loading
                      (PyTorch only). 0 means that the data will be loaded
                      in the main process. (default: 0)
- dataloader_prefetch_factor DATALOADER_PREFETCH_FACTOR
                      Number of batches loaded in advance by each worker. 2
                      means there will be a total of 2 * num_workers batches
                      prefetched across all workers. Default is 2 for
                      PyTorch < 2.0.0 and otherwise None. (default: None)
- past_index PAST_INDEX
                      If >=0, uses the corresponding part of the output as
                      the past state for next step. (default: -1)
- run_name RUN_NAME   An optional descriptor for the run. Notably used for
                      wandb logging. (default: None)
- disable_tqdm DISABLE_TQDM
                      Whether or not to disable the tqdm progress bars.
                      (default: None)
- remove_unused_columns [REMOVE_UNUSED_COLUMNS]
                      Remove columns not required by the model when using an
                      nlp.Dataset. (default: True)
- no_remove_unused_columns
                      Remove columns not required by the model when using an
                      nlp.Dataset. (default: False)
- label_names LABEL_NAMES [LABEL_NAMES ...]
                      The list of keys in your dictionary of inputs that
                      correspond to the labels. (default: None)
- load_best_model_at_end [LOAD_BEST_MODEL_AT_END]
                      Whether or not to load the best model found during
                      training at the end of training. When this option is
                      enabled, the best checkpoint will always be saved. See
                      `save_total_limit` for more. (default: False)
- metric_for_best_model METRIC_FOR_BEST_MODEL
                      The metric to use to compare two different models.
                      (default: None)
- greater_is_better GREATER_IS_BETTER
                      Whether the `metric_for_best_model` should be
                      maximized or not. (default: None)
- ignore_data_skip [IGNORE_DATA_SKIP]
                      When resuming training, whether or not to skip the
                      first epochs and batches to get to the same training
                      data. (default: False)
- fsdp FSDP           Whether or not to use PyTorch Fully Sharded Data
                      Parallel (FSDP) training (in distributed training
                      only). The base option should be `full_shard`,
                      `shard_grad_op` or `no_shard` and you can add CPU-
                      offload to `full_shard` or `shard_grad_op` like this:
                      full_shard offload` or `shard_grad_op offload`. You
                      can add auto-wrap to `full_shard` or `shard_grad_op`
                      with the same syntax: full_shard auto_wrap` or
                      `shard_grad_op auto_wrap`. (default: )
- fsdp_min_num_params FSDP_MIN_NUM_PARAMS
                      This parameter is deprecated. FSDP's minimum number of
                      parameters for Default Auto Wrapping. (useful only
                      when `fsdp` field is passed). (default: 0)
- fsdp_config FSDP_CONFIG
                      Config to be used with FSDP (Pytorch Fully Sharded
                      Data Parallel). The value is either a fsdp json config
                      file (e.g., `fsdp_config.json`) or an already loaded
                      json file as `dict`. (default: None)
- fsdp_transformer_layer_cls_to_wrap FSDP_TRANSFORMER_LAYER_CLS_TO_WRAP
                      This parameter is deprecated. Transformer layer class
                      name (case-sensitive) to wrap, e.g, `BertLayer`,
                      `GPTJBlock`, `T5Block` .... (useful only when `fsdp`
                      flag is passed). (default: None)
- accelerator_config ACCELERATOR_CONFIG
                      Config to be used with the internal Accelerator object
                      initializtion. The value is either a accelerator json
                      config file (e.g., `accelerator_config.json`) or an
                      already loaded json file as `dict`. (default: None)
- deepspeed DEEPSPEED
                      Enable deepspeed and pass the path to deepspeed json
                      config file (e.g. `ds_config.json`) or an already
                      loaded json file as a dict (default: None)
- label_smoothing_factor LABEL_SMOOTHING_FACTOR
                      The label smoothing epsilon to apply (zero means no
                      label smoothing). (default: 0.0)
- optim {adamw_hf,adamw_torch,adamw_torch_fused,adamw_torch_xla,adamw_torch_npu_fused,adamw_apex_fused,adafactor,adamw_anyprecision,sgd,adagrad,adamw_bnb_8bit,adamw_8bit,lion_8bit,lion_32bit,paged_adamw_32bit,paged_adamw_8bit,paged_lion_32bit,paged_lion_8bit,rmsprop,rmsprop_bnb,rmsprop_bnb_8bit,rmsprop_bnb_32bit,galore_adamw,galore_adamw_8bit,galore_adafactor,galore_adamw_layerwise,galore_adamw_8bit_layerwise,galore_adafactor_layerwise}
                      The optimizer to use. (default: adamw_torch)
- optim_args OPTIM_ARGS
                      Optional arguments to supply to optimizer. (default:
                      None)
- adafactor [ADAFACTOR]
                      Whether or not to replace AdamW by Adafactor.
                      (default: False)
- group_by_length [GROUP_BY_LENGTH]
                      Whether or not to group samples of roughly the same
                      length together when batching. (default: False)
- length_column_name LENGTH_COLUMN_NAME
                      Column name with precomputed lengths to use when
                      grouping by length. (default: length)
- report_to REPORT_TO [REPORT_TO ...]
                      The list of integrations to report the results and
                      logs to. (default: None)
- ddp_find_unused_parameters DDP_FIND_UNUSED_PARAMETERS
                      When using distributed training, the value of the flag
                      `find_unused_parameters` passed to
                      `DistributedDataParallel`. (default: None)
- ddp_bucket_cap_mb DDP_BUCKET_CAP_MB
                      When using distributed training, the value of the flag
                      `bucket_cap_mb` passed to `DistributedDataParallel`.
                      (default: None)
- ddp_broadcast_buffers DDP_BROADCAST_BUFFERS
                      When using distributed training, the value of the flag
                      `broadcast_buffers` passed to
                      `DistributedDataParallel`. (default: None)
- dataloader_pin_memory [DATALOADER_PIN_MEMORY]
                      Whether or not to pin memory for DataLoader. (default:
                      True)
- no_dataloader_pin_memory
                      Whether or not to pin memory for DataLoader. (default:
                      False)
- dataloader_persistent_workers [DATALOADER_PERSISTENT_WORKERS]
                      If True, the data loader will not shut down the worker
                      processes after a dataset has been consumed once. This
                      allows to maintain the workers Dataset instances
                      alive. Can potentially speed up training, but will
                      increase RAM usage. (default: False)
- skip_memory_metrics [SKIP_MEMORY_METRICS]
                      Whether or not to skip adding of memory profiler
                      reports to metrics. (default: True)
- no_skip_memory_metrics
                      Whether or not to skip adding of memory profiler
                      reports to metrics. (default: False)
- use_legacy_prediction_loop [USE_LEGACY_PREDICTION_LOOP]
                      Whether or not to use the legacy prediction_loop in
                      the Trainer. (default: False)
- push_to_hub [PUSH_TO_HUB]
                      Whether or not to upload the trained model to the
                      model hub after training. (default: False)
- resume_from_checkpoint RESUME_FROM_CHECKPOINT
                      The path to a folder with a valid checkpoint for your
                      model. (default: None)
- hub_model_id HUB_MODEL_ID
                      The name of the repository to keep in sync with the
                      local `output_dir`. (default: None)
- hub_strategy {end,every_save,checkpoint,all_checkpoints}
                      The hub strategy to use when `--push_to_hub` is
                      activated. (default: every_save)
- hub_token HUB_TOKEN
                      The token to use to push to the Model Hub. (default:
                      None)
- hub_private_repo [HUB_PRIVATE_REPO]
                      Whether the model repository is private or not.
                      (default: False)
- hub_always_push [HUB_ALWAYS_PUSH]
                      Unless `True`, the Trainer will skip pushes if the
                      previous one wasn't finished yet. (default: False)
- gradient_checkpointing [GRADIENT_CHECKPOINTING]
                      If True, use gradient checkpointing to save memory at
                      the expense of slower backward pass. (default: False)
- gradient_checkpointing_kwargs GRADIENT_CHECKPOINTING_KWARGS
                      Gradient checkpointing key word arguments such as
                      `use_reentrant`. Will be passed to
                      `torch.utils.checkpoint.checkpoint` through
                      `model.gradient_checkpointing_enable`. (default: None)
- include_inputs_for_metrics [INCLUDE_INPUTS_FOR_METRICS]
                      Whether or not the inputs will be passed to the
                      `compute_metrics` function. (default: False)
- fp16_backend {auto,apex,cpu_amp}
                      Deprecated. Use half_precision_backend instead
                      (default: auto)
- push_to_hub_model_id PUSH_TO_HUB_MODEL_ID
                      The name of the repository to which push the
                      `Trainer`. (default: None)
- push_to_hub_organization PUSH_TO_HUB_ORGANIZATION
                      The name of the organization in with to which push the
                      `Trainer`. (default: None)
- push_to_hub_token PUSH_TO_HUB_TOKEN
                      The token to use to push to the Model Hub. (default:
                      None)
- mp_parameters MP_PARAMETERS
                      Used by the SageMaker launcher to send mp-specific
                      args. Ignored in Trainer (default: )
- auto_find_batch_size [AUTO_FIND_BATCH_SIZE]
                      Whether to automatically decrease the batch size in
                      half and rerun the training loop again each time a
                      CUDA Out-of-Memory was reached (default: False)
- full_determinism [FULL_DETERMINISM]
                      Whether to call enable_full_determinism instead of
                      set_seed for reproducibility in distributed training.
                      Important: this will negatively impact the
                      performance, so only use it for debugging. (default:
                      False)
- torchdynamo TORCHDYNAMO
                      This argument is deprecated, use
                      `--torch_compile_backend` instead. (default: None)
- ray_scope RAY_SCOPE
                      The scope to use when doing hyperparameter search with
                      Ray. By default, `"last"` will be used. Ray will then
                      use the last checkpoint of all trials, compare those,
                      and select the best one. However, other options are
                      also available. See the Ray documentation (https://doc
                      s.ray.io/en/latest/tune/api_docs/analysis.html#ray.tun
                      e.ExperimentAnalysis.get_best_trial) for more options.
                      (default: last)
- ddp_timeout DDP_TIMEOUT
                      Overrides the default timeout for distributed training
                      (value should be given in seconds). (default: 1800)
- torch_compile [TORCH_COMPILE]
                      If set to `True`, the model will be wrapped in
                      `torch.compile`. (default: False)
- torch_compile_backend TORCH_COMPILE_BACKEND
                      Which backend to use with `torch.compile`, passing one
                      will trigger a model compilation. (default: None)
- torch_compile_mode TORCH_COMPILE_MODE
                      Which mode to use with `torch.compile`, passing one
                      will trigger a model compilation. (default: None)
- dispatch_batches DISPATCH_BATCHES
                      Deprecated. Pass {'dispatch_batches':VALUE} to
                      `accelerator_config`. (default: None)
- split_batches SPLIT_BATCHES
                      Deprecated. Pass {'split_batches':True} to
                      `accelerator_config`. (default: None)
- include_tokens_per_second [INCLUDE_TOKENS_PER_SECOND]
                      If set to `True`, the speed metrics will include `tgs`
                      (tokens per second per device). (default: False)
- include_num_input_tokens_seen [INCLUDE_NUM_INPUT_TOKENS_SEEN]
                      If set to `True`, will track the number of input
                      tokens seen throughout training. (May be slower in
                      distributed training) (default: False)
- neftune_noise_alpha NEFTUNE_NOISE_ALPHA
                      Activates neftune noise embeddings into the model.
                      NEFTune has been proven to drastically improve model
                      performances for instrcution fine-tuning. Check out
                      the original paper here:
                      https://arxiv.org/abs/2310.05914 and the original code
                      here: https://github.com/neelsjain/NEFTune. Only
                      supported for `PreTrainedModel` and `PeftModel`
                      classes. (default: None)
- optim_target_modules OPTIM_TARGET_MODULES
                      Target modules for the optimizer defined in the
                      `optim` argument. Only used for the GaLore optimizer
                      at the moment. (default: None)
- sortish_sampler [SORTISH_SAMPLER]
                      Whether to use SortishSampler or not. (default: False)
- predict_with_generate [PREDICT_WITH_GENERATE]
                      Whether to use generate to calculate generative
                      metrics (ROUGE, BLEU). (default: False)
- generation_max_length GENERATION_MAX_LENGTH
                      The `max_length` to use on each evaluation loop when
                      `predict_with_generate=True`. Will default to the
                      `max_length` value of the model configuration.
                      (default: None)
- generation_num_beams GENERATION_NUM_BEAMS
                      The `num_beams` to use on each evaluation loop when
                      `predict_with_generate=True`. Will default to the
                      `num_beams` value of the model configuration.
                      (default: None)
- generation_config GENERATION_CONFIG
                      Model id, file path or url pointing to a
                      GenerationConfig json file, to use during prediction.
                      (default: None)
- use_badam [USE_BADAM]
                      Whether or not to use the BAdam optimizer. (default:
                      False)
- badam_mode {layer,ratio}
                      Whether to use layer-wise or ratio-wise BAdam
                      optimizer. (default: layer)
- badam_start_block BADAM_START_BLOCK
                      The starting block index for layer-wise BAdam.
                      (default: None)
- badam_switch_mode {ascending,descending,random,fixed}
                      the strategy of picking block to update for layer-wise
                      BAdam. (default: ascending)
- badam_switch_interval BADAM_SWITCH_INTERVAL
                      Number of steps to update the block for layer-wise
                      BAdam. Use -1 to disable the block update. (default:
                      50)
- badam_update_ratio BADAM_UPDATE_RATIO
                      The ratio of the update for ratio-wise BAdam.
                      (default: 0.05)
- badam_mask_mode {adjacent,scatter}
                      The mode of the mask for BAdam optimizer. `adjacent`
                      means that the trainable parameters are adjacent to
                      each other, `scatter` means that trainable parameters
                      are randomly choosed from the weight. (default:
                      adjacent)
- badam_verbose BADAM_VERBOSE
                      The verbosity level of BAdam optimizer. 0 for no
                      print, 1 for print the block prefix, 2 for print
                      trainable parameters (default: 0)
- use_galore [USE_GALORE]
                      Whether or not to use the gradient low-Rank projection
                      (GaLore). (default: False)
- galore_target GALORE_TARGET
                      Name(s) of modules to apply GaLore. Use commas to
                      separate multiple modules. Use "all" to specify all
                      the linear modules. (default: all)
- galore_rank GALORE_RANK
                      The rank of GaLore gradients. (default: 16)
- galore_update_interval GALORE_UPDATE_INTERVAL
                      Number of steps to update the GaLore projection.
                      (default: 200)
- galore_scale GALORE_SCALE
                      GaLore scaling coefficient. (default: 0.25)
- galore_proj_type {std,reverse_std,right,left,full}
                      Type of GaLore projection. (default: std)
- galore_layerwise [GALORE_LAYERWISE]
                      Whether or not to enable layer-wise update to further
                      save memory. (default: False)
- dpo_beta DPO_BETA   The beta parameter for the DPO loss. (default: 0.1)
- dpo_loss {sigmoid,hinge,ipo,kto_pair}
                      The type of DPO loss to use. (default: sigmoid)
- dpo_label_smoothing DPO_LABEL_SMOOTHING
                      The robust DPO label smoothing parameter in cDPO that
                      should be between 0 and 0.5. (default: 0.0)
- dpo_ftx DPO_FTX     The supervised fine-tuning loss coefficient in DPO
                      training. (default: 0.0)
- orpo_beta ORPO_BETA
                      The beta (lambda) parameter in ORPO loss representing
                      the weight of the SFT loss. (default: 0.1)
- ppo_buffer_size PPO_BUFFER_SIZE
                      The number of mini-batches to make experience buffer
                      in a PPO optimization step. (default: 1)
- ppo_epochs PPO_EPOCHS
                      The number of epochs to perform in a PPO optimization
                      step. (default: 4)
- ppo_score_norm [PPO_SCORE_NORM]
                      Use score normalization in PPO training. (default:
                      False)
- ppo_target PPO_TARGET
                      Target KL value for adaptive KL control in PPO
                      training. (default: 6.0)
- ppo_whiten_rewards [PPO_WHITEN_REWARDS]
                      Whiten the rewards before compute advantages in PPO
                      training. (default: False)
- ref_model REF_MODEL
                      Path to the reference model used for the PPO or DPO
                      training. (default: None)
- ref_model_adapters REF_MODEL_ADAPTERS
                      Path to the adapters of the reference model. (default:
                      None)
- ref_model_quantization_bit REF_MODEL_QUANTIZATION_BIT
                      The number of bits to quantize the reference model.
                      (default: None)
- reward_model REWARD_MODEL
                      Path to the reward model used for the PPO training.
                      (default: None)
- reward_model_adapters REWARD_MODEL_ADAPTERS
                      Path to the adapters of the reward model. (default:
                      None)
- reward_model_quantization_bit REWARD_MODEL_QUANTIZATION_BIT
                      The number of bits to quantize the reward model.
                      (default: None)
- reward_model_type {lora,full,api}
                      The type of the reward model in PPO training. Lora
                      model only supports lora training. (default: lora)
- additional_target ADDITIONAL_TARGET
                      Name(s) of modules apart from LoRA layers to be set as
                      trainable and saved in the final checkpoint. (default:
                      None)
- lora_alpha LORA_ALPHA
                      The scale factor for LoRA fine-tuning (default:
                      lora_rank * 2). (default: None)
- lora_dropout LORA_DROPOUT
                      Dropout rate for the LoRA fine-tuning. (default: 0.0)
- lora_rank LORA_RANK
                      The intrinsic dimension for LoRA fine-tuning.
                      (default: 8)
- lora_target LORA_TARGET
                      Name(s) of target modules to apply LoRA. Use commas to
                      separate multiple modules. Use "all" to specify all
                      the linear modules. LLaMA choices: ["q_proj",
                      "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj",
                      "down_proj"], BLOOM & Falcon & ChatGLM choices:
                      ["query_key_value", "dense", "dense_h_to_4h",
                      "dense_4h_to_h"], Baichuan choices: ["W_pack",
                      "o_proj", "gate_proj", "up_proj", "down_proj"], Qwen
                      choices: ["c_attn", "attn.c_proj", "w1", "w2",
                      "mlp.c_proj"], InternLM2 choices: ["wqkv", "wo", "w1",
                      "w2", "w3"], Others choices: the same as LLaMA.
                      (default: all)
- loraplus_lr_ratio LORAPLUS_LR_RATIO
                      LoRA plus learning rate ratio (lr_B / lr_A). (default:
                      None)
- loraplus_lr_embedding LORAPLUS_LR_EMBEDDING
                      LoRA plus learning rate for lora embedding layers.
                      (default: 1e-06)
- use_rslora [USE_RSLORA]
                      Whether or not to use the rank stabilization scaling
                      factor for LoRA layer. (default: False)
- use_dora [USE_DORA]
                      Whether or not to use the weight-decomposed lora
                      method (DoRA). (default: False)
- create_new_adapter [CREATE_NEW_ADAPTER]
                      Whether or not to create a new adapter with randomly
                      initialized weight. (default: False)
- name_module_trainable NAME_MODULE_TRAINABLE
                      Name of trainable modules for partial-parameter
                      (freeze) fine-tuning. Use commas to separate multiple
                      modules. Use "all" to specify all the available
                      modules. LLaMA choices: ["mlp", "self_attn"], BLOOM &
                      Falcon & ChatGLM choices: ["mlp", "self_attention"],
                      Qwen choices: ["mlp", "attn"], InternLM2 choices:
                      ["feed_forward", "attention"], Others choices: the
                      same as LLaMA. (default: all)
- num_layer_trainable NUM_LAYER_TRAINABLE
                      The number of trainable layers for partial-parameter
                      (freeze) fine-tuning. (default: 2)
- pure_bf16 [PURE_BF16]
                      Whether or not to train model in purely bf16 precision
                      (without AMP). (default: False)
- stage {pt,sft,rm,ppo,dpo,orpo}
                      Which stage will be performed in training. (default:
                      sft)
- finetuning_type {lora,freeze,full}
                      Which fine-tuning method to use. (default: lora)
- use_llama_pro [USE_LLAMA_PRO]
                      Whether or not to make only the parameters in the
                      expanded blocks trainable. (default: False)
- plot_loss [PLOT_LOSS]
                      Whether or not to save the training loss curves.
                      (default: False)
- do_sample [DO_SAMPLE]
                      Whether or not to use sampling, use greedy decoding
                      otherwise. (default: True)
- no_do_sample        Whether or not to use sampling, use greedy decoding
                      otherwise. (default: False)
- temperature TEMPERATURE
                      The value used to modulate the next token
                      probabilities. (default: 0.95)
- top_p TOP_P         The smallest set of most probable tokens with
                      probabilities that add up to top_p or higher are kept.
                      (default: 0.7)
- top_k TOP_K         The number of highest probability vocabulary tokens to
                      keep for top-k filtering. (default: 50)
- num_beams NUM_BEAMS
                      Number of beams for beam search. 1 means no beam
                      search. (default: 1)
- max_length MAX_LENGTH
                      The maximum length the generated tokens can have. It
                      can be overridden by max_new_tokens. (default: 1024)
- max_new_tokens MAX_NEW_TOKENS
                      The maximum numbers of tokens to generate, ignoring
                      the number of tokens in the prompt. (default: 1024)
- repetition_penalty REPETITION_PENALTY
                      The parameter for repetition penalty. 1.0 means no
                      penalty. (default: 1.0)
- length_penalty LENGTH_PENALTY
                      Exponential penalty to the length that is used with
                      beam-based generation. (default: 1.0)

All parameters for exporting/merging models

- help            show this help message and exit
- model_name_or_path MODEL_NAME_OR_PATH
                        Path to the model weight or identifier from huggingface.co/models or modelscope.cn/models. (default: None)
- adapter_name_or_path ADAPTER_NAME_OR_PATH
                       Path to the adapter weight or identifier from huggingface.co/models. (default: None)
- cache_dir CACHE_DIR
                        Where to store the pre-trained models downloaded from huggingface.co or modelscope.cn. (default: None)
- use_fast_tokenizer [USE_FAST_TOKENIZER]
                        Whether or not to use one of the fast tokenizer (backed by the tokenizers library). (default: True)
- no_use_fast_tokenizer
                        Whether or not to use one of the fast tokenizer (backed by the tokenizers library). (default: False)
- resize_vocab [RESIZE_VOCAB]
                        Whether or not to resize the tokenizer vocab and the embedding layers. (default: False)
- split_special_tokens [SPLIT_SPECIAL_TOKENS]
                        Whether or not the special tokens should be split during the tokenization process. (default: False)
- new_special_tokens NEW_SPECIAL_TOKENS
                        Special tokens to be added into the tokenizer. (default: None)
- model_revision MODEL_REVISION
                        The specific model version to use (can be a branch name, tag name or commit id). (default: main)
- low_cpu_mem_usage [LOW_CPU_MEM_USAGE]
                        Whether or not to use memory-efficient model loading. (default: True)
- no_low_cpu_mem_usage
                        Whether or not to use memory-efficient model loading. (default: False)
- quantization_bit QUANTIZATION_BIT
                        The number of bits to quantize the model using bitsandbytes. (default: None)
- quantization_type {fp4,nf4}
                        Quantization data type to use in int4 training. (default: nf4)
- double_quantization [DOUBLE_QUANTIZATION]
                        Whether or not to use double quantization in int4 training. (default: True)
- no_double_quantization
                        Whether or not to use double quantization in int4 training. (default: False)
- quantization_device_map {auto}
                        Device map used to infer the 4-bit quantized model, needs bitsandbytes>=0.43.0. (default: None)
- rope_scaling {linear,dynamic}
                        Which scaling strategy should be adopted for the RoPE embeddings. (default: None)
- flash_attn {off,sdpa,fa2,auto}
                        Enable FlashAttention for faster training and inference. (default: auto)
-shift_attn [SHIFT_ATTN]
                        Enable shift short attention (S^2-Attn) proposed by LongLoRA. (default: False)
- mixture_of_depths {convert,load}
                        Convert the model to mixture-of-depths (MoD) or load the MoD model. (default: None)
- use_unsloth [USE_UNSLOTH]
                        Whether or not to use unsloth's optimization for the LoRA training. (default: False)
- visual_inputs [VISUAL_INPUTS]
                        Whethor or not to use multimodal LLM that accepts visual inputs. (default: False)
- moe_aux_loss_coef MOE_AUX_LOSS_COEF
                        Coefficient of the auxiliary router loss in mixture-of-experts model. (default: None)
- disable_gradient_checkpointing [DISABLE_GRADIENT_CHECKPOINTING]
                        Whether or not to disable gradient checkpointing. (default: False)
- upcast_layernorm [UPCAST_LAYERNORM]
                        Whether or not to upcast the layernorm weights in fp32. (default: False)
- upcast_lmhead_output [UPCAST_LMHEAD_OUTPUT]
                        Whether or not to upcast the output of lm_head in fp32. (default: False)
- infer_backend {huggingface,vllm}
                        Backend engine used at inference. (default: huggingface)
- vllm_maxlen VLLM_MAXLEN
                        Maximum input length of the vLLM engine. (default: 2048)
- vllm_gpu_util VLLM_GPU_UTIL
                        The fraction of GPU memory in (0,1) to be used for the vLLM engine. (default: 0.9)
- vllm_enforce_eager [VLLM_ENFORCE_EAGER]
                        Whether or not to disable CUDA graph in the vLLM engine. (default: False)
- offload_folder OFFLOAD_FOLDER
                        Path to offload model weights. (default: offload)
- use_cache [USE_CACHE]
                        Whether or not to use KV cache in generation. (default: True)
- no_use_cache        Whether or not to use KV cache in generation. (default: False)
- hf_hub_token HF_HUB_TOKEN
                        Auth token to log in with Hugging Face Hub. (default: None)
- ms_hub_token MS_HUB_TOKEN
                        Auth token to log in with ModelScope Hub. (default: None)
- export_dir EXPORT_DIR
                        Path to the directory to save the exported model. (default: None)
- export_size EXPORT_SIZE
                        The file shard size (in GB) of the exported model. (default: 1)
- export_device EXPORT_DEVICE
                        The device used in model export, use cuda to avoid addmm errors. (default: cpu)
- export_quantization_bit EXPORT_QUANTIZATION_BIT
                        The number of bits to quantize the exported model. (default: None)
- export_quantization_dataset EXPORT_QUANTIZATION_DATASET
                        Path to the dataset or dataset name to use in quantizing the exported model. (default: None)
- export_quantization_nsamples EXPORT_QUANTIZATION_NSAMPLES
                        The number of samples used for quantization. (default: 128)
- export_quantization_maxlen EXPORT_QUANTIZATION_MAXLEN
                        The maximum length of the model inputs used for quantization. (default: 1024)
- export_legacy_format [EXPORT_LEGACY_FORMAT]
                        Whether or not to save the `.bin` files instead of `.safetensors`. (default: False)
- export_hub_model_id EXPORT_HUB_MODEL_ID
                        The name of the repository if push the model to the Hugging Face hub. (default: None)
- print_param_status [PRINT_PARAM_STATUS]
                        For debugging purposes, print the status of the parameters in the model. (default: False)
- template TEMPLATE   Which template to use for constructing prompts in training and inference. (default: None)
- dataset DATASET     The name of provided dataset(s) to use. Use commas to separate multiple datasets. (default: None)
- eval_dataset EVAL_DATASET
                        The name of provided validation dataset(s) to use. Use commas to separate multiple datasets. (default: None)
- dataset_dir DATASET_DIR
                        Path to the folder containing the datasets. (default: data)
- split SPLIT         Which dataset split to use for training and evaluation. (default: train)
- cutoff_len CUTOFF_LEN
                        The cutoff length of the tokenized inputs in the dataset. (default: 1024)
- reserved_label_len RESERVED_LABEL_LEN
                        The minimum cutoff length reserved for the tokenized labels in the dataset. (default: 1)
- train_on_prompt [TRAIN_ON_PROMPT]
                        Whether to disable the mask on the prompt or not. (default: False)
- streaming [STREAMING]
                        Enable dataset streaming. (default: False)
- buffer_size BUFFER_SIZE
                        Size of the buffer to randomly sample examples from in dataset streaming. (default: 16384)
- mix_strategy {concat,interleave_under,interleave_over}
                        Strategy to use in dataset mixing (concat/interleave) (undersampling/oversampling). (default: concat)
- interleave_probs INTERLEAVE_PROBS
                        Probabilities to sample data from datasets. Use commas to separate multiple datasets. (default: None)
- overwrite_cache [OVERWRITE_CACHE]
                        Overwrite the cached training and evaluation sets. (default: False)
- preprocessing_num_workers PREPROCESSING_NUM_WORKERS
                        The number of processes to use for the pre-processing. (default: None)
- max_samples MAX_SAMPLES
                        For debugging purposes, truncate the number of examples for each dataset. (default: None)
- eval_num_beams EVAL_NUM_BEAMS
                        Number of beams to use for evaluation. This argument will be passed to `model.generate` (default: None)
- ignore_pad_token_for_loss [IGNORE_PAD_TOKEN_FOR_LOSS]
                        Whether or not to ignore the tokens corresponding to padded labels in the loss computation. (default: True)
- no_ignore_pad_token_for_loss
                        Whether or not to ignore the tokens corresponding to padded labels in the loss computation. (default: False)
- val_size VAL_SIZE   Size of the development set, should be an integer or a float in range `[0,1)`. (default: 0.0)
- packing PACKING     Whether or not to pack the sequences in training. Will automatically enable in pre-training. (default: None)
- tokenized_path TOKENIZED_PATH
                        Path to save or load the tokenized datasets. (default: None)
- use_badam [USE_BADAM]
                        Whether or not to use the BAdam optimizer. (default: False)
- badam_mode {layer,ratio}
                        Whether to use layer-wise or ratio-wise BAdam optimizer. (default: layer)
- badam_start_block BADAM_START_BLOCK
                        The starting block index for layer-wise BAdam. (default: None)
- badam_switch_mode {ascending,descending,random,fixed}
                        the strategy of picking block to update for layer-wise BAdam. (default: ascending)
- badam_switch_interval BADAM_SWITCH_INTERVAL
                        Number of steps to update the block for layer-wise BAdam. Use -1 to disable the block update. (default: 50)
- badam_update_ratio BADAM_UPDATE_RATIO
                        The ratio of the update for ratio-wise BAdam. (default: 0.05)
- badam_mask_mode {adjacent,scatter}
                        The mode of the mask for BAdam optimizer. `adjacent` means that the trainable parameters are adjacent to each other, `scatter` means that trainable parameters are randomly choosed from the
                        weight. (default: adjacent)
- badam_verbose BADAM_VERBOSE
                        The verbosity level of BAdam optimizer. 0 for no print, 1 for print the block prefix, 2 for print trainable parameters (default: 0)
- use_galore [USE_GALORE]
                        Whether or not to use the gradient low-Rank projection (GaLore). (default: False)
- galore_target GALORE_TARGET
                        Name(s) of modules to apply GaLore. Use commas to separate multiple modules. Use "all" to specify all the linear modules. (default: all)
- galore_rank GALORE_RANK
                        The rank of GaLore gradients. (default: 16)
- galore_update_interval GALORE_UPDATE_INTERVAL
                        Number of steps to update the GaLore projection. (default: 200)
- galore_scale GALORE_SCALE
                        GaLore scaling coefficient. (default: 0.25)
- galore_proj_type {std,reverse_std,right,left,full}
                        Type of GaLore projection. (default: std)
- galore_layerwise [GALORE_LAYERWISE]
                        Whether or not to enable layer-wise update to further save memory. (default: False)
- dpo_beta DPO_BETA   The beta parameter for the DPO loss. (default: 0.1)
- dpo_loss {sigmoid,hinge,ipo,kto_pair}
                        The type of DPO loss to use. (default: sigmoid)
- dpo_label_smoothing DPO_LABEL_SMOOTHING
                        The robust DPO label smoothing parameter in cDPO that should be between 0 and 0.5. (default: 0.0)
- dpo_ftx DPO_FTX     The supervised fine-tuning loss coefficient in DPO training. (default: 0.0)
- orpo_beta ORPO_BETA
                        The beta (lambda) parameter in ORPO loss representing the weight of the SFT loss. (default: 0.1)
- ppo_buffer_size PPO_BUFFER_SIZE
                        The number of mini-batches to make experience buffer in a PPO optimization step. (default: 1)
- ppo_epochs PPO_EPOCHS
                        The number of epochs to perform in a PPO optimization step. (default: 4)
- ppo_score_norm [PPO_SCORE_NORM]
                        Use score normalization in PPO training. (default: False)
- ppo_target PPO_TARGET
                        Target KL value for adaptive KL control in PPO training. (default: 6.0)
- ppo_whiten_rewards [PPO_WHITEN_REWARDS]
                        Whiten the rewards before compute advantages in PPO training. (default: False)
- ref_model REF_MODEL
                        Path to the reference model used for the PPO or DPO training. (default: None)
- ref_model_adapters REF_MODEL_ADAPTERS
                        Path to the adapters of the reference model. (default: None)
- ref_model_quantization_bit REF_MODEL_QUANTIZATION_BIT
                        The number of bits to quantize the reference model. (default: None)
- reward_model REWARD_MODEL
                        Path to the reward model used for the PPO training. (default: None)
- reward_model_adapters REWARD_MODEL_ADAPTERS
                        Path to the adapters of the reward model. (default: None)
- reward_model_quantization_bit REWARD_MODEL_QUANTIZATION_BIT
                        The number of bits to quantize the reward model. (default: None)
- reward_model_type {lora,full,api}
                        The type of the reward model in PPO training. Lora model only supports lora training. (default: lora)
- additional_target ADDITIONAL_TARGET
                        Name(s) of modules apart from LoRA layers to be set as trainable and saved in the final checkpoint. (default: None)
- lora_alpha LORA_ALPHA
                        The scale factor for LoRA fine-tuning (default: lora_rank * 2). (default: None)
- lora_dropout LORA_DROPOUT
                        Dropout rate for the LoRA fine-tuning. (default: 0.0)
- lora_rank LORA_RANK
                        The intrinsic dimension for LoRA fine-tuning. (default: 8)
- lora_target LORA_TARGET
                        Name(s) of target modules to apply LoRA. Use commas to separate multiple modules. Use "all" to specify all the linear modules. LLaMA choices: ["q_proj", "k_proj", "v_proj", "o_proj",
                        "gate_proj", "up_proj", "down_proj"], BLOOM & Falcon & ChatGLM choices: ["query_key_value", "dense", "dense_h_to_4h", "dense_4h_to_h"], Baichuan choices: ["W_pack", "o_proj", "gate_proj",
                        "up_proj", "down_proj"], Qwen choices: ["c_attn", "attn.c_proj", "w1", "w2", "mlp.c_proj"], InternLM2 choices: ["wqkv", "wo", "w1", "w2", "w3"], Others choices: the same as LLaMA. (default:
                        all)
- loraplus_lr_ratio LORAPLUS_LR_RATIO
                        LoRA plus learning rate ratio (lr_B / lr_A). (default: None)
- loraplus_lr_embedding LORAPLUS_LR_EMBEDDING
                        LoRA plus learning rate for lora embedding layers. (default: 1e-06)
- use_rslora [USE_RSLORA]
                        Whether or not to use the rank stabilization scaling factor for LoRA layer. (default: False)
- use_dora [USE_DORA]
                        Whether or not to use the weight-decomposed lora method (DoRA). (default: False)
- create_new_adapter [CREATE_NEW_ADAPTER]
                        Whether or not to create a new adapter with randomly initialized weight. (default: False)
- name_module_trainable NAME_MODULE_TRAINABLE
                        Name of trainable modules for partial-parameter (freeze) fine-tuning. Use commas to separate multiple modules. Use "all" to specify all the available modules. LLaMA choices: ["mlp",
                        "self_attn"], BLOOM & Falcon & ChatGLM choices: ["mlp", "self_attention"], Qwen choices: ["mlp", "attn"], InternLM2 choices: ["feed_forward", "attention"], Others choices: the same as LLaMA.
                        (default: all)
- num_layer_trainable NUM_LAYER_TRAINABLE
                        The number of trainable layers for partial-parameter (freeze) fine-tuning. (default: 2)
- pure_bf16 [PURE_BF16]
                        Whether or not to train model in purely bf16 precision (without AMP). (default: False)
- stage {pt,sft,rm,ppo,dpo,orpo}
                        Which stage will be performed in training. (default: sft)
- finetuning_type {lora,freeze,full}
                        Which fine-tuning method to use. (default: lora)
- use_llama_pro [USE_LLAMA_PRO]
                        Whether or not to make only the parameters in the expanded blocks trainable. (default: False)
- plot_loss [PLOT_LOSS]
                        Whether or not to save the training loss curves. (default: False)
- do_sample [DO_SAMPLE]
                        Whether or not to use sampling, use greedy decoding otherwise. (default: True)
- no_do_sample        Whether or not to use sampling, use greedy decoding otherwise. (default: False)
- temperature TEMPERATURE
                        The value used to modulate the next token probabilities. (default: 0.95)
- top_p TOP_P         The smallest set of most probable tokens with probabilities that add up to top_p or higher are kept. (default: 0.7)
- top_k TOP_K         The number of highest probability vocabulary tokens to keep for top-k filtering. (default: 50)
- num_beams NUM_BEAMS
                        Number of beams for beam search. 1 means no beam search. (default: 1)
- max_length MAX_LENGTH
                        The maximum length the generated tokens can have. It can be overridden by max_new_tokens. (default: 1024)
- max_new_tokens MAX_NEW_TOKENS
                        The maximum numbers of tokens to generate, ignoring the number of tokens in the prompt. (default: 1024)
- repetition_penalty REPETITION_PENALTY
                        The parameter for repetition penalty. 1.0 means no penalty. (default: 1.0)
- length_penalty LENGTH_PENALTY
                        Exponential penalty to the length that is used with beam-based generation. (default: 1.0)

All parameters for inference

- model_name_or_path MODEL_NAME_OR_PATH
                        Path to the model weight or identifier from huggingface.co/models or modelscope.cn/models. (default: None)
- adapter_name_or_path ADAPTER_NAME_OR_PATH
                        Path to the adapter weight or identifier from huggingface.co/models. Use commas to separate multiple adapters. (default: None)
- adapter_folder ADAPTER_FOLDER
                        The folder containing the adapter weights to load. (default: None)
- cache_dir CACHE_DIR
                        Where to store the pre-trained models downloaded from huggingface.co or modelscope.cn. (default: None)
- use_fast_tokenizer [USE_FAST_TOKENIZER]
                        Whether or not to use one of the fast tokenizer (backed by the tokenizers library). (default: True)
- no_use_fast_tokenizer
                        Whether or not to use one of the fast tokenizer (backed by the tokenizers library). (default: False)
- resize_vocab [RESIZE_VOCAB]
                        Whether or not to resize the tokenizer vocab and the embedding layers. (default: False)
- split_special_tokens [SPLIT_SPECIAL_TOKENS]
                        Whether or not the special tokens should be split during the tokenization process. (default: False)
- new_special_tokens NEW_SPECIAL_TOKENS
                        Special tokens to be added into the tokenizer. Use commas to separate multiple tokens. (default: None)
- model_revision MODEL_REVISION
                        The specific model version to use (can be a branch name, tag name or commit id). (default: main)
- low_cpu_mem_usage [LOW_CPU_MEM_USAGE]
                        Whether or not to use memory-efficient model loading. (default: True)
- no_low_cpu_mem_usage
                        Whether or not to use memory-efficient model loading. (default: False)
- quantization_method {bitsandbytes,hqq,eetq}
                        Quantization method to use for on-the-fly quantization. (default: bitsandbytes)
- quantization_bit QUANTIZATION_BIT
                        The number of bits to quantize the model using bitsandbytes. (default: None)
- quantization_type {fp4,nf4}
                        Quantization data type to use in int4 training. (default: nf4)
- double_quantization [DOUBLE_QUANTIZATION]
                        Whether or not to use double quantization in int4 training. (default: True)
- no_double_quantization
                        Whether or not to use double quantization in int4 training. (default: False)
- quantization_device_map {auto}
                        Device map used to infer the 4-bit quantized model, needs bitsandbytes>=0.43.0. (default: None)
- rope_scaling {linear,dynamic}
                        Which scaling strategy should be adopted for the RoPE embeddings. (default: None)
- flash_attn {auto,disabled,sdpa,fa2}
                        Enable FlashAttention for faster training and inference. (default: auto)
- shift_attn [SHIFT_ATTN]
                        Enable shift short attention (S^2-Attn) proposed by LongLoRA. (default: False)
- mixture_of_depths {convert,load}
                        Convert the model to mixture-of-depths (MoD) or load the MoD model. (default: None)
- use_unsloth [USE_UNSLOTH]
                        Whether or not to use unsloth's optimization for the LoRA training. (default: False)
- visual_inputs [VISUAL_INPUTS]
                        Whethor or not to use multimodal LLM that accepts visual inputs. (default: False)
- moe_aux_loss_coef MOE_AUX_LOSS_COEF
                        Coefficient of the auxiliary router loss in mixture-of-experts model. (default: None)
- disable_gradient_checkpointing [DISABLE_GRADIENT_CHECKPOINTING]
                        Whether or not to disable gradient checkpointing. (default: False)
- upcast_layernorm [UPCAST_LAYERNORM]
                        Whether or not to upcast the layernorm weights in fp32. (default: False)
- upcast_lmhead_output [UPCAST_LMHEAD_OUTPUT]
                        Whether or not to upcast the output of lm_head in fp32. (default: False)
- train_from_scratch [TRAIN_FROM_SCRATCH]
                        Whether or not to randomly initialize the model weights. (default: False)
- infer_backend {huggingface,vllm}
                        Backend engine used at inference. (default: huggingface)
- vllm_maxlen VLLM_MAXLEN
                        Maximum sequence (prompt + response) length of the vLLM engine. (default: 2048)
- vllm_gpu_util VLLM_GPU_UTIL
                        The fraction of GPU memory in (0,1) to be used for the vLLM engine. (default: 0.9)
- vllm_enforce_eager [VLLM_ENFORCE_EAGER]
                        Whether or not to disable CUDA graph in the vLLM engine. (default: False)
- vllm_max_lora_rank VLLM_MAX_LORA_RANK
                        Maximum rank of all LoRAs in the vLLM engine. (default: 32)
- offload_folder OFFLOAD_FOLDER
                        Path to offload model weights. (default: offload)
- use_cache [USE_CACHE]
                        Whether or not to use KV cache in generation. (default: True)
- no_use_cache
                        Whether or not to use KV cache in generation. (default: False)
- infer_dtype {auto,float16,bfloat16,float32}
                        Data type for model weights and activations at inference. (default: auto)
- hf_hub_token HF_HUB_TOKEN
                        Auth token to log in with Hugging Face Hub. (default: None)
- ms_hub_token MS_HUB_TOKEN
                        Auth token to log in with ModelScope Hub. (default: None)
- export_dir EXPORT_DIR
                        Path to the directory to save the exported model. (default: None)
- export_size EXPORT_SIZE
                        The file shard size (in GB) of the exported model. (default: 1)
- export_device {cpu,auto}
                        The device used in model export, use `auto` to accelerate exporting. (default: cpu)
- export_quantization_bit EXPORT_QUANTIZATION_BIT
                        The number of bits to quantize the exported model. (default: None)
- export_quantization_dataset EXPORT_QUANTIZATION_DATASET
                        Path to the dataset or dataset name to use in quantizing the exported model. (default: None)
- export_quantization_nsamples EXPORT_QUANTIZATION_NSAMPLES
                        The number of samples used for quantization. (default: 128)
- export_quantization_maxlen EXPORT_QUANTIZATION_MAXLEN
                        The maximum length of the model inputs used for quantization. (default: 1024)
- export_legacy_format [EXPORT_LEGACY_FORMAT]
                        Whether or not to save the `.bin` files instead of `.safetensors`. (default: False)
- export_hub_model_id EXPORT_HUB_MODEL_ID
                        The name of the repository if push the model to the Hugging Face hub. (default: None)
- print_param_status [PRINT_PARAM_STATUS]
                        For debugging purposes, print the status of the parameters in the model. (default: False)
- template TEMPLATE
                        Which template to use for constructing prompts in training and inference. (default: None)
- dataset DATASET
                        The name of dataset(s) to use for training. Use commas to separate multiple datasets. (default: None)
- eval_dataset EVAL_DATASET
                        The name of dataset(s) to use for evaluation. Use commas to separate multiple datasets. (default: None)
- dataset_dir DATASET_DIR
                        Path to the folder containing the datasets. (default: data)
- cutoff_len CUTOFF_LEN
                        The cutoff length of the tokenized inputs in the dataset. (default: 1024)
- train_on_prompt [TRAIN_ON_PROMPT]
                        Whether or not to disable the mask on the prompt. (default: False)
- mask_history [MASK_HISTORY]
                        Whether or not to mask the history and train on the last turn only. (default: False)
- streaming [STREAMING]
                        Enable dataset streaming. (default: False)
- buffer_size BUFFER_SIZE
                        Size of the buffer to randomly sample examples from in dataset streaming. (default: 16384)
- mix_strategy {concat,interleave_under,interleave_over}
                        Strategy to use in dataset mixing (concat/interleave) (undersampling/oversampling). (default: concat)
- interleave_probs INTERLEAVE_PROBS
                        Probabilities to sample data from datasets. Use commas to separate multiple datasets. (default: None)
- overwrite_cache [OVERWRITE_CACHE]
                        Overwrite the cached training and evaluation sets. (default: False)
- preprocessing_num_workers PREPROCESSING_NUM_WORKERS
                        The number of processes to use for the pre-processing. (default: None)
- max_samples MAX_SAMPLES
                        For debugging purposes, truncate the number of examples for each dataset. (default: None)
- eval_num_beams EVAL_NUM_BEAMS
                        Number of beams to use for evaluation. This argument will be passed to `model.generate` (default: None)
- ignore_pad_token_for_loss [IGNORE_PAD_TOKEN_FOR_LOSS]
                        Whether or not to ignore the tokens corresponding to the pad label in loss computation. (default: True)
- no_ignore_pad_token_for_loss
                        Whether or not to ignore the tokens corresponding to the pad label in loss computation. (default: False)
- val_size VAL_SIZE
                        Size of the development set, should be an integer or a float in range `[0,1)`. (default: 0.0)
- packing PACKING


                        Enable sequences packing in training. Will automatically enable in pre-training. (default: None)
- neat_packing [NEAT_PACKING]
                        Enable sequence packing without cross-attention. (default: False)
- tool_format TOOL_FORMAT
                        Tool format to use for constructing function calling examples. (default: None)
- tokenized_path TOKENIZED_PATH
                        Path to save or load the tokenized datasets. (default: None)
- use_badam [USE_BADAM]
                        Whether or not to use the BAdam optimizer. (default: False)
- badam_mode {layer,ratio}
                        Whether to use layer-wise or ratio-wise BAdam optimizer. (default: layer)
- badam_start_block BADAM_START_BLOCK
                        The starting block index for layer-wise BAdam. (default: None)
- badam_switch_mode {ascending,descending,random,fixed}
                        the strategy of picking block to update for layer-wise BAdam. (default: ascending)
- badam_switch_interval BADAM_SWITCH_INTERVAL
                        Number of steps to update the block for layer-wise BAdam. Use -1 to disable the block update. (default: 50)
- badam_update_ratio BADAM_UPDATE_RATIO
                        The ratio of the update for ratio-wise BAdam. (default: 0.05)
- badam_mask_mode {adjacent,scatter}
                        The mode of the mask for BAdam optimizer. `adjacent` means that the trainable parameters are adjacent to each other, `scatter` means that trainable parameters
                        are randomly choosed from the weight. (default: adjacent)
- badam_verbose BADAM_VERBOSE
                        The verbosity level of BAdam optimizer. 0 for no print, 1 for print the block prefix, 2 for print trainable parameters. (default: 0)
- use_galore [USE_GALORE]
                        Whether or not to use the gradient low-Rank projection (GaLore). (default: False)
- galore_target GALORE_TARGET
                        Name(s) of modules to apply GaLore. Use commas to separate multiple modules. Use `all` to specify all the linear modules. (default: all)
- galore_rank GALORE_RANK
                        The rank of GaLore gradients. (default: 16)
- galore_update_interval GALORE_UPDATE_INTERVAL
                        Number of steps to update the GaLore projection. (default: 200)
- galore_scale GALORE_SCALE
                        GaLore scaling coefficient. (default: 0.25)
- galore_proj_type {std,reverse_std,right,left,full}
                        Type of GaLore projection. (default: std)
- galore_layerwise [GALORE_LAYERWISE]
                        Whether or not to enable layer-wise update to further save memory. (default: False)
- pref_beta PREF_BETA
                        The beta parameter in the preference loss. (default: 0.1)
- pref_ftx PREF_FTX
                        The supervised fine-tuning loss coefficient in DPO training. (default: 0.0)
- pref_loss {sigmoid,hinge,ipo,kto_pair,orpo,simpo}
                        The type of DPO loss to use. (default: sigmoid)
- dpo_label_smoothing DPO_LABEL_SMOOTHING
                        The robust DPO label smoothing parameter in cDPO that should be between 0 and 0.5. (default: 0.0)
- kto_chosen_weight KTO_CHOSEN_WEIGHT
                        The weight factor of the desirable losses in KTO training. (default: 1.0)
- kto_rejected_weight KTO_REJECTED_WEIGHT
                        The weight factor of the undesirable losses in KTO training. (default: 1.0)
- simpo_gamma SIMPO_GAMMA
                        The target reward margin term in SimPO loss. (default: 0.5)
- ppo_buffer_size PPO_BUFFER_SIZE
                        The number of mini-batches to make experience buffer in a PPO optimization step. (default: 1)
- ppo_epochs PPO_EPOCHS
                        The number of epochs to perform in a PPO optimization step. (default: 4)
- ppo_score_norm [PPO_SCORE_NORM]
                        Use score normalization in PPO training. (default: False)
- ppo_target PPO_TARGET
                        Target KL value for adaptive KL control in PPO training. (default: 6.0)
- ppo_whiten_rewards [PPO_WHITEN_REWARDS]
                        Whiten the rewards before compute advantages in PPO training. (default: False)
- ref_model REF_MODEL
                        Path to the reference model used for the PPO or DPO training. (default: None)
- ref_model_adapters REF_MODEL_ADAPTERS
                        Path to the adapters of the reference model. (default: None)
- ref_model_quantization_bit REF_MODEL_QUANTIZATION_BIT
                        The number of bits to quantize the reference model. (default: None)
- reward_model REWARD_MODEL
                        Path to the reward model used for the PPO training. (default: None)
- reward_model_adapters REWARD_MODEL_ADAPTERS
                        Path to the adapters of the reward model. (default: None)
- reward_model_quantization_bit REWARD_MODEL_QUANTIZATION_BIT
                        The number of bits to quantize the reward model. (default: None)
- reward_model_type {lora,full,api}
                        The type of the reward model in PPO training. Lora model only supports lora training. (default: lora)
- additional_target ADDITIONAL_TARGET
                        Name(s) of modules apart from LoRA layers to be set as trainable and saved in the final checkpoint. Use commas to separate multiple modules. (default: None)
- lora_alpha LORA_ALPHA
                        The scale factor for LoRA fine-tuning (default: lora_rank * 2). (default: None)
- lora_dropout LORA_DROPOUT
                        Dropout rate for the LoRA fine-tuning. (default: 0.0)
- lora_rank LORA_RANK
                        The intrinsic dimension for LoRA fine-tuning. (default: 8)
- lora_target LORA_TARGET
                        Name(s) of target modules to apply LoRA. Use commas to separate multiple modules. Use `all` to specify all the linear modules. (default: all)
- loraplus_lr_ratio LORAPLUS_LR_RATIO
                        LoRA plus learning rate ratio (lr_B / lr_A). (default: None)
- loraplus_lr_embedding LORAPLUS_LR_EMBEDDING
                        LoRA plus learning rate for lora embedding layers. (default: 1e-06)
- use_rslora [USE_RSLORA]
                        Whether or not to use the rank stabilization scaling factor for LoRA layer. (default: False)
- use_dora [USE_DORA]
                        Whether or not to use the weight-decomposed lora method (DoRA). (default: False)
- pissa_init [PISSA_INIT]
                        Whether or not to initialize a PiSSA adapter. (default: False)
- pissa_iter PISSA_ITER
                        The number of iteration steps performed by FSVD in PiSSA. Use -1 to disable it. (default: 16)
- pissa_convert [PISSA_CONVERT]
                        Whether or not to convert the PiSSA adapter to a normal LoRA adapter. (default: False)
- create_new_adapter [CREATE_NEW_ADAPTER]
                        Whether or not to create a new adapter with randomly initialized weight. (default: False)
- freeze_trainable_layers FREEZE_TRAINABLE_LAYERS
                        The number of trainable layers for freeze (partial-parameter) fine-tuning. Positive numbers mean the last n layers are set as trainable, negative numbers mean
                        the first n layers are set as trainable. (default: 2)
- freeze_trainable_modules FREEZE_TRAINABLE_MODULES
                        Name(s) of trainable modules for freeze (partial-parameter) fine-tuning. Use commas to separate multiple modules. Use `all` to specify all the available
                        modules. (default: all)
- freeze_extra_modules FREEZE_EXTRA_MODULES
                        Name(s) of modules apart from hidden layers to be set as trainable for freeze (partial-parameter) fine-tuning. Use commas to separate multiple modules.
                        (default: None)
- pure_bf16 [PURE_BF16]
                        Whether or not to train model in purely bf16 precision (without AMP). (default: False)
- stage {pt,sft,rm,ppo,dpo,kto}
                        Which stage will be performed in training. (default: sft)
- finetuning_type {lora,freeze,full}
                        Which fine-tuning method to use. (default: lora)
- use_llama_pro [USE_LLAMA_PRO]
                        Whether or not to make only the parameters in the expanded blocks trainable. (default: False)
- freeze_vision_tower [FREEZE_VISION_TOWER]
                        Whether ot not to freeze vision tower in MLLM training. (default: True)
- no_freeze_vision_tower
                        Whether ot not to freeze vision tower in MLLM training. (default: False)
- train_mm_proj_only [TRAIN_MM_PROJ_ONLY]
                        Whether or not to train the multimodal projector for MLLM only. (default: False)
- compute_accuracy [COMPUTE_ACCURACY]
                        Whether or not to compute the token-level accuracy at evaluation. (default: False)
- plot_loss [PLOT_LOSS]
                        Whether or not to save the training loss curves. (default: False)
- do_sample [DO_SAMPLE]
                        Whether or not to use sampling

, use greedy decoding otherwise. (default: True)
- no_do_sample
                        Whether or not to use sampling, use greedy decoding otherwise. (default: False)
- temperature TEMPERATURE
                        The value used to modulate the next token probabilities. (default: 0.95)
- top_p TOP_P
                        The smallest set of most probable tokens with probabilities that add up to top_p or higher are kept. (default: 0.7)
- top_k TOP_K
                        The number of highest probability vocabulary tokens to keep for top-k filtering. (default: 50)
- num_beams NUM_BEAMS
                        Number of beams for beam search. 1 means no beam search. (default: 1)
- max_length MAX_LENGTH
                        The maximum length the generated tokens can have. It can be overridden by max_new_tokens. (default: 1024)
- max_new_tokens MAX_NEW_TOKENS
                        The maximum numbers of tokens to generate, ignoring the number of tokens in the prompt. (default: 1024)
- repetition_penalty REPETITION_PENALTY
                        The parameter for repetition penalty. 1.0 means no penalty. (default: 1.0)
- length_penalty LENGTH_PENALTY
                        Exponential penalty to the length that is used with beam-based generation. (default: 1.0)
- default_system DEFAULT_SYSTEM
                        Default system message to use in chat completion. (default: None)