Llama Java parameters - eoctet/Octet.Chat GitHub Wiki

Llama Java parameters

The following is a list of all the parameters involved in this project.

[!NOTE] Other reference documents: Transformers docs.

Model parameters

  • Basic parameters
Parameter Default Description
model_path / Llama model path.
lora_base / Optional model to use as a base for the layers modified by the LoRA adapter.
lora_path / Apply a LoRA (Low-Rank Adaptation) adapter to the model (implies --no-mmap).
lora_scale 0.0 apply LoRA adapter with user defined scaling S (implies --no-mmap).
verbose false Print verbose output to stderr.
numa_strategy 0 Attempt one of the below optimization strategies that may help on some NUMA systems (default: disabled). enum LlamaNumaStrategy
  • Context parameters
Parameter Default Description
seed -1 Set the random number generator seed.
context_size 512 Option allows you to set the size of the prompt context used by the LLaMA models during text generation.
batch_size 2048 Set the batch size for prompt processing.
ubatch 512 Physical maximum batch size (default: 512).
seq_max 1 Max number of sequences (default: 1).
threads 4 Set the number of threads used for generation (single token).
threads_batch 4 Set the number of threads used for prompt and batch processing (multiple tokens).
rope_scaling_type -1 RoPE scaling type. enum LlamaRoPEScalingType
pooling_type -1 Pooling type for embeddings. enum LlamaPoolingType
rope_freq_base 0.0 Base frequency for RoPE sampling.
rope_freq_scale 0.0 Scale factor for RoPE sampling.
yarn_ext_factor -1.0 YaRN extrapolation mix factor, NaN = from model.
yarn_attn_factor 1.0 YaRN magnitude scaling factor.
yarn_beta_fast 32.0 YaRN low correction dim.
yarn_beta_slow 1.0 YaRN high correction dim.
yarn_orig_ctx 0 YaRN original context size.
defrag_thold -1.0 KV cache defragmentation threshold (default: -1.0, < 0 = disabled).
logits_all false Return logits for all tokens, not just the last token.
embedding false Embedding mode only.
offload_kqv true Whether to offload the KQV ops (including the KV cache) to GPU.
flash_attn false Enable flash attention (default: disabled).
  • Model parameters
Parameter Default Description
gpu_layers 0 Number of layers to offload to GPU (-ngl). If -1, all layers are offloaded.
split_mode 1 How to split the model across multiple GPUs. enum LlamaSplitMode
main_gpu / When using multiple GPUs this option controls which GPU is used for small tensors for which the overhead of splitting the computation across all GPUs is not worthwhile.
tensor_split / When using multiple GPUs this option controls how large tensors should be split across all GPUs.
vocab_only false Only load the vocabulary no weights.
mmap true Use mmap if possible (slower load but may reduce pageouts if not using mlock).
mlock false Lock the model in memory, preventing it from being swapped out when memory-mapped.
check_tensors false Validate model tensor data (default: disabled).

JSON template

{
  "model_path": "",
  "lora_base": "",
  "lora_path": "",
  "lora_scale": 0.0,
  "verbose": false,
  "numa_strategy": 0,
  "seed": -1,
  "context_size": 512,
  "batch_size": 2048,
  "ubatch": 512,
  "seq_max": 1,
  "threads": 4,
  "threads_batch": 4,
  "rope_scaling_type": -1,
  "pooling_type": -1,
  "rope_freq_base": 0.0,
  "rope_freq_scale": 0.0,
  "yarn_ext_factor": -1.0,
  "yarn_attn_factor": 1.0,
  "yarn_beta_fast": 32.0,
  "yarn_beta_slow": 1.0,
  "yarn_orig_ctx": 0,
  "defrag_thold": -1.0,
  "logits_all": false,
  "embedding": false,
  "offload_kqv": true,
  "flash_attn": false,
  "gpu_layers": 0,
  "split_mode": 1,
  "main_gpu": 0,
  "tensor_split": [],
  "vocab_only": false,
  "mmap": true,
  "mlock": false,
  "check_tensors": false
}

Generate parameters

Parameter Default Description
temperature 0.8 Adjust the randomness of the generated text.
repeat_penalty 1.1 Control the repetition of token sequences in the generated text.
penalize_nl true Disable penalization for newline tokens when applying the repeat penalty.
frequency_penalty 0.0 Repeat alpha frequency penalty.
presence_penalty 0.0 Repeat alpha presence penalty.
top_k 40 TOP-K Sampling Limit the next token selection to the K most probable tokens.
top_p 0.9 TOP-P Sampling Limit the next token selection to a subset of tokens with a cumulative probability above a threshold P.
tsf 1.0 Tail Free Sampling (TFS) Enable tail free sampling with parameter z.
typical 1.0 Typical Sampling Enable typical sampling sampling with parameter p.
min_p 0.05 Min P Sampling Sets a minimum base probability threshold for token selection.
mirostat_mode DISABLED Mirostat Sampling Enable Mirostat sampling, controlling perplexity during text generation. DISABLED = disabled, V1 = Mirostat, V2 = Mirostat 2.0
mirostat_eta 0.1 Mirostat Sampling Set the Mirostat learning rate, parameter eta.
mirostat_tau 5.0 Mirostat Sampling Set the Mirostat target entropy, parameter tau.
dynatemp_range 0.0 Dynamic Temperature Sampling Dynamic temperature range. The final temperature will be in the range of (temperature - dynatemp_range) and (temperature + dynatemp_range).
dynatemp_exponent 1.0 Dynamic Temperature Sampling Dynamic temperature exponent.
grammar_rules / Specify a grammar (defined inline or in a file) to constrain model output to a specific format.
max_new_token_size 512 Maximum new token generation size.
verbose_prompt false Print the prompt before generating text.
last_tokens_size 64 Maximum number of tokens to keep in the last_n_tokens deque.
special false If true, special tokens are rendered in the output.
logit_bias / Adjust the probability distribution of words.
stopping_word / Control the stop word list for generating stops, with values that can be text or token IDs.
infill false Enable infill mode for the model.
spm_fill false Use Suffix/Prefix/Middle pattern for infill (instead of Prefix/Suffix/Middle) as some models prefer this.
prefix_token / Specify a prefix token in fill mode. (If not specified, read from the model by default)
suffix_token / Specify a suffix token in fill mode.
middle_token / Specify a middle token in fill mode.
session_cache false If enabled, each chat conversation will be stored in the session cache.
prompt_cache false Cache the system prompt in the session and does not update them again.
user User Specify user nickname.
assistant Assistant Specify bot nickname.

JSON template

{
  "temperature": 0.8,
  "repeat_penalty": 1.1,
  "penalize_nl": true,
  "frequency_penalty": 0.0,
  "presence_penalty": 0.0,
  "top_k": 40,
  "top_p": 0.9,
  "tsf": 1.0,
  "typical": 1.0,
  "min_p": 0.05,
  "mirostat_mode": "DISABLED",
  "mirostat_eta": 0.1,
  "mirostat_tau": 5.0,
  "dynatemp_range": 0.0,
  "dynatemp_exponent": 1.0,
  "grammar_rules": null,
  "max_new_token_size": 512,
  "verbose_prompt": false,
  "last_tokens_size": 64,
  "special": false,
  "logit_bias": null,
  "stopping_word": null,
  "infill": false,
  "spm_fill": false,
  "prefix_token": "",
  "suffix_token": "",
  "middle_token": "",
  "session_cache": false,
  "prompt_cache": false,
  "user": "User",
  "assistant": "Assistant"
}