Llama Java parameters - eoctet/Octet.Chat GitHub Wiki

Llama Java parameters

The following is a list of all the parameters involved in this project.

[!NOTE] Other reference documents: Transformers docs.

Model parameters

Basic parameters

Parameter	Default	Description
model_path	/	Llama model path.
lora_base	/	Optional model to use as a base for the layers modified by the LoRA adapter.
lora_path	/	Apply a LoRA (Low-Rank Adaptation) adapter to the model (implies --no-mmap).
lora_scale	0.0	apply LoRA adapter with user defined scaling S (implies --no-mmap).
verbose	false	Print verbose output to stderr.
numa_strategy	0	Attempt one of the below optimization strategies that may help on some NUMA systems (default: disabled). `enum` LlamaNumaStrategy

Context parameters

Parameter	Default	Description
seed	-1	Set the random number generator seed.
context_size	512	Option allows you to set the size of the prompt context used by the LLaMA models during text generation.
batch_size	2048	Set the batch size for prompt processing.
ubatch	512	Physical maximum batch size (default: 512).
seq_max	1	Max number of sequences (default: 1).
threads	4	Set the number of threads used for generation (single token).
threads_batch	4	Set the number of threads used for prompt and batch processing (multiple tokens).
rope_scaling_type	-1	RoPE scaling type. `enum` LlamaRoPEScalingType
pooling_type	-1	Pooling type for embeddings. `enum` LlamaPoolingType
rope_freq_base	0.0	Base frequency for RoPE sampling.
rope_freq_scale	0.0	Scale factor for RoPE sampling.
yarn_ext_factor	-1.0	YaRN extrapolation mix factor, NaN = from model.
yarn_attn_factor	1.0	YaRN magnitude scaling factor.
yarn_beta_fast	32.0	YaRN low correction dim.
yarn_beta_slow	1.0	YaRN high correction dim.
yarn_orig_ctx	0	YaRN original context size.
defrag_thold	-1.0	KV cache defragmentation threshold (default: -1.0, < 0 = disabled).
logits_all	false	Return logits for all tokens, not just the last token.
embedding	false	Embedding mode only.
offload_kqv	true	Whether to offload the KQV ops (including the KV cache) to GPU.
flash_attn	false	Enable flash attention (default: disabled).

Model parameters

Parameter	Default	Description
gpu_layers	0	Number of layers to offload to GPU (-ngl). If -1, all layers are offloaded.
split_mode	1	How to split the model across multiple GPUs. `enum` LlamaSplitMode
main_gpu	/	When using multiple GPUs this option controls which GPU is used for small tensors for which the overhead of splitting the computation across all GPUs is not worthwhile.
tensor_split	/	When using multiple GPUs this option controls how large tensors should be split across all GPUs.
vocab_only	false	Only load the vocabulary no weights.
mmap	true	Use mmap if possible (slower load but may reduce pageouts if not using mlock).
mlock	false	Lock the model in memory, preventing it from being swapped out when memory-mapped.
check_tensors	false	Validate model tensor data (default: disabled).

JSON template

{
  "model_path": "",
  "lora_base": "",
  "lora_path": "",
  "lora_scale": 0.0,
  "verbose": false,
  "numa_strategy": 0,
  "seed": -1,
  "context_size": 512,
  "batch_size": 2048,
  "ubatch": 512,
  "seq_max": 1,
  "threads": 4,
  "threads_batch": 4,
  "rope_scaling_type": -1,
  "pooling_type": -1,
  "rope_freq_base": 0.0,
  "rope_freq_scale": 0.0,
  "yarn_ext_factor": -1.0,
  "yarn_attn_factor": 1.0,
  "yarn_beta_fast": 32.0,
  "yarn_beta_slow": 1.0,
  "yarn_orig_ctx": 0,
  "defrag_thold": -1.0,
  "logits_all": false,
  "embedding": false,
  "offload_kqv": true,
  "flash_attn": false,
  "gpu_layers": 0,
  "split_mode": 1,
  "main_gpu": 0,
  "tensor_split": [],
  "vocab_only": false,
  "mmap": true,
  "mlock": false,
  "check_tensors": false
}

Generate parameters

Parameter	Default	Description
temperature	0.8	Adjust the randomness of the generated text.
repeat_penalty	1.1	Control the repetition of token sequences in the generated text.
penalize_nl	true	Disable penalization for newline tokens when applying the repeat penalty.
frequency_penalty	0.0	Repeat alpha frequency penalty.
presence_penalty	0.0	Repeat alpha presence penalty.
top_k	40	TOP-K Sampling Limit the next token selection to the K most probable tokens.
top_p	0.9	TOP-P Sampling Limit the next token selection to a subset of tokens with a cumulative probability above a threshold P.
tsf	1.0	Tail Free Sampling (TFS) Enable tail free sampling with parameter z.
typical	1.0	Typical Sampling Enable typical sampling sampling with parameter p.
min_p	0.05	Min P Sampling Sets a minimum base probability threshold for token selection.
mirostat_mode	DISABLED	Mirostat Sampling Enable Mirostat sampling, controlling perplexity during text generation. `DISABLED = disabled`, `V1 = Mirostat`, `V2 = Mirostat 2.0`
mirostat_eta	0.1	Mirostat Sampling Set the Mirostat learning rate, parameter eta.
mirostat_tau	5.0	Mirostat Sampling Set the Mirostat target entropy, parameter tau.
dynatemp_range	0.0	Dynamic Temperature Sampling Dynamic temperature range. The final temperature will be in the range of (temperature - dynatemp_range) and (temperature + dynatemp_range).
dynatemp_exponent	1.0	Dynamic Temperature Sampling Dynamic temperature exponent.
grammar_rules	/	Specify a grammar (defined inline or in a file) to constrain model output to a specific format.
max_new_token_size	512	Maximum new token generation size.
verbose_prompt	false	Print the prompt before generating text.
last_tokens_size	64	Maximum number of tokens to keep in the last_n_tokens deque.
special	false	If true, special tokens are rendered in the output.
logit_bias	/	Adjust the probability distribution of words.
stopping_word	/	Control the stop word list for generating stops, with values that can be text or token IDs.
infill	false	Enable infill mode for the model.
spm_fill	false	Use Suffix/Prefix/Middle pattern for infill (instead of Prefix/Suffix/Middle) as some models prefer this.
prefix_token	/	Specify a prefix token in fill mode. (If not specified, read from the model by default)
suffix_token	/	Specify a suffix token in fill mode.
middle_token	/	Specify a middle token in fill mode.
session_cache	false	If enabled, each chat conversation will be stored in the session cache.
prompt_cache	false	Cache the system prompt in the session and does not update them again.
user	User	Specify user nickname.
assistant	Assistant	Specify bot nickname.

JSON template

{
  "temperature": 0.8,
  "repeat_penalty": 1.1,
  "penalize_nl": true,
  "frequency_penalty": 0.0,
  "presence_penalty": 0.0,
  "top_k": 40,
  "top_p": 0.9,
  "tsf": 1.0,
  "typical": 1.0,
  "min_p": 0.05,
  "mirostat_mode": "DISABLED",
  "mirostat_eta": 0.1,
  "mirostat_tau": 5.0,
  "dynatemp_range": 0.0,
  "dynatemp_exponent": 1.0,
  "grammar_rules": null,
  "max_new_token_size": 512,
  "verbose_prompt": false,
  "last_tokens_size": 64,
  "special": false,
  "logit_bias": null,
  "stopping_word": null,
  "infill": false,
  "spm_fill": false,
  "prefix_token": "",
  "suffix_token": "",
  "middle_token": "",
  "session_cache": false,
  "prompt_cache": false,
  "user": "User",
  "assistant": "Assistant"
}