shisa‐v2 - AUGMXNT/shisa GitHub Wiki

Areas of Improvement

Run sweeps w/ different parameters to determine best sampling settings to minimize leakage
Reduced tokenizer size testing for language leakage (maybe not a problem if not using extended Tokenizer)

Training samples of reply in Japanese, reply in English, reply in the language the user speaks, etc.
Multi-turn training with language switching within turns

Review % of translate to JA/EN
Potentially take Snow/translation datasets (and our own data sets) and swap w/ automated variations of Reply in English/Japanese, appended/prepended

Figure out a good way to insert who made you, tell me about yourself, describe yourself, etc.

Swap to lm-eval vLLM for fast inference (or OpenAI API w/ llama.cpp GGUF, ExLlamaV2, MLC, etc) - 50X faster than HF Transformers
Keep data format
OpenAI API 1.0+
Make compatible w/ shisa-eval-server (human eval)
Turn Elyza 100 tasks into tasks.json w/ custom judging rubric (may need to extend format?)

Orion 14B
- 2.5T tokens, multilingual
- Efficient tokenizer
- Instant commercial license: https://www.orionstar.com/llm-license.html
- very high reasoning scores
Swallow 70B
- Llama2 based, Llama license (700MAU)
- 7B, 13B, 70B
- 70B has GQA, +100B JA pretrain, 46K (JA extended vocab)

Yi 34B
- Not so efficient
- Instant commercial license: https://www.lingyiwanwu.com/yi-license
- terrible JA tokenizer
DeepSeek LLM 67B (MIT License)
- No commercial limitations, just restriction to lawful, non-military, non-harming minors etc
- Has a 7B to tune on
- GQA, 2T EN/CN pretrain, 4K context, 102.4K vocab
  - en: 4.329528 , ja: 0.852132
  - Oof, bad tokenizer for Japanese
Qwen-72B (Qwen License) - licensing sucks
- Can't train derived works
- 100M MAU
Mixtral 8x7B (Apache 2.0) - too hard to tune

See: