shisa‐v2 - AUGMXNT/shisa GitHub Wiki
See: shisa-ai/shisa-v2
Areas of Improvement
Code cleanup
- Setup black, gitleaks hooks
 - Move code around so it make sense
 - Setup .gitignore working dirs
 - 1-click cloud deploy containers for training, evals
 
Language leakage
- Run sweeps w/ different parameters to determine best sampling settings to minimize leakage
 - Reduced tokenizer size testing for language leakage (maybe not a problem if not using extended Tokenizer)
 
Instruction following
- Compare English vs Japanese instruction following
 
Language steerability
- Training samples of reply in Japanese, reply in English, reply in the language the user speaks, etc.
 - Multi-turn training with language switching within turns
 
Training Data
Tuning diversity
- See: https://github.com/jondurbin/bagel
 - Mix up instruction formats/formatting
 
Language prefs
- Review % of translate to JA/EN
 - Potentially take Snow/translation datasets (and our own data sets) and swap w/ automated variations of Reply in English/Japanese, appended/prepended
 
Niceties
- Figure out a good way to insert who made you, tell me about yourself, describe yourself, etc.
 
DPO Review
- DPO quality really needs manual review - LLM as judge... needs sanity checking, lots of errata
 - DPO vs KTO? https://twitter.com/ethayarajh/status/1732837520784957476
 - Preference Tuning LLMs with Direct Preference Optimization Methods
 
Pre-Training
- 12B vs 8B, but maybe try a fine-tune w/o on bigger models to start with
 - Curriculum Training https://twitter.com/stablequan/status/1734057289542484038
 
Relevant New Research
- Order Matters in the Presence of Dataset Imbalance for Multilingual Learning
 - Unsupervised Approach to Evaluate Sentence-Level Fluency: Do We Really Need Reference?
 - Hyperpolyglot LLMs: Cross-Lingual Interpretability in Token Embeddings
 - How Multilingual is Multilingual LLM?
 - Breaking the Language Barrier: Improving Cross-Lingual Reasoning with Structured Self-Attention
 - Not All Languages Are Created Equal in LLMs: Improving Multilingual Capability by Cross-Lingual-Thought Prompting
 
Evals
See: https://github.com/AUGMXNT/inference-benchmark for benchmarks
llm_judge fork
- Swap to lm-eval vLLM for fast inference (or OpenAI API w/ llama.cpp GGUF, ExLlamaV2, MLC, etc) - 50X faster than HF Transformers
 - Keep data format
 - OpenAI API 1.0+
 - Make compatible w/ shisa-eval-server (human eval)
 - Turn Elyza 100 tasks 
into tasks.jsonw/ custom judging rubric (may need to extend format?) 
Bigger runs
Options
- Orion 14B
- 2.5T tokens, multilingual
 - Efficient tokenizer
 - Instant commercial license: https://www.orionstar.com/llm-license.html
 - very high reasoning scores
 
 - Swallow 70B
- Llama2 based, Llama license (700MAU)
 - 7B, 13B, 70B
 - 70B has GQA, +100B JA pretrain, 46K (JA extended vocab)
 
 
Bad Options
- Yi 34B
- Not so efficient
 - Instant commercial license: https://www.lingyiwanwu.com/yi-license
 - terrible JA tokenizer
 
 - DeepSeek LLM 67B (MIT License)
- No commercial limitations, just restriction to lawful, non-military, non-harming minors etc
 - Has a 7B to tune on
 - GQA, 2T EN/CN pretrain, 4K context, 102.4K vocab
- en: 4.329528 , ja: 0.852132
 - Oof, bad tokenizer for Japanese
 
 
 - Qwen-72B (Qwen License) - licensing sucks
- Can't train derived works
 - 100M MAU
 
 - Mixtral 8x7B (Apache 2.0) - too hard to tune
 
Misc
Improved HF Space?
See: