shisa‐v2 - AUGMXNT/shisa GitHub Wiki
See: shisa-ai/shisa-v2
Areas of Improvement
Code cleanup
- Setup black, gitleaks hooks
- Move code around so it make sense
- Setup .gitignore working dirs
- 1-click cloud deploy containers for training, evals
Language leakage
- Run sweeps w/ different parameters to determine best sampling settings to minimize leakage
- Reduced tokenizer size testing for language leakage (maybe not a problem if not using extended Tokenizer)
Instruction following
- Compare English vs Japanese instruction following
Language steerability
- Training samples of reply in Japanese, reply in English, reply in the language the user speaks, etc.
- Multi-turn training with language switching within turns
Training Data
Tuning diversity
- See: https://github.com/jondurbin/bagel
- Mix up instruction formats/formatting
Language prefs
- Review % of translate to JA/EN
- Potentially take Snow/translation datasets (and our own data sets) and swap w/ automated variations of Reply in English/Japanese, appended/prepended
Niceties
- Figure out a good way to insert who made you, tell me about yourself, describe yourself, etc.
DPO Review
- DPO quality really needs manual review - LLM as judge... needs sanity checking, lots of errata
- DPO vs KTO? https://twitter.com/ethayarajh/status/1732837520784957476
- Preference Tuning LLMs with Direct Preference Optimization Methods
Pre-Training
- 12B vs 8B, but maybe try a fine-tune w/o on bigger models to start with
- Curriculum Training https://twitter.com/stablequan/status/1734057289542484038
Relevant New Research
- Order Matters in the Presence of Dataset Imbalance for Multilingual Learning
- Unsupervised Approach to Evaluate Sentence-Level Fluency: Do We Really Need Reference?
- Hyperpolyglot LLMs: Cross-Lingual Interpretability in Token Embeddings
- How Multilingual is Multilingual LLM?
- Breaking the Language Barrier: Improving Cross-Lingual Reasoning with Structured Self-Attention
- Not All Languages Are Created Equal in LLMs: Improving Multilingual Capability by Cross-Lingual-Thought Prompting
Evals
See: https://github.com/AUGMXNT/inference-benchmark for benchmarks
llm_judge fork
- Swap to lm-eval vLLM for fast inference (or OpenAI API w/ llama.cpp GGUF, ExLlamaV2, MLC, etc) - 50X faster than HF Transformers
- Keep data format
- OpenAI API 1.0+
- Make compatible w/ shisa-eval-server (human eval)
- Turn Elyza 100 tasks
into tasks.json
w/ custom judging rubric (may need to extend format?)
Bigger runs
Options
- Orion 14B
- 2.5T tokens, multilingual
- Efficient tokenizer
- Instant commercial license: https://www.orionstar.com/llm-license.html
- very high reasoning scores
- Swallow 70B
- Llama2 based, Llama license (700MAU)
- 7B, 13B, 70B
- 70B has GQA, +100B JA pretrain, 46K (JA extended vocab)
Bad Options
- Yi 34B
- Not so efficient
- Instant commercial license: https://www.lingyiwanwu.com/yi-license
- terrible JA tokenizer
- DeepSeek LLM 67B (MIT License)
- No commercial limitations, just restriction to lawful, non-military, non-harming minors etc
- Has a 7B to tune on
- GQA, 2T EN/CN pretrain, 4K context, 102.4K vocab
- en: 4.329528 , ja: 0.852132
- Oof, bad tokenizer for Japanese
- Qwen-72B (Qwen License) - licensing sucks
- Can't train derived works
- 100M MAU
- Mixtral 8x7B (Apache 2.0) - too hard to tune
Misc
Improved HF Space?
See: