shisa‐v2 - shisa-ai/shisa-v2 GitHub Wiki

There are several goals for shisa-v2:

We want to make sure that the shisa-v2 repo captures all our working code; ideally we should be able to almost one-click new tunes from future models and dataset improvements
Obviously we want to increase the quality. The goal is at least to be the SOTA open model (full stop), and to be competitive with leading proprietary models
- Better JA fluency
- Eliminate improper code switching/language leakage
- Proper language switching/steerability between EN and JA based on context

What we need to build:

Efficient eval framework
Heuristic evals
New datasets
Human Eval framework
ChatArena

Tokenizer

For shisa-v1 we made a very efficient tokenizer extension, but it was a bit underbaked. A paper published in Feb 2024 which did analysis on extending/swapping tokenizer. Their conclusion was that for modern (1.5B & 7B parameter models tests), 50B or more tokens of additional training was required to recover performance vs the native tokenizer. It's a bit unclear how this applies to extension with pre-averaged values (using FVT ablation performs significantly better but isn't shown), but ideally we can pick a good-enough tokenizer and skip tokenizer extension (and the required extensive continued pre-train).

https://arxiv.org/pdf/2402.01035
See also: https://arxiv.org/pdf/2401.01055

If we did do tokenizer switching, look into Zero-Shot Tokenizer Transfer (ZeTT): https://github.com/bminixhofer/zett

Pre-Training

Along the same lines, we seek to find a model where we can get good performance skipping extensive (billions of tokens) of continued pre-train entirely. This approach will have use fine-tuning many base model candidates on a known dataset (eg, the shisa-v1 set) and evaluating performance. One thing we can look into is combining some CPT with our fine-tuning dataset:

Ablations

While there are a lot of questions I'd like answered, the biggest one in terms of training order is whether paired (EN+JA) training will lead to better downstream performance (due to promoting cross-lingual transfer) than randomized/mixed training on the same dataset. We will need some sufficiently big dual translation datasets to test on and to see if it matters for fine tuning (Swallow tested parallel corpus for pretraining)

https://arxiv.org/pdf/2404.17790 "Therefore, no evidence was obtained that the parallel corpus promotes cross-lingual transfer and improves abilities other than translation"