ZZZ Other JA Models and Resources - AUGMXNT/shisa GitHub Wiki

In general the best-maintained resource tracking the latest JA-related LLM stuff is: https://github.com/llm-jp/awesome-japanese-llm/

I'm also now curating a JP AI Twitter List: https://twitter.com/i/lists/1738064886427734518

See also latest HF JA text-generation Models: https://huggingface.co/models?pipeline_tag=text-generation&language=ja&sort=modified

We'll just use this to track some of the more interesting Japanese releases and write-ups we come across:

2024-01-15 Karasu/Qarasu

Released EOY 2023, but a new announcement here:

First new model to use some of our dataset:

https://huggingface.co/models?dataset=dataset:augmxnt/ultra-orca-boros-en-ja-v1
Karasu 7B is based off of shisa-7b-v1

2023-12-21 Nekomata

Rinna has released new Qwen-based models (150K Qwen tokenizer, +66B token pre-training). Based on how strong Qwen-14B Chat was, interested to see how this tune compares:

Tweet: https://twitter.com/rinna_research/status/1737648832345989428
Announcement: https://rinna.co.jp/news/2023/12/20231221.html
Collection: https://huggingface-co.translate.goog/collections/rinna/nekomata-6582b5134ee85531becbb9a9
Benchmarks: https://rinnakk.github.io/research/benchmarks/lm/index.html
Instruct model uses a mix of stuff (included mistranslations): https://huggingface.co/rinna/nekomata-14b-instruction

2023-12-20 ELYZA-tasks-100 Shootoff

A very detailed writeup testing out a lot of JA (and a few non-JA) LLMs using GPT-4 judging (with some analysis on that aspect as well). For instruction following, I assume a decent large-sized instruct is pretty necessary to answer many of these questions properly.

2023-12-19 Swallow

Site: https://tokyotech-llm.github.io/
Announcement: https://tokyotech-llm.github.io/swallow-llama
Technical Writeup: https://zenn.dev/tokyotech_lm/articles/d6cb3a8fdfc907
Tokenizer extension to 43176 (I believe that %64 would be better for perf)
100B continued pretrain
Instruct models are using the same mistranslations, I've run JA MT-Bench results that show expected performance
I've created the appropriate chat_template for their instruct format
Some of my notes/testing in https://discord.com/channels/1147858054231105577/1147862078695149608/1187047159468675073

2023-12-18 Kobata-Recipes

Kazuki Fujii (who worked on training Swallow) published their PEFT, FSDP, PEFT+FSDP recipes. (Aside: FSDP still maybe degrades Mistral so better to use DeepSpeed or for small tunes, Unsloth)

2023-12-14 DeepSpeed Writeup

So, more useful for Japanese native speakers, but fun code+animated diagrams explaining some DeepSpeed stuff: https://zenn.dev/turing_motors/articles/d00c46a79dc976

2023-11-01 NTT LLM - tsuzumi 7B

NTT's commercial LLM targeted for March 2024 release

https://www.rd.ntt/research/LLM_tsuzumi.html

2023-10-21 ALMA Ja

EN/JA Translation model based off of ALMA-7B

2023-08-29 ELYZA-japanese-Llama-2-7b

Announcement: https://note.com/elyza/n/na405acaca130
Technical Writeup: https://zenn.dev/elyza/articles/2fd451c944649d
Evals (ELYZA-100 discussion): https://zenn.dev/elyza/articles/5e7d9373c32a98

Sakura

A series of models being fine-tuned by a Chinese community using Chinese base models? Need to look at it more

https://huggingface.co/sakuraumi/Sakura-13B-Galgame

Japanese Stability

Nov 2023: Beta 7B/70B (Llama2) w/ 100B slightly filtered (SlimPajama for EN, but unfiltered for JA) additional token pretrain using the default tokenizers - the 70B probably the strongest explicitly JA focused open model, but kneecapped by bad fine tune (In native speaker testing, Xwin-LM-70B-V0.1 generated significantly better Japanese responses in chat!)
- Japanese announcement
- Japanese MT-Bench
- Not officially announced, but they also did a Mistral 7B pretrain called "Gamma"
  - https://huggingface.co/stabilityai/japanese-stablelm-base-gamma-7b
- Instruct tune was again using poorly translated datasets
  - Gamma instruct: dolly , anthropic
  - Beta instruct: Anthropic HH-RLHF, Databricks Dolly 15-k, OpenAssistant Conversations Dataset
August 2023: Stability AI JP released their first "Alpha" models (English announcement) - Apache 2.0, 7B parameter, 750B token pretrain (unfiltered datasets), GPT-NeoX, 65K NovelAI/nerdstash-tokenizer-v1 - fluency limited
- Fine tune used an Alapaca translation, and the largely incorrect dolly , anthropic, and wikinews from here