Dia‐TTS - mesolitica/malaya-speech GitHub Wiki

We recently released continue finetuning Dia-TTS for Malaysian context, it should able to zero-shot any Malaysian and Singaporean speakers,

https://huggingface.co/mesolitica/Malaysian-Dia-1.6B, trained on entire Malaysian Emilia, great on diversity but not so much for audio quality and speaker consistency.

Diarizer pipeline from Pyannote is not perfect, some of the pseudolabeled speakers contain multiple speakers.
Original Emilia did not save proper bitrate for MP3 24k sample rate, check out https://github.com/open-mmlab/Amphion/issues/436, basically even though the audio save in 24k sample rate but the actual quality in 16k sample rate.
Great on diversity voice conversion such as synthetic speech instructions.

https://huggingface.co/mesolitica/Malaysian-Podcast-Dia-1.6B, trained on Malaysian and Singaporean podcasts from Malaysian-Emilia-v2, great on audio quality but lack of speaker diversity.

Only trained on podcasts only and permutation similarity at least 80% based on Titanet Large Speaker Embedding, check out https://github.com/mesolitica/malaya-speech/blob/master/session/sesame-tts/podcast/prepare-stage2.ipynb,

vectors = []
for i in range(0, len(ys), 4):
    vectors_ = model(ys[i: i + 4])
    vectors.append(vectors_)

cosine = cosine_similarity(np.concatenate(vectors))

for speaker in speakers.keys():
    data_ = []
    for row in speakers[speaker]:
        for row_ in speakers[speaker]:
            if row == row_:
                continue
            
            if cosine[row, row_] < 0.8:
                continue

            data_.append({
                'reference_audio': speakers[speaker][row]['audio'],
                'reference_text': speakers[speaker][row]['transcription'],
                'target_audio': speakers[speaker][row_]['audio'],
                'target_text': speakers[speaker][row_]['transcription'],
            })

Save in proper bitrate for MP3 24k sample rate.
Great on podcast quality voice conversion but required clean audio reference to make it better.

Install library

pip3 install git+https://github.com/mesolitica/dia-fix-compile

This fork is to skip torch compile on encoder side. Most of the time encoder input is dynamic length especially you are doing batch inference and the feed forward on encoder side only happened once.

How to voice conversion

Remember, all these generations are stochastic, you might not get the same output when run locally.

# wget https://github.com/mesolitica/malaya-speech/raw/refs/heads/master/speech/speech-instructions/husein-assistant.mp3
from dia.model import Dia
model = Dia.from_pretrained("mesolitica/Malaysian-Podcast-Dia-1.6B", compute_dtype="float16")
clone_from_text = "[S1] Uhm, hello, selamat pagi ye, saya dari customer service, boleh saya bantu you dengan apa apa ke?"
clone_from_audio = 'husein-assistant.mp3'
t_ = 'Hello semua orang, saya suka nasi ayam dan nama saya Husein bin Zolkepli.'
texts = [clone_from_text + '[S1] ' + t_.strip()]
clone_from_audios = [clone_from_audio]
output = model.generate(
    texts, 
    audio_prompt=clone_from_audios, 
    use_torch_compile=True, 
    verbose=True, 
    max_tokens=2000, 
    temperature = 1.0, 
    cfg_scale = 1.0,
)

https://github.com/user-attachments/assets/cad0470a-d410-46bf-ac85-d2ca30a56b74

Context switching

t_ = 'Hello semua orang, I like chicken chop with rice but I tak suka sangat fish and chip, jadi selalunya I makan ayam gepuk, sedap sangat, try lah.'
texts = [clone_from_text + '[S1] ' + t_.strip()]
clone_from_audios = [clone_from_audio]

output = model.generate(
    texts, 
    audio_prompt=clone_from_audios, 
    use_torch_compile=True, 
    verbose=True, 
    max_tokens=2000, 
    temperature = 1.0, 
    cfg_scale = 1.0,
)

https://github.com/user-attachments/assets/5540a952-9320-4a5b-bbfc-65f7d7e8413a

Filler sound

It should able to inherit the base model capabilities, read more about support filler sounds at https://github.com/nari-labs/dia?tab=readme-ov-file#features

t_ = 'Hello semua orang, Uhm, I like chicken chop with rice, haha (laughs), (clears throat), but I tak suka sangat fish and chip, jadi selalunya I makan ayam gepuk, sedap sangat, (coughs) (coughs) try lah.'
texts = [clone_from_text + '[S1] ' + t_.strip()]
clone_from_audios = [clone_from_audio]

output = model.generate(
    texts, 
    audio_prompt=clone_from_audios, 
    use_torch_compile=True, 
    verbose=True, 
    max_tokens=2000, 
    temperature = 1.0, 
    cfg_scale = 1.0,
)

https://github.com/user-attachments/assets/84257056-ad2c-45a9-ac12-a841298a6e31

Podcast style

text = "[S1] Dia is an open weights text to dialogue model, boleh cakap melayu dan bunyi english local. [S2] Yeah betul, boleh buat macam macam. [S1] Wow. cool giler! (laughs) [S2] Try it now on Github or Hugging Face."

output = model.generate(
    text, 
    use_torch_compile=True, 
    verbose=True,
    max_tokens=2000, 
    temperature = 1.0, 
    cfg_scale = 1.0,
)

First example,

https://github.com/user-attachments/assets/3aea09c2-5822-4808-a295-07e548501b4d

Second example,

https://github.com/user-attachments/assets/ba2aa373-a298-4d26-afc6-17e29bcd2617

Finetuning on your dataset

Because these models able to voice conversion any speakers but this is far stable to serve on production.

But the current released models source code at Source code at https://github.com/mesolitica/malaya-speech/tree/master/session/dia-tts

Voice Conversion

You want the model able to maintain voice conversion capability but better generation.

Mid-training

Because Pyannote diarizer is not perfect, you want to continue mid-training using higher confidence speaker similarity, this can be done using speaker embedding such as Titanet Large.

Post-training

You must use mid-training model to generate synthetic N outputs after that pick the best one using automatic pipeline such as calculating CER and speaker similarity or the best verified manually. So here you have positive and negative outcomes, with that you can start with DPO loss.
You can fix pronunciations by adding extra speakers that speak correctly in SFT, during inference, you can disregard that extra speakers if not important.

Text-to-Speech

Now you want the model able to generate fixed speakers by using prompt only like Husein: my name is Husein, so when the LLM saw Husein: in the prefix, the model know the sound output should be sound Husein. Plus when you do TTS, you are no longer required audio reference, the decoder only needs to start with the BOS token.

Mid-training

You can use voice conversion to generate a lot of synthetic speech dataset on specific speaker, after that make sure the pronunciations are correct using force alignment and verify the pitch is ok. Much better if you able to record using voice actors, but please verify outputs from the voice actors also. After you gather enough dataset (no need that much, I tried 1 hour dataset, the output is already perfect), you can continue SFT without audio references. The dataset should be looks like for input Husein: Hi my name is Husein and the output is the audio that speak Hi my name is Husein.

Post-training

You can fix pronunciations by adding extra speakers that speak correctly in SFT, the prompt can be like speaker1: Hi my name is Husein and the output is the correct pronunciation for Hi my name is Husein. But during inference later, you can just simply stick to Husein speaker.

Full parameter or LoRA?

LoRA is good enough if you are doing post-training.

How to add new languages?

Modern Malay (Bahasa Melayu or Bahasa Malaysia) is written using the Latin (Roman) alphabet, just like English. This writing system is called Rumi. Here are some key points:

Similarities to English, both use the same 26-letter Latin alphabet: A–Z.
The script is written left to right.
Most letters have similar phonetic values, though pronunciation rules differ.

Because of these, at 0.05 epoch checkpoint full parameter finetuning already able to generate Malay outputs. Dia-TTS did not disclosed the languages been pretrained, if you are training different characters or right to left, probably will take some effort.