Voicebank development - stakira/OpenUtau GitHub Wiki

This page is a work in progress. Contributions are welcomed

Anyone can make a voicebank with their voice and use it in OpenUtau. There are basically two types of voicebanks: UTAU concatenative voicebanks and Machine learning voicebanks.

Before making a new voicebank, it's recommended to use existing voicebanks in OpenUtau. This will help you understand the concepts in voicebank development, and help you to make a good voicebank.

UTAU voicebank development

To make a UTAU voicebank, you need to record all the syllables in a language and label them.

How does it work

The user inputs lyrics and notes
The phonemizer converts the notes and lyrics into a list of concatenate units.
For each concatenate unit, the resampler loads the corresponding sample from the voicebank, change the duration and pitch of the sample, and apply flags to them.
The wavtool joins all the audio slices produced by resampler together and output the final audio.

Recording

Firstly, find a reclist suitable for your language. A reclist is a text file that has all syllables or phonemes and their combinations in a language. Here are some publicly available reclists:

Phonemizer	Reclist
JA VCV & CVVC	Japanese reclists
ZH CVVC	Hr.J Chinese CVVC by haru
EN ARPA	ARPAsing Reclist Directory
EN XSAMPA	Delta-style English reclists Salem-Style English CVVC GrayGlish
EN VCCV	Core American English VCCV by PaintedCz

English recorded with any of these methods have various pros and cons. You may want to try recording a language with a smaller set of vowels first, or recording and training an AI voicebank for DiffSinger.

Theoretically you can record a voicebank with any software you like. However, a dedicated reclist recorder application can automatically prompt you to record line by line and save them using the text of the line as file names. The recommended software is recstar

Many people record with a "guide BGM" so that their samples are at a consistent BPM and pitch. You can find some guideBGMs to start with here and here (Japanese site). ("#-Mora" is the number of syllables in each string)

Multi-pitch voicebanks

You can record multiple subbanks for one voicebank. For example, you can record in multiple pitches to make its range wider, or record in multiple vocal modes to let the user choose between different singing styles. Each subbank is equivalent to a full single-pitch voicebank and should contain all the voice lines in your reclist.

You can start developing your voicebank with only one subbank, and add more subbanks in the future.

otoing

After recording, you will need to make oto.ini files.

oto.ini is a mark-up language that tells OpenUtau where each phoneme is in the voicebank and how to manipulate them.

The recommended way to make oto.ini for a voicebank is using vLabeler.

If you want a tutorial that covers how to oto most styles of voicebank, Yin's tutorial is commonly recommended by the community. It uses older tools than vLabeler, but the basic logic is the same.

How oto.ini works

Let's look at an example oto.ini line from Kasane Teto in vLabeler and dissect it.

Yellow line: Offset, or left blank. The start of the phoneme.
Green line: Overlap. Everything between the left blank and overlap will be blended with the previous note.
Red line: Preutterance. The start of the musical note.
Blue line: Fixed region. Everything before this line will not be looped/stretched.
White line: Ctoff, or right blank. The end of the phoneme. Everything between the fixed region and cutoff will be looped/stretched.

See also: Anatomy of the OTO on UtaForum.

Other files

Create a subfolder inside your OpenUtau's Singers folder, and put your voice folders (wav and oto.ini) into it. You still need these files and informations in your voicebank:

character.txt

character.txt is the file that tells OpenUtau this is a voicebank, and the name of the voicebank. Create a new text file in your voicebank named character.txt and edit it. Here is a minimal example of your character.txt:

name=name_of_your_voicebank

Set up Subbanks

If your voicebank contains multiple subbanks, you'll need to set them up in OpenUtau with Tools → Singers → Edit subbanks where you can assign each subbank to a certain pitch range in a voice color.

Default phonemizer

Launch OpenUtau. In Tools → Singers, click ⚙ → Default Phonemizer and select the phonemizer that your voicebank supports. After the user chooses your voicebank, the phonemizer will be autometically chosen.

Subbanks and default phonemizer infomation are stored in character.yaml inside the voicebank. See tech note ‐ character.yaml

Packing UTAU voicebank

In Tools → Singers, click ⚙ → Publish Singer. You'll get a zip file of your singer for distributing.

Machine Learning voicebank development

Machine learning voicebanks produce more fluent singing voice with less manual edits, but you'll need a GPU to train them. OpenUtau supports two engines that allow making voicebanks by yourself: NNSVS and DiffSinger. To make a machine learning voicebank, you need to record your singing voice, label them and train a machine learning model.

Recording

Record any song in this language with any recording software you like. Just ensure that:

all the lyrics are in the language your voicebank supports
your dataset contains all the phonemes in the language.

Labelling

After recording, make phoneme-level labels for your voicebank. You can use vlabeler to make labels.

SVS Singing voice database - tutorial by PixPrucer

There are also automated tools that make labels for you:

SOFA
LabelMakr (a GUI for SOFA)

Training

After labelling your dataset, you can either train a DiffSinger voicebank or an ENUNU voicebank.

DiffSinger

ENUNU

ENUNU training kit