Reversing RHVoice - techtonik/RHVoice GitHub Wiki

Reversing the software is the process of understandin how it works. Often with the goal to modify or fix it. Reversing is a process that is usually applied to compiled programs, so things should be easy if you have source code, but it is not always the case. Especially if source language is unfamiliar.

RHVoice

I needed an easy tool to speak text to me and from what I've heard, RHVoice is the best synthesizer for Russian language. It is made by Olga Yakovleva for people with impaired vision. There are governmental scientific programs that are assumed to help, but they fail for many reasons, some of them is the lack of motivation and public feedback, because there are no people who need voice synthesizers for life working in the Academy of Sciences. From from my Russian friends who uses screen reader daily, I've heard that RHVoice is the best practical solution.

I decided to hack on RHVoice, because I wanted better human to machine interfaces (human augmentation is one of my part time obsessions), and also, because I see that this project is already useful, so if I can improve its usability further, that would be a good thing.

I also wanted to bring Belarussian language there and understand how the whole system works.

Reversing RHVoice for better usage experience

RHVoice is written in C++, and I have problems with that kanguage. There is also a Python interface for RHVoice that is required to make synthesizer usable in NVDA, so I started my quest by reversing the NVDA plugin to get a standalone speaking module out of it.

NVDA plugin is a .py file that calls RHVoice.dll using ctypes. I could successfully load .dll and get RHVoice version. Loading the synthesizer appeared more troublesome, because there was no error reporting. Wrapper that exposed functions from .dll contained a catch all expressions with return of 0 for any error (RHVoice_new_tts_engine).

Finding the wrapper in src/lib.cpp took an hour or more, just because I was not familiar with C++. Compiling it with AppVeyour took another hour or two.

Adding logging

RHVoice_new_tts_engine initialization function needs to be passed a long structure with few callbacks and other parameters. When engine fails to initialize, there was no error message to get what went wrong. So I had to add logging of exceptions to stderr, which is turned on by calling RHVoice_set_logging(True).

After few more hours RHVoice.dll got logging. Downloading new set of compiled files from AppVeyor and trying Python module on it reported from RHVoice_tts_engine_struct:

No language resources are available

I've got the same error on Linux while running the test:

echo test|./build/linux/test/RHVoice-test

Reported here https://github.com/Olga-Yakovleva/RHVoice/issues/14

Finding language resources

scons install clearly sets them correctly, but I don't want to install anything just yet. I just want to test that the Python module works.

The message is defined in engine.hpp inside of exception no_languages, which is thrown only in engine.cpp. The code appeared too mysterious for me from the first try:

engine::engine(const init_params& p):
  voice_profiles_spec("voice_profiles"),
  data_path(p.data_path),
  config_path(p.config_path),
  version(VERSION),
  languages(p.get_language_paths(),path::join(config_path,"dicts")),
  voices(p.get_voice_paths(),languages),
  prefer_primary_language("prefer_primary_language",true),
  logger(p.logger)
{
  logger->log(tag,RHVoice_log_level_info,"creating a new engine");
  if(languages.empty())
    throw no_languages();

An hour later I could read that. The stuff that comes after engine::engine function prototype (it is a method, actually) and before functon body is called "member initializer list".

Starting from the bottom, languages is checked for emptiness. It is initialized by the line above:

languages(p.get_language_paths(),path::join(config_path,"dicts")),

languages is class member, defined in engine.hpp as language_list languages;, and language_list is in turn defined as resource_list and accepts language_paths and userdict_path as its arguments.

Digging further, if config_path is not set, it defaults to "./config" on Windows and "/etc/RHVoice" on Linux. These settings are set in src/core/SConstript. There is also DATA_PATH setting set to "./data" on Windows by default.

Looking at how RHVoice uses these directories, it expects to find RHVoice.ini (Windows) there. I have no idea what are "dicts" that languages requires for initialization. On the other hand, p.get_language_paths() is rather clear - it returns subdirs in DATA_PATH/languages. Then languages constructor initializes each of the four hardcoded languages

Russian, English, Esperanto, Georgian.

So, putting data/languages/Russian etc. into RHVoice.dll dir solved initialization problem.

Asking RHVoice to speak up

Just passing "text" to engine is not enough. Text needs to be wrapped into message object, which includes text, its length, but also synthesis parameters such as main voice, and this one is obligatory. I've got this information from RHVoice_message_struct in src/lib/lib.cpp file after inspecting public RHVoice_speak function mentioned in src/lib/lib.def. lib.def defines which functions are visible from RHVoice.dll

Important step was to add logging to RHVoice_new_message function from src/lib/lib.cpp. This allowed to pinpoint exact error message at runtime, which appeared to be:

RHVoice_new_message: No synthesis parameters

Feeding empty synth_params to RHVoice_new_message call gave another error on stderr log output:

RHVoice_new_message: The main voice name is mandatory

This attribute is called voice_profile. The difference between "voice" and "voice profile" is that "voice" object contains info sample rate, language, gender, country, and "voice profile" contains multiple voices and functions to select voice for specific language or select voice based on text data. So, "voice profile" is basically a combination of selected voices for narrating some text content.

From the start every voice gets a profile with the same name. Specifying voice name in synth_param in the message made a callback function called after invoking RHVoice_speak. There is still no sound, but it is a bit of a progress. FWIW, stings "text" resulted in 17 calls to callback function. Might be pretty slow if Python will called too many times.

Making sound from RHVoice data

Callback functions were specified when initializing RHVoice engine. One callback function to be exact - speech_callback is obligatory. Inspection shows that it's called from src/core/hts_engine_call.cpp. The caller is method sink::on_input(), which seems to be callback too.

When executing RHVoice.py wrapper (which is a work in progress at the moment), this message is seen on the stderr:

default Engine is default

It appears right before callback is executed. This string is src/third-party/mage/mage.cpp and in particular in MAGE::Mage::addEngine. Looks like RHVoice doesn't produce sound itself, but uses 3rd party library.

Mage is found on http://mage.numediart.org/ and it is a library for speech synthesis in environments that need fast response. So, this is the code that produces actual sound, and it says it is based on HTS engine, which describes itself as "software to synthesize speech waveform". http://hts-engine.sourceforge.net/ Now I am confused. If HTS makes waveforms (digital sound data or samples), then what Mage does?

Mage is written by Maria Astrinaki, also a lady like Olga Yakovleva. It has a very tidy and good looking web site with a couple of interesting scientific papers about the subject. But still no answer, where is the difference? After few more hours the answer is - Mage is HTS with real-time streaming capability.

http://www.numediart.org/projects/project-13-2-phts-for-maxmsp/

Both synthesizers use HMM or Hidden Markov Models, which has very cool example on Wikipedia page, but it doesn't explain how to get the sound out. Clearly, the samples received by speech_callback should be the waveform, but which format is it in?

I joined the pieces and wrote them into binary file, loaded with Raw Data import into Audacity and found by trying that the data is 16000 Hz, 16 bit per sample, mono. Used Python wave modules and voila - after few more hours there is a dumped wave file.

Handling Russian language

Speeding up the speech

It appeared that the produced sound was both slow and low. The answer was to tweak parameters of the synthesizer. The experimental way after one more day is to set relative_pitch and relative_rate to 1.0, but how does it work is not completely clear, because there is also absolute values for the same params.

Getting volume higher

Setting relative_volume to 1.o solved the problem.

Adding command line interface

Finally, to make it a convenient command line tool, a command line help and parameter for -i input and -o output files were added.

Getting output to speakers

The realtime interaces to send waveform to speakers requires more time and more sources, but it is possible, just not in this timeframe.