Reversing RHVoice - techtonik/RHVoice GitHub Wiki
Reversing the software is the process of understandin how it works. Often with the goal to modify or fix it. Reversing is a process that is usually applied to compiled programs, so things should be easy if you have source code, but it is not always the case. Especially if source language is unfamiliar.
RHVoice
I needed an easy tool to speak text to me and from what I've heard, RHVoice is the best synthesizer for Russian language. It is made by Olga Yakovleva for people with impaired vision. There are governmental scientific programs that are assumed to help, but they fail for many reasons, some of them is the lack of motivation and public feedback, because there are no people who need voice synthesizers for life working in the Academy of Sciences. From from my Russian friends who uses screen reader daily, I've heard that RHVoice is the best practical solution.
I decided to hack on RHVoice, because I wanted better human to machine interfaces (human augmentation is one of my part time obsessions), and also, because I see that this project is already useful, so if I can improve its usability further, that would be a good thing.
I also wanted to bring Belarussian language there and understand how the whole system works.
Reversing RHVoice for better usage experience
RHVoice is written in C++, and I have problems with that kanguage. There is also a Python interface for RHVoice that is required to make synthesizer usable in NVDA, so I started my quest by reversing the NVDA plugin to get a standalone speaking module out of it.
NVDA plugin is a .py file that calls RHVoice.dll using
ctypes. I could successfully load .dll and get RHVoice
version. Loading the synthesizer appeared more troublesome,
because there was no error reporting. Wrapper that exposed
functions from .dll contained a catch all expressions with
return of 0 for any error (RHVoice_new_tts_engine).
Finding the wrapper in src/lib.cpp took an hour or more,
just because I was not familiar with C++. Compiling it with
AppVeyour took another hour or two.
Adding logging
RHVoice_new_tts_engine initialization function needs to be passed a long structure with few callbacks and other parameters. When engine fails to initialize, there was no error message to get what went wrong. So I had to add logging of exceptions to stderr, which is turned on by calling RHVoice_set_logging(True).
After few more hours RHVoice.dll got logging. Downloading new set of compiled files from AppVeyor and trying Python module on it reported from RHVoice_tts_engine_struct:
No language resources are available
I've got the same error on Linux while running the test:
echo test|./build/linux/test/RHVoice-test
Reported here https://github.com/Olga-Yakovleva/RHVoice/issues/14
Finding language resources
scons install clearly sets them correctly, but I don't
want to install anything just yet. I just want to test that
the Python module works.
The message is defined in engine.hpp inside of exception
no_languages, which is thrown only in engine.cpp. The
code appeared too mysterious for me from the first try:
engine::engine(const init_params& p):
voice_profiles_spec("voice_profiles"),
data_path(p.data_path),
config_path(p.config_path),
version(VERSION),
languages(p.get_language_paths(),path::join(config_path,"dicts")),
voices(p.get_voice_paths(),languages),
prefer_primary_language("prefer_primary_language",true),
logger(p.logger)
{
logger->log(tag,RHVoice_log_level_info,"creating a new engine");
if(languages.empty())
throw no_languages();
An hour later I could read that. The stuff that comes after engine::engine function prototype (it is a method, actually) and before functon body is called "member initializer list".
Starting from the bottom, languages is checked for
emptiness. It is initialized by the line above:
languages(p.get_language_paths(),path::join(config_path,"dicts")),
languages is class member, defined in engine.hpp as
language_list languages;, and language_list is in turn
defined as resource_list and accepts language_paths and
userdict_path as its arguments.
Digging further, if config_path is not set, it defaults to
"./config" on Windows and "/etc/RHVoice" on Linux. These
settings are set in src/core/SConstript. There is also
DATA_PATH setting set to "./data" on Windows by default.
Looking at how RHVoice uses these directories, it expects to
find RHVoice.ini (Windows) there. I have no idea what are
"dicts" that languages requires for initialization. On the
other hand, p.get_language_paths() is rather clear - it
returns subdirs in DATA_PATH/languages. Then languages
constructor initializes each of the four hardcoded languages
- Russian, English, Esperanto, Georgian.
So, putting data/languages/Russian etc. into RHVoice.dll
dir solved initialization problem.
Asking RHVoice to speak up
Just passing "text" to engine is not enough. Text needs to
be wrapped into message object, which includes text, its
length, but also synthesis parameters such as main voice,
and this one is obligatory. I've got this information from
RHVoice_message_struct in src/lib/lib.cpp file after
inspecting public RHVoice_speak function mentioned in
src/lib/lib.def. lib.def defines which functions are
visible from RHVoice.dll
Important step was to add logging to RHVoice_new_message
function from src/lib/lib.cpp. This allowed to pinpoint
exact error message at runtime, which appeared to be:
RHVoice_new_message: No synthesis parameters
Feeding empty synth_params to RHVoice_new_message
call gave another error on stderr log output:
RHVoice_new_message: The main voice name is mandatory
This attribute is called voice_profile. The difference
between "voice" and "voice profile" is that "voice" object
contains info sample rate, language, gender, country, and
"voice profile" contains multiple voices and functions to
select voice for specific language or select voice based on
text data. So, "voice profile" is basically a combination of
selected voices for narrating some text content.
From the start every voice gets a profile with the same name.
Specifying voice name in synth_param in the message made a
callback function called after invoking RHVoice_speak.
There is still no sound, but it is a bit of a progress. FWIW,
stings "text" resulted in 17 calls to callback function.
Might be pretty slow if Python will called too many times.
Making sound from RHVoice data
Callback functions were specified when initializing RHVoice
engine. One callback function to be exact - speech_callback
is obligatory. Inspection shows that it's called from
src/core/hts_engine_call.cpp. The caller is method
sink::on_input(), which seems to be callback too.
When executing RHVoice.py wrapper (which is a work in
progress at the moment), this message is seen on the stderr:
default Engine is default
It appears right before callback is executed. This string is
src/third-party/mage/mage.cpp and in particular in
MAGE::Mage::addEngine. Looks like RHVoice doesn't produce
sound itself, but uses 3rd party library.
Mage is found on http://mage.numediart.org/ and it is a library for speech synthesis in environments that need fast response. So, this is the code that produces actual sound, and it says it is based on HTS engine, which describes itself as "software to synthesize speech waveform". http://hts-engine.sourceforge.net/ Now I am confused. If HTS makes waveforms (digital sound data or samples), then what Mage does?
Mage is written by Maria Astrinaki, also a lady like Olga Yakovleva. It has a very tidy and good looking web site with a couple of interesting scientific papers about the subject. But still no answer, where is the difference? After few more hours the answer is - Mage is HTS with real-time streaming capability.
http://www.numediart.org/projects/project-13-2-phts-for-maxmsp/
Both synthesizers use HMM or Hidden Markov Models, which has
very cool example on Wikipedia page, but it doesn't explain
how to get the sound out. Clearly, the samples received by
speech_callback should be the waveform, but which format
is it in?
I joined the pieces and wrote them into binary file, loaded with Raw Data import into Audacity and found by trying that the data is 16000 Hz, 16 bit per sample, mono. Used Python wave modules and voila - after few more hours there is a dumped wave file.
Handling Russian language
Speeding up the speech
It appeared that the produced sound was both slow and low.
The answer was to tweak parameters of the synthesizer. The
experimental way after one more day is to set
relative_pitch and relative_rate to 1.0, but how does
it work is not completely clear, because there is also
absolute values for the same params.
Getting volume higher
Setting relative_volume to 1.o solved the problem.
Adding command line interface
Finally, to make it a convenient command line tool, a command line help and parameter for -i input and -o output files were added.
Getting output to speakers
The realtime interaces to send waveform to speakers requires more time and more sources, but it is possible, just not in this timeframe.