Implementing a speech recognizer plugin - dialogos-project/dialogos GitHub Wiki
A speech recognizer plugin in DialogOS consists of three components:
- The recognizer interface, which implements
com.clt.speech.recognition.Recognizer
. - The recognizer client, which extends
com.clt.dialog.client.RecognizerClient
. - The recognizer plugin, which implements
com.clt.dialogos.plugin.Plugin
.
It is convenient to package all three components into a single Gradle project. These projects can either be standalone, or subprojects of the dialogos gradle projects. The Sphinx and MaryTTS plugins put the client classes into a "client" subpackage and the plugin classes into a "plugin" subpackage. Other plugins should follow this convention.
A standalone Gradle project needs to include a dependency to DialogOS, preferrably via jitpack.io. See plugins.
A Gradle subproject needs to be included in the top-level settings.gradle
. The build.gradle
for the subproject needs to declare dependencies on :Diamant
, :com.clt.speech
, and :RecognizerClient
, plus any other dependencies that are required to run your speech recognizer.
The most important method in the recognizer interface is startImpl
, which does whatever your speech recognizer needs to do to recognize speech, and then returns an object of type com.clt.speech.recognition.RecognitionResult
. RecognitionResult
is an interface; thus, you will need to implement your own class for it. Alternatively, if your recognizer returns the recognition result as a single string, and you do not care about confidences, you can use the class com.clt.speech.recognition.simpleresult.SimpleRecognizerResult
in the com.clt.speech
subproject.
The recognizer client implements the method createRecognizer
, which returns an instance of your recognizer interface. It also defines a bunch of other methods which provide metadata.
A key class in your plugin is the plugin node class. This is a subclass of com.clt.diamant.graph.Node
which defines the behavior of the nodes in the DialogOS graph which call your plugin. We write MyNode
for your node class below; replace it as appropriate. You simplify your life considerably if you subclass com.clt.diamant.graph.nodes.AbstractInputNode
.
Ensure that there is a file src/main/resources/META-INF/services/com.clt.dialogos.plugin.Plugin
, and that it contains the fully qualified name of the recognizer plugin class in a single line.
Ensure that the plugin is registered as a recognizer plugin, by having the following line in the initialize
method of your plugin class:
com.clt.diamant.graph.Node.registerNodeTypes("Speech Recognition", Arrays.asList(new Class<?>[] { MyNode.class } ));
Set the name under which the configuration panel your plugin is displayed in the "Graph" menu by returning it from the getName
method of your plugin class.
The node toolbox is the rightmost part of the DialogOS window, which displays the palette of node types. You define the name under which your plugin will be displayed in the toolbox by returning it from the getNodeTypeName
method of your node class.
If you want an icon displayed next to the name, ensure that:
- The package mentioned in
Resources#resources
matches the package of your plugin. - There is a file
src/main/resources/<your package>/<MyNode>.png
, where<MyNode>
stands for your node class.
The plugin is subdivided into the packages edu.cmu.lti.dialogos.sphinx.client
and edu.cmu.lti.dialogos.sphinx.plugin
which implement the RecognitionClient via CMU Sphinx-4 and the plugin capabilities, respectively.
In DialogOS, a speech recognition client implements AbstractRecognizer
. For our purposes, AbstractRecognizer has too many implementation requirements (deal with domains and contexts, deal with transcription, properties, etc.). As we do not need domains, SingleDomainRecognizer
hides domains by disabling them. SphinxBaseRecognizer
abstracts away transcription, properties and the audio format (which is always fixed for CMU Sphinx). Finally, Sphinx
(which stands at the end of the class hierarchy AbstractRecognizer
->SingleDomainRecognizer
->SphinxBaseRecognizer
->Sphinx
) actually defines startImpl()
and deals with recognition proper via classes from CMU Sphinx.
One of the challenges of the code is the interweaving of CMU Sphinx' and DialogOS' classes. On both ends, class names include Recognizer, Recognition, Result, and similar; this can be very confusing and while analyzing the code, try to keep track whether a class is from the CMU Sphinx or the DialogOS world.
Quite some of the code deals with transforming between objects of these two worlds and adapting functionalities. startImpl()
sets up the CMU Sphinx-based recognizer (a ConfigurableSpeechRecognizer
that adds some functionality to CMU Sphinx' standard recognizer API), and calls its startRecognition()
method, waits for a result, adds some additional checks, and finally returns a transformation of the CMU Sphinx result into a DialogOS result. Likewise, stopImpl()
merely passes on the stop request to the CMU Sphinx' ConfigurableSpeechRecognizer.
The implementation uses SphinxContext
to store all information that is relevant to recognition (e.g. the language to be used, pronunciations in addition to the built-in lexicon, the grammar to recognize from, etc.). This information is injected into the setup for ConfigurableSpeechRecognizer before initializing recognition. To confuse you a bit more, CMU Sphinx also encapsulates (some of) this information in a Context
object -- which is not the same as the plugin's SphinxContext.
The Plugin
class itself only deals with some housekeeping, most notably, registering the SphinxNode
class as a type of node to be used in dialog models; it also sets up Settings
which keeps some global settings of the recognizer, such as an exception dictionary; this is where you would put settings such as voice activiy settings, loudness, or similar. SphinxNode
derives from AbstractInputNode
which provides the basic recognition setup (managing a grammar, checking results against the grammar, etc). SphinxNode itself then implements createRecognitionExecutor()
and the SphinxRecognitionExecutor
that is created, manages recognition execution.
A RecognitionExecutor
manages the recognition. On start()
, it receives the grammar, the possible patterns that it expects as outcomes of recognition, a timeout, a statelistener (that reflects recognition state in the GUI) and a threshold for recognition confidence. stop()
is expected to abort any ongoing recognition. The SphinxRecognitionExecutor
sets the recognizer's context according to the grammar, registers state listeners, and initiates recognition. Upon return of a recognition result, it checks the result and returns it. It also manages timeout.