Implementing a speech recognizer plugin

A speech recognizer plugin in DialogOS consists of three components:

The recognizer interface, which implements com.clt.speech.recognition.Recognizer.
The recognizer client, which extends com.clt.dialog.client.RecognizerClient.
The recognizer plugin, which implements com.clt.dialogos.plugin.Plugin.

It is convenient to package all three components into a single Gradle project. These projects can either be standalone, or subprojects of the dialogos gradle projects. The Sphinx and MaryTTS plugins put the client classes into a "client" subpackage and the plugin classes into a "plugin" subpackage. Other plugins should follow this convention.

A standalone Gradle project needs to include a dependency to DialogOS, preferrably via jitpack.io. See plugins. A Gradle subproject needs to be included in the top-level settings.gradle. The build.gradle for the subproject needs to declare dependencies on :Diamant, :com.clt.speech, and :RecognizerClient, plus any other dependencies that are required to run your speech recognizer.

Interface and client

The most important method in the recognizer interface is startImpl, which does whatever your speech recognizer needs to do to recognize speech, and then returns an object of type com.clt.speech.recognition.RecognitionResult. RecognitionResult is an interface; thus, you will need to implement your own class for it. Alternatively, if your recognizer returns the recognition result as a single string, and you do not care about confidences, you can use the class com.clt.speech.recognition.simpleresult.SimpleRecognizerResult in the com.clt.speech subproject.

The recognizer client implements the method createRecognizer, which returns an instance of your recognizer interface. It also defines a bunch of other methods which provide metadata.

Plugin

A key class in your plugin is the plugin node class. This is a subclass of com.clt.diamant.graph.Node which defines the behavior of the nodes in the DialogOS graph which call your plugin. We write MyNode for your node class below; replace it as appropriate. You simplify your life considerably if you subclass com.clt.diamant.graph.nodes.AbstractInputNode.

Ensure that there is a file src/main/resources/META-INF/services/com.clt.dialogos.plugin.Plugin, and that it contains the fully qualified name of the recognizer plugin class in a single line.

Ensure that the plugin is registered as a recognizer plugin, by having the following line in the initialize method of your plugin class:

com.clt.diamant.graph.Node.registerNodeTypes("Speech Recognition", Arrays.asList(new Class<?>[] { MyNode.class } ));

Set the name under which the configuration panel your plugin is displayed in the "Graph" menu by returning it from the getName method of your plugin class.

Node Toolbox

The node toolbox is the rightmost part of the DialogOS window, which displays the palette of node types. You define the name under which your plugin will be displayed in the toolbox by returning it from the getNodeTypeName method of your node class.

If you want an icon displayed next to the name, ensure that:

The package mentioned in Resources#resources matches the package of your plugin.
There is a file src/main/resources/<your package>/<MyNode>.png, where <MyNode> stands for your node class.

Case Study: Sphinx plugin

The plugin is subdivided into the packages edu.cmu.lti.dialogos.sphinx.client and edu.cmu.lti.dialogos.sphinx.plugin which implement the RecognitionClient via CMU Sphinx-4 and the plugin capabilities, respectively.

Client

In DialogOS, a speech recognition client implements AbstractRecognizer. For our purposes, AbstractRecognizer has too many implementation requirements (deal with domains and contexts, deal with transcription, properties, etc.). As we do not need domains, SingleDomainRecognizer hides domains by disabling them. SphinxBaseRecognizer abstracts away transcription, properties and the audio format (which is always fixed for CMU Sphinx). Finally, Sphinx (which stands at the end of the class hierarchy AbstractRecognizer->SingleDomainRecognizer->SphinxBaseRecognizer->Sphinx) actually defines startImpl() and deals with recognition proper via classes from CMU Sphinx.

One of the challenges of the code is the interweaving of CMU Sphinx' and DialogOS' classes. On both ends, class names include Recognizer, Recognition, Result, and similar; this can be very confusing and while analyzing the code, try to keep track whether a class is from the CMU Sphinx or the DialogOS world.

Quite some of the code deals with transforming between objects of these two worlds and adapting functionalities. startImpl() sets up the CMU Sphinx-based recognizer (a ConfigurableSpeechRecognizer that adds some functionality to CMU Sphinx' standard recognizer API), and calls its startRecognition() method, waits for a result, adds some additional checks, and finally returns a transformation of the CMU Sphinx result into a DialogOS result. Likewise, stopImpl() merely passes on the stop request to the CMU Sphinx' ConfigurableSpeechRecognizer.

The implementation uses SphinxContext to store all information that is relevant to recognition (e.g. the language to be used, pronunciations in addition to the built-in lexicon, the grammar to recognize from, etc.). This information is injected into the setup for ConfigurableSpeechRecognizer before initializing recognition. To confuse you a bit more, CMU Sphinx also encapsulates (some of) this information in a Context object -- which is not the same as the plugin's SphinxContext.

Plugin

The Plugin class itself only deals with some housekeeping, most notably, registering the SphinxNode class as a type of node to be used in dialog models; it also sets up Settings which keeps some global settings of the recognizer, such as an exception dictionary; this is where you would put settings such as voice activiy settings, loudness, or similar. SphinxNode derives from AbstractInputNode which provides the basic recognition setup (managing a grammar, checking results against the grammar, etc). SphinxNode itself then implements createRecognitionExecutor() and the SphinxRecognitionExecutor that is created, manages recognition execution.

A RecognitionExecutor manages the recognition. On start(), it receives the grammar, the possible patterns that it expects as outcomes of recognition, a timeout, a statelistener (that reflects recognition state in the GUI) and a threshold for recognition confidence. stop() is expected to abort any ongoing recognition. The SphinxRecognitionExecutor sets the recognizer's context according to the grammar, registers state listeners, and initiates recognition. Upon return of a recognition result, it checks the result and returns it. It also manages timeout.

Implementing a speech recognizer plugin - dialogos-project/dialogos GitHub Wiki

Implementing a speech recognizer plugin

Interface and client

Plugin

Node Toolbox

Case Study: Sphinx plugin

Client

Plugin

⚠️ GitHub.com Fallback ⚠️

Implementing a speech recognizer plugin - dialogos-project/dialogos GitHub Wiki

Implementing a speech recognizer plugin

Interface and client

Plugin

Node Toolbox

Case Study: Sphinx plugin

Client

Plugin

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️