Voice Recognition - JpEncausse/SARAH-Documentation GitHub Wiki
The official documentation has been moved to http://wiki.sarah.encausse.net/
.
.
.
.
.
.
.
.
WSRSpeechManager instanciate multiple WSRSpeechEngine to wrap a Microsoft SpeechRecognitionEngine. It relies on 2 library System.Speech.* and Microsoft.Speech.* for Kinect.
See also:
- S.A.R.A.H: Une grammaire enrichie.
- S.A.R.A.H: Demo 1 : Grammaire XML
- S.A.R.A.H: Demo 5 : Requête HTTP
The SpeechEngine load XML grammar files to match sentences according to a given confidence level. For each match an HTTP Request is sent to server.
- Audio input is provided by device, files, network, ...
- Grammar can trigger instant action
- Grammar can be executed in a given context
- Grammar can be rewrited / reloaded
Set property only to true to disactivate all other Kinect features and improve cpu.
- Each engine works with only 1 language.
- Audio framework is automatically reloaded every hour.
; Hot replace SARAH name
name=SARAH
; Speech engine language
language=fr-FR
; speech only do not start other features (for low cpu)
only=false
; Speech 1st word confidence (aka SARAH)
trigger=0.8
; Speech overall confidence
confidence=0.70
; Restart engine every X millisecond (1000 x 60 x 60 = 3600000 = 1 hour)
restart=3600000
[directory]
; Path to XML Grammar directories
directory1=macros
directory2=plugins
To use SARAH in an other language:
- Set speech engine language:
language=en-US - Copy and rename
grammar.xmlintogrammar_en_US.xml(convention) - Translate commands and set attribute
xml:lang="en-US"
The Microsoft Speech Platform SDK 11 supports grammar files that use Extensible Markup Language (XML) elements and attributes, as specified in the World Wide Web Consortium (W3C) Speech Recognition Grammar Specification (SRGS) Version 1.0. These XML elements and attributes represent the rule structures that define the words or phrases (commands) recognized by speech recognition engines.
<grammar version="1.0" xml:lang="fr-FR" mode="voice" root="ruleTime" xmlns="http://www.w3.org/2001/06/grammar" tag-format="semantics/1.0">
<rule id="ruleTime" scope="public">
<example>Sarah il est quelle heure ?</example>
<item>Sarah</item>
<one-of>
<item>il est quelle heure</item>
<item>quelle heure est il</item>
</one-of>
<item repeat="0-1">
<one-of>
<item>à NewYork</item>
<item>à Paris</item>
</one-of>
</item>
</rule>
</grammar>The MSDN Grammar Elements describe Microsoft implementation. Here is a Solitaire card game example.
SARAH improve Microsoft Grammar with HTTP Parameters. When an XML element is matching the linked is executed.
<grammar version="1.0" xml:lang="fr-FR" mode="voice" root="ruleMeteo" xmlns="http://www.w3.org/2001/06/grammar" tag-format="semantics/1.0">
<rule id="ruleMeteo" scope="public">
<example>Sarah quelle est la météo pour demain ?</example>
<tag>out.action=new Object(); </tag>
<item>Sarah</item>
<one-of>
<item>quelle est la météo</item>
<item>est-ce qu'il pleut</item>
<item>comment dois-je m'habiller</item>
</one-of>
<item repeat="0-1">
<one-of>
<item>aujourd'hui<tag>out.action.date="1";</tag></item>
<item>en ce moment<tag>out.action.date="1";</tag></item>
<item>après demain<tag>out.action.date="4";</tag></item>
</one-of>
</item>
<tag>out.action._attributes.uri="http://127.0.0.1:8080/sarah/phantom/meteo";</tag>
</rule>
</grammar>An object named action is created.
- Each object linked to action will be an HTTP request parameter. (e.g. out.action.date="1";)
- Attribute uri linked to action define the request URI. (e.g. _out.action.attributes.uri="http://127.0.0.1:8080/sarah/phantom/meteo";)
This will send the given request:
http://127.0.0.1:8080/sarah/phantom/meteo?date=1
And logs the given Semantics:
<SML text="Sarah quelle est la météo aujourd'hui" utteranceConfidence="0.855" confidence="0.855">
<action confidence="0.998" uri="http://127.0.0.1:8080/sarah/phantom/meteo">
<date confidence="0.998">1</date>
</action>
</SML>- Grammar must follow XML encoding: & is &
- Actions must follow HTTP encoding
Example of encoding both values:
<tag>out.action.param1=encodeURIComponent("Sam & Max")</tag>SARAH improve Microsoft Grammar with custom attributes. Like uri these attributes trigger instant behavior:
<tag>out.action._attributes.tts="Bonjour le monde";</tag>| Name | Values | Description |
|---|---|---|
| uri | URI (http://) | Define HTTP Request URI |
| tts | String | Trigger Instant TTS |
| notts | boolean | Stop Text To Speech |
| dictation | boolean or lang | Send audio to Google. |
| play | Path or URI | Play MP3/WAV/WMA local or stream |
| picture | boolean | Upload a photo taken by kinect (only main Sensor) |
| threashold | float | Override default confidence |
| context | list | Activate grammar list comma separated |
| listen | boolean | Stop/Start listening |
| restart | boolean | Restart speech engine |
| height | boolean | Say current user height in meter |
In real world, some grammar do not need to be loaded and require a given context.
For example XBMC plugin provides 2 grammar.
- The first one will start/stop XBMC and activate the second grammar.
- The second grammar handle all XBMC commands
- Rules starting with "lazy" are not loaded
- In rule name (in the XML)
- In file name
- Grammar attribute can trigger context
- out.action._attributes.context = "chatterbot.xml"
- out.action._attributes.context = "default"
- HTTP Server can also receive an HTTP request to do the same
; Reset grammar to default after given timeout (millis)
ctxTimeout=60000
[context]
; XML Grammar files to load (instead of all)
The custom.ini also contains a section to list grammar to load at startup
Microsoft Grammar are very efficient but use closed statement. In real life, some grammar need wildcard. See also Google vs Microsoft
SARAH recherche * sur Wikipedia
The Microsoft <ruleref special="GARBAGE" /> tag is used to bypass unknown audio between 2 known word. Then the attribute dictation will send this audio to Google "Speech to text".
<tag>out.action=new Object(); </tag>
<item>Sarah recherche</item>
<ruleref special="GARBAGE" />
<item>wikipedia</item>
<tag>out.action._attributes.uri="http://127.0.0.1:8080/sarah/dictionary";</tag>
<tag>out.action._attributes.dictation="true";</tag>Value of dictation attribute can be true/false or the language to detect en-US.
The server script must parse the given string to guess words between "recherche" and "wikipedia"
exports.action = function(data, callback, config, SARAH){
var search = data.dictation;
var rgxp = /Sarah recherche (.+) (sur)* Wikipedia/i
var match = search.match(rgxp);
if (!match || match.length <= 1){
return callback({'tts': "Je ne comprends pas"});
}
search = match[1];
...
}
Audio of each speech recognition is processed to compute pitches. An average pitch is computed by ProfileManager to recognize people.
A mâle have a pitch between 100 and 200. A female have a pitch above 200. The property is a threshold to compare users.
; Pitch delta to define a voice
pitch=30
See also all HTTP Parameters to control all client features. For instance the listen parameter can switch on/off listening.
The RTP Server listen and recognize incoming audio throught the network. Here is the ffmpeg commande line to stream audio Kinect to Kinect.
ffmpeg -f dshow -i audio="Réseau de microphones (Kinect U" -ar 16000 -acodec pcm_s16le -f rtp rtp://127.0.0.1:7887
It should works from Raspberry Pi using the right commande line.
; Local RTP Client 7887
rtpport=7887
The File Watcher will monitor the /audio folder. Append files will be recognize.
- HTTP Engine can also store incomming upload in this folder
- Mail plugin can also download attachement to this folder
; Path to audio folder to watch
audio=audio
The framework provides, for Kinect, a way to compare given text with clent's grammar. The text is compared using phonetical meaning.
http://127.0.0.1:8888?emulate=text to recognize
For instance, the Android App perform a local speech to text, then send the it to client checking grammars.