Voice Recognition - JpEncausse/SARAH-Documentation GitHub Wiki

The official documentation has been moved to http://wiki.sarah.encausse.net/

.
.
.
.
.
.
.
.

WSRSpeechManager instanciate multiple WSRSpeechEngine to wrap a Microsoft SpeechRecognitionEngine. It relies on 2 library System.Speech.* and Microsoft.Speech.* for Kinect.

See also:

In a Nutshell

The SpeechEngine load XML grammar files to match sentences according to a given confidence level. For each match an HTTP Request is sent to server.

  • Audio input is provided by device, files, network, ...
  • Grammar can trigger instant action
  • Grammar can be executed in a given context
  • Grammar can be rewrited / reloaded

Set property only to true to disactivate all other Kinect features and improve cpu.

Limitations

  • Each engine works with only 1 language.
  • Audio framework is automatically reloaded every hour.

Configuration

; Hot replace SARAH name
name=SARAH

; Speech engine language
language=fr-FR

; speech only do not start other features (for low cpu)
only=false

; Speech 1st word confidence (aka SARAH)
trigger=0.8

; Speech overall confidence
confidence=0.70

; Restart engine every X millisecond (1000 x 60 x 60 = 3600000 = 1 hour)
restart=3600000

[directory]
; Path to XML Grammar directories
directory1=macros 
directory2=plugins

Setup other language

To use SARAH in an other language:

  • Set speech engine language: language=en-US
  • Copy and rename grammar.xml into grammar_en_US.xml (convention)
  • Translate commands and set attribute xml:lang="en-US"

Grammar

The Microsoft Speech Platform SDK 11 supports grammar files that use Extensible Markup Language (XML) elements and attributes, as specified in the World Wide Web Consortium (W3C) Speech Recognition Grammar Specification (SRGS) Version 1.0. These XML elements and attributes represent the rule structures that define the words or phrases (commands) recognized by speech recognition engines.

<grammar version="1.0" xml:lang="fr-FR" mode="voice" root="ruleTime" xmlns="http://www.w3.org/2001/06/grammar" tag-format="semantics/1.0">
  <rule id="ruleTime" scope="public">
    <example>Sarah il est quelle heure ?</example>
    <item>Sarah</item>
    <one-of>
      <item>il est quelle heure</item>
      <item>quelle heure est il</item>
    </one-of>
    <item repeat="0-1">
      <one-of>
        <item>à NewYork</item>
        <item>à Paris</item>
      </one-of>
    </item>
  </rule>
</grammar>

The MSDN Grammar Elements describe Microsoft implementation. Here is a Solitaire card game example.

HTTP Request

SARAH improve Microsoft Grammar with HTTP Parameters. When an XML element is matching the linked is executed.

<grammar version="1.0" xml:lang="fr-FR" mode="voice" root="ruleMeteo" xmlns="http://www.w3.org/2001/06/grammar" tag-format="semantics/1.0">
  <rule id="ruleMeteo" scope="public">
    <example>Sarah quelle est la météo pour demain ?</example>
    <tag>out.action=new Object(); </tag>
     
    <item>Sarah</item>
    <one-of>
      <item>quelle est la météo</item>
      <item>est-ce qu'il pleut</item>
      <item>comment dois-je m'habiller</item>
    </one-of>
 
    <item repeat="0-1">
      <one-of>
        <item>aujourd'hui<tag>out.action.date="1";</tag></item>
        <item>en ce moment<tag>out.action.date="1";</tag></item>
        <item>après demain<tag>out.action.date="4";</tag></item>
      </one-of>
    </item>
     
    <tag>out.action._attributes.uri="http://127.0.0.1:8080/sarah/phantom/meteo";</tag>
  </rule> 
</grammar>

An object named action is created.

  • Each object linked to action will be an HTTP request parameter. (e.g. out.action.date="1";)
  • Attribute uri linked to action define the request URI. (e.g. _out.action.attributes.uri="http://127.0.0.1:8080/sarah/phantom/meteo";)

This will send the given request:

http://127.0.0.1:8080/sarah/phantom/meteo?date=1

And logs the given Semantics:

<SML text="Sarah quelle est la météo aujourd'hui" utteranceConfidence="0.855" confidence="0.855">
  <action confidence="0.998" uri="http://127.0.0.1:8080/sarah/phantom/meteo">
    <date confidence="0.998">1</date>
  </action>
</SML>

Troubleshooting

  • Grammar must follow XML encoding: & is &
  • Actions must follow HTTP encoding

Example of encoding both values:

<tag>out.action.param1=encodeURIComponent("Sam &amp; Max")</tag>

Attributes

SARAH improve Microsoft Grammar with custom attributes. Like uri these attributes trigger instant behavior:

<tag>out.action._attributes.tts="Bonjour le monde";</tag>

List

Name Values Description
uri URI (http://) Define HTTP Request URI
tts String Trigger Instant TTS
notts boolean Stop Text To Speech
dictation boolean or lang Send audio to Google.
play Path or URI Play MP3/WAV/WMA local or stream
picture boolean Upload a photo taken by kinect (only main Sensor)
threashold float Override default confidence
context list Activate grammar list comma separated
listen boolean Stop/Start listening
restart boolean Restart speech engine
height boolean Say current user height in meter

Contexte (Lazy)

In real world, some grammar do not need to be loaded and require a given context.

For example XBMC plugin provides 2 grammar.

  • The first one will start/stop XBMC and activate the second grammar.
  • The second grammar handle all XBMC commands

Activation

  • Rules starting with "lazy" are not loaded
    • In rule name (in the XML)
    • In file name
  • Grammar attribute can trigger context
    • out.action._attributes.context = "chatterbot.xml"
    • out.action._attributes.context = "default"
  • HTTP Server can also receive an HTTP request to do the same

custom.ini

; Reset grammar to default after given timeout (millis)
ctxTimeout=60000

[context]         
; XML Grammar files to load (instead of all)

The custom.ini also contains a section to list grammar to load at startup

Wildcard (Google)

Microsoft Grammar are very efficient but use closed statement. In real life, some grammar need wildcard. See also Google vs Microsoft

SARAH recherche * sur Wikipedia

The Microsoft <ruleref special="GARBAGE" /> tag is used to bypass unknown audio between 2 known word. Then the attribute dictation will send this audio to Google "Speech to text".

<tag>out.action=new Object(); </tag>
    
<item>Sarah recherche</item>
<ruleref special="GARBAGE" />
<item>wikipedia</item>
    
<tag>out.action._attributes.uri="http://127.0.0.1:8080/sarah/dictionary";</tag>
<tag>out.action._attributes.dictation="true";</tag>

Value of dictation attribute can be true/false or the language to detect en-US.

The server script must parse the given string to guess words between "recherche" and "wikipedia"

exports.action = function(data, callback, config, SARAH){
  var search = data.dictation;
  var rgxp = /Sarah recherche (.+) (sur)* Wikipedia/i

  var match = search.match(rgxp);
  if (!match || match.length <= 1){
    return callback({'tts': "Je ne comprends pas"});
  }
  search = match[1];
  ...
}

Pitch

Audio of each speech recognition is processed to compute pitches. An average pitch is computed by ProfileManager to recognize people.

A mâle have a pitch between 100 and 200. A female have a pitch above 200. The property is a threshold to compare users.

; Pitch delta to define a voice
pitch=30

Speech Engine

See also all HTTP Parameters to control all client features. For instance the listen parameter can switch on/off listening.

RTP Server

The RTP Server listen and recognize incoming audio throught the network. Here is the ffmpeg commande line to stream audio Kinect to Kinect.

ffmpeg  -f dshow  -i audio="Réseau de microphones (Kinect U"   -ar 16000 -acodec pcm_s16le -f rtp rtp://127.0.0.1:7887

It should works from Raspberry Pi using the right commande line.

custom.ini

; Local RTP Client 7887
rtpport=7887

File Watcher

The File Watcher will monitor the /audio folder. Append files will be recognize.

  • HTTP Engine can also store incomming upload in this folder
  • Mail plugin can also download attachement to this folder

custom.ini

; Path to audio folder to watch
audio=audio

Emulate Recognition

The framework provides, for Kinect, a way to compare given text with clent's grammar. The text is compared using phonetical meaning.

http://127.0.0.1:8888?emulate=text to recognize

For instance, the Android App perform a local speech to text, then send the it to client checking grammars.

⚠️ **GitHub.com Fallback** ⚠️