Vosk Tutorial and Example - ECE-180D-WS-2023/Knowledge-Base-Wiki GitHub Wiki

Vosk: Near-Realtime Offline Speech Recognition

Virtual assistants such as Siri, and Alexa have become an integral part of our daily lives. For example, users can make phone calls, send messages, play music, set reminders, and access information with just a few voice commands. In addition, voice assistants like Alexa have been used by some people with disabilities who have difficulty using a traditional touch interface 1(https://www.pbs.org/wgbh/nova/article/people-with-disabilities-use-ai-to-improve-their-lives/).

These AI assistants rely on complex speech recognition algorithms to transcribe and interpret our speech and respond appropriately. However, the most fundamental task they must accomplish is detecting their 'wake word' and the following phrase to preprocess and send the relevant data to a cloud server 2(https://towardsdatascience.com/how-amazon-alexa-works-your-guide-to-natural-language-processing-ai-7506004709d3).

Although these voice assistants are very impressive, they require an internet connection to process the complex instructions that are given to them. But what if you wanted to use speech recognition in an offline environment for your capstone project? Well...you're in luck! There is a ready-made solution for speech recognition that doesn't require much technical skill and works offline called Vosk. Although the accuracy for many of these models is not as good as the models used by the tech giant voice assistants, it is powerful enough for many uses such as our capstone design course.

What is vosk?

Vosk is an open-source, free, and offline speech recognition tool kit that was forked from CMU Sphinx. It has implementations in many languages including Android, C, IOS, and Python as well as language models in 16 different languages. For simplicity, we will be using the Python for this tutorial.

Some Pros:	Some Cons:
Simple to Use	Some words may not be in the model dictionary
Low Latency
Good Accuracy

Install Vosk Python

Note: We recommend that you work in a virtual environment with venv or anaconda/miniconda. Don't forget to enter the virtual environment before beginning!

You can install vosk using pip with the command:

pip install vosk

In case the pip install command does not properly work, libraries can be installed with conda using the command:

Note: On Windows, with Python version 3.7 or greater, PyAudio cannot be installed with pip. You must use conda.

conda install vosk

If you wish to use a microphone instead of a prerecorded .wav file, then you will need to use sounddevice and pyaudio It is a requirement to continue with the tutorial, so we highly recommend installing it. Both packages can be installed with pip with the command:

pip install sounddevice 
pip install pyaudio

First Test with Example Code

To verify the installation, clone the vosk-api github repo and navigate to ./vosk-api/python/example. To test real-time speech recognition, run the following command:

python test_microphone.py

You will notice that vosk-model-small-en-us-0.15.zip will download. This model works well for most cases that one would have to use, as the accuracy is high for the amount of storage necessary. However, if you notice that the accuracy of the model is unacceptable, you can download larger models online and tradeoff speed for accuracy. If the larger model is still unacceptable, you can create and train your own model by following the instructions at the bottom of the Models page of the official documentation. We will leave this exercise to the reader.

Making a Custom Speech Recognition Script

Now let's make a script that detects a pre-specified set of keywords. Create a new file with the editor of your choice in the directory of your choice. We recommend that you follow along step-by-step with the tutorial to understand the code more thoroughly, but if you are in a rush, the complete code is at the end of this section

We are first going to add our imports:

import queue
import sys
import sounddevice as sd
import json

from vosk import Model, KaldiRecognizer, SetLogLevel

Next, we define a queue that will store the microphone data and the callback function that is required for sd.RawInputStream that handles adding the microphone data to the queue.

q = queue.Queue()

def callback(indata, frames, time, status):
    """
    Check for invalid status and add the indata to the queue.
    This is called (from a separate thread) for each audio block.
    """
    if status:
        print(status, file=sys.stderr)
    q.put(bytes(indata))

Now we will list our desired words. Make sure that all words in special_words are lowercase and are str type since the recognizer only outputs in lowercase. Define the words that you want to detect in the format:

DESIRED_WORD : COMMON_ERROR_LIST

COMMON_ERROR_LIST is found experimentally using the test_microphone.py from the Vosk github examples (see: First Test with Example Code).

special_words = {
    "start" : ["stuart"],
    "stop" : [],
    "pause" : [],
    "continue" : [],
    "player" : [],
    "one" : ["won"],
    "two" : ["to", "too"],
    "i'm excited" : ["i'm actually"],
}

Now we will do some initialization. Silence Vosk's logging, and initialize the Vosk model with the language of your choice. See models for the correct model name. Finally, we set up the first available microphone as our SoundDevice.

# Turn off Vosk CLI logging feature (on by default)
SetLogLevel(-1)

# Initialize the Vosk Model 
model = Model(lang="en-us")

# Set up the sound device to the first available input
device_info = sd.query_devices(None, "input")
device = device_info["name"]
samplerate = int(device_info["default_samplerate"])

Initialize the SoundDevice and call the callback function to add raw data to the queue. Then we will initialize the Kaldi recognizer. Kaldi is a format of a machine learning model that is specifically used for speech recognition. Luckily, vosk abstracts away all the hard stuff and the only 'machine learning' we need to do is initialize the recognizer with the model.

with sd.RawInputStream(samplerate=samplerate, blocksize = 8000, device=device,
        dtype="int16", channels=1, callback=callback):
    print("#" * 80)
    print("Press Ctrl+C to stop the recording")
    print("#" * 80)

    # Initialize the KaldiRecognizer from vosk. 
    # For more infor about Kaldi: http://kaldi-asr.org/
    rec = KaldiRecognizer(model, samplerate)
    prev_guess = ""

Now on to the meat of the script. Inside a while True loop, get the data from the front of the queue.

# Get the top of the queue and pass through our recognizer
data = q.get()

If our recognizer completed processing the phrase, it sets rec.AcceptWaveform to True and stores the result in rec.Result. Otherwise, we process the partial result from rec.PartialResult, which is json with the type str. Convert the str to a dict and look at the last word of the string. This is the most recent word that was spoken. We then check if the previous guess is different so we don't detect the same command over and over. Finally, we check the guess word against the special word dictionary keys and items.

if rec.AcceptWaveform(data):
    # print(f"VOSK thought you said: {rec.Result()}")
    prev_guess = ""
else:
    # Get the partial result string in the format of a json file.
    # Convert it to a json dict and get the value of "partial"
    rtguess = rec.PartialResult()
    rtguess = json.loads(rtguess)["partial"]

    # Only take the most recently said word in phrase
    if rtguess != "": 
        rtguess = rtguess.split()[-1]

    # Avoid recording multiple commands detected for the commands
    if rtguess != prev_guess:
        print(f"GUESS: {rtguess} PREV: {prev_guess}")

        # Compare our real-time guess with our special words dict
        for word, errs in special_words.items():
            if rtguess == word or rtguess in errs:
                print(f"COMMAND {word} DETECTED!")
        prev_guess = rtguess

Complete Code

As promised, here is the code for you to copy and paste into your project!

import queue
import sys
import sounddevice as sd
import json

from vosk import Model, KaldiRecognizer, SetLogLevel


q = queue.Queue()

def callback(indata, frames, time, status):
    """
    Check for invalid status and add the indata to the queue.
    This is called (from a separate thread) for each audio block.
    """
    if status:
        print(status, file=sys.stderr)
    q.put(bytes(indata))


try:
    ####################################################################
    # Define the words that you want to detect in the format:
    # DESIRED_WORD : COMMON_ERROR_LIST
    # COMMON_ERROR_LIST is found experimentally using test_microphone.py
    #####################################################################
    # Make sure that all words in special_words are lowercase
    # and str type. The recognizer only outputs in lowercase
    #####################################################################
    special_words = {
        "start" : ["stuart"],
        "stop" : [],
        "pause" : [],
        "continue" : [],
        "player" : [],
        "one" : ["won"],
        "two" : ["to", "too"],
        "i'm excited" : ["i'm actually"],
    }

    # Turn off Vosk CLI logging feature (on by default)
    SetLogLevel(-1)

    # Initialize the Vosk Model 
    model = Model(lang="en-us")

    # Set up the sound device to the first available input
    device_info = sd.query_devices(None, "input")
    device = device_info["name"]
    samplerate = int(device_info["default_samplerate"])

    # Initialize the SoundDevice and call the callback function to add raw data to 
    # the queue   
    with sd.RawInputStream(samplerate=samplerate, blocksize = 8000, device=device,
            dtype="int16", channels=1, callback=callback):
        print("#" * 80)
        print("Press Ctrl+C to stop the recording")
        print("#" * 80)

        # Initialize the KaldiRecognizer from vosk. 
        # For more infor about Kaldi: http://kaldi-asr.org/
        rec = KaldiRecognizer(model, samplerate)
        prev_guess = ""

        while True:
            # Get the top of the queue and pass through our recognizer
            data = q.get()
            
            # If our recognizer completed its prediction for the given phrase
            # it sets rec.AcceptWaveform to True and stores the result in 
            # rec.Result
            if rec.AcceptWaveform(data):
                # print(f"VOSK thought you said: {rec.Result()}")
                prev_guess = ""
            else:
                # Get the partial result string in the format of a json file.
                # Convert it to a json dict and get the value of "partial"
                rtguess = rec.PartialResult()
                rtguess = json.loads(rtguess)["partial"]

                # Only take the most recently said word in phrase
                if rtguess != "": 
                    rtguess = rtguess.split()[-1]

                # Avoid recording multiple commands detected for the commands
                if rtguess != prev_guess:
                    print(f"GUESS: {rtguess} PREV: {prev_guess}")

                    # Compare our real-time guess with our special words dict
                    for word, errs in special_words.items():
                        if rtguess == word or rtguess in errs:
                            print(f"COMMAND {word} DETECTED!")
                    prev_guess = rtguess

# Exit nicely on Keyboard Interrupt, else print exception name and message
except KeyboardInterrupt:
    print("\nDone")
except Exception as e:
    print(type(e).__name__ + ": " + str(e))

Vosk in Voice Activated Chess

Let us use this speech recognition code for a voice-activated chess game, one of its many potential applications. In this game, we will be looking for the user to make moves using their microphone and feeding those moves into the chess library for Python. The chess library itself is thankfully easy to use. To install it, simply run this in your command line:

pip install chess

Another option is to use the conda installer as follows if the pip install command does not work. Please refer to the previous voice processing section for more details on conda installation.

conda install chess

Here is some simple code to familiarize yourself with the library. There are many other functions that can be used to learn the state of the board and the game, and those can be found in the documentation linked below.

import chess

board=chess.Board()
print(board)

while True:
    move = input("Enter a move: ")
    board.push_san(move)
    print(board)

The code creates a new chess board object by initializing the 'Board' class, which represents a standard chess board with all of the pieces in their starting positions. The board is then printed to the console using the 'print' function, representing the board state in an 8x8 grid of characters. The upper case letters stand for the white pieces, while the lower case letters stand for the black pieces.

The program then enters into a 'while' loop that continuously prompts the user to input a chess move. The input is read as a string using the 'input' function and stored in a variable called 'move'.

The program then uses the 'push_san' method of the board object to apply the move to the board. This method takes the move input as a string in SAN (Standard Algebraic Notation) format, which is a standard way of representing chess moves. Once the move has been applied, the updated board is printed to the console again using the 'print' function.

This loop continues indefinitely, allowing the user to enter new moves and watch the resulting board state in real time.

Try running this code in a program, playing around with the functions available in the chess library. These additional functions can be used to enhance the game and make it like a true chess game.

Typical chess notation looks like "Nf6", with the piece first, and the ending square denoted by the last two characters. Hence, integrating with the Vosk code, the voice commands need to be broken into three distinct components: the piece, the ending file, and the ending rank. Providing confirmations along the way will prove useful, as this combination of characters is very unique.

Modifying the original speech collection code, we want to add different words to be recognized, going into the "special_words" dictionary:

special_words = {
        "a":[],
        "b":["be", "the"],
        "c":["see", "sea", "she"],
        #etc. (full dictionary included at the end)
    }

As you can notice, this dictionary maps words that are commonly mistaken for command words to those command words. This word cloud allows more room for error in the speech detection while still detecting the correct commands.

One approach for receiving the commands is by breaking down the intended move into the aforementioned three component parts for detection. This can be done inside the "while True" loop from the previous speech detection code, like so:

# Compare our real-time guess with our special words dict
                    for word, errs in special_words.items():
                        if rtguess == word or rtguess in errs and prev_guess != rtguess:
                                #check if the piece command word is valid    
                            
                                #check if the file command is valid      
                                                                
                                #check if the rank command is valid
#(full code included at the end)

The user's input is stored in the variable 'rtguess'. The code first loops through the dictionary called 'special_words', which contains a list of recognized chess commands and their possible spelling errors. This full dictionary is included at the end for reference.

If the user input matches one of the recognized chess commands, or is a recognized spelling error of a command and is different from the previous guess, then the code proceeds to execute the corresponding action for that command.

If the player has entered a valid piece name in their first slot, the code stores the piece name and prints a message indicating that the piece has been detected. If the player has entered a valid file name in their second slot, the code stores the file name and prints a message indicating that the file has been detected. If the player has entered a valid rank number in the third slot, the code stores the rank number and prints a message indicating that the rank has been detected.

Once the player has entered all three parts of their move, the code concatenates the three parts of the move into a single command string and applies the move to the chess board using the 'push_san' method. The updated board is then printed to the console.

The code also switches the turn to the next player and prints a message indicating which player is to move next. Finally, the code resets the move command and clears the input string. If any part of the player's input is invalid, the code prints an error message indicating which part of the input is invalid.

The output of this addition should look like this:

r n b q k b n r
p p p p p p p p
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
P P P P P P P P
R N B Q K B N R
################################################################################


#Command instructions: Say the Piece, Ending file, then Ending rank


################################################################################


################################################################################
#White to move
################################################################################
COMMAND pawn DETECTED!
COMMAND e DETECTED!
COMMAND four DETECTED!
e4
r n b q k b n r
p p p p p p p p
. . . . . . . .
. . . . . . . .
. . . . P . . .
. . . . . . . .
P P P P . P P P
R N B Q K B N R
################################################################################
#Black to move
################################################################################

(And so on).

Wrapping Up

Congratulations on embarking on the journey to master speech recognition! This incredible technology holds boundless potential, ranging from empowering the disabled community to effortlessly checking the weather and beyond. As time progresses, speech recognition will undoubtedly become even faster and more precise, opening up endless possibilities for its application. We sincerely hope that this tutorial has equipped you with valuable insights, enabling you to leverage the power of speech recognition with confidence.

Chess Game Code

Integrate the chess code with the speech processing to get Voice Acivated Chess. This is a simple exercise and is left to the reader as such. The previous code with voice activation coupled with command detection and the chess engine is one of the ways that speech detection and processing can be used to enhance existing activities.

special_words = {
        "a":["hay", "hey"],
        "b":["be", "the"],
        "c":["see", "sea", "she"],
        "d":["de", "t"],
        "e": ["he"],
        "f": ["after", "as", "if"],
        "g": ["gee"],
        "h": ["is"],
        "one":["won", "when"],
        "two" : ["to", "too", "do"],
        "three" : ["free"],
        "four" : ["for", "thor"],
        "five" : ["phi"],
        "six" : [],
        "seven" : [],
        "eight" : [],
        "knight" : ["night"],
        "pawn" : ["on", "bon", "palm", "pollen", "point", "pine", "paul", "pom", "time"],
        "bishop" : ["fisher", "should"],
        "rook" : ["look", "rock"],
        "queen" : ["green", "queen's", "clean"],
        "king" : ["kim", "kill"]



# Compare our real-time guess with our special words dict
                    for word, errs in special_words.items():
                        if rtguess == word or rtguess in errs and prev_guess != rtguess:
                            #Check if the piece command word is valid    
                            
                            #print(f"COMMAND {word} DETECTED!")
                            prev_guess = rtguess
                            if rtguess != '':
                                if i%3 == 0:
                                    try:
                                        piece = piece_dict[word]
                                        command[(i%3)] = piece
                                        print(f"COMMAND {word} DETECTED!")
                                        i = i+1
                                    except:
                                        #print("Invalid piece name")
                                        print('')
                                
                                #check if file command is valid      
                                elif i%3 == 1:
                                    #print(word)
                                    if word in letter_list:
                                        command[(i%3)] = word
                                        print(f"COMMAND {word} DETECTED!")
                                        i = i+1
                                    else:
                                        #print("invalid file name")
                                        print('')

                                        
                                #check if rank command is valid
                                elif i%3 == 2:
                                    try:
                                        number = number_dict[word]
                                        print(f"COMMAND {word} DETECTED!")
                                        command[(i%3)] = number
                                    
                                        command_string = ""
                                        command_string = command_string.join(command).replace(" ", "")
                                        print(command_string)
                                        board.push_san(command_string)
                                        print(board)
                                        
                                        
                                        #Switch turns
                                        turn = not turn    
                                        print("#" * 80)
                                        if turn:
                                            print("White to move")
                                        else:
                                            print("Black to move")
                                        print("#" * 80)
                                        
                                        i = i+1
                                        
                                        #Clear command
                                        command = ["", "", ""]
                                        command_string = ""
                                    except:
                                        #print("invalid rank number")
                                        print('')