Speech Recognition and Processing in Python - ECE-180D-WS-2023/Knowledge-Base-Wiki GitHub Wiki

Speech Recognition and Processing in Python

Sharjeel Rahman

Introduction

Training your own speech-to-text model for your project is a very daunting task and there are many factors to consider in speech signal processing — noise, reverb, volume, distance, etc., that will determine the performance of said model. Thankfully, the SpeechRecognition library for Python built by Anthony Zhang simplifies this task. In this article, we will learn how to conduct speech-to-text prediction in Python, and what to consider when creating commands for your project.

Background on Speech Signals

In theory, speech signals are simple signals that if we were to take the fourier transform and plot the frequency vs. time of the signal, we would be able to breakdown the constants, vowels, fricatives, and stops of the signal. With these considerations, creating your word cloud of speech commands should take word structure in terms of the fourier transforms into account. As shown below, we have a graph of the frequency vs. time of vowels.

Formants.

We are easily able to see the formants (thick cluster of frequencies marked by F1 and F2), and with this we are able to deduce what the sound is, and also what kind of stops are between words. Formants are directly related to the size and shape of your vocal tract, which creates a frequency response shown above. By analyzing these formants, we can gain information about the type of sound and speech being produced. As such, when thinking about your word cloud, find words that have very different structure.

Getting Started

Installing

To install the library, type the following command in Pip: pip install pyaudio. This is only required if you need microphone input pip install SpeechRecognition.## ### For further instructions on installation, see the PyPI documentation found here.

Using the Library

The script used in this section that I wrote will be included. In this section, we will be writing a simple script on speech-to-text script that will conduct speech-to-text inferencing and also will be used in the next section on creating clean data. The entire script can be found at the bottom of the article.

To begin, we import the libraries used and create an instance of our recognizer.

Then, we decide on what recognizer to use. My recommendation would be to use the IBM speech-to-text recognizer or Google Web Speech API which we will set our recognizer instance to. There are 5 others, and all of these require some form of setup or account creation, but the Google one allows us to import it and quickly get going. It’s very important to recognize that the recognizer we set requires an internet connection when moving forward.

We need to create a recognizer and microphone instance. We are all setup to perform speech-to-text inferencing. The following function can be called at anytime to perform inferencing at a time of your choice.

The results are below. Top is "hello," and bottom is the inference on "would."

Results. Top is 'hello', bottom is inference on 'would'

My intentions were “Would” and the recognizer produced would. In the next section, we will discuss ways to control what words we find and how we can manage this.

What to Consider in Your Project

As you have noticed already, running an inference on a microphone input does offer some form of error. This section will focus on useful functions to reduce noise and make the recognizer easier to use.

Energy Thresholding

The energy_threshold parameter, which defines the minimum energy level for detecting speech, is included property in recognizer_instance. This can be useful for detecting speech in noisy environments or for filtering out low-volume speech signals. Energy thresholding will allow you to set how strict the recognizer is when it starts recognition. This will be useful to tweak and test values (from -50 to 0 dB according to the documentation) in order to see what minimizes the speech signals picked up by your microphone that we do not want. You can either physically set the threshold such as -50 making it more sensitive to quiet noise recognizer.energy_threshold = -50, or you can use the function recognizer.adjust_for_ambient_noise(source) which automatically adjusts the energy threshold based on the noise level in the environment.

Noise Reduction

Although this is somewhat covered by energy thresholding, there are further methods by outside libraries that will help reduce noise. Energy thresholding will make sure the microphone picks up at the correct time, but it will not account for static microphone noise. A filter I can recommend that I’ve worked with before is the Savitzky-Golay filter, which is easily importable with the Scipy toolkit.

Confidence Value

When the inference is complete, you’ll see a list of words alongside their confidence value that the model produces. If we are attempting to make the prediction on the word: “Would,” we may see words with similar structure such as “wood,” “could,” “woot,” etc., with varying confidence. Because of this, you may find success in setting a specific threshold value for each word. Below is an example of an inference on the word “would.” Here, I would set a threshold value of 0.XX, and accept the command if “would” is above this confidence value. However, using words that have similar sounding words will be detrimental, and making sure each of your words have unique structure will create a phenomenal inference procedure.

Word Cloud

Given you are using speech recognition software for the settings of a game, it would be more practical to set a word bank of acceptable phrases. To access this, in the initialization of the google api, set the “Show all’ parameter to true meaning it will return other transcriptions other than the most likely.

response["transcription"] = recognizer.recognize_google(audio, show_all=True)

The next step is to physically print out all the possible transcriptions. The code below goes through the entire list and appends the phrase (not its confidence level) to an array called choices.

guess = recognize_speech_from_mic(recognizer, microphone)


    while(not guess["transcription"]):
        guess = recognize_speech_from_mic(recognizer, microphone)


    choices=[]
    length=len(guess["transcription"]["alternative"])


    for i in range(0,length):
        choice=(guess["transcription"]["alternative"][i])
        choices.append(choice['transcript'])

From here, depending on the use for speech recognition, one can set a word bank of acceptable words. For example, for setting up the game, some words I would like the user to use are “stop”, “back”, “select”. This can be saved in an array in which I called “acceptable_words”. After this, when a phrase is spoken, the code below goes through all the transcriptions from most likely to least and compares to see if it is in the word bank. One thing to be cautious of is capitalization. It is best to print out all the alternative words to check if the api had them capitalized or not. Also, including both cases in your word bank is a safe bet.

for word in choices:
    if work in acceptable_words:
        print(f"Recognized word: {word}")
        break

Listening Time

A useful adaptation for practical applications is to adjust the listening period. This can be done by modifying the parameters in the line audio = recognizer.listen(source). One of the important parameters is the second parameter called "timeout". This parameter defines the time period in seconds that the device will listen for speech before giving up and throwing a speech_recognition.WaitTimeoutError exception. For example, if you want the device to listen for 10 seconds, you can modify the line to audio = recognizer.listen(source, timeout=10). This is useful for debugging and for situations where you want to avoid the speech detector picking up on background noise or other sounds that are not part of the speech input

Furthermore, another parameter that you can adjust is the phrase_time_limit parameter. This parameter defines the maximum duration in seconds for a single phrase or sentence. If a phrase or sentence exceeds this duration, the device will stop listening and return the part of the phrase that was recognized before the limit was reached. To use only this parameter and not the "timeout" parameter mentioned earlier, you can replace the "timeout" parameter with "None" in the recognizer.listen() function. For example, audio = recognizer.listen(source, None, 10) will set the phrase time limit to 10 seconds while disabling the timeout parameter. This is a useful technique when working with small phrases or sentences. Setting a timer will help reduce the processing time as the speech recognizer may continue to pick up on background noise even after your phrase was spoken. By setting a phrase time limit, the device will stop listening once the limit is reached, which can improve the accuracy of the speech recognition system and make it more efficient.

Conclusion

In this article, we learned how to perform a speech inference on a given speech and different considerations for your project such as energy thresholding, confidence values, and noise reduction. Adjusting these parameters can help you optimize the performance of your speech recognition system and make it more robust and accurate for your specific use case. The code used in the project is found below.

Code


import speech_recognition as sr
import numpy as np
import pyaudio

def inference(recognizer, microphone):
 with microphone as source:
  recognizer.adjust_for_ambient_noise(source)
  audio = recognizer.listen(source)
    
  response = {
   "success": True,
   "error": None,
   "transcription": None
  }

  try:
   response["transcription"] = recognizer.recognize_google(audio)
   except sr.RequestError:
    response["success"] = False
    response["error"] = "API Unavailable"
   except sr.UnknownValueError:
    response["error"] = "Unable to recognize speech"
    return response

r = sr.Recognizer()

mic = sr.Microphone()

words = inference(r, mic)

print(words["transcription"])

References:

Speech Recognition Documentation: https://pypi.org/project/SpeechRecognition/

Savitzky-Golay Filter: https://docs.scipy.org/doc/scipy/reference/generated/scipy.signal.savgol_filter.html

Wikipedia Formant: https://en.wikipedia.org/wiki/Formant