Real-Time Speech Recognition Using TensorFlow

Overview

Automatic Speech Recognition (ASR) transforms spoken language into text, powering applications like voice assistants, smart devices, and transcription tools. ASR operates by extracting audio features, processing them through machine learning models, and generating text outputs. For instance, a spoken word like "hello" is converted into text data.

This tutorial demonstrates how to create a simple ASR system using TensorFlow. We will use the Mini Speech Commands Dataset, which contains audio files for common speech commands such as "no", "yes," "up," "down," etc. This example will guide you through preprocessing audio data into spectrograms, creating a Convolutional Neural Network (CNN), and training the model to recognize commands.

Understanding Spectrograms

Spectrograms are a visual representation of the frequency content of a signal over time. They are particularly useful for analyzing audio signals, such as speech or music, because they show how different frequency components (like pitch and tone) vary over the duration of the sound.

Steps to create a Spectrogram

Convert Time Domain to Frequency Domain
- Break down the audio signal to small segments and compute frequencies for each segment using a short-time-fourier transform (STFT).
Calculate Magntitude
- Fourier Transforms produce complex numbers, so we want to work wth the magnitude to represent frequency strengths.
Color Mapping:
- Time runs along the x-axis, and frequency along the y-axis.
- Color intensity represents amplitude.

In Figure 1 below, you can see an example of a simple spectrogram.

Spectrogram Process Diagram

Figure 1: Example of a Spectrogram showing frequency changes over time

In the context of speech recognition, spectrograms allow us to capture critical information about the audio signal. Using its spatial pattern recognition capabilities - or in plain English, its ability to detect shapes and details - the CNN can interpret the spectrograms to classify audio commands accurately.

Convolutional Neural Networks (CNNs)

What are CNNs?

CNNs are a type of neural network widely used in image recognition. This type of neural network excels in identifying spatial patterns, such as edges and textures in data.

Key Components:

Convolutional Layers: Identify local features in the input (e.g., patterns in a spectrogram).
Pooling Layers: Reduce dimensionality and emphasize important features.
Fully Connected Layers: Interpret extracted features and assign probabilities for classification.

CNN Structure

Figure 2: Structure of a Convolutional Neural Network showing the flow from input to classification.

Our aim is to recognize spoken words and assign, or "classify", each audio clip to one of several pre-defined, or "labeled", categories.

About TensorFlow

TensorFlow is an open-source machine learning library developed by Google. It is widely used for creating machine learning models, including deep learning model such as speech recognition. If you are interested in learning more about TensorFlow, visit the official TensorFlow documentation

Setup and Dependencies

In your local environment, begin by installing and importing the necessary libraries. We'll use TensorFlow for model building, Matplotlib for visualization, and Seaborn for plotting.

!pip install -U -q tensorflow tensorflow_datasets

import os
import pathlib
import numpy as np
import tensorflow as tf
import matplotlib.pyplot as plt
import seaborn as sns
from tensorflow.keras import layers, models
from IPython import display

Download and Prepare Dataset

We use TensorFlow's get_file function to download the Mini Speech Commands dataset if it's not already in the specified directory.

# Define the dataset path
DATASET_PATH = 'data/mini_speech_commands'
data_dir = pathlib.Path(DATASET_PATH)

# Download and extract if the dataset does not exist
if not data_dir.exists():
    tf.keras.utils.get_file(
        'mini_speech_commands.zip',
        origin="http://storage.googleapis.com/download.tensorflow.org/data/mini_speech_commands.zip",
        extract=True,
        cache_dir='.',
        cache_subdir='data'
    )

# Verify the dataset structure
commands = np.array(tf.io.gfile.listdir(str(data_dir)))
commands = commands[(commands != 'README.md') & (commands != '.DS_Store')]
print('Commands:', commands)

Load the Dataset and Create Train/Validation Splits

TensorFlow's audio_dataset_from_directory simplifies loading audio files directly from directories and labels them based on folder names. We split the dataset for training and validation.

Let's begin! The following code loads the audio files from the specified directory, splits them into training and validation sets, and sets each audio file’s length to 1 second (16,000 samples).

# Load audio dataset and split it into training and validation sets
train_ds, val_ds = tf.keras.utils.audio_dataset_from_directory(
    directory=data_dir,
    batch_size=64,
    validation_split=0.2,
    seed=0,
    output_sequence_length=16000,
    subset='both'
)

After loading the dataset, we can print the class (command) names to ensure they were loaded correctly.

#Display the label names
label_names = np.array(train_ds.class_names)
print("Label names:", label_names)

Commands: ['right' 'go' 'no' 'left' 'stop' 'up' 'down' 'yes']

Squeeze the Audio Tensor Dimensions

The audio files may have unnecessary extra dimensions (like empty channels). We can use the squeeze function to remove the extra dimensions for consistency.

def squeeze(audio, labels):
  audio = tf.squeeze(audio, axis=-1)
  return audio, labels

train_ds = train_ds.map(squeeze, tf.data.AUTOTUNE)
val_ds = val_ds.map(squeeze, tf.data.AUTOTUNE)

Creating a Test Set

To ensure a proper evaluation after training, we split the validation set in half to create a separate test set.

test_ds = val_ds.shard(num_shards=2, index=0)
val_ds = val_ds.shard(num_shards=2, index=1)

Visualizing Sample Audio Waveforms

Let's visualize a few sample audio waveforms from the dataset to get a better understanding of the data. This plot will display a selection of audio waveforms along with their associated labels.

The results are outlined in Figure 3 below.

for example_audio, example_labels in train_ds.take(1):  
  print(example_audio.shape)
  print(example_labels.shape)

label_names[[1,1,3,0]]

plt.figure(figsize=(16, 10))
rows = 3
cols = 3
n = rows * cols
for i in range(n):
  plt.subplot(rows, cols, i+1)
  audio_signal = example_audio[i]
  plt.plot(audio_signal)
  plt.title(label_names[example_labels[i]])
  plt.yticks(np.arange(-1.2, 1.2, 0.2))
  plt.ylim([-1.1, 1.1])

mutliwaveform

Figure 3: Mutliple Waveforms of the Dataset Commands

Converting to Spectrograms

To train a CNN, we need to convert the audio waveforms into spectrograms, which represent frequency over time.

Define the Spectrogram Conversion Function

We start by defining the get_spectrogram function, which will transform each audio waveform into a spectrogram.

def get_spectrogram(waveform):
    # Compute the Short-Time Fourier Transform (STFT)
    spectrogram = tf.signal.stft(
        waveform, frame_length=255, frame_step=128
    )
    # Get the magnitude of the STFT
    spectrogram = tf.abs(spectrogram)
    # Add a channel dimension for compatibility with CNNs
    spectrogram = spectrogram[..., tf.newaxis]
    return spectrogram

Short-Time Fourier Transform (STFT): Converts each waveform into frequency components.
Magnitude Calculation: We take the absolute value to keep only the magnitude of each frequency.
Channel Dimension: Adds a new dimension, so the spectrogram is compatible with CNNs that expect image-like data.

Apply Spectrogram Conversion to the Dataset

Next, we apply our function to each audio waveform in the training and valiation datasets using make_spec_ds. This funtion maps the transformation across the dataset.

def make_spec_ds(ds):
    return ds.map(
        map_func=lambda audio, label: (get_spectrogram(audio), label),
        num_parallel_calls=tf.data.AUTOTUNE
    )

Exploring the Waveforms and Spectrograms for Audio Samples

Display Sample Waveform and Spectrogram with Audio Playback

We can start by analyzing and listening to three examples to understand the transformation.

for i in range(3):
    label = label_names[example_labels[i]]
    waveform = example_audio[i]
    spectrogram = get_spectrogram(waveform)

    print('Label:', label)
    print('Waveform shape:', waveform.shape)
    print('Spectrogram shape:', spectrogram.shape)
    print('Audio playback')
    display.display(display.Audio(waveform, rate=16000))  # Plays the audio

Note: display.Audio allows us to listen to each sample.

Visualizing Spectrograms

We define a plot_spectrogram function to visualize the frequency components over time.

def plot_spectrogram(spectrogram, ax):
    if len(spectrogram.shape) > 2:
        assert len(spectrogram.shape) == 3
        spectrogram = np.squeeze(spectrogram, axis=-1)
    
    # Convert frequencies to log scale and transpose for correct orientation
    log_spec = np.log(spectrogram.T + np.finfo(float).eps)
    height, width = log_spec.shape
    X = np.linspace(0, np.size(spectrogram), num=width, dtype=int)
    Y = range(height)
    
    # Plot the log-transformed spectrogram
    ax.pcolormesh(X, Y, log_spec)

Now, let's look at the spectrograms obtained from the dataset to get a broader view of the data the model will be trained on.

for example_spectrograms, example_spect_labels in train_spectrogram_ds.take(1):
    break
rows = 3
cols = 3
n = rows * cols
fig, axes = plt.subplots(rows, cols, figsize=(16, 9))

for i in range(n):
    r = i // cols
    c = i % cols
    ax = axes[r][c]
    plot_spectrogram(example_spectrograms[i].numpy(), ax)
    ax.set_title(label_names[example_spect_labels[i].numpy()])

plt.show()

Mutliple Spectrograms

Figure 5: Mutliple Command Spectrograms

Finally, let's use our function to create a combined plot of the waveform and its spectrogram for a sample command.

fig, axes = plt.subplots(2, figsize=(12, 8))
timescale = np.arange(waveform.shape[0])

# Plot waveform
axes[0].plot(timescale, waveform.numpy())
axes[0].set_title('Waveform')
axes[0].set_xlim([0, 16000])

# Plot spectrogram
plot_spectrogram(spectrogram.numpy(), axes[1])
axes[1].set_title('Spectrogram')
plt.suptitle(label.title())
plt.show()

MutlSpect

Figure 6: Waveform to Spectrogram Mapping for 'Right" command.

Optimizing Data Loading with Caching and Prefetching

To speed up the training process, we use TensorFlow's caching and prefacing capabilities, which enable efficient data loading.

train_spectrogram_ds = train_spectrogram_ds.cache().shuffle(10000).prefetch(tf.data.AUTOTUNE)
val_spectrogram_ds = val_spectrogram_ds.cache().prefetch(tf.data.AUTOTUNE)
test_spectrogram_ds = test_spectrogram_ds.cache().prefetch(tf.data.AUTOTUNE)

Cache: Stores data in memory after first epoch to reduce loading time.
Shuffle: Randomly shuffles the dataset to help with model generalizaiton.
Prefetch: Loads the next batchj while the current one is being processed.

Defining the CNN Artchitecture

With our spectrograms ready, we can now setup a Convolutional Neural Network (CNN) to classify the audio commands.

Below, we outline the model architecture designed to capture patterns in the spectrograms and differentiate between various commands.

input_shape = example_spectrograms.shape[1:]
print('Input shape:', input_shape)
num_labels = len(label_names)

# Instantiate the Normalization layer and adapt it to the training data
norm_layer = layers.Normalization()
norm_layer.adapt(data=train_spectrogram_ds.map(lambda spec, label: spec))

model = models.Sequential([
    layers.Input(shape=input_shape),
    # Resize and normalize the input
    layers.Resizing(32, 32),
    norm_layer,
    layers.Conv2D(32, 3, activation='relu'),
    layers.Conv2D(64, 3, activation='relu'),
    layers.MaxPooling2D(),
    layers.Dropout(0.25),
    layers.Flatten(),
    layers.Dense(128, activation='relu'),
    layers.Dropout(0.5),
    layers.Dense(num_labels),
])
model.summary()

After running this cell, you should see something like this:

┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Layer (type)                    ┃ Output Shape           ┃       Param # ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ resizing_3 (Resizing)           │ (None, 32, 32, 1)      │             0 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ normalization_3 (Normalization) │ (None, 32, 32, 1)      │             3 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ conv2d_6 (Conv2D)               │ (None, 30, 30, 32)     │           320 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ conv2d_7 (Conv2D)               │ (None, 28, 28, 64)     │        18,496 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ max_pooling2d_3 (MaxPooling2D)  │ (None, 14, 14, 64)     │             0 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dropout_6 (Dropout)             │ (None, 14, 14, 64)     │             0 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ flatten_3 (Flatten)             │ (None, 12544)          │             0 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_6 (Dense)                 │ (None, 128)            │     1,605,760 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dropout_7 (Dropout)             │ (None, 128)            │             0 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_7 (Dense)                 │ (None, 8)              │         1,032 │
└─────────────────────────────────┴────────────────────────┴───────────────┘

Normalization Layer: Scales pixels values to improve model convergence.
Convolutional Layers:
- Two convolutional layers Conv2D extract features from spectrogram images, with ReLU (a threshold operation) as the activation function.
- MaxPooling2D redcuces the spatial dimensions, helping the model focus on important features.
- Dropout layers prevent overfitting by randomly disbaling neurons during training.
Fully Connected (Dense) Layers:
- Flatten layers converts the 2D matrix to a vector.
- A dense layer with 128 neurons captures more abstract patterns.

This CNN architecture is designed to capture the distinct audio patterns in each spectrogram and classify the commands accurately.

Model Training

We start by compiling the model with an optimizer, loss function and metrics. These components are essential for configuring the learning process before training.

model.compile(
    optimizer=tf.keras.optimizers.Adam(),
    loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    metrics=['accuracy'],
)

Optimizer: Adam is chosen for efficiency and robusteness.
Loss Functions: We use SparseCategoricalCrossEntropy becasue it is ideal for mutliclass classification
Metrics: Accuracy is used to monitor model performance in terms of correct predictions.

Next, we train the model on the spectrogram dataset, using early stopping to halt training if the model stops improving.

EPOCHS = 10
history = model.fit(
    train_spectrogram_ds,
    validation_data=val_spectrogram_ds,
    epochs=EPOCHS,
    callbacks=[tf.keras.callbacks.EarlyStopping(verbose=1, patience=2)],
)

Note: Epoch refers to the one entire passing of training data through the algorithm. It is a hyperparameter that determines the process of training in our model.

Early stopping is a form of regularization illustrated below for clarity.

EarlyStop

Figure 7: Early Stopping Callback Function

Evaluating the Model

To assess the model performance, we plot the training and validation loss and accuracy over epochs. This gives us a clear view of how well the model learns and generalizes.

metrics = history.history
plt.figure(figsize=(16,6))

# Plotting loss
plt.subplot(1,2,1)
plt.plot(history.epoch, metrics['loss'], metrics['val_loss'])
plt.legend(['loss', 'val_loss'])
plt.ylim([0, max(plt.ylim())])
plt.xlabel('Epoch')
plt.ylabel('Loss [CrossEntropy]')

# Plotting accuracy
plt.subplot(1,2,2)
plt.plot(history.epoch, 100*np.array(metrics['accuracy']), 100*np.array(metrics['val_accuracy']))
plt.legend(['accuracy', 'val_accuracy'])
plt.ylim([0, 100])
plt.xlabel('Epoch')
plt.ylabel('Accuracy [%]')

Val_Metrics

Figure 8: Training and Validation Loss and Accuracy Curves

The twp graphs above show how well our model is learning over time.

Loss Plot:
- Loss is a measure of how far off the model's predictions are from the correct answers.
- Training loss shows how the model does on the data it is learning from.
- Validation loss shows how it peforms on new, unseen data.
Accuracy Plot:
- Accuracy is simply the percentage of corerct answers
- Higher accuracy is better

After training, we evaluate the model's performance on the test dataset to check how well it generalizes to new, unseen data.

model.evaluate(test_spectrogram_ds, return_dict=True)

To analyze classification results in more detail, we can create a confusion matrix, which shows the frequency of correct and incorrect predictions for each command class.

y_pred = model.predict(test_spectrogram_ds)
y_pred = tf.argmax(y_pred, axis=1)
y_true = tf.concat(list(test_spectrogram_ds.map(lambda s,lab: lab)), axis=0)

confusion_mtx = tf.math.confusion_matrix(y_true, y_pred)
plt.figure(figsize=(10, 8))
sns.heatmap(confusion_mtx,
            xticklabels=label_names,
            yticklabels=label_names,
            annot=True, fmt='g')
plt.xlabel('Prediction')
plt.ylabel('Label')
plt.show()

Conf_matrix

Figure 9: Confusion Matrix - Labels and Model Predictions

As you can see in the confusion matrix, the model is more prone to make prediction errors for similar words such as "no" and "go" or "go" and "down". This suggests that certain commands might require additional tuning, perhaps with more training data or by enhancing the preprocessing steps to capture subtle differences in similar audio signals.

Limitations and Challenges

The speech recognition system described above provides a solid foundation for audio classification. However, it is important to understand its limitations.

Environmental Constraints

Performance may degrade in noisy environments
Background sounds can interfere with recognition accuracy
Model may struggle with varying microphone qualities

Resource Requirements

Audio-to-spectrogram conversion is computationally expensive
Real-time processing requires sufficient CPU/GPU resources

Speaker Variability

The speaker's accent may impact performance
Voice variations (pitch, speed, emotion) can impact accuracy
Distance from the microphone affects recognition

References

ScienceDirect on Spectrograms
"Spectrogram." ScienceDirect, https://www.sciencedirect.com/topics/engineering/spectrogram.
Spectrogram - Wikipedia
"Spectrogram." Wikipedia, The Free Encyclopedia, Wikimedia Foundation, https://en.wikipedia.org/wiki/Spectrogram.
Waveform Analysis Paper
Balaji, V., and G. Sadashivappa. "Waveform Analysis and Feature Extraction from Speech Data of Dysarthric Persons." 2019 6th International Conference on Signal Processing and Integrated Networks (SPIN), IEEE, 2019, pp. 955–960. doi:10.1109/SPIN.2019.8711768.
Image of CNN Structure
"Image of CNN Structure." ImageKit, https://ik.imagekit.io/upgrad1/abroad-images/imageCompo/images/41Q35ZMU.png?pr-true.
TensorFlow Audio Tutorial
"Simple Audio Recognition: Recognizing Keywords." TensorFlow, https://www.tensorflow.org/tutorials/audio/simple_audio#setup.
SpeechRecognition Library on PyPI
"SpeechRecognition." PyPI, https://pypi.org/project/SpeechRecognition/.
Deepgram - Python Audio Libraries
"Best Python Audio Libraries for Speech Recognition in 2023." Deepgram, https://deepgram.com/learn/best-python-audio-libraries-for-speech-recognition-in-2023.

Wiki Addition: Applications of Real-Time Speech Recognition

By Brandon Sutton

Introduction

To extend Kimon’s exploration into the world of real-time speech recognition using tensor flow, I will now discuss some applications of this methodology. There are many ways that speech recognition can be used to bolster entire industries. From healthcare to automotive, speech recognition is already, or will soon become, indispensable.

Healthcare

I have a friend who works as a medical scribe. He complains of long twelve-hour days spent writing non-stop. This is an incredibly inefficient system, as the entire job of the scribe is to take notes. In addition, it is very error-prone. A robust speech-recognition system would allow far more time for patient care, and reduce the frequency of errors.

Business

Almost everyone has had a bad experience with customer support. The most common choice for companies was to use call centers in other countries where agents were meant to read off of a script. This oftentimes made it frustrating to establish a context for the conversation. Nowadays, most companies rely on text-based AI solutions. This typically works much better, but service is best when it has a touch of humanity. Studies have shown that consumers have less trust in support labeled as “AI” (1). Using a speech-recognition front end could reduce frustration by providing the illusion of humanity.

Automotive

For many years, we had fantastic analog controls in our vehicles, with a plethora of knobs and buttons to use without ever needing to look. Now, cars have become glorified computers with all controls obscured by a menu on a digital touchscreen. This makes it very dangerous to adjust the volume or turn down the temperature, as the driver is forced to take his eyes off the road (2). Speech recognition would enter the scene seamlessly, allowing drivers to keep their eyes on the road at all times. With the newest technology, drivers could even have full-blown conversations with their cars to keep them awake on long drives.

Accessibility

The most obvious use case for speech recognition is for use by those who are visually impaired or unable to type. For example, a blind person could wear a device that allows them to interact with their environment in an auditory way. They could be aware of hazards in their path, such as stairs, a skateboard, or even a busy street. It would be as if another person were guiding them. While this new technology may put service dogs out of a job, it has the potential to grant a new level of autonomy to people with disabilities.

Conclusion

At this moment, speech recognition is mostly used in cases where the correct identification of a spoken word is NOT mission-critical. Anyone who has owned an Amazon Alexa can attest that speech recognition is imperfect and mistakes are made constantly. Right now, speech recognition is not the answer if a person’s life is at risk, as in cases of accessibility. As AI continues to advance, we may look to a future where speech recognition can be trusted for daily use, even when accuracy is critical.

Sources

(1) https://www.businessinsider.com/ai-chatbots-customer-service-call-center-annoying-problems-2024-11

(2) https://interestingengineering.com/transportation/dangerous-touch-screen-in-cars

Speech Recognition Using TensorFlow - 180D-FW-2024/Knowledge-Base-Wiki GitHub Wiki

Real-Time Speech Recognition Using TensorFlow

Overview

Understanding Spectrograms

Steps to create a Spectrogram

Convolutional Neural Networks (CNNs)

About TensorFlow

Setup and Dependencies

Download and Prepare Dataset

Load the Dataset and Create Train/Validation Splits

Squeeze the Audio Tensor Dimensions

Creating a Test Set

Visualizing Sample Audio Waveforms

Converting to Spectrograms

Define the Spectrogram Conversion Function

Apply Spectrogram Conversion to the Dataset

Exploring the Waveforms and Spectrograms for Audio Samples

Display Sample Waveform and Spectrogram with Audio Playback

Visualizing Spectrograms

Optimizing Data Loading with Caching and Prefetching

Defining the CNN Artchitecture

Model Training

Evaluating the Model

Limitations and Challenges

References

Wiki Addition: Applications of Real-Time Speech Recognition

By Brandon Sutton

Introduction

Healthcare

Business

Automotive

Accessibility

Conclusion

Sources

⚠️ GitHub.com Fallback ⚠️

Speech Recognition Using TensorFlow - 180D-FW-2024/Knowledge-Base-Wiki GitHub Wiki

Real-Time Speech Recognition Using TensorFlow

Overview

Understanding Spectrograms

Steps to create a Spectrogram

Convolutional Neural Networks (CNNs)

About TensorFlow

Setup and Dependencies

Download and Prepare Dataset

Load the Dataset and Create Train/Validation Splits

Squeeze the Audio Tensor Dimensions

Creating a Test Set

Visualizing Sample Audio Waveforms

Converting to Spectrograms

Define the Spectrogram Conversion Function

Apply Spectrogram Conversion to the Dataset

Exploring the Waveforms and Spectrograms for Audio Samples

Display Sample Waveform and Spectrogram with Audio Playback

Visualizing Spectrograms

Optimizing Data Loading with Caching and Prefetching

Defining the CNN Artchitecture

Model Training

Evaluating the Model

Limitations and Challenges

References

Wiki Addition: Applications of Real-Time Speech Recognition

By Brandon Sutton

Introduction

Healthcare

Business

Automotive

Accessibility

Conclusion

Sources

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️