Speech Recognition Using TensorFlow - 180D-FW-2024/Knowledge-Base-Wiki GitHub Wiki
Automatic Speech Recognition (ASR) transforms spoken language into text, powering applications like voice assistants, smart devices, and transcription tools. ASR operates by extracting audio features, processing them through machine learning models, and generating text outputs. For instance, a spoken word like "hello" is converted into text data.
This tutorial demonstrates how to create a simple ASR system using TensorFlow. We will use the Mini Speech Commands Dataset, which contains audio files for common speech commands such as "no", "yes," "up," "down," etc. This example will guide you through preprocessing audio data into spectrograms, creating a Convolutional Neural Network (CNN), and training the model to recognize commands.
Spectrograms are a visual representation of the frequency content of a signal over time. They are particularly useful for analyzing audio signals, such as speech or music, because they show how different frequency components (like pitch and tone) vary over the duration of the sound.
- Convert Time Domain to Frequency Domain
- Break down the audio signal to small segments and compute frequencies for each segment using a short-time-fourier transform (STFT).
- Calculate Magntitude
- Fourier Transforms produce complex numbers, so we want to work wth the magnitude to represent frequency strengths.
- Color Mapping:
- Time runs along the x-axis, and frequency along the y-axis.
- Color intensity represents amplitude.
In Figure 1 below, you can see an example of a simple spectrogram.
Figure 1: Example of a Spectrogram showing frequency changes over time
In the context of speech recognition, spectrograms allow us to capture critical information about the audio signal. Using its spatial pattern recognition capabilities - or in plain English, its ability to detect shapes and details - the CNN can interpret the spectrograms to classify audio commands accurately.
What are CNNs?
- CNNs are a type of neural network widely used in image recognition. This type of neural network excels in identifying spatial patterns, such as edges and textures in data.
Key Components:
- Convolutional Layers: Identify local features in the input (e.g., patterns in a spectrogram).
- Pooling Layers: Reduce dimensionality and emphasize important features.
- Fully Connected Layers: Interpret extracted features and assign probabilities for classification.
Figure 2: Structure of a Convolutional Neural Network showing the flow from input to classification.
Our aim is to recognize spoken words and assign, or "classify", each audio clip to one of several pre-defined, or "labeled", categories.
TensorFlow is an open-source machine learning library developed by Google. It is widely used for creating machine learning models, including deep learning model such as speech recognition. If you are interested in learning more about TensorFlow, visit the official TensorFlow documentation
In your local environment, begin by installing and importing the necessary libraries. We'll use TensorFlow for model building, Matplotlib for visualization, and Seaborn for plotting.
!pip install -U -q tensorflow tensorflow_datasets
import os
import pathlib
import numpy as np
import tensorflow as tf
import matplotlib.pyplot as plt
import seaborn as sns
from tensorflow.keras import layers, models
from IPython import display
We use TensorFlow's get_file
function to download the Mini Speech Commands dataset if it's not already in the specified directory.
# Define the dataset path
DATASET_PATH = 'data/mini_speech_commands'
data_dir = pathlib.Path(DATASET_PATH)
# Download and extract if the dataset does not exist
if not data_dir.exists():
tf.keras.utils.get_file(
'mini_speech_commands.zip',
origin="http://storage.googleapis.com/download.tensorflow.org/data/mini_speech_commands.zip",
extract=True,
cache_dir='.',
cache_subdir='data'
)
# Verify the dataset structure
commands = np.array(tf.io.gfile.listdir(str(data_dir)))
commands = commands[(commands != 'README.md') & (commands != '.DS_Store')]
print('Commands:', commands)
TensorFlow's audio_dataset_from_directory
simplifies loading audio files directly from directories and labels them based on folder names. We split the dataset for training and validation.
Let's begin! The following code loads the audio files from the specified directory, splits them into training and validation sets, and sets each audio fileβs length to 1 second (16,000 samples).
# Load audio dataset and split it into training and validation sets
train_ds, val_ds = tf.keras.utils.audio_dataset_from_directory(
directory=data_dir,
batch_size=64,
validation_split=0.2,
seed=0,
output_sequence_length=16000,
subset='both'
)
After loading the dataset, we can print the class (command) names to ensure they were loaded correctly.
#Display the label names
label_names = np.array(train_ds.class_names)
print("Label names:", label_names)
Commands: ['right' 'go' 'no' 'left' 'stop' 'up' 'down' 'yes']
The audio files may have unnecessary extra dimensions (like empty channels). We can use the squeeze
function to remove the extra dimensions for consistency.
def squeeze(audio, labels):
audio = tf.squeeze(audio, axis=-1)
return audio, labels
train_ds = train_ds.map(squeeze, tf.data.AUTOTUNE)
val_ds = val_ds.map(squeeze, tf.data.AUTOTUNE)
To ensure a proper evaluation after training, we split the validation set in half to create a separate test set.
test_ds = val_ds.shard(num_shards=2, index=0)
val_ds = val_ds.shard(num_shards=2, index=1)
Let's visualize a few sample audio waveforms from the dataset to get a better understanding of the data. This plot will display a selection of audio waveforms along with their associated labels.
The results are outlined in Figure 3 below.
for example_audio, example_labels in train_ds.take(1):
print(example_audio.shape)
print(example_labels.shape)
label_names[[1,1,3,0]]
plt.figure(figsize=(16, 10))
rows = 3
cols = 3
n = rows * cols
for i in range(n):
plt.subplot(rows, cols, i+1)
audio_signal = example_audio[i]
plt.plot(audio_signal)
plt.title(label_names[example_labels[i]])
plt.yticks(np.arange(-1.2, 1.2, 0.2))
plt.ylim([-1.1, 1.1])
Figure 3: Mutliple Waveforms of the Dataset Commands
To train a CNN, we need to convert the audio waveforms into spectrograms, which represent frequency over time.
We start by defining the get_spectrogram
function, which will transform each audio waveform into a spectrogram.
def get_spectrogram(waveform):
# Compute the Short-Time Fourier Transform (STFT)
spectrogram = tf.signal.stft(
waveform, frame_length=255, frame_step=128
)
# Get the magnitude of the STFT
spectrogram = tf.abs(spectrogram)
# Add a channel dimension for compatibility with CNNs
spectrogram = spectrogram[..., tf.newaxis]
return spectrogram
- Short-Time Fourier Transform (STFT): Converts each waveform into frequency components.
- Magnitude Calculation: We take the absolute value to keep only the magnitude of each frequency.
- Channel Dimension: Adds a new dimension, so the spectrogram is compatible with CNNs that expect image-like data.
Next, we apply our function to each audio waveform in the training and valiation datasets using make_spec_ds
. This funtion maps the transformation across the dataset.
def make_spec_ds(ds):
return ds.map(
map_func=lambda audio, label: (get_spectrogram(audio), label),
num_parallel_calls=tf.data.AUTOTUNE
)
We can start by analyzing and listening to three examples to understand the transformation.
for i in range(3):
label = label_names[example_labels[i]]
waveform = example_audio[i]
spectrogram = get_spectrogram(waveform)
print('Label:', label)
print('Waveform shape:', waveform.shape)
print('Spectrogram shape:', spectrogram.shape)
print('Audio playback')
display.display(display.Audio(waveform, rate=16000)) # Plays the audio
Note: display.Audio
allows us to listen to each sample.
We define a plot_spectrogram
function to visualize the frequency components over time.
def plot_spectrogram(spectrogram, ax):
if len(spectrogram.shape) > 2:
assert len(spectrogram.shape) == 3
spectrogram = np.squeeze(spectrogram, axis=-1)
# Convert frequencies to log scale and transpose for correct orientation
log_spec = np.log(spectrogram.T + np.finfo(float).eps)
height, width = log_spec.shape
X = np.linspace(0, np.size(spectrogram), num=width, dtype=int)
Y = range(height)
# Plot the log-transformed spectrogram
ax.pcolormesh(X, Y, log_spec)
Now, let's look at the spectrograms obtained from the dataset to get a broader view of the data the model will be trained on.
for example_spectrograms, example_spect_labels in train_spectrogram_ds.take(1):
break
rows = 3
cols = 3
n = rows * cols
fig, axes = plt.subplots(rows, cols, figsize=(16, 9))
for i in range(n):
r = i // cols
c = i % cols
ax = axes[r][c]
plot_spectrogram(example_spectrograms[i].numpy(), ax)
ax.set_title(label_names[example_spect_labels[i].numpy()])
plt.show()
Figure 5: Mutliple Command Spectrograms
Finally, let's use our function to create a combined plot of the waveform and its spectrogram for a sample command.
fig, axes = plt.subplots(2, figsize=(12, 8))
timescale = np.arange(waveform.shape[0])
# Plot waveform
axes[0].plot(timescale, waveform.numpy())
axes[0].set_title('Waveform')
axes[0].set_xlim([0, 16000])
# Plot spectrogram
plot_spectrogram(spectrogram.numpy(), axes[1])
axes[1].set_title('Spectrogram')
plt.suptitle(label.title())
plt.show()
Figure 6: Waveform to Spectrogram Mapping for 'Right" command.
To speed up the training process, we use TensorFlow's caching and prefacing capabilities, which enable efficient data loading.
train_spectrogram_ds = train_spectrogram_ds.cache().shuffle(10000).prefetch(tf.data.AUTOTUNE)
val_spectrogram_ds = val_spectrogram_ds.cache().prefetch(tf.data.AUTOTUNE)
test_spectrogram_ds = test_spectrogram_ds.cache().prefetch(tf.data.AUTOTUNE)
- Cache: Stores data in memory after first epoch to reduce loading time.
- Shuffle: Randomly shuffles the dataset to help with model generalizaiton.
- Prefetch: Loads the next batchj while the current one is being processed.
With our spectrograms ready, we can now setup a Convolutional Neural Network (CNN) to classify the audio commands.
Below, we outline the model architecture designed to capture patterns in the spectrograms and differentiate between various commands.
input_shape = example_spectrograms.shape[1:]
print('Input shape:', input_shape)
num_labels = len(label_names)
# Instantiate the Normalization layer and adapt it to the training data
norm_layer = layers.Normalization()
norm_layer.adapt(data=train_spectrogram_ds.map(lambda spec, label: spec))
model = models.Sequential([
layers.Input(shape=input_shape),
# Resize and normalize the input
layers.Resizing(32, 32),
norm_layer,
layers.Conv2D(32, 3, activation='relu'),
layers.Conv2D(64, 3, activation='relu'),
layers.MaxPooling2D(),
layers.Dropout(0.25),
layers.Flatten(),
layers.Dense(128, activation='relu'),
layers.Dropout(0.5),
layers.Dense(num_labels),
])
model.summary()
After running this cell, you should see something like this:
βββββββββββββββββββββββββββββββββββ³βββββββββββββββββββββββββ³ββββββββββββββββ β Layer (type) β Output Shape β Param # β β‘βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ© β resizing_3 (Resizing) β (None, 32, 32, 1) β 0 β βββββββββββββββββββββββββββββββββββΌβββββββββββββββββββββββββΌββββββββββββββββ€ β normalization_3 (Normalization) β (None, 32, 32, 1) β 3 β βββββββββββββββββββββββββββββββββββΌβββββββββββββββββββββββββΌββββββββββββββββ€ β conv2d_6 (Conv2D) β (None, 30, 30, 32) β 320 β βββββββββββββββββββββββββββββββββββΌβββββββββββββββββββββββββΌββββββββββββββββ€ β conv2d_7 (Conv2D) β (None, 28, 28, 64) β 18,496 β βββββββββββββββββββββββββββββββββββΌβββββββββββββββββββββββββΌββββββββββββββββ€ β max_pooling2d_3 (MaxPooling2D) β (None, 14, 14, 64) β 0 β βββββββββββββββββββββββββββββββββββΌβββββββββββββββββββββββββΌββββββββββββββββ€ β dropout_6 (Dropout) β (None, 14, 14, 64) β 0 β βββββββββββββββββββββββββββββββββββΌβββββββββββββββββββββββββΌββββββββββββββββ€ β flatten_3 (Flatten) β (None, 12544) β 0 β βββββββββββββββββββββββββββββββββββΌβββββββββββββββββββββββββΌββββββββββββββββ€ β dense_6 (Dense) β (None, 128) β 1,605,760 β βββββββββββββββββββββββββββββββββββΌβββββββββββββββββββββββββΌββββββββββββββββ€ β dropout_7 (Dropout) β (None, 128) β 0 β βββββββββββββββββββββββββββββββββββΌβββββββββββββββββββββββββΌββββββββββββββββ€ β dense_7 (Dense) β (None, 8) β 1,032 β βββββββββββββββββββββββββββββββββββ΄βββββββββββββββββββββββββ΄ββββββββββββββββ
- Normalization Layer: Scales pixels values to improve model convergence.
-
Convolutional Layers:
- Two convolutional layers
Conv2D
extract features from spectrogram images, with ReLU (a threshold operation) as the activation function. -
MaxPooling2D
redcuces the spatial dimensions, helping the model focus on important features. -
Dropout
layers prevent overfitting by randomly disbaling neurons during training.
- Two convolutional layers
-
Fully Connected (Dense) Layers:
- Flatten layers converts the 2D matrix to a vector.
- A dense layer with 128 neurons captures more abstract patterns.
This CNN architecture is designed to capture the distinct audio patterns in each spectrogram and classify the commands accurately.
We start by compiling the model with an optimizer, loss function and metrics. These components are essential for configuring the learning process before training.
model.compile(
optimizer=tf.keras.optimizers.Adam(),
loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
metrics=['accuracy'],
)
-
Optimizer:
Adam
is chosen for efficiency and robusteness. -
Loss Functions: We use
SparseCategoricalCrossEntropy
becasue it is ideal for mutliclass classification -
Metrics:
Accuracy
is used to monitor model performance in terms of correct predictions.
Next, we train the model on the spectrogram dataset, using early stopping to halt training if the model stops improving.
EPOCHS = 10
history = model.fit(
train_spectrogram_ds,
validation_data=val_spectrogram_ds,
epochs=EPOCHS,
callbacks=[tf.keras.callbacks.EarlyStopping(verbose=1, patience=2)],
)
Note: Epoch refers to the one entire passing of training data through the algorithm. It is a hyperparameter that determines the process of training in our model.
- Early stopping is a form of regularization illustrated below for clarity.
Figure 7: Early Stopping Callback Function
To assess the model performance, we plot the training and validation loss and accuracy over epochs. This gives us a clear view of how well the model learns and generalizes.
metrics = history.history
plt.figure(figsize=(16,6))
# Plotting loss
plt.subplot(1,2,1)
plt.plot(history.epoch, metrics['loss'], metrics['val_loss'])
plt.legend(['loss', 'val_loss'])
plt.ylim([0, max(plt.ylim())])
plt.xlabel('Epoch')
plt.ylabel('Loss [CrossEntropy]')
# Plotting accuracy
plt.subplot(1,2,2)
plt.plot(history.epoch, 100*np.array(metrics['accuracy']), 100*np.array(metrics['val_accuracy']))
plt.legend(['accuracy', 'val_accuracy'])
plt.ylim([0, 100])
plt.xlabel('Epoch')
plt.ylabel('Accuracy [%]')
Figure 8: Training and Validation Loss and Accuracy Curves
The twp graphs above show how well our model is learning over time.
- Loss Plot:
- Loss is a measure of how far off the model's predictions are from the correct answers.
- Training loss shows how the model does on the data it is learning from.
- Validation loss shows how it peforms on new, unseen data.
- Accuracy Plot:
- Accuracy is simply the percentage of corerct answers
- Higher accuracy is better
After training, we evaluate the model's performance on the test dataset to check how well it generalizes to new, unseen data.
model.evaluate(test_spectrogram_ds, return_dict=True)
To analyze classification results in more detail, we can create a confusion matrix, which shows the frequency of correct and incorrect predictions for each command class.
y_pred = model.predict(test_spectrogram_ds)
y_pred = tf.argmax(y_pred, axis=1)
y_true = tf.concat(list(test_spectrogram_ds.map(lambda s,lab: lab)), axis=0)
confusion_mtx = tf.math.confusion_matrix(y_true, y_pred)
plt.figure(figsize=(10, 8))
sns.heatmap(confusion_mtx,
xticklabels=label_names,
yticklabels=label_names,
annot=True, fmt='g')
plt.xlabel('Prediction')
plt.ylabel('Label')
plt.show()
Figure 9: Confusion Matrix - Labels and Model Predictions
As you can see in the confusion matrix, the model is more prone to make prediction errors for similar words such as "no" and "go" or "go" and "down". This suggests that certain commands might require additional tuning, perhaps with more training data or by enhancing the preprocessing steps to capture subtle differences in similar audio signals.
The speech recognition system described above provides a solid foundation for audio classification. However, it is important to understand its limitations.
Environmental Constraints
- Performance may degrade in noisy environments
- Background sounds can interfere with recognition accuracy
- Model may struggle with varying microphone qualities
Resource Requirements
- Audio-to-spectrogram conversion is computationally expensive
- Real-time processing requires sufficient CPU/GPU resources
Speaker Variability
- The speaker's accent may impact performance
- Voice variations (pitch, speed, emotion) can impact accuracy
- Distance from the microphone affects recognition
-
ScienceDirect on Spectrograms
"Spectrogram." ScienceDirect, https://www.sciencedirect.com/topics/engineering/spectrogram. -
Spectrogram - Wikipedia
"Spectrogram." Wikipedia, The Free Encyclopedia, Wikimedia Foundation, https://en.wikipedia.org/wiki/Spectrogram. -
Waveform Analysis Paper
Balaji, V., and G. Sadashivappa. "Waveform Analysis and Feature Extraction from Speech Data of Dysarthric Persons." 2019 6th International Conference on Signal Processing and Integrated Networks (SPIN), IEEE, 2019, pp. 955β960. doi:10.1109/SPIN.2019.8711768. -
Image of CNN Structure
"Image of CNN Structure." ImageKit, https://ik.imagekit.io/upgrad1/abroad-images/imageCompo/images/41Q35ZMU.png?pr-true. -
TensorFlow Audio Tutorial
"Simple Audio Recognition: Recognizing Keywords." TensorFlow, https://www.tensorflow.org/tutorials/audio/simple_audio#setup. -
SpeechRecognition Library on PyPI
"SpeechRecognition." PyPI, https://pypi.org/project/SpeechRecognition/. -
Deepgram - Python Audio Libraries
"Best Python Audio Libraries for Speech Recognition in 2023." Deepgram, https://deepgram.com/learn/best-python-audio-libraries-for-speech-recognition-in-2023.
To extend Kimonβs exploration into the world of real-time speech recognition using tensor flow, I will now discuss some applications of this methodology. There are many ways that speech recognition can be used to bolster entire industries. From healthcare to automotive, speech recognition is already, or will soon become, indispensable.
I have a friend who works as a medical scribe. He complains of long twelve-hour days spent writing non-stop. This is an incredibly inefficient system, as the entire job of the scribe is to take notes. In addition, it is very error-prone. A robust speech-recognition system would allow far more time for patient care, and reduce the frequency of errors.
Almost everyone has had a bad experience with customer support. The most common choice for companies was to use call centers in other countries where agents were meant to read off of a script. This oftentimes made it frustrating to establish a context for the conversation. Nowadays, most companies rely on text-based AI solutions. This typically works much better, but service is best when it has a touch of humanity. Studies have shown that consumers have less trust in support labeled as βAIβ (1). Using a speech-recognition front end could reduce frustration by providing the illusion of humanity.
For many years, we had fantastic analog controls in our vehicles, with a plethora of knobs and buttons to use without ever needing to look. Now, cars have become glorified computers with all controls obscured by a menu on a digital touchscreen. This makes it very dangerous to adjust the volume or turn down the temperature, as the driver is forced to take his eyes off the road (2). Speech recognition would enter the scene seamlessly, allowing drivers to keep their eyes on the road at all times. With the newest technology, drivers could even have full-blown conversations with their cars to keep them awake on long drives.
The most obvious use case for speech recognition is for use by those who are visually impaired or unable to type. For example, a blind person could wear a device that allows them to interact with their environment in an auditory way. They could be aware of hazards in their path, such as stairs, a skateboard, or even a busy street. It would be as if another person were guiding them. While this new technology may put service dogs out of a job, it has the potential to grant a new level of autonomy to people with disabilities.
At this moment, speech recognition is mostly used in cases where the correct identification of a spoken word is NOT mission-critical. Anyone who has owned an Amazon Alexa can attest that speech recognition is imperfect and mistakes are made constantly. Right now, speech recognition is not the answer if a personβs life is at risk, as in cases of accessibility. As AI continues to advance, we may look to a future where speech recognition can be trusted for daily use, even when accuracy is critical.
(1) https://www.businessinsider.com/ai-chatbots-customer-service-call-center-annoying-problems-2024-11
(2) https://interestingengineering.com/transportation/dangerous-touch-screen-in-cars