Foundational Concepts of Speech Recognition - ECE-180D-WS-2023/Knowledge-Base-Wiki GitHub Wiki

Speech Recognition: Foundational Concepts from a Biological Perspective

Intro

With the many computing advancements during the last 10-15 years, neural networks and deep learning have greatly advanced machine learning (ML) and its applications to many human-like tasks. One such avenue is speech recognition and tools such as Siri, Google Assistant, and Alexa to name a few are readily available to the public. These tools were made possible due to the vast amount of data but quantity alone is not sufficient when creating ML applications. From a young age we are able to understand speech so well even when we aren't exposed to millions of different samples, so why not look at how our own biology is able to perceive and process sound?

Human Speech Transmission/Reception

Speech Production (Vocal Cords/Vocal Tract)

The very first stage of the speech model process is the production of sound itself. This can be broken down into the following two components: vibration of the vocal cords and shape of the vocal tract.
There are three types of source functions namely periodic, noisy, and both. For periodic sources, air from the lungs passes through our vocal cords creating a periodic vibration. Rather than having one specific frequency the pitch and its harmonics are also visible in the source function depicted above. To better explain pitch, consider the differences between a young child speaking with that of an opera singer. Regardless of what is being actually said the child’s voice will sound much higher than that of the opera singer. This is due to the rate of vibration or pitch of the vocal folds as the child will have a much higher frequency than the opera singer. Pitch varies and rough statistics are as follows:

Low Avg High
Men 80 125 200
Women 150 225 350
Children 200 300 500

For noisy sources, there isn’t a clear set of frequency components and as a result doesn’t have a clear pitch. There are some sources that have slight periodic components mixed in with some noise. The phonemes (sounds) in English are created by one of these source functions and can be roughly classified into 3 categories: vowels as having periodic sources, semi-vowels (quite rare in English) with a mixture of periodicity and noise, and consonants being either periodic or noisy. These classifications are not unique to any specific person but the values of the pitch can greatly vary. What is more important for separating different phonemes is the shape of the vocal tract. A common approach is to model this as a transfer function that modifes the vocal cord vibration and view it from a frequency spectrum perspective. Specifically the peaks or formant frequencies of this spectrum hold the relevant information of the sounds.
One of the challenging parts of speech recognition is that the formant frequency values are not consistent between men, women and children. The formants are based on the individual geometry of the vocal tract which varies in length, volume and shape from person to person. As formants are generally only analyzed for the case of vowels, another approach is required to identify consonants. Speech is not simply a concatenation of sounds but a more fluid as neighboring sounds affect one another. Consonants can then be identified based on how they tweak the starts and ends of vowels but can be difficult to discern as the duration of the sound is very short.

This shows the incredible way in which our brains work as two people speaking the same sound can each have their own unique biology resulting in different pitch of the vocal cords and different formant values yet we still perceive it as the same sound. For speech recognition this is quite the challenge and is where a large amount of data from different people and dialects plays a big role as there are no set in stone “truths” in speech production.

Spectrograms

Up until now speech has been treated using this linear model however we know that these sounds are ever changing and it’s due to this that taking a look at the whole signal we cannot treat it like an LTI system. Instead we must look at small segments and this is achieved through a process called windowing. The effect is analogous to slicing a portion of our speech and only looking at that piece. It’s not as simple as this as there are many considerations for the shape, length, and rate at which we window but the overarching idea is to continually look at small segments of our speech signal. Together this windowing and frequency transform (Short Time Fourier Transform or STFT) can be plotted into a figure called a spectrogram which holds 3-axes of information.

Along the bottom we have the various windowed segments which may have some overlap (in fact they usually do to meet reconstruction and avoid aliasing). Moving up is the frequency spectrum and the intensity of the gray scale indicates which value has a greater magnitude. By modifying the parameters of our window we get a different plot indicating things like formant frequencies, pitch, and their harmonics. These changes allow us to further augment our original data and together serve as data into the various machine learning algorithms that have been developed up until today.

Ears and MFCC

So far the primary focus has been on speech production and some of the modifications that can be made through digital signal processing techniques like the STFT. However this begs the question about how our actual ears process speech and if there might be a better method to employ. Our eardrums vibrate based on the incoming sound and propagate this through the cochlea. The rate of vibration causes specific cochlear hairs to resonate which transmits this signal to our brain. From our current understanding of this process, the cochlea sends the frequency components of the auditory signal to the brain but there is a slight nuance to this. Frequencies are not scaled linearly as the amount of bandwidth associated increases with higher frequencies and is somewhat logarithmic in nature. In other words we are more sensitive to changes in lower frequencies than higher ones that are harder to distinguish.

To mimic this behavior of the ear we instead look at things from the Mel-scale and apply a transformation called the Mel-Frequency Cepstral Coefficients (MFCC). Specifically, frequency components of our signal are converted to Mels through a logarithmic process. The reasoning lies in our sensitivity to perceiving differences in lower frequencies easier than in higher ones even if the differences in Hertz are the same. For example we can distinctly notice the differences between a 500 Hz and 1000 Hz tone but is much harder for a 10,000 Hz and 10,500 Hz tone. In turn this rescaling to model our own human biology has empirically improved the error rates of machine learning models.

Emotion

The next step in speech recognition for AI is to identify emotions to allow them to converse with humans more naturally. When people are conversing, they are constantly picking up on cues about the other’s emotions in order to react properly. If AI were able to understand a person’s emotions through their body language, facial expressions, and voice, then AI would become much more effective in certain applications. For example in call centers, by understanding the customer’s emotions they can redirect them to the appropriate person to help them. Another application would be for businesses that want to get feedback on how an advertisement was received by someone to better tailor their marketing. In the medical field the technology can help doctors to better monitor how a patient is feeling in examinations or procedures. Like with any new technologies, there are ethical implications to consider as well as issues that must be addressed. Some of the main issues in this technology would be privacy concerns and potential bias in how accurately different individuals’ emotions can be read.

“We have a lot of neurons in our brain for social interactions. We’re born with some of those skills, and then we learn more. It makes sense to use technology to connect to our social brains, not just our analytical brains.” says MIT Sloan professor Brynjolfsson (Alkhaldi). Speaking and listening are just two parts of conversation. For a conversation to go smoothly, reading the other person’s emotions and reacting to them are just as important as speaking and listening to them. This is why the natural progression of AI and speech recognition should be an understanding of emotions, or our social brains. Without emotion you are losing a big part of the understanding and context of what the other is saying. Since emotion isn’t as concrete as understanding speech and deciphering what was said, many parameters are required to accurately decipher it.

Ethical Implications of Emotion AI

Emotion AI requires large amounts of data to train with machine learning algorithms, just like any other AI. The data it needs is data of facial expressions from camera feed as well as microphone data to listen to speech. This makes the collection of data for training very privacy sensitive. The application of this technology would also require video and voice data from the user so if this technology were to be implemented it is important that these concerns are addressed. Users should be made aware of what data they are allowing it to use. The second concern with emotion AI is bias. Not everyone expresses emotions the same way, so there is potential for bias from the researchers’ opinions on what certain expressions mean. There has already been controversy over facial recognition technology because of findings that the accuracy was best on white male subjects and significantly poorer for dark skinned female subjects. Bias like this could show up in emotion detection for example in elderly subjects emotion may appear differently on their faces because of wrinkles (Somers). This could cause someone to appear angry or fatigued when they aren’t. A way to alleviate this issue would be to obtain a lot of participant help through questionnaires that ask them to answer what emotions they think a given face is portraying. This would also reduce bias if you can survey from a wide range of people.

Conclusion

Speech recognition and emotion AI, like all other machine learning models rely heavily on the input data. Many results through neural networks are largely empirical and can be difficult to ground in sound theory of why they work well. Treating them like black boxes is not that different from our understanding of the brain but we do have a lot of control over the types of inputs that are being fed into these models. Understanding how speech production and hearing works in humans gives that much needed context and serves as a guide for how a machine can also go about recognizing and processing sounds. By no means is this all that is required as sound identification is often compared with our knowledge about the language itself. A linguistic perspective can provide insight about which sounds can go together, structures of sentences and things of this nature to further improve the accuracy of speech recognition software. Similarly to understanding human biology to give context to speech recognition, getting a lot of human input into how emotions are displayed will be essential to create a ground truth for emotion recognition. Topics like Natural Language Processing can be a next step to explore if this is of interest to you but from a biological perspective, hopefully this article has given you some of the important intuition needed to understand one part of speech processing.

Sources

https://theaisummer.com/speech-recognition/

https://www.analyticsvidhya.com/blog/2022/03/a-comprehensive-overview-on-automatic-speech-recognition-asr/

Rabiner, L. R., & Schafer, R. W. (2011). Theory and applications of Digital Speech Processing. Pearson/Prentice Hall.

Abeer, A (2023) ECE M214A Lecture Slides

Alkhaldi, Nadejda. “Emotional AI: Are Algorithms Smart Enough to Decipher Human Emotions?” IoT For All, 4 May 2022, https://www.iotforall.com/emotional-ai-are-algorithms-smart-enough-to-decipher-human-emotions.

Somers, Meredith. “Emotion AI, Explained.” MIT Sloan, 8 Mar. 2019, https://mitsloan.mit.edu/ideas-made-to-matter/emotion-ai-explained.

Word Count: 1923