Fundamentals of Audio Signal Processing - ECE-180D-WS-2024/Wiki-Knowledge-Base GitHub Wiki

Introduction

In the field of audio technology, the pursuit of precise sound quality has become a focal point. Audio quality plays a pivotal pole in shaping our perception and enjoyment of music, films, and various forms of digital content. In order to achieve pristine audio, we must delve into the complex domain of audio signal processing. From the initial capture of sound to its reproduction, we employ techniques and algorithms in order to refine and optimize the auditory experience. Let’s dive deep into the world of audio quality, by learning further about the fundamentals and intricate processes behind the scenes of audio signal processing, and take a look into speech recognition as an application, where audio signal processing is important to understand.

Background

The evolution of sound technology has a history in modernization of audio quality, control, and recording. Most households nowadays possess expensive sound systems including loudspeakers, TV’s, computers, and mobile phones; however, it was not always like that. The beginning of audio does not come close to the high quality sound we have today. Some of the very first audio inventions included the carbon microphone in 1875 invented by David Edward Hughes, the first electric loudspeaker in 1876 invented by Alexander Graham Bell, and the first moving coil in the end of the 19th century invented by Oliver Lodge. During World War II, audio took a significant step, and the coaxial Duplex driver was introduced in 1943 (citation). Shortly after, the loudspeaker became popular, substantially improving the audio in theaters. The quality and clarity of audio at higher volumes greatly improved. Modernization led to mobile phones, bluetooth, and virtual reality expanding the possibilities of audio. The sound production continuously improves as new mobile phone models are released, and new updates are created. The evolution of audio quality is ongoing, with continued advancements in digital signal processing and immersive audio technologies shaping the future of audio reproduction.

Fundamentals of Audio

The most common method of depicted sound is through a waveform. A waveform visually illustrates the fluctuation of sound amplitude over time. The amplitude represented on the y-axis and time on the x-axis. The amplitude varies above and below the x-axis. In general, amplitude denotes the relative strength of sound waves or transmitted vibrations, affecting our perception of volume and its loudness. It is quantified in decibels (dB), representing the sound pressure level or intensity. The figure below demonstrates a short (3 millisecond) excerpt of a sound file. The range for human hearing begins at 0 dB. Everyday sounds typically register below 60 dB, and louder noises such as a car engine starting are typically closer to 70 dB. The figure below demonstrates a short (3 millisecond) excerpt of a sound file.

26227d7a-2b0b-49d4-a672-f8240a790ca2-768x372

Frequency represents the rate at which a sound pressure wave repeats itself per second. It is inversely proportional to the number of oscillations. Lower frequencies generate fewer oscillations than higher frequencies. Natural sounds in everyday life encompass a spectrum of frequencies. Tonal sounds consist of a fundamental frequency accompanied by a series of overtones which are multiples of the fundamental frequency. Overtones are indistinguishable to humans’ ears since they blend so well with the fundamental frequency. Pitch can be defined as the position of a single sound in the complete range of sound which is determined by the frequency of the waves producing it. Below is a simple visual of how the waveform looks for quieter and louder sounds and a visual for lower and higher pitches.

unnamed

_The phase of a waveform refers to the timing of a point within a wave cycle of a periodic waveform. It is a crucial aspect of audio signals, determining how waves interfere with each other. For instance, when two waves (with the same frequency) combine, they result in a larger wave with greater amplitude; however, if the same waves are 180 degrees out of phase, their amplitudes destructively interfere, canceling each other out. _

Audio Signal Processing

Audio Signal Processing involves the application of techniques to manipulate audio signals; these signals exist in both digital and analog forms with frequencies ranging from 20 to 20,000 Hz. Analog signals manifest in electrical forms whereas digital signals are represented in binary code. The analog-to-digital conversion process is known as quantization. In digital audio, an analog-to-digital converter captures several audio samples per second at a specific sample rate and bit depth in order to reconstruct the original signal. A higher sample rate and bit depth contributes to enhanced audio resolution and smoother playback is achieved.

In order to accurately reproduce the sound/audio, we must take enough samples. Any sampling rate must have twice the frequency of the original recording, otherwise the sound is not faithfully reproduced. For music, a standard sample rate is 44.1 kHz samples per second which is the standard for most consumer audio. In the final example above, enough samples are taken to successfully reconstruct the audio signal. After the sampling, the computer must store it. The bit depth characterizes the number of available bits the digital gear is able to map a signal’s amplitude values to. Since sound waves have an extensive array of potential amplitude values, it becomes necessary to represent these values as bits for accurate measurement within the digital domain. How the sample rate and bit depth work together to reconstruct a wave cycle is illustrated below.

The bit depth cannot be too low, where we do not have enough bits to store our samples; therefore, having the correct bit depth is essential for accurately reproducing the audio.

Obstacles within Audio

A primary focus is to employ computational methods to alter sounds by addressing issues such as overmodulation, echo, and unwanted noise through various techniques. In order to address these issues that interfere with the quality of audio, we use post-processing algorithms. Some useful techniques used include data compression, Automatic Echo Cancellation (AEC), Resampling, Filtering, Equalization, Automatic Gain Control, and Beamforming. Diving deeper into compression first, it stands as a tool to diminish the dynamic range of audio signals. When having to work with high pitch variations, recording without compression can lead to a distorted sound. A compressor moderates the volume of the loudest sound while amplifying the softer tones. Compression also contributes to reducing the bandwidth of digital audio streams. The two primary types of audio compression are lossless and lossy compression. The more widely used type is lossy due to its higher compression ratio. Among the most prevalent audio compressions are MP3 and AAC compression. Moreover, filters serve as fundamental circuits in signal processing and play a crucial role in eliminating undesired noise, echo, and distortion. Some of the most used filters are low-pass, high-pass, band-pass, and band-stop filters.

Low-pass filters enable the passage of frequencies below the designated cut-off frequency level while curbing those surpassing the cut-off range. A high-pass filter operates inversely to a low-pass filter by allowing frequencies higher than the cut-off frequency range to pass through. Post-resampling of signals, a bandpass filter is implemented to eliminate excess noise by diminishing frequencies beyond or below the cut-off frequency range. The band-stop filter leaves the majority of frequencies unaltered while suppressing those within a specified range to extremely low levels. Filtering is important for audio processing because it modifies the frequency response, changes its tone as needed, and removes any unwanted noise.

Voice Recognition

Audio Signal Processing is essential for understanding how Voice Recognition works. Defining voice recognition: it is the machine or program’s ability to receive and interpret dictation and to understand spoken commands. Computers using voice recognition software need the conversion of analog audio into digital signals. This procedure is known as the analog-to-digital (A/D) conversion. It must be able to possess a digital database containing words, accompanied by a swift mechanism for comparing this data with incoming signals. The speech patterns are stored on the hard drive and loaded into memory when the program is initiated. A comparator then analyzes the patterns against the output of the A/D converter.

Google Speech AI

One type of audio recognition technology that is heavily used nowadays is called Google Speech AI, also known as Google Cloud Speech-to-Text. This technology applies machine learning and neural networks to process audio data and then convert it into words. It is used over various industries offering a range of applications such as voice assistants, transcription services, call center analytics, voice search, dictation software, etc. Speech recognition serves as the foundation for voice assistants such as Siri and Alexa, which are used heavily in everyday life. It enables users to engage with computers using natural language data. Speech recognition encompasses several steps for audio accuracy. The recognition of words and content the user provides is the first essential part. This accuracy step involves training the model to identify each word. Then the program must convert the recognized audio into text. AI analyzes the frequency of spoken words and their combinations to deduce meaning; this process is known as “predictive modeling.” Huge companies such as Google, Apple, and Amazon are using AI-based voice recognition to enhance their customer service. It provides a more simple, efficient way for a smooth-running company.

Conclusion

Audio signal processing, with its roots in analog and digital domains, plays an essential role in shaping the quality and clarity of sound. The journey begins with the representation of sound through waveforms, illustrating the amplitude variations over time. Comprehending the frequency content of natural sounds is important for understanding the richness of audio. Advanced techniques, including compression, filtering, and equalization ensure a balance of dynamic range and eliminating unwanted elements. The transition from analog to digital further refines audio quality, especially in modern digital systems. The evolution of audio processing includes the application of AI in speech recognition which has been heavily used for a promising future of efficient, high-quality speech processing.