Basic Foundations of Spatial Audio Utilizing Pyroomacoustics - ECE-180D-WS-2024/Wiki-Knowledge-Base GitHub Wiki

Basic Foundations of Spatial Audio utilizing Pyroomacoustics

Introduction

Spatial audio defines the digital technique in which audio signals are processed in a way so that sounds have the illusion of coming from specific locations and directions relative to a user. It is often used in the context of furthering defining the immersion of a user, whether it be in music, a video game, or in a movie theater. In this article, we explore the various fields in which spatial audio technology is applied, highlighting how it enhances immersion and realism in each context. Then, we provide a basic introduction to the mathematical foundations of spatial audio is presented to the reader before providing a brief tutorial using Pyroomacoustics, a popular python signal processing module, to create a virtual room, adding directionality to both audio sources as well as microphones.

Background

Spatial audio technology has revolutionized how we experience sound across different domains. Its ability to create immersive and realistic auditory environments has led to significant advancements in several fields. Virtual reality, augmented reality, gaming, and even healthcare are all areas where spatial audio comes into play. In virtual reality, spatial audio is essential to creating an immersive experience. Spatial audio allows users to navigate through the virtual environment while anchoring them with auditory cues that match the visual stimuli. An example of the use of spatial audio in real-world virtual reality applications is training simulations. Whether a pilot, surgeon, or military personnel, training simulations for these professions utilize spatial audio to create realism and enhance the effectiveness of the training. Augmented reality, which blends virtual objects with the real world, makes use of spatial audio to enhance this integration by ensuring virtual sounds come from the appropriate direction and distance. The main use of this technology in augmented reality comes with navigation. Navigation applications utilize spatial audio to guide users, improving orientation and usability. The two fields that apply spatial audio the most currently are gaming and cinema. These industries rely on spatial audio to design unique and captivating environments for improved gameplay and cinematic experiences. Whether it be a first-person shooter game or a surround-sound implementation in a movie theater, spatial audio technology offers a gateway to realistic auditory exposure. Spatial audio is also finding innovative applications in healthcare. It is used in therapy for conditions like PTSD or anxiety, where virtual environments with calming sounds help patients relax. Advanced hearing aids also use spatial audio processing to help users better localize sounds. Spatial audio’s ability to create a three-dimensional auditory experience has vast and varied applications across numerous fields.

Understanding how time and frequency changes in real space is integral to the foundations of spatial audio and audio processing. To convert an ordinary audio signal into one that can be registered into a virtualized 3-dimensional space, we rely on two main techniques: shifting the signal by a specific amount of time, and modifying the amplitude of the frequency according to angle and direction of the source. Together, these two techniques comprise the “Head Related Transfer Function”, or HRTF for short. The HRTF is the acoustic transfer function used to map a sound source to the ear canal of a listener without taking into consideration any interference from the room or objects held within the room of the listener. HRTF’s are highly idiosyncratic as it is dependent on physiological factors such as the reverberation properties of the listener’s head shape, pinna (the visible part of an ear), and torso. As a result, measuring the exact HRTF for a user is an extremely tedious and lengthy process, requiring experimentation by recording and testing impulse responses to characterize the system. The measurement process is conducted in an anechoic chamber to eliminate reflections and uses high-precision microphones placed at the ear canal. The impulse responses from each loudspeaker to the microphones are recorded and transformed into the frequency domain, capturing the unique filtering effects of the listener’s anatomy. The data captured by the microphones also includes the direct path and any diffractions around the head and torso. Certain innovations such as machine learning and 3D modeling are furthering HRTF accuracy. An example of the general measurement process is shown below in Figure 1:

Figure 1: HRTF Measurement Setup

Once we have determined the transfer function necessary to localize a signal, sound quality becomes the next step in a realistic simulation. Sound quality is the reverberation time, denoting the amount of echo and clarity of a sound within a room. Reverberation time is defined as the time it takes for a sound to decay by 60dB. This can be approximated through Sabine’s Equation, which denotes a simplistic estimation through the formula: RT60 = 0.049 V/A, where RT60 = Reverberation Time, V = volume of the space (feet cubed), A = sabins (total room absorption at given frequency) and A (sabins) is determined by a = Σ S α with S = surface area of material (feet squared), α = sound absorption coefficient at given frequency or the NRC. Understanding reverberation time is important for different applications. Designing concert halls, recording studios, and public spaces utilize reverberation time to ensure optimal sound quality. More advanced techniques such as computational acoustic modeling and real-time digital signal processing can be used to fine-tune these reverberation characteristics and allow for a more tailored acoustic experience.

Figure 2: Example Reverberation Time of a Sound

In light of the traditional methods via manual signal processing, many modern day software packages, such as Pyroomacoustics, abstract the heavy computational process and provide intuitive tools to simulate and analyze spatial audio. At its core, Pyroomacoustics works to expedite the HRTF process by acting as a room impulse response generator based on an image source model that works with 2D/3D rooms of any type. This is done by first creating a virtualized room. The module then initializes artificial room impulse responses (RIR) between the audio source and microphone of the listener, simulating the real world physics of a generic HRTF as well as utilizing Sabine’s Equation to modulate sound quality. Once the impulse responses have been created, the source audio sample is convolved with the RIRs respective to the location of the listener in the virtualized room. Up next, we will be introducing the basic features of Pyromacoustics by providing an example for setting up different audio locations within a room.

Pyroomacoustics Basic Tutorial

The first step in setting up our program would be to install the Pyroomacoustics python module alongside some of the other modules that we will be using to read and play the audio signals. To do this we can run:

    pip install numpy pyroomacoustics scipy sounddevice

Once we have installed Pyroomacoustics, we begin by importing all the necessary packages into our python file:

    import numpy as np
    import pyroomacoustics as pra
    from scipy.io import wavfile
    import sounddevice as sd

Next, we set up our basic room parameters with a 10m x 7.5m x 3.5m room, 0.3 second reverberation time, and import our source signal by reading an example wavfile. In the example, we define and utilize Sabine’s formula to determine the wall energy absorption and maximum order of the image source model to achieve a goal reverberation time. Finally, we set up a virtualized room in the shape of a shoebox (a small room with 4-6 walls where every corner is a right angle) utilizing the pra.ShoeBox() method:

    # The desired reverberation time and dimensions of the room
    rt60_tgt = 0.3  # seconds
    room_dim = [10, 7.5, 3.5]  # meters

    # We invert Sabine's formula to obtain the parameters for the ISM simulator
    e_absorption, max_order = pra.inverse_sabine(rt60_tgt, room_dim)

    # import a mono wavfile as the source signal
    # the sampling frequency should match that of the room
    fs, audio = wavfile.read("./CantinaBand3.wav")

    # Create the room
    room = pra.ShoeBox(
        room_dim, fs=fs, materials=pra.Material(e_absorption), max_order=max_order
    )

Following this, to add an audio source and microphone (listener) to the virtualized room environment, we can simply use the built in methods (.add_source and .add_microphone_array) of pra.ShoeBox. These two methods take in coordinates (or a list of coordinates) in the form of lists as their main arguments:

     # place the source in the room
    room.add_source([2.5, 3.73, 1.76], signal=audio, delay=0.5)

    # define the locations of the microphones
    mic_locs = np.c_[
        [6.3, 4.87, 1.2], [6.3, 4.93, 1.2],  # mic 1  # mic 2
    ]

    # finally place the array in the room
    room.add_microphone_array(mic_locs)

Next we generate the necessary RIRs to characterize the room’s system by calling upon room.simulate() which builds the RIR automatically. The output of the simulated convolutions is stored into the signals attribute of the microphone locations of our room.

    # Run the simulation (this will also build the RIR automatically)
    room.simulate()

    room.mic_array.to_wav(
        f"./CantinaBand3_Mod.wav",
        norm=True,
        bitdepth=np.int16,
    )

Finally, we can read and play our original audio signal alongside our newly modified signal sequentially through this simple loop calling upon sd.play() and sd.wait():

    audios = ["./CantinaBand3.wav", "./CantinaBand3_Mod.wav"]
    for audio in audios:
        print(f"Playing {audio}")
        fs, audio = wavfile.read(audio)
        sd.play(audio, fs)
        sd.wait()

Conclusion

In this exploration of spatial audio foundations and practical implementation using Pyroomacoustics, we've delved into the intricate techniques behind creating immersive auditory experiences. We began by learning about the various fields spatial audio technology can be utilized in. Whether it be virtual reality, augmented reality, gaming, or healthcare and plenty of others, spatial audio technology's ability to create a three-dimensional auditory experience allows for enhanced realism and pushes user engagement to new heights. We also covered the fundamental concepts of Head Related Transfer Functions (HRTF), looking into the detailed measurement process and the significance of sound quality in spatial simulations, we've highlighted the complexities involved in replicating real-world audio environments digitally.

Through Pyroomacoustics, we've demonstrated how these complexities can be abstracted, allowing for streamlined creation and analysis of spatial audio scenarios. From setting up room parameters and importing audio signals to defining source and microphone locations within a virtualized space, the tutorial offers a structured approach to setting up introductory spatial audio processing. This tutorial serves as a foundational resource for exploring spatial audio processing, however, for readers wanting to explore more features of Pyroomacoustics, their documentation can be explored on the official Pyroomacoustics documentation web page.

Full Example Code

    import numpy as np
    import pyroomacoustics as pra
    from scipy.io import wavfile
    import sounddevice as sd

    # The desired reverberation time and dimensions of the room
    rt60_tgt = 0.3  # seconds
    room_dim = [10, 7.5, 3.5]  # meters

    # We invert Sabine's formula to obtain the parameters for the ISM simulator
    e_absorption, max_order = pra.inverse_sabine(rt60_tgt, room_dim)

    # import a mono wavfile as the source signal
    # the sampling frequency should match that of the room
    fs, audio = wavfile.read("./CantinaBand3.wav")

    # Create the room
    room = pra.ShoeBox(
        room_dim, fs=fs, materials=pra.Material(e_absorption), max_order=max_order
    )

    # place the source in the room
    room.add_source([2.5, 3.73, 1.76], signal=audio, delay=0.5)

    # define the locations of the microphones
    mic_locs = np.c_[
        [6.3, 4.87, 1.2], [6.3, 4.93, 1.2],  # mic 1  # mic 2
    ]

    # finally place the array in the room
    room.add_microphone_array(mic_locs)

    # Run the simulation (this will also build the RIR automatically)
    room.simulate()

    room.mic_array.to_wav(
        f"./CantinaBand3_Mod.wav",
        norm=True,
        bitdepth=np.int16,
    )

    audios = ["./CantinaBand3.wav", "./CantinaBand3_Mod.wav"]
    for audio in audios:
        print(f"Playing {audio}")
        fs, audio = wavfile.read(audio)
        sd.play(audio, fs)
        sd.wait()