dev.AudioRoadmap - tooll3/t3 GitHub Wiki

Discussing future Audio Pipeline

This is currently just a place to store some thoughts about the long-term future of Audio in Tooll. Nothing mentioned here is a promise, and everything is subject to change (and probably will change).

It may rely heavily on a Channel-like workflow, similar to Touch Designer, currently in ideation phase for future Tooll development. At the very least, it should be compatible so you can crunch your audio into all kinds of bits and bytes to drive visuals and whatnot.

Ultimately, it would be great if Tooll were able to handle most audio output/manipulation that a typical DAW would, sans recording. That means playing back live audio and multiple audio clips simultaneously, routing audio through busses for effects processing, running FFTs on arbitrary audio streams, etc, to drive different visuals of Tooll. With this, the possibilities for VJ-ing, live performance, composition, and playful iteration can be greatly expanded.

Integrating VST/CLAP effects would be a real game changer here.

For the backend, NAudio and BASS may already have everything we need to make this happen, so long as it's handled carefully. If not, we may need to seek an alternative.

Things that still need to be considered:

  • If most operators are lazily evaluated (aka outside the [AudioOutput] loop), then FFT and other effects driving visuals won't be sample-accurate. is that ok? (probably...)
  • How does Unreal and other similar tools handle the UX for this?
  • many other things im sure

AudioEngine

After mentally working through some of the pros and cons of following certain Tooll workflow standards, I am currently preferring a static AudioEngine class that manages the project's audio playback - meaning it can reference instances of [AudioClip] and [AudioSource] and trigger an "Update" function such that all audio output will execute according to the project's desired sample rate.

This class will do nothing but play audio along with the timeline (or consistently for [AudioSource]) - aka provide samples to its [AudioClip] and [AudioSource] instances. It will not output audio by itself - not without an [AudioOutput] operator. This allows audio to be used for more than just sound :)

For sample rate, we should support 48000 by default. No other samplerates seem necessary in the short-term, and any other audio, when imported, should be re-sampled. For live-input with [AudioSource], we should request a samplerate of 48000 until we have a demand or need for greater flexibility. This can simply be defined as a constant in the AudioEngine.

This class should also cache information of provided audio files, including but not limited to loading them all into a dictionary and pre-processing FFT if necessary. It would also handle re-sampling to the target samplerate if necessary.

This class would also define default audio bufferSize. Likely only one global bufferSize to keep the code tame and the devices happy.

It would also make sense for this AudioEngine to detect feedback and mute itself to prevent damage to users' systems (or hearing).

For [AudioOutput] instances nested inside another operator, I'm not entirely sure what to do. One idea is to have operators containing nested audio automatically generate an Input boolean for determining if the [AudioOutput] should output audio or not.

For [AudioClip] and [AudioSource], this class should be aware of whether or not these clips are currently wired to anything that requests its output (i.e. an Output window or a Command). If it's not, don't bother reading samples of the audio.

SampleBuffer

This would be the class used for audio sampling, to be pooled and re-used as new samples are received from our audio streams. It would simply be a wrapper class for an array of samples of length bufferSize. We should target 32-bit floating point audio by default to simplify development, but should strive to maintain interoperability with 24-bit once we get our sea legs. It would be a good idea to use .Net 7's generic math by default.

SampleBuffers should also be interoperable with Tooll's future Channel system. Since each audio channel will have a dedicated SampleBuffer, this should be relatively straightforward, and could provide an extremely thorough testing ground for the stability and efficiency of the Channel workflow.

SampleBufferGroup

It may be worth having a wrapper class for SampleBuffers to support arbitrary numbers of channels, for simpler interop with different kinds of audio and Channel workflows.

.NET approach

More recent versions of .NET provide many more options for controlling how the GC works - allowing it to work more asynchronously and less aggressively. To me, this option should be considered but should be a last resort. The development goal of this should be to create zero garbage at runtime, with the only exceptions being when the sample rate is changed or other non-runtime events.

Programming this in C# is often thought to be a bit of a square peg, but audio libraries like NAudio tend to disagree. .Net's latest advancements should allow us to create a system that performs at near-native speeds.

When passing SampleBuffers from op to op, each op should maintain its own copy of that buffer - its output buffer. This should be considered immutable. Essentially, operations should work something like OutputSampleBuffer[i] = TransformSample(InputSampleBuffer[i]). Working solely with value-types in this way and not creating new SampleBuffer objects means we should be set up for success for efficient operations. Ideally, any object instantiated at runtime should be pooled if discarded.

SIMD

We should also look into the potential for utilizing SIMD vector operations wherever possible - if we can turn a 1024-length buffer into something the CPU considers to be a 256-length buffer, that'd be wonderful. I don't know a ton about that, though, and it would need to be investigated. Using Spans to create Vector4s out of an array should be trivial, though, if that's even necessary. (Edit: I actually found a wonderful library for exactly this, and it's something we should probably use for other things too)

Here's what might be a helpful guide to using .Net's native SIMD stuff.

SIMD seems to be unavailable for things compiled by Roslyn, just for things compiled by RyuJIT. Idk what the project uses for its non-operator compilation, but we may want to see if we can compile Core and Editor with RyuJIT (if it's not the default already) and call those functions from operators and still reap the benefits of SIMD instructions

Library approach

Much of this may sound like it warrants reinventing the wheel, however BASS and NAudio, both of which are currently referenced in the Tooll project, are well-respected libraries for doing everything we're talking about here. Where we can offload work to these libraries, we probably should, and we should investigate opportunities for that. We need to research what kinds of inputs and outputs these libraries give us for audio streams, and whether we can use those outputs with little enough overhead or GC pressure to still enable our Tooll tom-foolery

Operators (Realtime)

I propose the following operators for realtime audio work - meaning these will be updated by the AudioEngine, independently of Tooll's typical [Command] or DirtyFlag-oriented approach. Not to say they won't have DirtyFlags - it's just that if any non-audio op requests its output, it will almost certainly always be dirty.

I also think that [AudioSource] and [AudioClip] should still not even bother updating its outputs unless they're requested - they can be marked dirty with each sample from the AudioEngine, but only copy its contents to its output when Update() is called.

[AudioSource]

This is the live audio input - on the operator you can select from a list of audio sources available on your system, and that audio will play back in real time. You can have as many instances of each device as you like, as on the AudioEngine they should only be references to the actual device. As such, they can share data, allowing multiple references to have little computational overhead.

The instance should output as many channels as that device contains, in a similar way to Carla.

Input:

  • Audio source name (drop-down list)
  • Always on (bool) - if true, this audio source will continue to be processed, even when playback is stopped.
  • Gain

Output:

  • SampleBuffer[] or SampleBufferGroup

[AudioClip]

This is essentially a [TimeClip] but for audio file playback, with additional bells and whistles for controlling gain, panning, and the region of the provided audio file that clip will play back. Changing the start/end time of this file will lengthen/shorten the duration of the clip in the Timeline while maintaining its start position.

Inputs

  • [AudioFile] or filepath

Outputs

  • SampleBuffer[] or SampleBufferGroup

[AudioOutput]

This will playback audio through your system's default audio device by forwarding the SampleBuffers provided to it to the AudioEngine. This is likely the most challenging piece of the puzzle, as performance will have a direct impact on its quality. The way that audio channels are interpreted (i.e. Left, Right) will be determined by the index of the SampleBuffer in the SampleBuffer/SampleBufferGroup provided to this, following standard audio conventions for Stereo, 5.1, and 7.1 sound (i.e. index 0 is Left, index 1 is Right)

A stretch goal would be to allow for specifying a specific device such that multiple different [AudioOutput] can be used for different physical and virtual devices.

If the current device is not found, it should log an error to the user and default to the system's default device.

It may be worth considering this operator as functioning similarly to [RenderTarget], except instead of the OutputWindow or a separate requesting operator asking for its data, it's driven by the AudioEngine.

Inputs

  • SampleBuffer[] or SampleBufferGroup
  • Gain

Operators (lazily evaluated)

These operators will be totally divorced from the AudioEngine class (with the potential exception of [AudioFile]). They will be evaluated in the same way as every other operator in Tooll - evaluated as needed. If they are placed in a place that has an [AudioOutput] upstream, they will of course run in lockstep with the AudioEngine.

Bus

This is where the magic happens. This operator provides a convenient place to merge different audio streams together to make your audio pipeline more manageable, more efficient, or reroute specific sounds, devices, or channels to achieve specific visual or auditory effects. It's very dumb, though I'm probably underestimating the engineering required to properly mix different sample streams together. In theory, it would just output the sum of its inputs (via FFT), but further research is required. This may be one of those things BASS or NAudio needs to help with if it proves to be too complicated.

Input

  • SampleBuffer[] or SampleBufferGroup (Multi-Input Slot)
  • Gain

Output

  • SampleBuffer[] or SampleBufferGroup

Gain

This is just an op to apply gain separately for greater flexibility in audio routing and audio/visual effects. Ideally, it's not specific to audio and can support a wider breadth of Channel workflows. But maybe an audio-specific variant is warranted to utilize proper dB conventions.

Input

  • SampleBuffer[] or SampleBufferGroup
  • Gain (float)

Output

  • SampleBuffer[] or SampleBufferGroup

[AudioFile]

This provides an audio file that can be used in an [AudioClip]. It provides no playback by itself, but it provides a convenient visual indicator of where certain audio files lives and useful metadata if you choose to use it. I am wondering if this should be mandatory.

Input:

  • File path

Outputs:

  • File path
  • Audio metadata

(Optional) a struct providing all of that audio's metadata, so an [AudioClip] can loop (wrap) or rename itself without requesting that information itself?

FFT

This allows you to perform FFT operations on an audio stream, much like how the current [FFT] or [AudioAnalysis] operators work. We may want several variations of this or several different kinds of outputs - I haven't thought the specifics through.

Inputs

  • SampleBuffer or SampleBuffer[] or SampleBufferGroup

Outputs

  • SampleBuffer or SampleBuffer[] or SampleBufferGroup

Channel workflow inside the audio playback loop

Our future Tooll Channel workflow should be interoperable with everything in here, allowing for absolutely heinous operations on raw audio data. These operations would become much more demanding if they are downstream to an [AudioOutput], but could be great fun to play with for surprising audio effects (someone could even make a compressor or delay) or an important tool to shape audio output to drive visuals.

Audio Clip Playback Speed control

I would want speed control to be controlled though the typical [TimeClip] interface, though I don't know much about how that works and how well that would play with the way we sample audio. Could be worth doing though.

Sketch

The sketch below is not up to date with what I've described above, but until I update it, it can serve as a general example of the possibility space.

image