Week 2 Codefest - zanzibarcircuit/ECE510 GitHub Wiki

Bootstrapping the main project!

Introduction to problem: Real time Parameter updating of VAE for Audio Synthesis

My initial idea is to apply audio synthesizer techniques to a generative audio variational audio encoder network to really just see what it sounds like! In traditional synthesis low frequency oscillators (or LFOs) are used to update parameters (like filter cutoff points, fundamental frequencies, volumes, etcs) in real time using oscillators in the infrasound-audible frequencies to create richer time evolving textures. I would like to see if these can be applied to generative networks in a way that creates interesting sonic textures. To do this, I would like to train a VAE network on a bunch of sounds (on a server, not at the edge) and in traditional VAE style, control the latent space for various parameters. I would then like to vary the decoder part of the network with an LFO to create evolving textures. The goal is to implement the latter half on a chiplet with a really lightweight network. I'm not sure if this will work or sound like anything interesting but let's have a look.

Heilmeier Questions

What am I trying to do?

I'm trying to train a lightweight VAE network on some database of sounds with a 4-8 latent space parameters, and implement the decoder and an LFO on a chiplet. The LFO will adjust the the latent space parameters in real time to synthesize sound.

How is it done today, and what are the limits of current practice?

Generative audio is done with a CPU or GPU and there are no dedicated standalone synthesizers doing this to my knowledge. This means you are normally sitting at your computer while making music which isn't uncommon but a lot of people like dedicated devices and to get away from their monitors.

What’s new in your approach and why do you think it will succeed?

It would be great to make this type of sound design more portable and make more modular hardware to expand on the current state of standalone synthesizers.

Who cares?

Hardware synthesizers are wildly popular in electronic music and it would be exciting to introduce AI based synthesis to a standalone device. People only care if it sounds interesting however.

What are the risks?

I don't know if we can fit the type of VAE we need on dedicated HW. I also don't know if it will sound good. It depends on the network choice.

(I'm not going to address time and cost because the answer to both are 'just the time I have for the class')

What are the midterm and final “exams” to check for success?

Midterm: SW: A working algorithm that generates interesting sounds. HW: A toplevel diagram for the design and associated workflow to implement it. I should be able to implement the LFO. Also having a testbench to verify my design.

Code

I got some generated sounds using the MNIST spoken database chunked into 1 100 ms sections but it didn't sound like anything. It makes me wonder about the viablity of this project. I'm going to try on a more musically diverse data set using variable size chunks but I'm not sure if it'll work. My ChatGPT transcript is pasted so far.

CF2-Transcript.pdf

My Colab notebook is in the CF2 folder.

I still need to get something that sounds interesting to make this project a success. I need to think about it a little more.