Week 3 Codefest - zanzibarcircuit/ECE510 GitHub Wiki

Introduction

After doing some research, I looked at attempting to accelerate the WaveNet-style autoencoder Google uses for their Magenta NSynth project (https://arxiv.org/pdf/1704.01279). Some issues I'm running into early on are that it's an autoregressive model, which means that audio is generated on a per-sample basis and is inherently sequential rather than parallelized. I'm wondering if this is a bad project to accelerate given that parallelization is sort of the point of this class.

1.

Regardless, I attempted to profile this and found it might be too complicated for what I'm trying to learn here. I asked ChatGPT to create a similar network in PyTorch that would be easier to understand. After profiling, I learned the convolution was the time-intensive bottleneck. In the network ChatGPT built, conv1d was the bottleneck function. In a generation of 1024 samples, it took more than 75% of the time or 0.163 seconds of 0.215 total seconds. ChatGPT estimates I can speed this up by 130x, so effectively removing the convolution time from the overall design. If I accelerate the rest (padding/activations/adds), which ChatGPT believes could be sped up by 10-15x, I could make it real time. Right now it takes 0.215 seconds to generate 0.0625 seconds of audio. If we were able to just speed the conv block up by 100x, we could effectively generate in real time. I think for the sake of the course though, let's plan on doing both to give us a ~10x increase in speed and generate 0.0625 seconds of audio every 0.0215 seconds, around 3x faster than real time.

2, 3, 4

I have decided to accelerate both the dilated convolution and the additional mathematical operations of the network to an FPGA. As discussed, this should give me a 10x increase in speed and 3x better than real time. Assuming we're using PCIe Gen4 and looking at 2 seconds of data (a normal drum sample) @ 16 kHz, we'd be offloading a tensor of size (1, 256, 32000) with 4 bytes of resolution, so 32.77 MB. This would be the same to and from the device, so 32.77 MB x 2 = 65.54 MB. At a speed of 14 GB/s, we'd have 65.54 MB / 14000 MB/s = 4.68 ms. Given that we're expecting a 10x increase from acceleration, and looking at our baseline of 0.0215 for 1/16 of a second, we'd be looking at 0.688 seconds (2 seconds of audio data) + 0.00468 seconds = 0.69268 seconds. The data routing is negligible given the speedup, so we should be good!

5

ChatGPT suggested I use a combination of PyMTL3 and MyHDL, and in the spirit of the class I decided why not. Its rationale was that I use PyMTL3 to write the convolution module and then use MyHDL to verify that the hardware implementation works in the torch model. I then asked ChatGPT to help me implement the convolution block in PyMTL3.

6 and 7

I attempted to get a 1D convolution block working in MyHDL, but after a protracted conversation with ChatGPT, it couldn't implement this in HDL and match the output. Not sure why.

def conv1d_reference(x, w, b=0):
    """
    Naive 1D convolution (no padding, stride=1).
    x: input signal, shape (T,)
    w: kernel weights, shape (K,)
    b: bias term (scalar)
    returns: output signal, shape (T - K + 1,)
    """
    K = len(w)
    T = len(x)
    y = np.zeros(T - K + 1)
    for i in range(T - K + 1):
        y[i] = np.sum(x[i:i+K] * w) + b
    return y

Anyway, to test my design flow, I had ChatGPT generate a testbench and implement this in HDL. The attached files are in my GitHub, and here is the transcript with ChatGPT: [CF3.transcript.pdf](https://github.com/user-attachments/files/19830438/CF3.transcript.pdf).