Week 4 Codefest - zanzibarcircuit/ECE510 GitHub Wiki

Rethinking

After doing some research, I found that my NSynth-style network is probably not the best choice for this class. A more workable network for this class would be RAVE: A variational autoencoder for fast and high-quality neural audio synthesis. Unlike the autoregressive model of NSynth, which is very (very) slow, this network is a variational autoencoder that encodes an audio sample into 64 latent space variables and decodes them into an audio wav of size 256, or 5.8 ms @ a sampling rate of 44.1 kHz. Typically in VAEs, the encoder network is primarily used for training or live audio processing. Because of this, I'm going to focus on implementing just the decoder network.

Implementing in Hardware

I'm still working on understanding the overall network, but today I'm going to try and implement the conv1d layer from the residual stack.

image

This will be a toy(ish) example related to the real network, but it will essentially be a 1D convolution that will be parallelized for each input channel (latent variable). For each time step, the kernel of size 3 will be applied to the input, and the output will be the sum of all inputs.

Python Implementation

I had AI generate a Python version of this first to compare it to. The code is below.

def leaky_relu(x, negative_slope=0.01):
    return np.where(x >= 0, x, x * negative_slope)

def conv1d_residual_block(input_data, skip_data, weights):
    input_channels, time_steps = input_data.shape
    output_channels, _, kernel_size = weights.shape

    # Pad input on time axis (zero padding by 1 on each side)
    padded_input = np.pad(input_data, ((0, 0), (1, 1)), mode='constant', constant_values=0)

    # Output buffer
    output_data = np.zeros((output_channels, time_steps))

    # Process each time step
    for t in range(time_steps):
        for out_ch in range(output_channels):
            acc = 0
            for in_ch in range(input_channels):
                window = padded_input[in_ch, t:t+kernel_size]  # 3-sample window
                kernel = weights[out_ch, in_ch, :]
                acc += np.sum(window * kernel)
            # Add skip connection
            acc += skip_data[out_ch, t]
            # Apply LeakyReLU activation
            output_data[out_ch, t] = leaky_relu(acc)

    return output_data

For each time step, the filter is applied to each input channel, and then it is summed across all channels. A skip value is then summed with the sum across all channels, and a leaky ReLU is applied. This is done for all channels. I tested this with a small example, and it seems to work, giving me this output: [[-0.13 -0.04 -0.03 21. ] [ 7. 14. 17. 11. ]]

Now let's try it in Verilog.

Verilog Implementation

Below is the Verilog implementation. I feel like with each codefest I'm one step forward and two steps back. I finally sort of understand how to get things going on my actual machine with Visual Studio Code, and I used iverilog this week to test the code ChatGPT generated, but I'm spending all my time trying to figure out the tools that I'm not actually figuring out how the Verilog code actually works. Iverilog (which ChatGPT recommended) can only use Verilog, not SystemVerilog, which is what I should be using given it's the modern standard for FPGA design. I wasted a ton of time trying to get things going in my Visual Studio/iverilog combo with .v instead of .sv that I had almost no time to dig into the actual code, which is below.

always @(posedge clk or posedge rst) begin
        if (rst) begin
            for (i = 0; i < INPUT_CHANNELS; i = i + 1) begin
                for (k = 0; k < KERNEL_SIZE; k = k + 1) begin
                    window[i][k] <= 0;
                end
            end
        end else begin
            for (i = 0; i < INPUT_CHANNELS; i = i + 1) begin
                window[i][2] <= window[i][1];
                window[i][1] <= window[i][0];
                window[i][0] <= input_data[i];
            end
        end
    end

    always @(*) begin
        for (o = 0; o < OUTPUT_CHANNELS; o = o + 1) begin
            mac_result[o] = 0;
            for (i = 0; i < INPUT_CHANNELS; i = i + 1) begin
                for (k = 0; k < KERNEL_SIZE; k = k + 1) begin
                    mac_result[o] = mac_result[o] + window[i][k] * weights[o][i][k];
                end
            end
        end
    end

    always @(posedge clk) begin
        for (o = 0; o < OUTPUT_CHANNELS; o = o + 1) begin
            result[o] = mac_result[o] + skip_data[o];
            if (result[o] >= 0)
                case (o)
                    0: output_data_0 <= result[o][DATA_WIDTH-1:0];
                    1: output_data_1 <= result[o][DATA_WIDTH-1:0];
                endcase
            else
                case (o)
                    0: output_data_0 <= result[o] >>> 6;
                    1: output_data_1 <= result[o] >>> 6;
                endcase
        end
    end

endmodule

The test bench it generated for me ran and gave this output:

Time 5: output_data_0=x, output_data_1=x (reset) Time 15: output_data_0=0, output_data_1=0 Time 25: output_data_0=1, output_data_1=5 Time 35: output_data_0=13, output_data_1=11 Time 45: output_data_0=17, output_data_1=9

They aren't really close. ChatGPT says it's because it's integer math rather than floating point, but I need some time to make sense of that. That's all the time I have for this, but I am totally overwhelmed by the openness of this class. Every path I take leads me down 1000 more paths, and I don't end up learning anything. I thought I had a good, simple idea to implement, but it's turning out to be anything but.

My code is in the main folder.

[CF4.transcript.pdf](https://github.com/user-attachments/files/19933470/CF4.transcript.pdf)