Week 6 & 7 Codefest - zanzibarcircuit/ECE510 GitHub Wiki
Summary
I used these two weeks to refocus on the end goal. I'm trying to implement the decoder of a VAE to speed up decoding for an audio generator network. The final network I decided on is below:
[Input z]
Latent Vector (1D: 4 elements)
│
▼
+----------------------------+
| Fully Connected Layer | FC: 4→4ch x 1fr
+----------------------------+
│
▼
+----------------------------+
| ReLU #0 | 4ch x 1fr
+----------------------------+
│
▼
+----------------------------+
| Upsample #0 | x2 → 4ch x 2fr
+----------------------------+
│
▼
+----------------------------+
| Conv1D #1 | 4in → 2out, K3, P1, S1
+----------------------------+
│
▼
+----------------------------+
| ReLU #1 | 2ch x 2fr
+----------------------------+
│
▼
+----------------------------+
| Upsample #1 | x2 → 2ch x 4fr
+----------------------------+
│
▼
+----------------------------+
| Conv1D #2 | 2in → 1out, K3, P1, S1
+----------------------------+
│
▼
+----------------------------+
| ReLU #2 | 1ch x 4fr
+----------------------------+
│
▼
+----------------------------+
| Conv1D Output Layer | 1in → 2out, K1, P0, S1
+----------------------------+
│
▼
+----------------------------+
| ReLU #3 | 2ch x 4fr
+----------------------------+
│
▼
[Final Output]
Mel Spectrogram (2ch x 4fr)
Keep in mind this is a simplified version of the one we were using before that can be scaled up. I decided to do this so I can have a proof of concept of a speedup and work from there. I'm fine using this for my final project.
I created modules for each layer as well as golden outputs in Python with profiling to test the speedup. Here is the output of the profiling:
--- Benchmarking Results ---
Number of benchmark runs: 100
Average total forward pass time: 0.00010528 seconds
Average time per component:
- relu_total : 0.00002664 seconds (25.30%)
- Conv1 : 0.00002303 seconds (21.88%)
- Conv2 : 0.00001329 seconds (12.62%)
- Conv3_Output : 0.00001148 seconds (10.90%)
- upsample_total : 0.00000659 seconds (6.26%)
- fc : 0.00000579 seconds (5.50%)
- reshape : 0.00000204 seconds (1.93%)
And below are the runtimes for each layer as well as the purported speedup on the FPGA:
All the individual modules have been verified for functionality compared to the output included in the ipynb. You can view the output in the tb files and compare to the ipynb for posterity. They all match except in the first FC layer where we have a small delta due to resolution in the FPGA, but this is a minor issue. We now have a working decoder! The next step was to get the top level working. I combined all modules into a single file with the implemented top level in order to get down to the gate level in OpenLane. I will save that for the next project.
One thing that helped me get this working was switching to Gemini. I think Gemini proved better for FPGA coding, although there are certainly some quirks with the system (it gets hung up a lot when coding in the canvas), it proved to give me a working decoder with effective debugging and a final top-level module.