Development Roadmap - mattsybeldon/cnn-fpga GitHub Wiki

Development Roadmap

This project will be divided up into several milestones that will be tackled sequentially in order to implement a convolutional neural network on an FPGA. The goal of this page is to provide divide up a somewhat ambitious task into smaller, realizable goals in a way that encourages focused development.

Phase 1: Establish a baseline software implementation using Caffe

We will begin the project development by implementing a fixed network structure using the Caffe framework. The purpose is to develop a simple reference to guide the development of the FPGA implementation. The network scope will be kept small, probably for something simple like recognizing pictures of cats or something. The small size is required to keep the hardware implementation feasible for initial hardware development. From the software implementation, we can get the following reference points to compare against the hardware:

Classifier performance
Determination of the network structure/parameters
The network weights
Latency

Once this is done, we will use Ristretto to obtain a fixed point neural network. Fixed point math is much easier to implement in hardware.

Phase 2: Implement a static neural network supporting only feedforward operation

Phase 2 is likely much longer than Phase 1 because all the pieces of the network must be built from scratch. To do a literal implementation of the CNN, we would have to implement the following modules:

An input data buffer that contains the data to be classified.
Individual node modules, parameterized to accept outputs from a flexible amount of inputs with associated weights. The node modules should leave hooks so that the weights can be reconfigured in preparation of eventual hardware training.
Convolution modules, which can implement N x N kernels. The size will be determined at compile time. The kernel itself should be determined at runtime.
A variety of dimensional reduction modules such as maxpool, average pool, etc.
Network controller(s) that can load in the network configuration. For this phase, the network controller will use predetermined settings partially based on the results from Phase 1 such as the network weights and the kernel values.

Already I can see some issues with this approach.

The cost of each individual node in fabric is high. The routing between nodes isn't really a big deal sine FPGAs come with lots of routing resources. The ALMs/LEs (depending on if you speak Intel/Xilinx) are probably okay since they're plentiful and an individual node probably doesn't eat up much for weight storage and the general data pipeline. What is an issue is the sheer amount of multiplications and additions, and that's just the weight . It is possible to implement these in fabric, but timing and utilization is gonna suck. We only have a limited amount of DSP blocks.
A similar issue comes up for implementing the convolutions. Just considering a single layer of R x R and a kernel of N x N. There are (N^2) multiplications and (N^2) additions for a single pixel. Do this for each of R^2 pixels, the order is roughly (N^2 * R^2). We can either tackle this spatially or temporally, and the best solution is probably some mixture. A purely spatial approach would do all the convolutions at the same time in parallel, but for any scale we would eat up all of the DSP blocks. The other extreme would be to have a single convolution module and do them all sequentially. This would eat up way fewer resources, but it would take forever to process.

I believe that we can handle these issues if we pipeline the neural network intelligently. Point by point:

When I visualize the operation of a neural network in general, I think of the outputs of the previous layer going into the current layer. A CNN has a convolution and feature reduction thrown into the mix, but the general idea doesn't change. The entire network pieces this together, layer by later. To me, this mental visualization seems like a pretty good way of breaking up the network into pipeline stages. We can keep the previous and current layers in fabric. We can then store the information between all (not just the previous and current) in some sort of memory. Fortunately for us, modern FPGAs come with lots of RAM resources. This does mean we will need to write some sort of controller that will access the correct information stored in the RAM, but this seems pretty doable.
The convolutions can be stored using a similar strategy. If we can store the convolution filters and results in a RAM, we can organize it by layer, write a controller, etc.

Phase 3a: Develop PCIe interface PL

In order to make this into an accelerator, we have to make it so some host system can control it. PCIe is the current standard for interfacing external hardware from a computer, so we'll pursue that route. Fortunately for us, both Intel and Xilinx offer a wealth of PCIe IP we can leverage, leaving us to develop some middleware PL to handle the abstracted signalling.

In terms of requirements, the PCIe interface should be fast enough so that the host system can send input data and retrieve a result faster than the host CPU can process the data itself. This could be achieved by minimizing latency, maximizing throughput, or some combination of the two.

One issue that will need to be addressed is determining how to test it without having developing a driver first while being on a limited budget. For this reason, I've labelled this phase 3a, indicating that we might rearrange it with other phase 3 tasks depending on what we learn along the way.

Phase 3b: Develop PCIe driver

This is the software half of the PCIe interface. There are a few capabilties that I have in mind for initial support, namely for programming the convolutional filters and network weights on the fly. Additionally, the software should be able to quickly provide input data and retrieve the result once the FPGA signals an interrupt.

Since this is sort of a hobby project, we will keep the driver development simple(r) at the expense of a little latency by leveraging Linux UIO drivers. Since I want to experiment a little with AWS and some of their FPGA offerings, we have to develop a Linux driver. This is because the EC2 F1 instances all run Linux and Azure currently has no fully flexible FPGA option (Brainwave only supports limited applications). We'll limit ourselves to a UIO driver to reduce driver development headaches, but eventually we can make a fully fledged kernel driver.