Place and Route Research - azonenberg/openfpga GitHub Wiki

Research for a future open source place-and-route (and pack) tool.

Packing

Packing is the process of mapping LUTs and flip flops into FPGA blocks (either LEs or PLBs).

Cluster-Based Architecture, Timing-Driven Packing and Timing-Driven Placement for FPGAs
- Thesis on timing-driven (as opposed to connection-driven) packing algorithms. Provides a good overview of the field. Also the only paper so far that answers "why the hell are we packing anyway?", which is basically "because it runs faster". Treating LUTs and FFs as independent tile types with very specific routing constraints would probably result in better solutions but would take a lot longer. update 12/5/2017 - packing to coarser blocks also allows the placer to be more technology-independent. Chip-specific knowledge is sequestered to the packer and router, allowing the placement engine to be generic.
- TL;DR find the critical path, start with the critical path, add more nodes based on "critical-ness", re-calculate critical path occasionally
- Seems like timing driven algorithms usually include a weighting between time and space which is useful
Routability Driven Packing
- To be read
VPR has an associated packer called VPACK, but I can't find a good paper on it. Will keep looking.

Placement

Placement is the process of placing PLBs into the FPGA fabric

There are two general approaches to this process - simulated annealing and analytical placement. Simulated annealing is a statistical model, while analytical models are analytical (duh?). Simulated annealing doesn't actually have to be single-threaded. Analytical methods tend to be faster but not as good (usually due to optimizing for slightly the wrong things).

VPR - Versatile Place and Route
- Bills itself as the "State of the art" academic PAR tool. Now part of VTR - Verilog to Routing
- Basically does clever Simulated Annealing, using a HWPL net model, for placement
- Designed to be modular so that individual algorithms can be swapped out - we should copy this feature but not the detailed design
- Includes packing, placement, and routing
- Uses an XML-based "architecture" file to describe the FPGA in question, allowing the tool to be generic over architectures - again, we should copy this feature but not the detailed design
Near-linear wirelength estimation for FPGA placement - "Star+"
- This is a wire length metric, replacing the traditional half-perimeter wire length model, for multi-pin nets. It has some very useful properties
- Differentiable w.r.t. block position change, making it suitable for analytic placers (and it won't over-value long nets like quadratic wire length metrics do)
- Can be calculated in constant time
- Can be re-calculated after a block move using only the block's new position and the net's old center of mass
Timing-driven Placement for FPGAs
- Use an estimate of delay based on (delta X, delta Y) - generally, routing from (2,4) to (6,8) will take the same length of time as from (3, 5) to (7, 9)
- Calculate the full delay of the critical path to determine the slack of a connection, and factor that into your cost function
- Recalculate the full delays every time you change the temperature in your simulated annealing
QPF: Efficient Quadratic Placement for FPGAs - IEEE paywall
- Analytic placer based on quadratic wire length with linearizing adjustments
- Uses low-temp SA to refine the final placement
- 6x faster than VPR, not quite as good a result
StarPlace: A new analytic method for FPGA placement
- Analytic placer based on Star+, conjugate gradient and successive over-relaxation
- 5x faster than VPR, 8-9% lower critical path delay as well
An Efficient and Effective Analytical Placer for FPGAs
- Analytic placer based on Log-Sum-Exp approximation to HPWL and timing cost
- Uses timing-driven low-temp SA to refine the final placement

Parallel Placement Algorithms

Population Annealing
- A very parallel version of simulated annealing (see also GPU-accelerated population annealing)
- A fundamental algorithm, not an FPGA paper specifically
GPU Approach to FPGA placement based on star+
- Partitions the chip into thread-block-sized regions, alternating the regions on even and odd iterations
- Using star+'s data locality, performs simulated annealing by evaluating every swap associated with a random displacement vector, with one thread per block
- Shows a speedup of between 6x and 90x relative to VPR, on some fairly small circuits
- awygle's favorite option
Parallelizing Simulated Annealing Placement for GPGPU - thesis
- Runs simulated annealing on subsets of the netlist (as opposed to subsets of the FPGA)
- Provides about a 5-20x speedup with very slightly less good results than VPR

Routing

Coming soon

Current Thoughts

awygle - Currently I'm of the opinion that the best option for implementation is the GPU-accelerated star+ placement algorithm with some ideas from the timing-driven placement paper (which seems pretty parallelizable, just needs a table lookup based on the displacement vector before threads are dispatched). No real opinion on packing or routing yet. As for overall architecture, I like the way that VPR is structured conceptually, in terms of having a data format for expressing the architecture information to make porting easy and the ability to swap in new algorithms in a straightforward way. I tend to like the Unix philosophy of independent tools, but a well-structured monolithic application would work just as well.