Z DARPA Report 2014 Q1 - StanfordVLSI/Genesis2 GitHub Wiki
SEEC: Specialized Extremely Efficient Computing
Date: ?Apr 15, 2014?
Quarterly Report Jan 1, 2014 – Mar 31, 2014
PI Mark Horowitz, Stanford University
Contract # HR0011-11-C-0007
[This]
We continue to develop and debug both Orion, our computational photography DSL, and our DPDA-to-hardware compiler (the Stencil Engine Generator). We have been mapping algorithms of higher complexity to the flow, resulting in increased running time. To address this issue we have instituted a hierarchical flow that splits the design into several smaller pieces and synthesizes them in parallel on a cluster. This, along with other continuing work, should help to extend the hardware generator to support image processing algorithms that operate in three dimensions or operate on pyramids. For example, SIFT is a common feature tracking algorithm that uses a pyramid for scale invariance. On the other hand, multi-frame HDR requires the registration of multiple frames and an integration of those frames in time. See the Stencil Engine Generator section below for more details.
Our prototype FPGA platform now has a VITA sensor in place of the earlier USB camera. We continue to map algorithms of higher and higher complexity to the platform, resulting in necessary optimizations to the design flow; see the "FPGA platform" section below for details.
In August 2013 we taped out a test chip based on our floating-point generator. This quarter, in mid-February, we received chip samples from the manufacturer. After that, we finished the chip packaging and test board assembling. We have done a full set of functional tests, and it turns out that the functionality of our chip and board is good. The fastest FPU module (the single precision, CMA design) can run up to 1.2GHz and consumes 25mW power. For the next step, we want to run some performance debugging on the chip and then to collect the energy and performance data we need for an in-depth analysis of the efficiency of the FPU design.
We have also made progress on each of the individual tasks, which we describe below.
Our mantra in creating hardware is that engineers should create constructors and not instances. We call these hardware oriented constructors generators. To hierarchically create bigger and bigger generators, we use Genesis2, an extension for SystemVerilog that includes a software-like elaboration layer. A generator is very different than high level synthesis (e.g. C-to-Verilog flows). HLS turns a C program into hardware, such that a given set of inputs produces the same output in both the C code and the hardware. In a generator however, the program is not a functional description of an algorithm, but rather a procedural description of how a piece of hardware needs to be built. Put differently, the inputs to the generator program are architectural parameters, and its output is the hardware description.
[NOTE]
As of mid-February, we now have samples of our test chip back from the manufacturer! As mentioned above, we have debugged the chip, and all functions seem to be working. We are now collecting and analyzing the performance and the power consumption of the chip. Also, we built and assembled a test board for floating point chip, which is now being used to test the functionality, performance and energy efficiency of the chip. See above for details.
As for the generator itself, we now have a floating point division and square root generator working with the least significant bit rounding correctly. We formalized the rounding problem and formally verified the rounding logic. The simulation result also proved the functionality. Based on the formulas we got, we are doing further optimizations for the division and square root generator.
This quarter we built the core of Linear Algebra Generator, with a micro-programmable controller inside of each PE (process element). We also designed a new architecture for the controller than can optimize the hardware for matrix multiplication, which will be implemented soon.
We continued our work with the flow that turns Orion domain-specific-language applications into Stencil Engines. We submitted a paper that was accepted for publication at SIGGRAPH. The paper presents an argument for the stencil domain abstraction as an efficient way to produce fixed function hardware and efficient software. This paper also present the Orion to Stencil Engine flow as a vehicle for the prototyping of real-time image signal processing algorithms. Much of the time, though, was spent improving the design flow to increase the efficiency of synthesizing large designs using automated hierarchical design techniques.
Next quarter we plan to improve the documentation in order to open-source the project. We also plan to enhance the robustness of the flow, allowing the designs and tools to more easily used by other universities and other researchers. Additionally, we are looking to bring up new applications and features on the Orion-to-Stencil-Engine flow in order to make the system more attractive to other researchers.
We successfully integrated the VITA sensor into our hardware and software framework and currently are able to use it as video source instead of the USB camera. For compatibility reasons, the data from the sensor goes to the DRAM first before being forwarded to DPDA-generated hardware for processing. This approach is slower than connecting the sensor directly, but makes debugging easier. Soon we will connect the sensor directly to the processing engine.
During the last few months we have been mapping algorithms of higher complexity to our flow, resulting in increased running time. To address this issue we put a lot of effort into implementing a hierarchical flow that splits the design into several smaller pieces and synthesizes them in parallel on a cluster.
We continue to work on integrating the Aptina sensor. We now have Aptina parts on hand as well as reference design files for the carrier PSB. In addition, have started working with nVIDIA on a computation module based on a Tegra tablet. The current plan is to use a 1GB Ethernet link between the two boards and run the viewfinder and higher level application components on the Tegra.
This has also initiated an effort to create a new API for the programmable ISP. It will be based on the old FCAM interface with the addition of the V4/Orion programming language. Such a model conceptually will resemble shader kernels and CUDA, a common existing way of interacting with graphics cards.
Today's SoC designer inevitably runs into an interface mismatch between their system and various existing IP blocks. While many bus standards have been developed to standardize IP interfaces, the diversity of standards causes a new problem: it is unlikely that all blocks in a system will use the same standard. Also, standardization of IP bus formats has not helped to solve the problem of integrating the IP block into the SoC's software stack. A significant portion of SoC design cost comes from the need to develop low level software drivers and APIs for each system. These problems are fundamental to SoC design: the designer of a block cannot possibly know about the environment where that block will be used because that environment does not exist when the block is created.
While this problem seems fundamental, the recent design trend towards building flexible hardware, or hardware generators, provides a possible solution. In previous quarters, we proposed and prototyped a parameterized bus interface generator that can connect the interface signals in each IP block to the current bus abstraction. Since the mapping between the IP block and the system bus is done algorithmically, this mapping is maintained even while the system bus and the generator evolve to support more complex operations. This parameterized interface grants current IP blocks the flexibility to connect to system interconnects in existing and future systems.
Having completed a prototype for a flexible interface generator, we focused this quarter's efforts toward addressing hardware and software integration for IP generators. Many fixed IP blocks advertise their interface resources through the use of a fixed protocol standard bus. For generators, however, the interface is likely to change with the desired IP instance. To help generators better advertise their interface to an interconnect generator, we developed a set of primitives that represent the different types of structures that might be found on its control interfaces. When an IP instance is created, these primitives are converted to interface RTL. Since these primitives exist in the generator space, they can be fed into our interface generator to connect the IP block to the rest of the hardware. These blocks are currently implemented in the ISP generator.
Additionally, these primitives incorporate some knowledge about when and how each interface resource can be accessed and modified. Using the information stored in this generator, we have worked on a method for automatically creating a low level C driver for each ISP-Gen accelerator, and a set of higher level data structures and methods designed to make the hardware more programmer friendly. We have also made progress in mapping function calls used to test Orion code to driver calls used to control the ISP-Gen accelerator. Once completed, we plan to use this as a portable API generator.
Previously we started creating a DSL OptiGraph for performing graph analytics on large, highly-skewed graphs such as social networks. We now have a working prototype which can perform multiple classic analyses such as PageRank, betweenness centrality, and triangle counting. In addition we have found that the OptiGraph implementation performs competitively with multiple existing graph analysis packages in C++ including SNAP and GraphLab. We are currently exploring various performance optimizations such as the tradeoffs in various internal memory layouts for the graph (e.g. arrays of arrays, bitsets, etc.). In addition we are studying how to best partition these graphs and perform automatic load balancing given the highly skewed workloads typical in these applications and datasets.
We have also implemented new static analyses within the Delite framework that allow us to better handle nested parallelism for highly parallel hardware targets such as GPUs. Using this analysis we can generate multidimensional GPU kernels that exploit the parallelism available in all nest levels rather than just one by automatically selecting how to map the nested parallelism onto the GPU hardware. In order to maximize performance we analyze the data access patterns within the parallel op and generate the GPU kernel such that the most frequent accesses will generate coalesced (vector) loads when possible. This strategy can have significant performance benefits in multiple domains that manipulate multidimensional data. We have currently applied it mostly to various matrix operations within our linear algebra DSL, and we are looking at using it to achieve better utilization and load balancing for various graph computations as well.
This past quarter, I submitted a paper to SIGGRAPH 2014 on solving problems in constrained layout synthesis using SMT solvers, through first tracing the execution of a probabilistic program and converting it to a logical formula.
I have three main projects this quarter:
1. A compiler that produces descriptions of specialized register machines that can be synthesized from individual functional probabilistic programs.
The project is based on the fact that if an abstract machine abstracts away details about the execution of the program, reducing its steps to what can be considered "essential", why not synthesize hardware that does little more than that, but takes bigger steps depending on the program and then see what we can get away with in space and time.
This idea has come up in various forms before, from the LISP machines [1] of the 1970s to, most recently, the Reduceron [2] in 2010, which offers an execution engine for lazy functional programs (with Haskell as example) that is expressible in existing reconfigurable hardware platforms (FPGA).
What this project provides is a concrete mechanization of abstract machines, specialized to the program. An initial program analysis similar to k-CFA first detects long sequences of abstract reduction steps that always happen, grouping the abstract executions into large transitions on the machine state. The result is a specialized abstract machine whose transitions are specialized to the program under analysis. The mechanization of such machines results in concrete reduction steps that may run more efficiently than those in the Reduceron due to their program-specific nature.
I am currently working on formulating the machine reduction steps based off of the CESK machine [3], using a notion of space/time invariants to translate abstract to concrete. Similar to the Reduceron, this scheme will consider as space-invariant individual function bodies (referred to as 'templates' by the Reduceron).
[1] LISP machine. http://en.wikipedia.org/wiki/Lisp_machine
[2] Reduceron. http://www.cs.york.ac.uk/fp/reduceron/
[3] Matthias Felleisen's thesis describes, among others, the CESK machine.
2. Probabilistic knowledge compilation: online synthesis of efficient sampling machines.
A key (if not the one and only) insight from machine learning that has applicability outside of machine learning is fluidity in representation. A related field is knowledge compilation (KC), which seeks to compile alternative representations of such objects as CNF SAT formulae in order to support solution and solution-counting queries in (hopefully) polynomial time.
I propose the field of probabilistic knowledge compilation, which is concerned with the construction of inexact alternative representations, for which inexact answers to the original query can be obtained, but (hopefully) much faster than the exact methods of KC.
As a first experiment, I consider AND/OR Boolean formulas over linear integer arithmetic, essentially describing disjoint unions of polyhedra over the integer lattice. As a first realization of the idea, I propose the online synthesis of efficient sampling machines, which perform like importance samplers but learn, from past runs, which states to avoid, and, in the beginning, start with an approximate representation that allows room for growth. The formula, over all variables at first, is restricted (and therefore approximated) to being subsets of all variables. In a depth-first sampling scheme, this allows us to terminate the search early. Next, during sampling, assignments that did not satisfy the constraints are added to a trie-like store, and such assignments are avoided in future runs intelligently, similar to conflict-driven clause learning (CDCL).
Next steps include, among other things, investigations of random depth-first search with backtracking as a non-reversible statistical process.
3. Shred.js: tracing probabilisitic programs in Javascript
I am also taking the developments in the (accepted) AISTATS 2014 paper and building a more portable version of it for use with the Webchurch system. As this is the second iteration of such work, I am also investigating
1. a representation of the space of structures associated with a probabilistic program to facilitate running MCMC kernels on traces.
2. a lightweight implementation of tracing that does not presuppose high-level syntax structures and allows (sequence-sensitive) composition of optimizations on traces.
The hope is that the techniques developed for Shred.js will make the tracing techniques in the AISTATS 2014 paper more portable and transferrable to a wide variety of platforms, MCMC on web church being the first step.
No major equipment purchases made during the reported period.
Ofer Shacham left Stanford to take a position with Google. He will continue to consult with the project, and support the Genesis II tool he created.
Conversations with many companies continued. We are working closely with NVIDIA and Google about various aspects of the hardware, and discussed the imaging part of the project with both ST and Sony.
At this point there are no major problems or concerns.
None in this report.
-
Z-DARPA Report 2013 Q4
- Final draft doc file: see sr desktop "SEEC Report 2013q4-mhchanges.docx" (wiki doesn't support docx uploads(?))
- Final pdf: https://github.com/StanfordVLSI/Genesis2/wiki/files/DARPA_HR0011-11-0007_SEEC_Quarterly_Technical_Report_Jan_2014.pdf
- Z-DARPA Report 2014 Q1