Reducing the Compile Time for a DML Device - intel/device-modeling-language GitHub Wiki

The compilation of DML devices can sometimes be slow. This page describes some reasons for slow compilation, and some techniques to overcome them.

Possible Reasons for Slow Compilation

  • A device with thousands of register declarations might slow down the compiler.
  • DMLC produces C files, which are then compiled by GCC. Large C files take more time for DMLC to generate, and more time for GCC to compile.
    • In DML 1.2, unique C functions are generated for the get/set/read/write operations of each register. This usually dominates C file size; furthermore, these functions are the slowest both for DMLC to generate and for GCC to compile.
    • In DML 1.4, the compiler can often generate accessor functions that are shared between all register instances, together with lookup tables. The tables can sometimes be large, but tables are an order of magnitude faster both for DMLC to generate and for GCC to compile. Therefore, the compile time depends heavily on how often DMLC can generate shared implementations of common methods.
  • In cases where C code generation does not dominate the time, then the template expansion phase, in particular when parameter values are evaluated, tends to be the main bottleneck.
  • The DML compiler is written in Python, which is a slow language.

Profile your compilation

  • Run make V=1 to get relevant DMLC and GCC command-lines. Make sure to copy the full command lines for compilation. Note that commands dmlc --dep and gcc -M are for dependency files and largely irrelevant.
  • Run these commands in isolation, use e.g. time to time the command. Note that commands should be run from the linux64/obj/modules/MODULENAME directory.
    • If you work on a multi-user machine then profiling results can depend too much on the machine's load; in this case you may want to move your code to a local machine with exclusive access to the CPU. Use the DMLC_DUMP_INPUT_FILES env variable as described in the README.
  • Compile a DMLC locally to enable some extra profiling flags. The simplest is to edit py/dml/dmlc.py and set time_dmlc = True. This adds a printout like:
       startup    0.073
       parsing    16.078
       process    236.947
       info       0.001
       c          137.433
       total      391.375
    
  • If time is dominated by C compilation, then try passing --split-c-file=1000000 to DMLC. The output will be split in megabyte-sized chunks. You can compare timestamps of these files to see which ones were extra slow to generate, and if you compile them sequentially (for x in *-dml-*.c ; do gcc $x ; done) then you can see which ones are expensive for GCC to compile. DMLC often generates many similar functions at the same time, so looking at the first function declaration of every file (grep -m 1 -B 2 '^{' *-dml-*.c |grep -v -e -# -e ':{' -e ^--) you can often get a fair estimate of culprits. Functions named _DML_M_* come from non-shared methods.

Move to DML 1.4

If your device is on DML 1.2, then moving to 1.4 will usually boost compile times by a factor 2-3.

Run gcc -O0

The default is to run gcc -O2, which adds some time. -O0 gives slower code, but device models are usually not that critical.

Make methods shared to reduce code explosion

If a large part of time is spent generating or compiling C files named _DML_M_*, then you probably have a template (or in each block) that is instantiated many times. The C function name is constructed from the DML object name plus the method name, so try to spot recurring patterns for functions with similar endings. Changing a declaration from method m(..) { .. } into shared method m(..) { ... } will allow DMLC to share its implementation across all instantiations of the template. A shared method declaration however requires some extra care, in particular you can only access params which you have assigned a type (like param foo : int;), you can only access other objects through typed parameters (like param companion_reg : register; param companion_reg = cast(other_bank.my_reg, register);), and you can only call shared methods. However, converting your most commonly used methods into shared methods is well worth the effort in terms of reduced compile times.

Use pypy

The pypy interpreter often speeds up the DMLC part of the build by around 5x for large files (but 2x slower for tiny files). The README explains how to set it up.

Break out smaller devices for unit testing DML code

Compile times are often dominated by huge register banks, which are seldom modified. In day-to-day work, when implementing an isolated piece of logic in some common DML file, then waiting a minute for that huge register bank to compile is a waste of time. Instead, create a small stub DML device that makes use of your common code in a minimal fashion, and write a small unit test for this stub to verify that the logic in your common code works as intended. The time needed to compile and test your code is now reduced to a second or two which is a huge increase in productivity. Of course you will still need to compile the full device to validate that it works in practice within the full system, but you don't need to do this as often.

Avoid creating explicit register objects for "dumb" registers

If modeling a large system on enterprise scale, bank definitions are usually produced from a machine-readable hardware description (expressed in a format such as IP-XACT), to provide things like register offsets and access patterns such as "read-only". These simple definitions are often sufficient for a vast majority of all registers; only a small subset of registers need to declare additional side-effects in the DML code. This means that the expressive power of DML is overkill for providing register implementations. It is fairly easy to write your own table-based register dispatcher in C tuned for your particular IP-XACT dialect, and it is possible to create a combined model where registers with side-effects are modeled in DML and remaining registers are delegated to that table-based C model.

Machine-generated bank declarations

Instead of generating one template per bank, generate one template per register, and only instantiate the registers that you access or override. This avoids expensive processing of registers that you never access.

Hand-written bank declarations

By writing the set of important registers by hand, you can often spot patterns. For instance, if you spot:

register foo_enable @ 0;
register foo_status @ 4;
register foo_data[i<8] @ 8 + i;
register bar_enable @ 80
register bar_status @ 84;
register bar_data[i<8] @ 88 + i;

then you will instinctively extract a template for enable/status/data and instantiate that in two groups foo and bar; apart from giving code that is easier to read, this also allows you to implement functionality in shared methods which helps if a template is instantiated many times. Only registers with functionality need to be explicitly modeled; stupid registers can be driven by a machine-generated C table. You need to validate hand-written code against machine specification which requires a bit of trickery.