Benchmarking and Profiling - NAVADMC/ADSM GitHub Wiki

The first set of benchmarks were run in late August 2014 to provide a baseline for future work on speeding up the code.

The run time per iteration (“user” time as reported by the time command) was recorded on 2.4GHz Intel Xeon machines. Because the simulation is stochastic, a distribution of run times is shown.

The first benchmark plot compares the program before and after adding code to write to the UnitStats table and to put the database in WAL mode. There is no noticeable difference between the two versions.

Figure 1. Run time before and after adding code to write to the UnitStats table and to put the database in WAL mode.

Optimization 1: Unit State Monitor

Profiling showed the Unit State Monitor taking more time than it should. That module was fixed to make better use of Unit State Change events to track the number of units in each state (commit 2c903cc). Figure 2 shows that this resulted in a 1.1×–1.4× speedup.

Figure 2. Run time before and after a fix to the Unit State Monitor.

(Non-)Optimization 2: GQuarks

Profiling still shows the “monitor” modules (the ones that maintain output variables) taking more time than they should. Output variables are objects that can have sub-categories (and sub-sub-categories...) The GLib data structures used to implement them can be sped up by using “GQuarks” instead of strings to index into sub-categories, according to the GLib documentation. But Figure 3 shows that trying this had no noticeable effect. This code change was not checked in.

Figure 3. Run time before and after using GQuarks in output variables.

Optimization 3: Simpler Output Variables

As mentioned above, output variables are objects that can have sub-categories (and sub-sub-categories...) It might be better to have extremely simple output variable objects and keep 1D or 2D arrays of them when sub-categories are needed (commit f8e72f2). Figure 4 shows that this change results in a 1.7×–4.3× speedup on top of optimization #1, or a 2×–6× speedup over the first benchmark from August 18.

Figure 4. Run time before and after simplifying the output variable data structure.

Optimization 4: Zone Monitor

A small fix to the Zone Monitor, very similar to the change to the Unit State Monitor in optimization #1 (commits 46dc5f1 and a74adf9). The speedup is not large but is consistent across the benchmark scenarios, bringing us to a 2×–6.5× speedup over the first benchmark from August 18.

Figure 5. Run time before and after a fix to the Zone Monitor.

(Non-)Optimization 5: Memoization Table

In March 2010, code for a data structure called a “memoization table”, essentially a cache for spatial searches, was contributed. In its original form it was slow due to an unnecessary insertion sort and frequent re-calculation of distances. With those problems fixed, running with the memoization table is still slower than running without (figure 6). The memoization table never expires entries from its cache, so its memory use is high. This code is stored in a branch named “distance-speedup” in case we wish to re-visit it.

Figure 6. Run time with and without the memoization table.

Optimization 6: Rearrange Desired Contacts Array

Profiling showed the program spends a lot of time on contact spread. That code uses an array of desired contacts, which is filled in with potential recipients during a scan over units near the source of infection. That array was re-arranged to make it faster to identify and discard units that cannot be recipients (commit a7605b0). Figure 7 shows that in two benchmark scenarios, this change either has no effect or produces a very small speedup. But the benchmark scenario that is heaviest on contact spread shows a large improvement.

Figure 7. Run time before and after rearranging the desired contacts array.

Optimization 7: Shorter Loops

There were more opportunities to shorten loops by making better use of the information in the infectious_units hash table and in the Unit State Change events (commits e19d80e, 89d6ff9, b0094a5, and 92c4b90). This produced a modest speedup in all three benchmark scenarios (figure 8).

Figure 8. Run time before and after shortening some loops.

Optimization 8: Inlining in Contact Spread Module

Some minor inlining in the contact spread module (commit adcd882) produces a noticeable speedup in the benchmark scenario that is heaviest on contact spread (figure 9).

Figure 9. Run time before and after inlining some functions in the contact spread module.

Optimization 9: Short-Distance Contacts Fix

This change and the next one need a brief review of how contact spread works. The rules for contact spread say that the simulator should:

  1. choose a desired distance for the contact.
  2. choose the potential recipient whose distance from the source best matches the desired distance.

Figure 10 illustrates this. The red X is the source of infection. The dotted black circle is the desired distance for the contact. The blue X’s are potential recipients. The blue X inside the green circle would be chosen.

Figure 10. Choosing a recipient for contact spread.

The green circle’s radius is twice that of the dotted black circle; if there is a potential recipient inside the green circle, then it is a better match based on distance than any potential recipient outside the green circle. However, if there was no potential recipient inside the green circle, the contact would go all the way out to the other blue X. There is no upper limit on contact distance.

In the simulator code, scanning the green circle is generally a fast operation (it can make use of the spatial index), while falling back on a scan outside the circle is a slower operation.

Benchmark scenario “I” was showing a high number of fallback scans. The cause was some parameters that were set to produce very short-distance contacts; for those contacts, the green circle was so small that there were no, or very few, neighbors inside the circle.

The simulator now estimates the population density and sets a minimum radius for the green circle (commit bfffc96). Experimentally, a circle that is expected to contain 3 dozen neighbors seems to be a good choice. This change produces a small speedup in Scenario “I” but does not affect the other benchmark scenarios (figure 11).

Figure 11. Run time before and after a fix for short-distance contacts.

Optimization 10: Rare High Incoming Types

After optimization #9 the benchmark scenarios were still showing a high number of fallback scans.

Sorting these by occurrences by recipient type revealed that the cause was contacts going to what could be called “rare high incoming” types. Because they are rare, the simulator will often fail to find one inside the green circle; because they have a high incoming rate, there will be a high performance penalty for frequent fallback scans.

The simulator now calculates a rarity score (0-2) and an incoming rate score (0-2) for each type. These are multiplied together and any type where the product is ≥3 is designated a rare high incoming type. These are treated as a special case for generation of direct and indirect contacts (commit 006af27).

This change produces a speedup in all three benchmark scenarios, most noticeably for the one that is heaviest on contact spread (figure 12). The scoring system that is used to designate rare high incoming types could potentially be tuned for improved results.

Figure 12. Run time before and after treating rare high incoming types as a special case.

Overall change in throughput since first benchmark: 520%–1230%.

How these numbers are generated: Because these runs are done on a shared cluster, we cannot assume that results generated on different machines, or at different times, are comparable. For each plot that superimposes a runtime distribution for a version A and a version B, the runs are done on the same machine, and they are interleaved (one iteration with version A, one iteration with version B, ...) so that a speedup or slowdown of the machine will not disproportionately affect the results for one version.