Routines for Optimizing SparsE Systems (ROSES)

We have options for running the GUNNS solver’s numerical methods on a computer’s Graphical Processing Unit (GPU), for speed. Modern GPU’s have thousands of processing cores that can massively parallelize math operations. This has benefits for rapid image rendering, but also for pure number crunching in crypto-currency mining and the kinds of numerical methods that GUNNS uses.

nVidia provides a run-time library called CUDA to interface with and run processes on its GPU’s. They currently dominate the GPU market and their GPU’s are commonly found in our workstations. So our initial delivery of ROSES only supports CUDA and nVidia GPU’s. We may eventually add support for other GPU brands and their run-time libraries, provided they have similar API’s.

We now provide 3 modes for controlling where the solver’s numerical methods are run. This is controlled by the mGpuMode term in the solver class:

CPU only: this is the old mode and runs all computations on the CPU, as before.
Dense Matrix GPU solution: a mode for decomposing the admittance matrix on the GPU, optimized for dense matrices but also works for sparse matrices.
Sparse Matrix GPU solution: a mode for solving the [A]{x} = {b} system on the GPU, optimized for sparse matrices but with some limitations.

All of these options have advantages and disadvantages. In general, the CPU mode is still faster than the GPU modes for all but very large networks. This is mainly due to the added overhead of copying the [A] matrix back and forth between main memory and the GPU. However as network size increases, the gains from higher computational speed in the GPU start to outweigh the memory copy overhead, and the GPU modes overtake the CPU in speed. These trends, and the relative pros & cons of the various modes are explained in detail below.

We expect the most likely use of ROSES will be for large thermal networks, on the order of thousands of nodes, imported from Thermal Desktop.

Compatible CUDA Versions

Note that we have limited compatibility between GUNNS and CUDA versions:

GUNNS releases 17.2 – 19.5 are compatible with CUDA 8, and we tested with 8.5.
As of Febrauary 2025, the GUNNS master branch is compatible with CUDA 12, is tested with 12.2, and is not compatible with CUDA 8.
We don’t know the compatibility with CUDA versions 9 through 11.

All of the code that interfaces with CUDA is in the gunns/ms-utils/math/linear_algebra/cuda folder. The 8 & 12 versions are not cross-compatible, and we have no requirement for backwards compatibility. Users of the latest GUNNS master branch who still need to use CUDA 8 will need to revert the GUNNS code in that folder to the 19.5 release version.

The CUDA API has changed a lot over the years, and there are still more changes coming that will affect us. Most of our calls to the API are deprecated in 12.2 and may break in versions post-12.2. It looks like they are planning to move all the linear algebra to a new cuDSS library. We will update GUNNS to follow these changes at some point TBD.

Installation of CUDA

ROSES requires CUDA to be installed, both for compilation and run-time. We have these dependencies on CUDA:

Compile dependencies:
- cuda_runtime.h
- cusolverSp.h
- cusolverDn.h
- cublas_v2.h
Run-time library dependencies:
- libcublas_static.a
- libcudart_static.a
- libcusolver.so
- libcusparse_static.a
- libculibos.a

See here for nVidia’s guide for installing on Linux.

Note that in Section 3.2, step 6, we had to change the command to $ sudo yum install cuda-10-0, because by default just “cuda” didn’t install the latest version.
You should end up with /usr/local/cuda-10.0 (or whatever version) and a symbolic link to it at /usr/local/cuda.

Building ROSES

By default, ROSES is disabled. ROSES is enabled in GUNNS by compiling the solver (Gunns class) with the GUNNS_CUDA_ENABLE compiler option. To compile with ROSES, add this:

 -I/usr/local/cuda/include -L/usr/local/cuda/lib64 -DGUNNS_CUDA_ENABLE=1

…to the CXXFLAGS environment variable, or if in Trick, to TRICK_CFLAGS and TRICK_CXXFLAGS. Note that the include and lib paths may vary based on your CUDA installation. In Trick, also make sure to add TRICK_EXCLUDE += /usr/local/cuda (or your install location) so that Trick will not attempt to compile, ICG or SWIG the CUDA stuff.

See gunns/sims/SIM_roses_benchmark/S_overrides.mk for an example of how to set Trick sim makefile options to enable ROSES.

All of the GUNNS compiled libraries (lib/test, lib/no_trick/, lib/trick, lib/trick_if) support ROSES and non-ROSES versions. The default $ make command builds the non-ROSES version. Use $ make roses to build the ROSES versions. The trickified library lib/trick_if only has one version, which supports both ROSES and default configurations.

Timing Benchmarking and Instrumentation

We provide a couple of ways to determine the best GPU option for your network. Note that currently, both of these depend on Trick:

SIM_roses_benchmark

The GUNNS repository contains a Trick sim that can be used to benchmark the ROSES performance and its various modes on your machine. The performance depends mainly on the GPU type, and to a lesser extent the CPU, memory and bus speeds, and the OS.

To build the sim, start a fresh shell in the GUNNS environment, go to gunns/sims/SIM_roses_benchmark, and build the simulation with the appropriate CP command for your Trick version. This sim will use the compiled libs gunns/lib/trick/libgunnsroses.a and gunns/lib/trick_if/libgunns.o if they are built. To run the sim, do:

$ ./S_main…exe RUN_test/input.py

It will cycle through networks of increasing size and output the average mSolveTime (see below) for each mode. At the end, it will terminate itself and estimate the ideal mGpuSizeThreshold values for the GPU_DENSE and GPU_SPARSE modes on this machine. Further options are available in the RUN_test/input.py to change the min & max network sizes to sweep through, and enable output of extra verification data.

Below is an example of the screen output:

GUNNS & ROSES Timing Benchmark Results:

GPU mode:   NO_GPU      GPU_DENSE   GPU_SPARSE
# nodes     time (s)     time (s)     time (s)
----------------------------------------------
      5     1.38e-06     4.13e-04     2.44e-03
     17     4.10e-06     9.24e-04     3.07e-03
     37     3.29e-05     1.76e-03     2.43e-03
     65     8.10e-05     2.88e-03     2.75e-03
    101     2.37e-04     4.57e-03     3.43e-03
    145     5.93e-04     6.38e-03     3.71e-03
    197     1.14e-03     8.72e-03     4.15e-03
    257     2.36e-03     1.17e-02     5.20e-03
    325     4.35e-03     1.49e-02     6.26e-03
    401     7.75e-03     1.86e-02     7.29e-03
    485     1.31e-02     2.31e-02     8.73e-03
    577     2.12e-02     2.84e-02     9.98e-03
    677     3.38e-02     3.36e-02     1.22e-02
    785     5.14e-02     4.04e-02     1.37e-02
    901     7.69e-02     4.82e-02     1.59e-02
   1025     1.13e-01     5.68e-02     1.84e-02
   1157     1.61e-01     6.74e-02     2.09e-02
   1297     2.29e-01     7.93e-02     2.37e-02
   1445     3.29e-01     9.29e-02     2.70e-02
   1601     4.27e-01     1.07e-01     3.01e-02
----------------------------------------------
GPU # nodes
threshold estimates:          675          388

Here are some benchmarks we have observed on example systems:

System: a DELL Precision T7600 with Xeon CPU E5-2687W @ 3.10 GHz, CENTOS 7, Trick 17, GeForce GTX 770, CUDA 8.5:

GPU # nodes
threshold estimates           676          391
(10 run average):

System: a DELL Precision T7600 with Xeon CPU E5-2687W @ 3.10 GHz, CENTOS 7, Trick 17, GeForce GTX 970, CUDA 8.5:

GPU # nodes
threshold estimates           604          284
(10 run average):

System: a DELL Precision T7600 with Xeon CPU E5-2687W @ 3.10 GHz, CENTOS 7, Trick 17, GeForce GTX 1080Ti, CUDA 8.5:

GPU # nodes
threshold estimates           549          261
(10 run average):

System: a DELL Precision 7920 with Xeon GOLD 6242 CPU @ 2.8 GHz, RHEL 8, Trick 19, GeForce GTX 1080Ti, CUDA 12.2:

GPU # nodes
threshold estimates           507          244
(10 run average):

Some notes:

As # nodes increases, GPU_SPARSE overtakes GPU_DENSE in speed before GPU_DENSE overtakes NO_GPU.
GPU_SPARSE becomes faster than GPU_DENSE because GUNNS matrices are usually very sparse, and it has a smaller amount of data it has to move between CPU & GPU memory, i.e. one [A] and one {b} vs. two [A].
As expected, better graphics cards and modern systems lower the GPU threshold.
The network used in this sim has similar sparsity as the typical GUNNS aspects, i.e. around 3-4 incident links per node.

Run-time Instrumentation

The above off-line benchmarking can give you a general idea of where the threshold between GPU & CPU is on your machine, but the actual best threshold will vary for each individual network. We expect several factors like sparsity, topography, aspect, and how dynamic the network is to all affect the best GPU mode and best GPU/CPU threshold.

So, we provide instrumentation to give a better idea of how a specific network performs in run-time. The GUNNS solver contains two variables that record elapsed step times in real-time:

mSolveTime: this the accumulated wall clock time, in seconds, of the [A] decomposition and {x} solve numerical methods for all islands of the most recent minor step of the solver.
mStepTime: this is the elapsed wall clock time, in seconds, of the most recent major step of the solver. This will always be larger than mSolveTime. This gives an indication of how much of the total network time is spent in the numerical solution methods, as opposed to stepping the other network objects like nodes & links. Note that this includes all the minor steps in this major step.

These terms allow direct observation of the performance of a specific network in response to changing its mGpuMode and mGpuSizeThreshold settings. This is intended to allow you to fine-tune the best settings for your specific network.

Note that these times simply record the elapsed wall-clock time between start and end, but this also includes any time spent interrupted by the OS to perform other threads. So these times aren’t necessarily the true time spent by the network. We believe these interrupts account for a lot of the variance you see in these times from step to step.

Currently, these clocks only work in Trick sims, as we rely on an accurate clock function provided by Trick.

Here is an example of what mSolveTime and mStepTime look like in run-time, while switching through the mGpuModes:

GPU Options, pros & cons:

Judging by the timing metrics from SIM_roses_benchmark, it would seem that GPU_SPARSE is always a better option than GPU_DENSE, and GPU_DENSE is never needed. However, GPU_SPARSE has some very severe limitations that aren’t captured in SIM_roses_benchmark, and these should be carefully considered:

GPU_SPARSE doesn’t get the decomposed [A] from the GPU and therefore can’t reuse it. The decomposed [A] is reused for 2 things:

1. In computing Network Capacitance, re-using the decomposed [A] eliminates a decomposition for every node in the system for which the network capacitance is needed.
2. When networks reach steady-state, the input [A] doesn’t change and therefore doesn’t need to be re-decomposed if we already have its decomposition. This eliminates a decomposition for every step where the input [A] hasn’t changed.

Therefore, networks that use network capacitance or spend a lot of time in steady-state are likely to see GPU_SPARSE become slower than other options.

This picture shows the basic differences between the 3 GPU options, where things happen on the CPU vs. GPU, what data is available to each option and when:

Using ROSES at Run-Time

Even though GUNNS may have been compiled with ROSES enabled (see above), you still must set controls in the solver to use the GPU. There are two runtime variables to control:

mGpuMode — An enumeration of the GPU mode, or which GPU or CPU method to use: NO_GPU, GPU_DENSE, GPU_SPARSE.
mGpuSizeThreshold — Only matrices and matrix islands that are at least this size (# nodes, or rows) will be solved on the GPU, and those that are smaller will still be solved on the CPU. This defaults to a very large number so that by default, no matrices will be sent to the GPU. The solver will not allow a value smaller than 2.

These are protected attributes of the solver class, but can be set by calling the public setGpuOptions(const Gunns::GpuMode mode, const int threshold) method. In Trick, the attributes can be set directly via Trick View. These options can be changed at any time during run.

In Trick sims, we recommend calling setGpuOptions from the input file.

All modes should result in the same network solution at all times. The only difference should be in how long it takes to solve, which may be observed in the mStepTime and mSolveTime variables (see above).

Developer Notes for GunnSmiths

All direct CUDA dependencies are in ms-utils/math/linear_algebra/cuda. Then, core/Gunns.cpp depends on those classes.
The CUDA dependencies in core/Gunns.cpp are wrapped in the GUNNS_CUDA_ENABLE compiler directive. We could have instead put them in a derived solver class, but this would complicate GunnsDraw by having to support different solver classes.
To run unit tests on the CUDA stuff, do “$ make -f Makefile.roses” instead of the usual “$ make”
Unit tests with CUDA enabled are in these test folders: ms-utils/math/linear_algebra/cuda and core/test.
The LCOV code coverage for core/Gunns is split between the ROSES and non-ROSES unit tests. Code coverage should account for the overlap of these two tests.
We suppress many “possible” Valgrind errors in the ROSES unit tests.
- We think these are false errors related to several warnings from Valgrind about “noted but unhandled ioctl…. This could cause spurious value errors to appear.”
- All of these come from the CUDA libraries.
- We maintain these suppressions in error suppression file gunns/test/utils/roses.supp, and Makefile.roses uses these suppressions by default.
- Newer graphics cards may output more of these errors that need to be added to the suppressions file. The gunns/test/utils/extract_valgrind_suppressions.py script can help automate this process; see instructions in that script.
The clock function used for mStepTime and mSolveTime is connected to Trick via a macro defined in core/GunnsInfraMacros.hh. There is room here to implement a function for non-Trick environments, but this is left to the users.

ROSES - nasa/gunns GitHub Wiki

Routines for Optimizing SparsE Systems (ROSES)

Compatible CUDA Versions

Installation of CUDA

Building ROSES

Timing Benchmarking and Instrumentation

SIM_roses_benchmark

Run-time Instrumentation

GPU Options, pros & cons:

Using ROSES at Run-Time

Developer Notes for GunnSmiths

⚠️ GitHub.com Fallback ⚠️

ROSES - nasa/gunns GitHub Wiki

Routines for Optimizing SparsE Systems (ROSES)

Compatible CUDA Versions

Installation of CUDA

Building ROSES

Timing Benchmarking and Instrumentation

SIM_roses_benchmark

Run-time Instrumentation

GPU Options, pros & cons:

Using ROSES at Run-Time

Developer Notes for GunnSmiths

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️