Runtimes and Benchmarking - pwollstadt/IDTxl GitHub Wiki

Theoretical and practical runtimes

The number of TE calculations scales with O(P2 d l S) where P is the number of processes, d the average in-degree that is eventually inferred, l the maximum temporal search depth per process, and S the number of surrogate calculations, assuming that d is independent of network size P. Note that in the worst case of a fully connected network, d=P, which leads to cubic runtimes.

IDTxl's network inference algorithm uses parallelisation to speed up network inference (N is number of time series samples):

  • For CPU estimators we parallelise over targets, with O(P d l S) TE calculations per target for the KSG estimator and O(P d l) TE calculations for the Gaussian estimator using analytical surrogates. Each calculation takes O(K N log N) for the KSG estimator and O(N) for the Gaussian estimator (N is number of time series samples).
  • For GPU estimators we parallelise over targets and surrogates. For a single target we have O(P d l) TE calculations including surrogates, assuming that all surrogates for a source fit into the GPU's main memory and can be processed in parallel. Each calculation on the GPU takes O(N2) time. Note that we additionally parallelise over points N depending on available memory, which leads to faster run times in practice.

For example, for network inference using the CPU and GPU KSG-implementation, we obtain the following runtimes for a single target on one computing node (S=200 surrogates, indegree of d=3, coupled logistic map dynamics, (Novelli, CNS, 2018)):

P N Runtime in hours (CPU) Runtime in hours (GPU)
10 100 0.01 0.005
1000 0.25 0.05
100 100 0.15 0.03 
1000 5.50   0.45

Hardware used: Intel Xeon CPUs with clock speeds of 2.1-2.6 GHz, NVIDIA GPU, V100 SXM2, 16 GB RAM (maximum runtimes over 10 repetitions).

The pratical runtime for a full network analysis that consideres each process, P, as a target can be parallelised over targets and thus depends on the number of available computing nodes. In the worst case where only a single node is available, the full runtime is the single-target runtime multiplied by P. In the best case where the number of computing nodes is equal to P, the full runtime is equal to the single-target runtime because all targets can be analyzed in parallel.

Practical recommendations on using GPUs versus CPUs

The abovementioned theoretical scaling of the GPU algorithm with O(N2) versus the scaling on CPU with O(N log N) has implications for real world datasets. As a rule of thumb datasets with low N, up to approximately N=10.000 samples, and with the requirement to run a large number of statistical surrogate datasets, like S= 1000 (e.g. for better multiple comparison correction, because there are many nodes on the network) fare a lot better on GPUs. For datasets with around 100.000 samples and S=200 to S=300 surrogate computations it is typically a draw between CPU and GPU. If you have millions of samples (which is a nice thing to have :-)) the CPU will most likely outperform GPUs at present.

⚠️ **GitHub.com Fallback** ⚠️