Runtimes and Benchmarking - pwollstadt/IDTxl GitHub Wiki

Theoretical and practical runtimes

The number of TE calculations scales with O(P² d l S) where P is the number of processes, d the average in-degree that is eventually inferred, l the maximum temporal search depth per process, and S the number of surrogate calculations, assuming that d is independent of network size P. Note that in the worst case of a fully connected network, d=P, which leads to cubic runtimes.

IDTxl's network inference algorithm uses parallelisation to speed up network inference (N is number of time series samples):

For CPU estimators we parallelise over targets, with O(P d l S) TE calculations per target for the KSG estimator and O(P d l) TE calculations for the Gaussian estimator using analytical surrogates. Each calculation takes O(K N log N) for the KSG estimator and O(N) for the Gaussian estimator (N is number of time series samples).
For GPU estimators we parallelise over targets and surrogates. For a single target we have O(P d l) TE calculations including surrogates, assuming that all surrogates for a source fit into the GPU's main memory and can be processed in parallel. Each calculation on the GPU takes O(N²) time. Note that we additionally parallelise over points N depending on available memory, which leads to faster run times in practice.

For example, for network inference using the CPU and GPU KSG-implementation, we obtain the following runtimes for a single target on one computing node (S=200 surrogates, indegree of d=3, coupled logistic map dynamics, (Novelli, CNS, 2018)):

P	N	Runtime in hours (CPU)	Runtime in hours (GPU)
10	100	0.01	0.005
	1000	0.25	0.05
100	100	0.15	0.03
	1000	5.50	0.45

Hardware used: Intel Xeon CPUs with clock speeds of 2.1-2.6 GHz, NVIDIA GPU, V100 SXM2, 16 GB RAM (maximum runtimes over 10 repetitions).

The pratical runtime for a full network analysis that consideres each process, P, as a target can be parallelised over targets and thus depends on the number of available computing nodes. In the worst case where only a single node is available, the full runtime is the single-target runtime multiplied by P. In the best case where the number of computing nodes is equal to P, the full runtime is equal to the single-target runtime because all targets can be analyzed in parallel.

Practical recommendations on using GPUs versus CPUs

The abovementioned theoretical scaling of the GPU algorithm with O(N²) versus the scaling on CPU with O(N log N) has implications for real world datasets. As a rule of thumb datasets with low N, up to approximately N=10.000 samples, and with the requirement to run a large number of statistical surrogate datasets, like S= 1000 (e.g. for better multiple comparison correction, because there are many nodes on the network) fare a lot better on GPUs. For datasets with around 100.000 samples and S=200 to S=300 surrogate computations it is typically a draw between CPU and GPU. If you have millions of samples (which is a nice thing to have :-)) the CPU will most likely outperform GPUs at present.

Runtimes and Benchmarking - pwollstadt/IDTxl GitHub Wiki

Theoretical and practical runtimes

Practical recommendations on using GPUs versus CPUs

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️