R parallel Benchmarks - tobigithub/R-parallel GitHub Wiki
Photo: William Warby@Flickr
Revolution Analytics R
The simplest way to introduce an (automatic) level of parallelism without any code change is to use the MKL libraries (Intel Math Kernel Library) within the freely available Revolution Analytics R. Unfortunately not all calculational procedures will take advantage of the MKL, nevertheless the speedups are simply impressive. It basically shows that not using parallelism is a big scam, one looses a 100-fold speed advantage for specific calculations such as the Matrix multiply.
The picture above shows the speedups for RRO Rgui 3.2.2 (64-bit) compared to the normal R 3.2.2 running on a 16 CPU (2x Xeon 2687W 3.1 GHz) workstation. Unfortunately this benchmark is a micro-benchmark, because it does not allow for enough time to fire up threads and finishes too fast on the R PRo Engine. Furthermore the LDA is not really well parallelized and does not use many threads.
# RRO Rgui 3.2.2 64-bit
# 16 MKLthreads (2x Xeon 2687W 3.1 GHz)
Matrix creation: 2.747148 sec.
Matrix multiply: 1.395878 sec.
Cholesky Factorization: 0.3190181 sec.
Singular Value Deomposition: 2.043302 sec.
Principal Components Analysis: 7.013975 sec.
Linear Discriminant Analysis: 18.6924 sec.
Total: 32.21173 sec.
# R 3.2.2 64-bit
# No MKLthreads (2x Xeon 2687W 3.1 GHz)
Matrix creation: 2.636405 sec.
Matrix multiply: 150.7625 sec.
Cholesky Factorization: 18.19884 sec.
Singular Value Deomposition: 44.15803 sec.
Principal Components Analysis: 183.521 sec.
Linear Discriminant Analysis: 134.9886 sec.
Total: 534.2654 sec.
For further entertainment please also run those two lines, which creates a matrix and performs a crossproduct calculation. If no-run decrease m and n accordingly. The matrix creation seems highly inefficient, plus truly depends on raw CPU and memory speed.
# Matrix creation requires > 100 GByte RAM
m <- 100000; n <- 100000
system.time (A <- matrix (runif (m*n),m,n))
# Matrix multiply
system.time (B <- crossprod(A))
Code:
MKL-benchmark.R - download the code for above R Pro MKL benchmark
Links:
MKL blog - Microbenchmark of the Pro R engine
MKL bench - Additional MKL Revo R benchmarks
Revo R - Info about RevoR
Pro R - Benefits of Multithreaded Performance with RRO
###The ATT R benchmark revisited The original R2.5 benchmark has been extended and modified for a number of years. It can be invoked directly in R via copy/paste of following single line:
# Run the old R Benchmark v2.5 (not recommended)
source(url("http://r.research.att.com/benchmarks/R-benchmark-25.R"))
# Total time: 5.5 seconds on (2x Xeon E5-2687W 3.1 GHz)
Unfortunately it is a micro-benchmark and runs only millisecond wise on newer CPUs or systems. That means times are variable, because threads are not allowed to fire up completely and many CPUs have energy saver on, so these short runs will have a high error rate. The benchmark is also used by Microsoft to showcase the Revolution R (PRO) engine, so I extended the current ATT Benchmark 3.0 to match a modern 16 core CPU workstation (2x Xeon E5-2687W 3.1 GHz). It is not the latest CPU from Intel and also not the biggest or oldest see Dual CPU setups, just an 2 year old high end dual-CPU workstation. Interestingly even the latest Desktop Core i7 such as the Intel Core i7-5960X are now faster than that system, at least in single CPU mode.
R Benchmark 3.0
===============
Number of times each test is run______________________________: 3
I. Matrix calculation
---------------------
Creation, transp., deformation of a 2600 x 2600 matrix (sec): 1.07599999999993
4000 x 4000 normal distributed random matrix ^1000_____ (sec): 0.96200000000008
Sorting of 1.1e+07 random values______________________ (sec): 1.06800000000012
5500 x 5500 cross-product matrix (b = a' * a)__________ (sec): 1.02499999999991
Linear regr. over a 4800 x 4800 matrix (c = a \ b')___ (sec): 0.965999999999985
--------------------------------------------
Trimmed geom. mean (2 extremes eliminated): 1.01880425043882
II. Matrix functions
--------------------
FFT over 5500000 random values________________________ (sec): 0.994000000000142
Eigenvalues of a 1200 x 1200 random matrix____________ (sec): 1.17099999999991
Determinant of a 6500 x 6500 random matrix____________ (sec): 1.05500000000011
Cholesky decomposition of a 8000 x 8000 matrix________ (sec): 1.00400000000027
Inverse of a 3600 x 3600 random matrix________________ (sec): 0.971999999999753
--------------------------------------------
Trimmed geom. mean (2 extremes eliminated): 1.01731985091094
III. Programmation
------------------
5500000 Fibonacci numbers calculation (vector calc)____ (sec): 0.989999999999964
Creation of a 6200 x 6200 Hilbert matrix (matrix calc) (sec): 1.07100000000028
Grand common divisors of 950000 pairs (recursion)_____ (sec): 1.08899999999976
Creation of a 800 x 800 Toeplitz matrix (loops)_______ (sec): 0.996999999999753
Escoufier's method on a 45x45 matrix (mixed)____________ (sec): 0.860000000000582
--------------------------------------------
Trimmed geom. mean (2 extremes eliminated): 1.01868507029692
Total time for all 15 tests_____________________________ (sec): 15.3000000000005
Overall mean (sum of I, II and III trimmed means/3)_____ (sec): 1.01826950113436
--- End of test ---
Now this can be used as already showcased multiple times on the web to compare against the native R version. On the same system (2x Xeon E5-2687W 3.1 GHz) running Revolution R vs original R we get the following results:
The new ATT benchmark v3.0 matches one second for each test on this system and runs a total of 15 seconds (still micro-benchmark) and it can be invoked here via copy/paste directly to PRO or R:
# Run the new R Benchmark v3.0 (recommended)
source(url("https://raw.githubusercontent.com/tobigithub/R-parallel/gh-pages/R/R-benchmark-30.R"))
# Total time: 15.3 seconds on (2x Xeon E5-2687W 3.1 GHz)
One could argue that the benchmark is not equal and prefers matrix operations, which it does, but then again, not parallelizing and not using processor options that are available since 10 years is the actual crime. Well, at least once people starting larger and longer calculations. Another question, why is the benchmark 24x faster and not 16-fold? Because the actual thread number is 32 and hyperthreading can have an positive effect on some calculations and the 24-fold effect is just an effect of that.
The 10 second benchmark (each run optimized to run roughly 10 seconds) speaks volumes. 150 seconds or 2-3 minutes for the parallel speedup and over one hour or 69 minutes for the normal R version. Who wants to wait one hour, nobody. Again the benchmarks are quite dependent on raw CPU speed, so a CPU with 4-5 GHz (Intel -X or -K series) actually will meet and outperform many of these numbers even with lower thread count. Run the 10s R Benchmark example on your CPU to see what happens.
# Run the new R Benchmark v3.0 (10s) (10 seconds each)
source(url("https://raw.githubusercontent.com/tobigithub/R-parallel/gh-pages/R/R-benchmark-30-10s.R"))
# Total time: 150 seconds on (2x Xeon E5-2687W 3.1 GHz)
Code:
Links:
Benefits of Multithreaded Performance with RRO
Firing all empty threads benchmark
Firing up threads can take a substantial amount of time, so called overhead, especially if the runtime in relation is very short. Seconds are needed to invoke the parallel machinery, running actual computational code that only requires microseconds would not be useful to parallelize. In such a case running code sequential maybe preferred. The following benchmark calculates the overhead that is needed to run such computations. In the Windows Task Manager (Ctrl-Alt-Del) observe the multiple conhost.exe and rscript.exe child processes as they built up and are destroyed at the end of the (empty) calculation.
# How long does it take to fire empty threads
# Installation of the required library with all dependencies
doInstall <- TRUE # Change to FALSE if you don't want packages installed.
toInstall <- c("doParallel")
if((doInstall) && (!is.element(toInstall, installed.packages()[,1])))
{
cat("Please install required package. Select server:"); chooseCRANmirror();
install.packages(toInstall, dependencies = c("Depends", "Imports"))
}
library(doParallel);
# how often to repeat the test
n = 3
Sys.time()->start;
for (i in 1:n)
{
cl <- makeCluster(detectCores());
registerDoParallel(cl);
#getDoParWorkers();
stopCluster(cl);
}
cl;
t=(Sys.time()-start);
cat(round(as.numeric(t/n),2)," sec total\n")
cat(round(as.numeric(t/n/detectCores()),2),"sec. per thread\n")
# clean up memory
invisible(gc())
The result for a 32 thread engine would be below and is a substantial amount of time, under higher clocked systems such as > 4 Ghz the times can be as low as 0.04 sec per thread, furthermore such times for a local machine are ok as long as the true code runs long enough.
socket cluster with 32 nodes on host ‘localhost’
14.33419 sec total
0.4479435 sec. per thread
Code:
More to follow...