R parallel Benchmarks - tobigithub/R-parallel GitHub Wiki

3297205226_a12b175d49_z 480x360

Revolution Analytics R

The simplest way to introduce an (automatic) level of parallelism without any code change is to use the MKL libraries (Intel Math Kernel Library) within the freely available Revolution Analytics R. Unfortunately not all calculational procedures will take advantage of the MKL, nevertheless the speedups are simply impressive. It basically shows that not using parallelism is a big scam, one looses a 100-fold speed advantage for specific calculations such as the Matrix multiply.

RRR

The picture above shows the speedups for RRO Rgui 3.2.2 (64-bit) compared to the normal R 3.2.2 running on a 16 CPU (2x Xeon 2687W 3.1 GHz) workstation. Unfortunately this benchmark is a micro-benchmark, because it does not allow for enough time to fire up threads and finishes too fast on the R PRo Engine. Furthermore the LDA is not really well parallelized and does not use many threads.

# RRO Rgui 3.2.2 64-bit
# 16 MKLthreads (2x Xeon 2687W 3.1 GHz)
Matrix creation: 2.747148  sec. 
Matrix multiply: 1.395878  sec. 
Cholesky Factorization: 0.3190181  sec. 
Singular Value Deomposition: 2.043302  sec. 
Principal Components Analysis: 7.013975  sec. 
Linear Discriminant Analysis:    18.6924  sec. 
Total:  32.21173  sec. 

# R 3.2.2 64-bit
# No MKLthreads (2x Xeon 2687W 3.1 GHz)
Matrix creation: 2.636405  sec. 
Matrix multiply: 150.7625  sec. 
Cholesky Factorization: 18.19884  sec. 
Singular Value Deomposition: 44.15803  sec. 
Principal Components Analysis: 183.521  sec. 
Linear Discriminant Analysis:    134.9886  sec. 
Total:  534.2654  sec.

For further entertainment please also run those two lines, which creates a matrix and performs a crossproduct calculation. If no-run decrease m and n accordingly. The matrix creation seems highly inefficient, plus truly depends on raw CPU and memory speed.

# Matrix creation requires > 100 GByte RAM
m <- 100000; n <- 100000
system.time (A <- matrix (runif (m*n),m,n))

# Matrix multiply
system.time (B <- crossprod(A))

Code:

MKL-benchmark.R - download the code for above R Pro MKL benchmark

Links:

MKL blog - Microbenchmark of the Pro R engine

MKL bench - Additional MKL Revo R benchmarks

Revo R - Info about RevoR

Pro R - Benefits of Multithreaded Performance with RRO

###The ATT R benchmark revisited The original R2.5 benchmark has been extended and modified for a number of years. It can be invoked directly in R via copy/paste of following single line:

# Run the old R Benchmark v2.5 (not recommended)
source(url("http://r.research.att.com/benchmarks/R-benchmark-25.R"))
# Total time: 5.5 seconds on (2x Xeon E5-2687W 3.1 GHz)

Unfortunately it is a micro-benchmark and runs only millisecond wise on newer CPUs or systems. That means times are variable, because threads are not allowed to fire up completely and many CPUs have energy saver on, so these short runs will have a high error rate. The benchmark is also used by Microsoft to showcase the Revolution R (PRO) engine, so I extended the current ATT Benchmark 3.0 to match a modern 16 core CPU workstation (2x Xeon E5-2687W 3.1 GHz). It is not the latest CPU from Intel and also not the biggest or oldest see Dual CPU setups, just an 2 year old high end dual-CPU workstation. Interestingly even the latest Desktop Core i7 such as the Intel Core i7-5960X are now faster than that system, at least in single CPU mode.

   R Benchmark 3.0
   ===============
Number of times each test is run______________________________:  3

   I. Matrix calculation
   ---------------------
Creation, transp., deformation of a  2600 x 2600  matrix (sec):  1.07599999999993 
4000 x 4000  normal distributed random matrix ^1000_____ (sec):  0.96200000000008 
Sorting of  1.1e+07  random values______________________ (sec):  1.06800000000012 
5500 x 5500  cross-product matrix (b = a' * a)__________ (sec):  1.02499999999991 
Linear regr. over a  4800 x 4800  matrix (c = a \ b')___ (sec):  0.965999999999985 
                      --------------------------------------------
                    Trimmed geom. mean (2 extremes eliminated):  1.01880425043882 

   II. Matrix functions
   --------------------
FFT over  5500000  random values________________________ (sec):  0.994000000000142 
Eigenvalues of a  1200 x 1200  random matrix____________ (sec):  1.17099999999991 
Determinant of a  6500 x 6500  random matrix____________ (sec):  1.05500000000011 
Cholesky decomposition of a  8000 x 8000  matrix________ (sec):  1.00400000000027 
Inverse of a  3600 x 3600  random matrix________________ (sec):  0.971999999999753 
                      --------------------------------------------
                    Trimmed geom. mean (2 extremes eliminated):  1.01731985091094 

   III. Programmation
   ------------------
5500000  Fibonacci numbers calculation (vector calc)____ (sec):  0.989999999999964 
Creation of a  6200 x 6200  Hilbert matrix (matrix calc) (sec):  1.07100000000028 
Grand common divisors of  950000  pairs (recursion)_____ (sec):  1.08899999999976 
Creation of a  800 x 800  Toeplitz matrix (loops)_______ (sec):  0.996999999999753 
Escoufier's method on a 45x45 matrix (mixed)____________ (sec):  0.860000000000582 
                      --------------------------------------------
                    Trimmed geom. mean (2 extremes eliminated):  1.01868507029692 


Total time for all 15 tests_____________________________ (sec):  15.3000000000005 
Overall mean (sum of I, II and III trimmed means/3)_____ (sec):  1.01826950113436 
                      --- End of test ---

Now this can be used as already showcased multiple times on the web to compare against the native R version. On the same system (2x Xeon E5-2687W 3.1 GHz) running Revolution R vs original R we get the following results:

RRR

The new ATT benchmark v3.0 matches one second for each test on this system and runs a total of 15 seconds (still micro-benchmark) and it can be invoked here via copy/paste directly to PRO or R:

# Run the new R Benchmark v3.0 (recommended)
source(url("https://raw.githubusercontent.com/tobigithub/R-parallel/gh-pages/R/R-benchmark-30.R"))
# Total time: 15.3 seconds on (2x Xeon E5-2687W 3.1 GHz)

One could argue that the benchmark is not equal and prefers matrix operations, which it does, but then again, not parallelizing and not using processor options that are available since 10 years is the actual crime. Well, at least once people starting larger and longer calculations. Another question, why is the benchmark 24x faster and not 16-fold? Because the actual thread number is 32 and hyperthreading can have an positive effect on some calculations and the 24-fold effect is just an effect of that.

revolution-analytics-r-just-faster-r10s

The 10 second benchmark (each run optimized to run roughly 10 seconds) speaks volumes. 150 seconds or 2-3 minutes for the parallel speedup and over one hour or 69 minutes for the normal R version. Who wants to wait one hour, nobody. Again the benchmarks are quite dependent on raw CPU speed, so a CPU with 4-5 GHz (Intel -X or -K series) actually will meet and outperform many of these numbers even with lower thread count. Run the 10s R Benchmark example on your CPU to see what happens.

# Run the new R Benchmark v3.0 (10s) (10 seconds each)
source(url("https://raw.githubusercontent.com/tobigithub/R-parallel/gh-pages/R/R-benchmark-30-10s.R"))
# Total time: 150 seconds on (2x Xeon E5-2687W 3.1 GHz)

Code:

R-benchmark-30.R

R-benchmark-30-10s.R

Links:

Benefits of Multithreaded Performance with RRO

Firing all empty threads benchmark

Firing up threads can take a substantial amount of time, so called overhead, especially if the runtime in relation is very short. Seconds are needed to invoke the parallel machinery, running actual computational code that only requires microseconds would not be useful to parallelize. In such a case running code sequential maybe preferred. The following benchmark calculates the overhead that is needed to run such computations. In the Windows Task Manager (Ctrl-Alt-Del) observe the multiple conhost.exe and rscript.exe child processes as they built up and are destroyed at the end of the (empty) calculation.

# How long does it take to fire empty threads  
# Installation of the required library with all dependencies
doInstall <- TRUE # Change to FALSE if you don't want packages installed.
toInstall <- c("doParallel") 
if((doInstall) && (!is.element(toInstall, installed.packages()[,1])))
{
	cat("Please install required package. Select server:"); chooseCRANmirror();
	install.packages(toInstall, dependencies = c("Depends", "Imports")) 
}
    
library(doParallel);
   
# how often to repeat the test
n = 3
Sys.time()->start;
for (i in 1:n)
{ 
 	  cl <- makeCluster(detectCores()); 
  	  registerDoParallel(cl);
  	  #getDoParWorkers(); 
  	  stopCluster(cl); 
  	  
} 
cl;
t=(Sys.time()-start);
cat(round(as.numeric(t/n),2)," sec total\n")
cat(round(as.numeric(t/n/detectCores()),2),"sec. per thread\n")	
    
# clean up memory
invisible(gc())

The result for a 32 thread engine would be below and is a substantial amount of time, under higher clocked systems such as > 4 Ghz the times can be as low as 0.04 sec per thread, furthermore such times for a local machine are ok as long as the true code runs long enough.

socket cluster with 32 nodes on host ‘localhost’
14.33419  sec total
0.4479435 sec. per thread

Code:

Fire empty doSNOW threads

Fire empty doParallel threads

More to follow...