R parallel Examples - tobigithub/R-parallel GitHub Wiki

XCMS with 24,000 samples on 32 threads

Tobias Kind (2015)

The xcms package is used in life sciences for alignment of LC-MS and GC-MS datasets. XCMS uses the snow library under Windows and RMPI under LINUX. The peak finding algorithm distributes the tasks to all available processors and then collects the data for final alignment. For this example the faahKO samples were simply replicated (copied) so that more than 24,000 files with a total 85 Gbyte size were obtained (12k in WT and KO respectively).

 ### finds peaks in NetCDF, use number of slaves to distribute
     xset <- xcmsSet(nSlaves=myCPUs)

Subsequent xcms routines processes such as retention time grouping are not fully parallelized. The memory requirement is 120 MByte per thread, so for 32 threads around 4 Gbyte which is very convenient for small datafiles, however larger files will require more.

The 24,000 samples were processed in 8 hours, hence the processing speed is surprisingly only around 1.3 second per sample. Looking into the sections which are actually parallelized we will see where the overhead is coming from. Also these LC-MS data files only contain a subset of scans and are quite small.

user   	  system  elapsed [s]
25672.34  2135.20 31232.23

Here for additional benchmark reasons the somewhat more recent mtbls2 package can be used. These are 20 minute LC-MS runs obtained on a QTOF instrument. Using a replicate number of 400 samples, the memory requirement has grown to 8 Gbyte for 32 threads and processing time slightly increases to 2.5 seconds per sample. There are many more settings possible such as deisotoping and other deconvolution methods, however the power of parallelism is shown quite beautifully. The IPO package can be used to optimize XCMS parameters. Metabonexus can be utilized for preconfigured XCMS settings.

The final xcms processing time for 24,000 samples or system time of 31232.23 seconds equals 8 hours 40 minutes 32.23 seconds. There was no file system bottleneck, all 32 threads were processing at 100%, disk system was a SSD RAID system with 4 SSDs. The final result file could not be loaded into EXCEL or OpenOffice, both of them have a column limit of 16k, however Dell Statistica was able to open the file myAlign.tsv (375x24879 matrix) with a small filesize of 144 MByte.

The below CPU screenshot memory requirement is somewhat misleading, only 4 Gbyte RAM were used by XCMS and parallel threads which are in fact spawned RScript.exe files (and their Windows conhost.exe helpers).

Interestining, when calculating the performance or efficiency only of the peak finding algorithm (xcmsSet) we can come to the conclusion that the efficiency when using xcmsSet(nCPUs) with the snow library equals xcms-speed = nCPU/2, hence with 16 CPUS a maximum 8-fold speedup can be observed, with 8 CPUs a maximum 4-fold speedup can be obtained.

The above picture was obtained by using only 200 samples and only benchmarking the peak finding section.

     ### finds peaks in NetCDF, use number of slaves to distribute
    xset <- xcmsSet(nSlaves=myCPUs)

Using a dual socket Intel Xeon E5-2687W (3.1 GHz) workstation (16 CPUs, 32 threads) we can see that the maximum performance is obtained with 16 threads, however not the theoretical 16-fold speedup, and that over-saturating the CPUs with >16 threads the performance actually drops.

Threads	time elapsed [s]	speedup	sec/sample	samples/sec	Comment
1	351.47	1.00	1.76	0.57	200 samples
2	188.39	1.87	0.94	1.06
4	102.88	3.42	0.51	1.94
8	60.45	5.81	0.3	3.31
16	48.95	7.18	0.24	4.09	nCPUs
24	51.58	6.81	0.26	3.88
32	53.88	6.52	0.27	3.71	nthreads
64	73.1	4.81	0.37	2.74
128	NA	NA	NA	NA	error

Table generated with MarkDownTableGenerator.

To come to an conclusion, the obtained average time of 1.3 seconds per sample for the overall xcms run is far away from the 0.24 seconds per samples that we could obtain theoretically here only from the parallelized section. Basically without parallel overhead and the additional retention time grouping calculations of XCMS the run of 24,000 samples could be finished in 1.6 hours. 60% of the total time during the processing of the 24,000 samples was used for other tasks and computational overhead. Here techniques such as line-by-line time and memory profiling with the library(lineprof) could help to further improve the parallel performance of any R package.

Is there any advantage of using a fast disk system with xcms? A small subset of 200 samples running the whole XCMS-TK-demo-snow-cluster.R xcms script with 16 CPUs, revealed that there is almost no difference. DISK RAID > SSD RAID > RamDisK or for 200 samples 149sec >143sec >144sec. This is despite the fact that the RAMDISK is almost 1000-fold faster than a simple disk system in 4k sequential speed. Hence xcms is mostly computational active and has very little disk overhead, plus the OS disk caching with enough RAM can sufficiently cache all data input/output.

The table below shows the specific run times for each xcms command (16 threads used). Only the xcmsSet() find peak section is parallelized, nothing else. It becomes clear that when running more than 5000 samples the peak finding algorithm only contributes 20% of the total time, but the reporting function increases almost exponentially. Of course the peak finding can have additional deconvolution parameters that will change the whole statistics, but using a more detailed time profiling even going in more detail on the sub-routine level could reveal many many more sections that could be potentially parallelized.

		samples	100	200	400	800	1600	3200	6400	12800
Num	xcms task	info	time [s]	time [s]	time [s]	time [s]	time [s]	time [s]	time [s]	time [s]
1	xcmsSet()	find peaks	47.46	49.34	84.43	160.63	306.27	611.19	1296.72	2357.73
2	group()	group peaks	8.65	8.50	11.52	17.22	27.66	51.94	79.48	187.08
3	retcor()	RT deviations	1.69	1.85	3.95	11.15	33.84	110.10	355.32	1377.49
4	group()	group samples	1.11	1.07	1.25	1.41	2.05	3.10	4.24	9.21
5	fillPeaks()	peak groups	32.12	35.27	60.65	139.69	265.45	498.92	997.20	1978.40
6	reporttab()	write report	29.22	33.75	60.05	145.76	296.69	717.02	2504.96	6750.80
7	-	Total time [s]	121.02	130.39	222.51	476.61	932.59	1993.23	5239.44	12662.70
8	-	Total time [min]	2.02	2.17	3.71	7.94	15.54	33.22	87.32	211.05
9	-	Total time [h]	0.0	0.0	0.1	0.1	0.3	0.6	1.46	3.52

A final investigation of different peak finding options reveals a slight correlation between the time spent and the number of found peaks (R^2=0.6). MetaboNexus, IPO and MetaXCMS and MS-Dial seem to be good starters for further investigation. One would need to determine if these are real "peaks" or just all m/z-RT pairs listed, hence all subsequent xcms steps are not considered here. The table below used 32 samples of the faahKO dataset and 32 samples of the mtbl2 datasets with 16 CPUs in parallel. Be aware that due to the copy procedure different files can be copied, so peak numbers can differ slightly. The related R-scripts can be found on the bottom.

#	XCMS Peakfinding method	faahKO Time [s]	faahKO Peaks found	mtbls2 Time [s]	mtbls2 Peaks found
1	xcmsFindPeaksTime original	16.94	12,892	40.51	36,754
2	HP_QTOF 30 ppm	46.26	57,998	521.41	145,028
3	HP_QTOF_high 15 ppm	46.77	57,998	373.72	100,890
4	HP_Orbi 2.5 ppm	22.89	20,562	28.79	4,454
5	UP_QTOF 30 ppm	36.76	75,276	143.72	218,856
6	UP_QTOF_high 15 ppm	38.79	75,276	126.53	184,432
7	UP_Orbi 2.5 ppm	17.31	23,326	24.27	17,620
8	mzMatch 2.0 ppm	110.98	99,798	27.34	2,692
9	CentWave Generic 30 ppm	73.84	103,840	578.79	156,956
10	massifquant 10 ppm	18.87	8,210	36.13	43,382
11	Trust Metabolomics	56.65	30,052	324.66	463,928
12	IPO centwave initial	59.21	70,474	36.68	26,657
13	IPO MatchedFilter initial (lowres)	26.30	50,692	87.65	143,752

The table above basically shows that for an estimated 1000 samples of the faahKO dataset the peak finding part can be finished in less than 1h for all methods. In the case of the high-resolution UPLC-QTOF mtbls2 set nine peak finding methods can finish within 1h for an estimated 1000-set with the exception of four methods. The peak numbers do not relate to the true number of compounds or peaks in the set, rather the first round of parallelized xcms peak finding. All computations were performed with the [Revolution R] (http://www.revolutionanalytics.com/) (PRO) engine.

Additional material:

snow and xcms - the packages used here
IPO - for using design of experiment to optimize XCMS parameters
metabonexus - for optimized LC and GC XCMS parameters
MetaXCMS - contains recommendations for QTOF and Orbitraps
MS-Dial - true LC-MS/MS based mass spectral deconvolution

Source code: