R GSoC 2015: Subsetted and parallel computations in matrixStats - HenrikBengtsson/matrixStats GitHub Wiki

Subsetted and parallel computations in matrixStats

Below is a project proposal for the R Project's participation in the Google Summer of Code 2015. The application deadline for students is on March 27, 2015, but please also note that there is a "soft" deadline to submit project tests on March 20.

Background and related work

The matrixStats package provides fast and memory-conservative functions for calculating summaries and statistics on rows and columns of matrices as well as vectors. For instance,

y <- matrixStats::colVars(X)

is much faster (~10-60 times) than

y <- apply(X, MARGIN = 2, FUN = vars)

For benchmark results on all functions of the package, see Benchmark Reports.

There are no known projects with the same goals as the matrixStats package, except from some individual functions that are part of larger packages scattered across CRAN and Bioconductor. In contrast, the matrixStats package focuses on providing a light-weight and cross-platform solution that is straightforward to use and install (with a minimal amount of package dependencies). The matrixStats package was created in 2007 and released on CRAN in 2009, it is well tested (96% code coverage), currently used by 55 CRAN and Bioconductor packages, and among the top-2% most installed packages (R Studio CRAN logs).

Project ideas

The idea of this project is to extend the matrixStats package with mechanisms for calculating matrixStats summaries and statistics

  1. on a subset of data (e.g. rows or columns), and
  2. chunked parallel processing.

Efficient processing of data in subsets will boost the performance of some use cases commonly seen by end users as well as developers. Furthermore, support for subsetted calculations provides a natural framework for processing data in parallel. As one of the below examples illustrate, subsetted calculations are also useful for calculating summaries across groups of rows or columns.

The following proof-of-concept examples illustrate what the outcome of this project may be.

Calculations on a subset of rows and columns:

y <- colVars(X, rows = -1:20, cols = 101:400)

Row summaries across groups of columns:

colGroups <- split(seq(along=factor), factor)
y <- sapply(colGroups, FUN=function(idxs) rowVars(X, cols = idxs))

Chunked parallel processing:

cl <- parallel::makeForkCluster(nnodes = 4)
matrixStats::useCluster(cl)
y <- colVars(X)

Details

Below are further details and motivation for this project.

Subsetted computations

Consider the following 1000-by-1000 numeric matrix:

X <- matrix(rnorm(1000^2), ncol = 1000, nrow = 1000)

Assume that we want to find the minimum value for a subset of the columns. This can be done using

y <- matrixStats::colMins(X[, 101:400])

Although the colMins() have been optimized to use a minimal amount of memory, we cannot avoid the creation of the intermediate object X[, 101:400] in the above call. The creation of this intermediate object introduces overhead due to (i) memory allocation of a 1000-by-300 matrix and (ii) the need to copy values. It also adds (iii) extra work to the garbage collector which has to spend time cleaning it out afterward. In addition, the intermediate object (iv) increases the overall memory footprint.

The first goal of this project is to extend the API by adding optional arguments rows and cols to functions, i.e.

y <- matrixStats::colMins(X, cols = 101:400)

as well as

y <- matrixStats::rowMins(X, rows = -(1:20), cols = 101:400)

This way the subsetting can instead be done internally and "on-the-fly" by native code.

Here is a proof-of-concept benchmark that illustrates the costs of subsetting with X[, 101:400] as well as the approach where the computations are done on the complete object and subsetting is done afterward:

X <- matrix(rnorm(1e6), ncol = 1000, nrow = 1000)
X_S <- X[, 101:400]  ## Emulate planned subsetting
stats <- microbenchmark::microbenchmark(
             colMins(X_S),
             colMins(X)[101:400],
             colMins(X[, 101:400]),
             "apply()+min()"=apply(X[, 101:400], MARGIN = 2L, FUN = min)
         )
print(stats)
Unit: microseconds
                  expr     min      lq  mean  median      uq     max
          colMins(X_S)   335.3   347.6   429   406.5   419.8   684.4
   colMins(X)[101:400]  3540.4  3707.7  4006  3723.5  3794.1  6610.8
 colMins(X[, 101:400])  5822.0  6060.1  7453  6267.6  9121.8 12733.8
         apply()+min() 11914.3 14266.5 17614 15178.5 17035.9 92841.3

This shows that the potential speedup in this particular case is at least 10 times compared with current colMins() and about 40 times compared with the traditional apply() + min() approach. If these calculations are done 10,000s of times (e.g. across human gene expressions or in bootstrap analyses), then the overall processing time saved with the proposed methods will be substantial.

Analogously, vector-based functions can gain from subsetting, e.g.

y <- weightedMean(x, w = w, idxs = 201:400)

Note that here subsetting is applied to two objects.

Parallel processing

The parallel package has been part of the core R distribution since R 2.14.0 (October 2011). It provides various mechanisms for parallel processing. With the above subsetting API in place, it will immediately be possible to do efficient matrixStats calculations on data chunks in parallel, e.g.

library('matrixStats')
library('parallel')
X <- matrix(rnorm(1e6), ncol = 1000, nrow = 1000)
cl <- makeForkCluster(nnodes = 4L)
subsets <- splitIndices(ncol(X), ncl = 4L)
y <- parLapply(cl, subsets, fun = function(cols) {
matrixStats::colMins(X, cols = cols)
})
y <- Reduce(c, y)
stopCluster(cl)

The second goals of this project is to simplify this and allow for calls such as:

y <- colMins(X, cl = cl)

or better by telling matrixStats via global settings to use parallel processing, e.g.

matrixStats::useCluster(cl)

such that matrixStats calls are the same regardless of parallel processing is used or not, e.g.

yc <- colMins(X)
yr <- rowMins(X)

The latter, referred to as "implicit parallel processing", would provide any existing code and packages already using matrixStats with an immediate performance boost without the need for modifications/updates.

Tasks and challenges

The overall objectives of this project and the matrixStats package is to provide the larger research community and the industry with an awesome and reliable tool.

Examples of tasks, challenges and goals (some may be optional):

  • Implement support for optional arguments 'rows' and 'cols' (and 'idxs' for vector functions) to existing matrixStats function both at the R level and the C level. This needs to be done in a way such that overhead added is kept at a minimum also when subsetting is not performed.

  • Write unit tests for common cases as well as corner cases (e.g. single-element objects, missing values, and integer-overflow calculations). Correctness of the calculations of matrixStats is critical for users.

  • Implement support for chunked parallel processing. Overhead added should be kept at a minimum, also in the case when parallel processing is not used.

  • Explore options for implementing the parallel support in native code in order to further minimize any overhead.

  • Consider how to handle early stopping and exceptions due to failed parallel processing (e.g. timeouts).

  • Sometimes the overhead of parallel processing is greater than the gain, e.g. for small data objects. Add size options/thresholds to control when parallel processing should be skipped. Explore the possibility to tune (estimate and set) these thresholds automatically for the R setup currently running.

  • Write benchmark reports.

  • Embrace bug reports and reproducible research.

  • Never be afraid to ask questions or propose alternative ideas/solutions - we all learn from asking questions and doing mistakes and behind every great idea there is a large number of "failed" ideas.

  • Pay attention to, and maybe even engage with, the "official" efforts in R and Bioconductor on parallel and distributed processing in R.

  • Learn new things and have fun!

Mentors

Any interested students should get in touch with Henrik Bengtsson ([email protected]; Google Melange username/id: henrikb) or Héctor Corrada Bravo ([email protected]; Google Melange username/id: hcorrada) as soon as possible so we can begin work on a more precise project proposal with specific goals and target completion dates. If anything is unclear don't hesitate to drop us a message.

Skills required

In order to succeed with this project, the following skills are required or very highly recommended:

  • R package development, e.g. building, testing and testing and testing.
  • C level programming, i.e. modifications to existing native code is needed.
  • Experience with GitHub is a major advantage, especially since it will be how you will access existing code and share your code.
  • If you already know how to use valgrind, that will also be a major advantage. If not, you will most likely learn how to use it since it is a very useful tool for validating C code.

Skill tests

Students, please send us results (links) of the below tests. You are not required to do all tests, but doing more tests and more difficult tests makes you more likely to be selected for this project. Also, these tests also serves as means for you (and us) to get a feeling for whether this project is a good match for you. The more of the tests that you understand and try to attack and even solve, the more confident you can be that you will be able to deliver on this project.

  1. Easy: Git and R package creation

    • Fork the matrixStats repository and build the package and check it with both R CMD check --as-cran.
    • Check out the 'develop' branch and redo the same.
    • If possible, also check with R CMD check --use-valgrind.
  2. Easy: Prototyping in R

    • Write an R function rowMins_R() that takes argument cols such that one can call:
    X <- matrix(1:50, nrow = 5, ncol = 10)
    y <- rowMins_R(X)
    y <- rowMins_R(X, cols = 6:8)

Demonstrate that the output is correct.

  1. Easy/Medium: Miscellaneous

    • If you have a idea on how to improve this project or matrixStats, we would love to hear about it.
    • If you have developed your own R package that uses C/C++ code (or other related code) which you are proud of, please don't hesitate to share it with us.
  2. Medium: C programming

    • Using the inline package add a prototype of your rowMins_R() function above, implemented in C, and name it rowMins_C().
  3. Medium: R programming

    • Using your rowMins_C() (alternatively rowMins_R()) function above, verify that you can find the row mins for a set of column groups of a matrix (similar to one of the example in the 'Project ideas' section):
     X <- matrix(1:50, nrow = 5, ncol = 10)
     factors <- as.factor(c(3,1,1,2,2,1,1,2,3,1))
     groups <- split(seq(along=factor), factor)
     Y <- sapply(groups, FUN=function(idxs) rowMins_R(X, cols=idxs))

Demonstrate its correctness.

  1. Medium/Hard: C programming

    • Implement weightedVar(x, w) in C, because currently it calculates the weighted variance of vector x using plain R. Demonstrate its correctness.
      Hint: You might find it helpful to use the weightedMean() function, with its high-level and low-level C functions, as a template and modify accordingly.
  2. Hard: Begin work on the project.

⚠️ **GitHub.com Fallback** ⚠️