GSOC 2021 Proposal - HenrikBengtsson/matrixStats GitHub Wiki

matrixStats: Consistent Support for Name Attributes

Below is a project proposal for the R Project's participation in the Google Summer of Code 2021 on the matrixStats package. The application deadline for students is on April 13, 2021, but please also note that there is a "soft" deadline to submit project tests on April 6, 2021.

Last updated: 2021-02-20

Background

The matrixStats package provides high-performing functions operating on rows and columns of matrices, e.g. rowMedians(), colRanks(), colSds(). These functions are optimized to run as fast as possible while consuming as little memory as possible. The performance gain comes from carefully designing and implementing the algorithms in native code (C) - all without relying on parallel processing.

For example, consider the following numeric 20-by-500 matrix:

X <- matrix(rnorm(20 * 500), nrow = 20, ncol = 500)

To calculate the mean and median for each column, we can do the following in base R:

mu <- apply(X, MARGIN = 2, FUN = median)

With matrixStats, we can do:

mu <- matrixStats::colMedians(X)

For this particular matrix, the matrixStats solution is 71 times faster and allocates 67 times less memory while performing the calculations. The performance comes from carefully written C code, which has less overhead compared to R code. Another reason for the improved performance is that less memory is allocated and less often, which results in R's garbage collector not having to run as frequently. We can see this as:

bench::mark(
  mu <- apply(X, MARGIN = 2, FUN = median),
  mu <- matrixStats::colMedians(X)
)
# # A tibble: 2 x 13
#   expression                                   min  median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc
#   <bch:expr>                               <bch:t> <bch:t>     <dbl> <bch:byt>    <dbl> <int> <dbl>
# 1 mu <- apply(X, MARGIN = 2, FUN = median)  13.9ms  14.3ms      69.0   293.2KB     34.5    22    11
# 2 mu <- matrixStats::colMedians(X)           212µs 214.5µs    4618.     4.38KB      0    2309     0
# # … with 5 more variables: total_time <bch:tm>, result <list>, memory <list>, time <list>, gc <list>

In pipelines where these calculations are performed over many matrices, say, 1,000 - 100,000s of matrices, the performance gain is significant when using the matrixStats package. This means that pipelines run faster, results are obtained sooner. If these pipelines are run in cloud services, the total cost will be lower because the overall CPU time will be lower. The energy consumption and therefore also the CO2 footprint will also be lower when using matrixStats.

For these and other reasons, the matrixStats package is used by a lot of R users and developers in their scripts and R packages. As of February 2021, there are 162 package on CRAN and 158 package on Bioconductor, totaling 320 packages, that directly rely on matrixStats for matrix-based operations. There are many more that depend on it indirectly. Moreover, the success of matrixStats has resulted in its API being implemented also for more complex matrix-like data types such as sparse matrices (Matrix and sparseMatrixStats) and Bioconductor-specific large matrix-like structures (DelayedArray and DelayedMatrixStats). To conclude, there is a lot of interest in the matrixStats package and its current and future development.

Problem: Row and column names are handled inconsistently, if at all

Consider the following 5-by-3 matrix with row and column names:

X <- matrix(rnorm(5 * 3), nrow = 5, ncol = 3)
rownames(X) <- c("a", "b", "c", "d", "e")
colnames(X) <- c("A", "B", "C")
X
#              A          B            C
# a  0.265712524  1.8908484 -1.884761053
# b -0.693603605 -0.7357395 -0.358448768
# c -0.239368448  0.6221126  1.076547924
# d  0.143282946 -0.8143789  0.005601136
# e -0.008303701  0.2542775  0.521039672

If we use the base R method to calculate column medians, we see that the names attribute of the results reflects the column names of the input matrix:

apply(X, MARGIN = 2, FUN = median)
#            A            B            C 
# -0.008303701  0.254277475  0.005601136

In contrast, if we use matrixStats, this information is dropped:

matrixStats::colMedians(X)
# [1] -0.008303701  0.254277475  0.005601136

Goal of this project

The goal of this GSOC project is to make matrixStats function handle names in the same manner as the base R function. In particular, we will treat the behavor of apply as the definition of how the matrixStats functions should behave. Hence, we should have:

matrixStats::colMedians(X)
#            A            B            C 
# -0.008303701  0.254277475  0.005601136

to be equivalent to

apply(X, MARGIN = 2, FUN = median)
#            A            B            C 
# -0.008303701  0.254277475  0.005601136

However, we will at least temporarly have an option to turn this behavor off through an option:

options(matrixStats.useNames = FALSE)
matrixStats::colMedians(X)
# [1] -0.008303701  0.254277475  0.005601136

For backward compatible reasons, we will also support matrixStats.useNames = NA (see below). This option can be made permanent and changed into an argument useNames later if it turns out that performance loss or backwards compatibility becomes an issue. Else, if no such problem arise or if checking the option contributes to a significant overhead, it should later on be removed.

Outcome and impact

If successful, these improvements will be included in the matrixStats package as soon as possible. Because the matrixStats is a highly popular already, the added support for name attributes will immediately reach a wide, active user base while at the same time attract users and developers who lack this feature and therefore have not been able to use matrixStats for their needs.

Tasks and challenges

The overall objectives of this project and the matrixStats package is to provide individuals, the research community, and the industry with an outstanding and reliable tool. Equally important is that you will learn new things from this project and have fun along the way!

Below is a sketched outline of tasks, challenges, and goals (some may be optional) this project covers. Although we, the mentor, have a good understanding of what needs to be accomplished, we do not know all the details. For example, just like with any other real-world project, there are knowns and unknowns, and some of the unknowns will force us to step back and reconsider the current design. This makes projects challenging, fun, and how we learn from them.

Tasks

The matrixStats API is implemented as R functions, which then call C functions for maximum performance. For example, colMedians() calls C function C_rowMedians() internally. We propose to first implement proof-of-concept support for useNames at the R level before considering doing it in C. This will allow us to implement all the needed unit tests as soon as possible. Correctness is critical in this project and for the matrixStats package.

Here is a sketch of an outline for this project:

  1. Catogarize the matrixStats according to the following criteria: Are the results of the function generated in C or R code? Does the function already support naming? Which datatypes (vector/matrices) does the function accept and return? Does it make sense for the function to support naming?

  2. Update the R functions to detect an option matrixStats.useNames = TRUE / FALSE / NA with NA being the backward-compatible default (explained further below)

  3. Write package tests for each function updated that tests all cases of useNames as well with and without row and column names in the input data. Also, write benchmarks where you test out different input sizes, dimensions and combinations of named and unnamed dimensions.

  4. Run reverse-package dependency checks on all 300+ packages to make sure nothing breaks (we will be able to help with compute resources, if needed)

  5. Move implementation for naming down to the C code to minimize the overhead of handling name attributes. Here, you do not have from scratch, but you can use an already created template where tests are already added.

  6. Relying on the above package tests, validate the correctness of the updated C code

  7. If time allows, run reverse-package dependency checks with matrixStats.useNames = FALSE and matrixStats.useNames = TRUE being the default to identify which packages rely on the current in-consistent behavior. Finally, check whether checking matrixStats.useNames has any measureable overhead by itself.

It is hard to estimate how much time and efforts it will take to implement and validate the new naming features throughout the matrixStats API up front. This will become more clear about half-way into Step 2 where we have a better sense on how much of the cases requires special care and much can rely on a generic solution. Writing test cases helps identify this.

Challenges

Because the matrixStats package is used for it's performance, we cannot risk that handling names creates an overhead that could have a significant negative impact on performance for downstream dependencies. For this reason, we will have to consider whether to introduce support for row and column names as optional, e.g.

matrixStats::colMedians(X, useNames = TRUE)
#            A            B            C 
# -0.008303701  0.254277475  0.005601136

However, if benchmarking shows that this should not be necessary, we will be providing a temporary option matrixStats.useNames = FALSE which can be later removed if there is no need for it.

What complicates the matter further is that some of the matrixStats functions do already preserve names, e.g.

matrixStats::colLogSumExps(X)
#        A        B        C 
# 2.605958 2.576195 1.364362

Because of the popularity of matrixStats, it is likely that there are some scripts and packages out that rely on the current in-consistent behavior across functions. We will be able to check existing packages using reverse-dependency checks (e.g. revdepcheck) and work with those package maintainers to update their code accordingly. However, we can never be able to find the users out there with scripts that rely on this. So, we need to move forward with great care here.

If the reason for not making matrixStats.useNames = TRUE the new default was performance, the reason for not making matrixStats.useNames = FALSE is to not break existing code. This leaves us with the conservative alternative of matrixStats.useNames = NA, which should leave the default behavior as-is for all functions, e.g.

options(matrixStats.useNames = NA)
matrixStats::colMedians(X)
# [1] -0.008303701  0.254277475  0.005601136

matrixStats::colLogSumExps(X)
#        A        B        C 
# 2.605958 2.576195 1.364362

When matrixStats implements matrixStats.useNames = NA throughout and it has been published on CRAN, we can start contacting maintainers of packages whose code breaks or runs slower if we would switch the default to matrixStats.useNames = FALSE or matrixStats.useNames = TRUE. Such packages can be identify by running reverse-dependency checks. Also after release there might be a good idea to keep this option in case there are some special corner cases in scripts people run which may cause crashes or sluggerish performance due to the naming.

In case you are fast

If it turns out that support for naming can be added faster than we expect above, there are additional improvements that can be worked on after finishing the above that would be valuable to the matrixStats package and all of its users. For ideas, see https://github.com/HenrikBengtsson/matrixStats/issues. A interesting candidate is to:

  • Speed up code by skipping checking for NA:s when possible. Checking for NA:s is very expensive and if we knows that there are no NA:s in the input data, we can pass na.rm = FALSE to skip these checks. Unfortunately, we rarely know whether or not there are missing values in the input data. However, since R 3.5.0, R keeps track of this information internally in some cases and we can use the new, powerful ALTREP framework to quickly check this information. One way we can use this in matrixStats is to query ALTREP for this information if the user specified na.rm = TRUE and if ALTREP says there are no NA:s in the data, then we can automatically switch that to na.rm = FALSE before running our algorithm. See also https://github.com/HenrikBengtsson/matrixStats/issues/191.

What you will learn or improve upon

Here are a few areas that you will learn or learn more about when working on this project:

  • R package development, e.g. building, installing, and checking

  • Robust code development through unit tests that have a high code coverage

  • Write reproducible tests, troubleshooting, and fixing bugs

  • How to introduce new features into a well-established code base while minimizing the risk of breaking existing code that is mostly unknown to you

  • Interact with other package developers via their issue trackers

  • Version control, pull requests, GitHub

All the above are valuable skills and fundamental concepts in programming and software engineering.

Don't be afraid to ask questions or propose alternative ideas/solutions - we all learn from asking questions and doing mistakes - and behind every great idea there are often many "failed" ideas.

Mentors

  1. Jakob Peder Pettersen, PhD Student, Department of Biotechnology and Food Science, Norwegian University of Science and Technology (NTNU), [email protected]. Jakob is a part of Almaaslab and does reseach on genome-scale metabolic modelling and behavor of microbial communities.

  2. Henrik Bengtsson, Associate Professor, Department of Epidemiology and Biostatistics, University of California San Francisco (UCSF), [email protected]. Henrik is a member of the R Foundation and the R Consortium Infrastructure Steering Committee. He is the author of a large number of CRAN and Bioconductor packages including the future framework for parallel processing. Henrik has been a GSOC mentor ('Subsetted and parallel computations in matrixStats', GSOC 2015).

If you are interested in conducting this project, please see https://github.com/rstats-gsoc/gsoc2021/wiki for how to proceed. If anything is unclear about this project, drop us a message.

Skills requires

In order to succeed with this project, the following skills are required or a highly recommended:

  • R package development, e.g. building, testing and testing and testing

  • Installing an R package with C node

  • Understanding basic lower-level programming constructs such as pointers

  • Some basic C programming, which can be learned

  • Experience with Git and GitHub is a major advantage, especially since it will be how you will access existing code and share your code

  • Being able to communicate in written English

  • Ability to communicate in spoken English is a preference but not a requirement

We expect that the participating student meets deadlines and responds to mentors and the GSOC organization in a timely manner.

Skill tests

Students, please send us results (links) of the below tests. You are not required to do all tests, but doing more tests and more difficult tests makes you more likely to be selected for this project. Also, these tests also serves as means for you (and us) to get a feeling for whether this project is a good match for you. The more of the tests that you understand and try to attack and even solve, the more confident you can be that you will be able to deliver on this project.

  1. Easy: Installing R packages with C code

    • Install matrixStats from source, i.e. install.packages("matrixStats", type = "source"). This requires that you have developer tools such as gcc and make installed. If you have not done this before, consider this important task being a 'Medium' task.
  2. Easy: Git and R package testing

    • Fork the matrixStats repository and build the package and check it with both devtools::check() and R CMD check --as-cran.
  3. Easy: Prototyping in R

    • Add argument useNames = NA to colMedians(). If a non-NA value is passed, give an informative error message.

    • Remember to add new argument to the Roxygen-based help and regenerated.

    • Write a package test that asserts that matrixStats.useNames = NA works and matrixStats.useNames = TRUE (or FALSE) gives the expected error.

    • Try to make the package passes R CMD check will all OKs

  4. Medium: Simple support for name attributes

    • Add handling of matrixStats.useNames to colMedians().

    • Write package tests that assert that also matrixStats.useNames = FALSE and matrixStats.useNames = TRUE works as expected.

    • Make the package passes R CMD check will all OKs

  5. Medium: A related, slightly different case

    • Add support naming support to colLogSumExps()

    • Write package tests and check with R CMD check.

  6. Medium/Hard: Implement in C code

    • Implement naming support for colLogSumExps() in C code (remember the template).

    • Validate correctness with R CMD check.

  7. Hard: Begin to work on the project.

⚠️ **GitHub.com Fallback** ⚠️