Performance testing - Rdatatable/data.table GitHub Wiki
This page describes the plan for performance testing -- what we do to check that merging a PR will not make the code slower.
atime overview and example
We use atime on GitHub actions for performance testing.
- the action is defined in https://github.com/Rdatatable/data.table/blob/master/.github/workflows/performance-tests.yml
- the test cases are in https://github.com/Rdatatable/data.table/blob/master/.ci/atime/tests.R
A performance test template is
# Comments with links to related issues/PRs (delete this line)
# Issue with FUN short description https://github.com/Rdatatable/data.table/issues/4498
# Caused by TODO add PR link or delete this line if cause unknown
# test case adapted from https://github.com/Rdatatable/data.table/pull/4501#issue-625311918 which is the fix PR.
"Short test name" = atime::atime_test(
# arguments (N, setup, expr, pkg.edit.fun) to pass to atime_versions (delete this line)
N = 10^seq(5, 7),
setup = {
L = as.data.table(as.character(rnorm(N, 1L, 0.5)))
setkey(L, V1)
},
expr = {
data.table:::`[.data.table`(L, , .SD)
},
Slow = "cacdc92df71b777369a217b6c902c687cf35a70d", # Parent of the first commit (https://github.com/Rdatatable/data.table/commit/74636333d7da965a11dad04c322c752a409db098) in the PR (https://github.com/Rdatatable/data.table/pull/4501/commits) that fixes the issue
Fast = "353dc7a6b66563b61e44b2fa0d7b73a0f97ca461" # Last commit in the PR (https://github.com/Rdatatable/data.table/pull/4501/commits) that fixes the issue
)
Note above that each SHA1 git version is specified with an argument name which will appear in the output, and a comment so we can later understand where the SHA1 came from.
- If there is a historical commit/PR known to have caused a regression, then use the argument names Before, Regression, Fixed. Before and Fixed should be fast and Regression should be slow.
- If there is no known point in the past which was fast, then use the argument names Fast and Slow. Fast is a new version after the fix, and Slow is an old version before the fix.
- You can include additional historical references if relevant. For example in https://github.com/Rdatatable/data.table/pull/4501#issuecomment-2226725327 the parent of the first commit was in 2020 (Slower old), and the parent on master of the last commit was in 2024 (Slow new), and the fix was faster than both. So three historical references are relevant (Slower old, Slow new, Fast). But make sure to use the names Slow/Fast, or Before/Regression/Fixed, which are assigned special colors, whereas any other names are shown in grey.
- Historical references must be in non-deleted branches. So please avoid deleting branches that may be relevant for performance tests. If you get an error like below, it means you need to find the relevant PR on github and click "Restore branch" button.
Error in revparse_single(object, branch): Error in 'git2r_revparse_single': Requested object could not be found
when trying to checkout 353dc7a6b66563b61e44b2fa0d7b73a0f97ca461
Also you should use data.table:::`[.data.table`(x,i,j)
instead of x[i,j]
because atime tests different versions by substituting data.table::
with data.table.someSHAversion::
(a different package name is created and installed for each version, and to have that package work you need to provide a special pkg.edit.fun because of the particularities of data.table, https://github.com/Rdatatable/data.table/blob/master/.ci/atime/tests.R#L43)
Also if there is some issue/PR comment with benchmark code that was used to derive the atime test, please include a link, as above test case adapted from https://github.com/Rdatatable/data.table/pull/4501#issue-625311918 which is the fix PR.
Documentation of historical commits
For each historical commit, make sure the corresponding comment has links to github web pages where we can see where this SHA1 came from. There are two such types of links
- For "Last commit in the PR (PR_LINK) that fixes the issue" (the commit is in a related PR), just provide a link to the PR commits tab page, such as https://github.com/Rdatatable/data.table/pull/4501/commits
- Can also be "Parent of the commit which caused the regression in the PR (PR_LINK)"
- Or "Commit which caused the regression in the PR (PR_LINK)"
- Make sure we can see the commit ID when we go to PR_LINK
- For "Parent of first commit (COMMIT_LINK) in the PR (PR_LINK) which VERBS the issue" (commit is not in a related PR), then COMMIT_LINK should be a link to the commit ID, such as https://github.com/Rdatatable/data.table/commit/74636333d7da965a11dad04c322c752a409db098
- For example above we have
Slow = "cacdc92df71b777369a217b6c902c687cf35a70d", # Parent of the first commit (https://github.com/Rdatatable/data.table/commit/74636333d7da965a11dad04c322c752a409db098) in the PR (https://github.com/Rdatatable/data.table/pull/4501/commits) that fixes the issue
- Make sure to include a PR_LINK to its commits tab, such as https://github.com/Rdatatable/data.table/pull/4501/commits so we can see that the commit is in that PR. In this case, we see that 74636333d7da965a11dad04c322c752a409db098 is really the first commit in that PR.
- VERBS should be fixes or causes
- and the https://github.com/Rdatatable/data.table/commit/74636333d7da965a11dad04c322c752a409db098 page shows that cacdc92df71b777369a217b6c902c687cf35a70d is the parent
- do not include the link to the same commit ID as in the code, as that does not help understanding how the commit is related to the PR/issue, and it is redundant with the code which already has that ID. For example the code above has
Slow = "cacdc92df71b777369a217b6c902c687cf35a70d"
so it is not necessary to inculde the linkhttps://github.com/Rdatatable/data.table/commit/cacdc92df71b777369a217b6c902c687cf35a70d
- do not include a link which includes data.table/pull/number/commits/sha (does not show parent) instead use data.table/commit/sha because that page shows the parent
- For example above we have
Why do we include historical references at all? Don't they just add visual clutter?
Discussion from https://github.com/Rdatatable/data.table/pull/4501#issuecomment-2227321671
It is true that Slow and Fast (and other historical references) are not strictly necessary to see future regressions, and they could add visual clutter, but they also essential to answer an important question: is the current performance as good/bad (or better than) the bad/good performance we have seen in the past? (at the time the issue was created/fixed)
the bad performance might still clutter things up in other PRs. After all, we fixed it! ...
you are right about the clutter but the benefit of being able to easily/immediately see the historical references outweighs the drawback of clutter.
We want to be able to see if future changes undo the historical fix. most of the time all of the different versions in a PR will all be the same as the "Fast" or "Fixed" commit but if there is some difference, we want to be able to easily/immediately see if it is as bad as we saw when the fix PR was merged.
A great example is the setDT performance test, which currently shows the result below. Without the "Slow" reference we may wonder if CRAN is faster or slower than the historical bad performance we had seen in the past. But with that reference it is clear that CRAN performance is the same as the historical bad performance that we are trying to fix.
Can not be run from forks
Any pull request sent from a fork can not be perfomance tested, because the Github action requires permission to create artifacts in the rdatatable/data.table
repo. So it is normal that the action fails for pull requests from forks.
Running locally
- To run all performance tests locally on your machine, use
atime::atime_pkg("path/to/data.table",".ci")
which will save result figures to thedata.table/.ci/atime
directory -- the main result figure istests_all_facet.png
. - To run a single performance test on your machine, just copy the same arguments as atime_test to atime_versions,
vinfo <- atime::atime_versions(
pkg.path="path/to/data.table",
pkg.edit.fun = pkg.edit.fun,
N = 10^seq(1, 7, by=0.5),
setup = {
L = as.data.table(as.character(rnorm(N, 1L, 0.5)))
setkey(L, V1)
},
expr = {
data.table:::`[.data.table`(L, , .SD)
},
Slow = "cacdc92df71b777369a217b6c902c687cf35a70d", # Parent of the first commit (https://github.com/Rdatatable/data.table/commit/74636333d7da965a11dad04c322c752a409db098) in the PR that fixes the issue
Fast = "353dc7a6b66563b61e44b2fa0d7b73a0f97ca461" # Last commit in the PR (https://github.com/Rdatatable/data.table/pull/4501/commits) that fixes the issue
)
plot(vinfo)
refs <- atime::references_best(vinfo)
plot(refs)
pred <- predict(refs)
plot(pred)
Alternatively you can actually evaluate the code in atime/tests.R and then use do.call
as below
vinfo <- do.call(atime::atime_versions, c(test.list["setDT improved in #5427"](/Rdatatable/data.table/wiki/"setDT-improved-in-#5427"), pkg.path="path/to/data.table"))
Or, since atime-2024.9.23, you can use the code below
pkg.info <- atime::atime_pkg_test_info("path/to/data.table") #meta-data about test cases (not run yet)
one.call <- pkg.info$test.call["memrecycle regression fixed in #5463"](/Rdatatable/data.table/wiki/"memrecycle-regression-fixed-in-#5463") #un-evaluated call to atime_versions()
one.result <- eval(one.call) #runs atime_versions()
plot(one.result)
Past issues
Performance testing was done in some cases:
The discussions/code above make use of the atime R package, which can measure time/memory for different versions of data.table
, so we can use it to compare master to release to PR.
- Michael says that it is important to run perf tests on several platforms - link and on GitHub CI as of 2024 we are actually only using Linux.
- The doc says that win/mac/linux images are provided by GitHub Actions.
Related team
A team, Performance Testers is assigned to one who is actively involved with the performance testing aspects of data.table
. Responsibilities that fall under this specialized role can include, but are not restricted to:
- Evaluating the scalability of
data.table
functions to track how they perform as datasets grow asymptotically. - Running comparative performance benchmarks to portray the relative efficiency of operations, i.e., in contrast to other packages that achieve similar functionality as
data.table
. - Writing open-source material like blog posts to document such (with code to run the benchmarks provisioned therein), as they would tend to be a great resource for the community. Examples: df-atime-figures, df-partial-match
- Designing test scenarios to measure performance, such as handling large datasets, performing complex queries, having concurrent operations, etc.
- Staying informed with the latest developments in R programming and performance testing methodologies to bring such updates to
data.table
.