ruptures for change point detection - rstats-gsoc/gsoc2025 GitHub Wiki
Background
ruptures (1.7k stars on github as of Mar 2025) is a popular python implementation of classic and state-of-the-art change-point detection algorithms, which are reviewed in Selective review of offline change point detection methods. Rebecca Killick maintains a helpful web page with references.
Related work
The Time Series CRAN Task View has a section reviewing R packages for change-point detection. However, there are drawbacks to existing R/python packages.
- changepoint and ruptures: several common algorithms, but current implementations do not achieve asymptotically optimal time complexity.
- gfpop, PeakSegOptimal, binsegRcpp: efficient C++ code, but R API not consistent with other tools.
Details of your coding project
The goal of this project is to re-implement classic (binary segmentation) and state-of-the-art (PELT, FPOP) change-point detection in modern C++ (using, for instance, Armadillo), which can be interfaced with R (and eventually Python).
- Re-implement core algorithms in C++
- common distributions (normal, Poisson, ...)
- Dynamic programming
- PELT for multi-variate data.
- FPOP for uni-variate data.
- Binary segmentation
- R package with consistent API to access each algorithm.
- Documentation
- vignettes
- tests
Expected impact
Ruptures is extremely popular so this project could have a large impact.
Mentors
Contributors, please contact the mentors below after completing at least one of the tests below.
- EVALUATING MENTOR: Charles Truong is the maintainer of ruptures in Python.
- Toby Hocking [email protected] is the author of numerous R packages, and has been a mentor/admin for R-GSOC since 2013.
Tests
Contributors, please do one or more of the following tests before contacting the mentors above.
- Easy: Write an R function
getCumsum
- Input is $X$, an array of size TxD containing doubles.
- Output is $Y$, an array of size (T+1)xD such that the first row is filled with 0s and $Y[i] = \sum_{j<i} X[j]$. Here, $Y[i]$ is the i-th row of $Y$.
- Write a C++ implementation using Armadillo and a R wrapper using RcppArmadillo.
- Medium: Write a C++ class
Cost
- This class is constructed from a double array $X$ of size TxD.
- Add a member function
double eval(int start, int end)
which returns $\sum_{t=\text{start}}^{\text{end}-1} \left| X[t] - m\right|^2$ where $|\cdot|$ is the Euclidean norm and $m$ is the empirical mean of $X$ on the segment $[i, j[$.
- Hard: Write an efficient
Cost::eval
function with constant time complexity. Use the following relations to precompute relevant quantities.
$$ \sum_{t=\text{start}}^{\text{end}-1} \left| X[t] - m \right|^2 = \left(\sum_{t} \left| X[t] \right|^2\right) - \left(\left| \sum_{t} X[t] \right|^2 / (\text{end}-\text{star})\right) $$
and
$$ \sum_{t=\text{start}}^{\text{end}-1} \left| X[t] \right|^2 = \left(\sum_{t=1}^{\text{end}-1} \left| X[t] \right|^2 \right) - \left(\sum_{t=1}^{\text{start}-1} \left| X[t] \right|^2 \right) $$
and
$$ \sum_{t=\text{start}}^{\text{end}-1} X[t] = \left(\sum_{t=1}^{\text{end}-1} X[t] \right) - \left(\sum_{t=1}^{\text{start}-1} X[t] \right). $$
Solutions of tests
Contributors, please post a link to your test results here.
Minh Long Nguyen | Github | Solutions for all 3 tests
Dev Goel | Github | Solutions
Olivier Boulant | Github | Solutions