Google Season of Docs Proposal Restructuring OpenMS Developer Documentation (OpenMS ReDevDoc) - OpenMS/OpenMS GitHub Wiki
Background
OpenMS is a cross-platform (Linux, Windows, macOS) software framework based around a core open source C++ library, which implements all data structures and algorithms required for mass spectroscopic (MS) data analyses. It was created in 2006, is currently at release 2.6 and is licensed under the three clause BSD license. Our contributors are computational mass spectrometrists, bioinformaticians, and data scientists. Among our users are large data repositories like MAssIVE and PRIDE, individuals that value the flexibility and open-source nature of OpenMS and companies in need of customized solutions to their analytical problems. OpenMS was cited in ~900 scientific publications in 2020, and its tools were used ~3,000 times per month by unique users in the first quarter of 2021. We have seen strong interest in our workshops, and over the last years, we have trained between 80 to 120 participants annually (2020 numbers are lower due to COVID-19 travel restrictions). Numerous downstream tools such as MSstats, aLFQ, QCloud2 and Skyline have integrated their tools with OpenMS.
Mass spectrometry is a sensitive high-throughput technique capable of direct quantitative and qualitative measurement of proteins, which perform essentially all cellular tasks required for life. For example, direct protein interactions between SARS-CoV-2 and human proteins are essential for virulence (such as the viral spike protein and human ACE2); mass spectrometry has produced a full map of 332 pairwise protein interactions. OpenMS is used not only for proteins but also metabolites. In addition, MS is used for more specialized purposes like analysis of cross-linked molecules (both protein-protein and protein-nucleic acid) and nucleic acid identification and quantification.
The documentation problem of OpenMS
OpenMS uses modern object-oriented, template-based C++, and extensive English documentation is available for several thousand C++ functions of the public API. All development is performed in the open, using GitHub. Besides from C++, most of the functionality of OpenMS is also available from Python through pyOpenMS. These Python bindings allow for rapid prototyping and offer an easier route to get acquainted with OpenMS for aspiring developers, but also create a documentation issue: classes are documented in C++ using doxygen and (partly duplicated) in Python using doc strings.
Besides duplicated class documentation, there are also two tutorials for software developers, one for C++ (using doxygen) and one for pyOpenMS (using readthedocs). For historic reasons, these two tutorials have diverged considerably. This divergence in documentation makes it hard to support and difficult for pyOpenMS developers to make the step from pyOpenMS to C++. Also, both tutorials are focused on proteomics with metabolomics completely absent. Metabolomics is an important use case for OpenMS, and its importance in biomedicine is rapidly growing. Lastly, there is little flow in the tutorials, in the sense that the tutorial builds upon previous sections in order to achieve a non-trivial goal.
OpenMS ReDevDoc scope
The OpenMS project (code-named OpenMS ReDevDoc) will:
- Review the current pyOpenMS tutorial and propose logical steps of increasing difficulty, culminating in the application of machine learning to a mass spectrometry data set, for example a gradient boosting machine to predict retention time.
- Using the review from the previous step, update the tutorial and relegate details (about other classes or functions) that do not fit in the flow to separate chapters.
- Devise and document a metabolomics workflow for example culminating in visualisation of quantities of metabolites in a mixture.
- Create a quick “cheat sheet” that documents the main classes and methods.
- Implement interactive code examples as Jupyter Notebooks and install a pyOpenMS server, such that the code examples can be run on the server, without the need to install pyOpenMS locally.
- Rewrite the C++ developers guide such that it is in lock-step with the pyOpenMS tutorial.
- Make a documentation landing page that summarizes all available documentation
Work that is out-of-scope of the OpenMS ReDevDoc project:
- Substantial changes to the C++ and Python API developer documentation at the level of classes and function
The core OpenMS developers Hannes Rost, Oliver Alka, Timo Sachsenberg and Tjeerd Dijkstra have committed to supervise the OpenMS ReDevDoc project. We have not yet identified a technical writer to work on the project
Measuring OpenMS ReDevDoc success
OpenMS receives an average of 300 pull requests per year to add or update classes or functions. pyOpenMS is downloaded 2500 per quarter on average pypistats of pyOpenMS). We believe that improved tutorials will result in more pull requests for OpenMS and more downloads of pyOpenMS. We would consider the OpenMS ReDevDoc project successful if, after publication of the new documentation:
- The yearly number of pull requests increases by 25%
- The quarterly number of downloads of pyOpenMS increases by 25%
Previous experience with GSoD or GSoC
So far, we have no prior experience with technical writers, and written all documentation without external help. In contrast, we have been highly active in mentoring GSoC students and participated in the course of three summers under the umbrella of the Open Bioinformatics Foundation (OBF). All students successfully completed GSoC and two of them continued contributing to OpenMS. In 2017 a student added algorithms for high-resolution isotope generators. In 2018, a student improved estimation of error probabilities and extended the project to a master thesis. In 2020, we mentored two students. The first student automatically generated an OpenMS R-package from our python bindings and the second student developed a novel tool for protein database suitability estimation. In Season of Docs, we will follow the same principles we developed when mentoring GSoC students: provide a friendly atmosphere on equal footing, have frequent meetings and work closely to achieve our joint goal of improving open source software and lowering the barrier of entry for beginners.
Reference: GSoC 2017 mentors: Timo Sachsenberg, Julianus Pfeuffer, Artem Tarasov, Oliver Alka https://summerofcode.withgoogle.com/archive/2017/projects/6722516903002112/
GSoC 2018 mentors: Timo Sachsenberg, Julianus Pfeuffer, Oliver Alka https://summerofcode.withgoogle.com/archive/2018/projects/5921078421487616/
GSoC 2020 mentors: Hannes Röst, Timo Sachsenberg, Oliver Alka and Chris Bielow, Julianus Pfeuffer https://summerofcode.withgoogle.com/archive/2020/projects/5403269754519552/ https://summerofcode.withgoogle.com/archive/2020/projects/5225514815455232/
Project budget
We estimate that this work will take four months to complete.
Budget item | Amount | Running Total | Notes/justifications |
---|---|---|---|
Technical writer read up on mass spectrometry and current (py)OpenMS tutorials, propose new pyOpenMS tutorials for both proteomics and metabolomics and write documentation | 8000.00 | 8000.00 | One month for reading up on MS, proteomics and metabolomics. Two months for tutorials in pyOpenMS and one month for interactive code examples and C++ tutorial in lock-step with pyOpenMS one |
Project t-shirts (10 t-shirts) | 200.00 | 8200.00 | Reuse the OpenMS t-shirt design from ASMS 2019 conference |
TOTAL | 8200.00 |
Contact
Open an issue on gitter to contact Tjeerd Dijkstra (@tjeerdijk), Oliver Alka (@oliveralka) or Timo Sachsenberg (@timosachsenberg)