GSoC 2021 Projects - aiidateam/aiida-core GitHub Wiki

Getting started with

AiiDA is a python framework for managing computational science workflows, with roots in computational materials science. It helps researchers manage large numbers of simulations (1k, 10k, 100k, ...) and complex workflows involving multiple executables. At the same time, it records the provenance of the entire simulation pipeline with the aim to make it fully reproducible.

AiiDA is used in research projects at universities, research institutes and companies (see SciPy 2020 talk, publications, and testimonials).

To be considered as a GSoC student, we ask you to make a small pull request to aiida-core - could be a simple bug fix, improving the documentation, etc. See e.g.

Why work on AiiDA?

Help accelerate the transition to open (computational) science
Help fix the reproducibility crisis. Computational science is a good place to start.
Work with a team of computational scientists (mostly physics backgrounds) who are passionate about both science and coding.
We have an active Slack workspace & biweekly developer meetings.

A background in materials science is not needed, but a basic interest in materials science topics will make things easier for you.

Project 1 - Extending the AiiDA REST API towards workflow management

Level: intermediate

AiiDA comes with a built-in REST API (based on the flask microframework) that provides access to the provenance graph stored automatically with any workflow execution. In order to enable the integration of AiiDA as a workflow backend into new or existing web platforms, we plan to extend the REST API to support workflow management.

The design of the REST API extension will follow an AiiDA enhancement proposal that is currently being drafted (and will be ready before you start).

Expected outcomes

In this project, you will implement POST methods that allow the creation of new AiiDA entities via the REST API, starting with /users, and continuing with /computers, /nodes and /groups.

For particularly motivated students, there are exciting stretch goals available (not required/expected):

Option 1: implement a new /processes endpoint supporting GET, PUT and DELETE for workflow management
Option 2: implement authentication for the new endpoints

Skills

We expect you to be familiar with object-oriented programming in python. Some familiarity with web frameworks like flask will be beneficial.

Project 2 - Performance optimizations at the ORM level

Level: intermediate

AiiDA uses an object-relational mapping (ORM) to map python objects to corresponding records in its PostgreSQL database. The AiiDA ORM allows users to create and manage objects (e.g. AiiDA nodes in the provenance graph) through the AiiDA python API.

While an ORM provides useful abstraction for the user, it adds overhead that can become a bottleneck when operating on large numbers of objects at once.

The goal of this project is to speed up these processes by implementing a ORM API for bulk object creation.

Expected outcomes

You will implement bulk insertion functionality in the AiiDA ORM that works with both ORM backends supported by AiiDA (django and sqlalchemy) and provides performance improvements of several orders of magnitude for large numbers of operations (don't worry, it will).

Stretch goal for exceptional students (not required/expected): use your implementation inside AiiDA to speed up data import and export from AiiDA archive files & more.

Skills

You will need to understand what an object-relational mapping is and be able to work with existing ORM python frameworks. This requires familiarity with object-oriented programming in python as well as a basic understanding of relational databases (like PostgreSQL). Previous experience with an ORM like django or sqlalchemy is beneficial, but not required.

Project 3 - Built-in support for containerized simulation codes (docker, shifter, singularity, ...)

Level: advanced

AiiDA automatically records every step of a workflow in the form of a directed acyclic graph, where each individual step is represented as a (calculation) node, linked to (data) nodes representing its inputs and outputs. The outputs, in turn, can be inputs of following (calculation) steps (see, e.g., the AiiDA tutorial). By inspecting all the "ancestors" of a given "result" in the graph, one obtains a complete description of its "provenance", i.e. the full sequence of (calculation) steps that produced it.

When a calculation is performed by a simulation code on a remote high-performance computer, a representation of the simulation code is included as an input of the calculation. As of today, AiiDA represents codes similar to "symlinks" to an existing executable on a remote computer: it stores a reference of the computer on which they are installed, and the full path to the executable (including additional metadata, such as the dynamic libraries loaded at runtime, etc.).

The last years have seen an increasing adoption of containers (using technologies such as docker, singularity, shifter or sarus), including in the HPC domain, where executables are no longer compiled on the target machine but are compiled once and run in a portable, encapsulated environment. The encapsulation of the full run-time environment, as well as the availability of global container registries, constitute a major step forward in terms of reproducibility - storing the identifier of the container in the AiiDA graph makes it possible to directly re-run existing workflows without access to the computer where it was originally executed.

This project aims to make containerized codes first-class citizens in the AiiDA provenance graph, making it possible to reproducibly re-run recorded workflows, even if simulation steps are run on different remote (super)computers.

Expected outcomes

In this project, you will extend the Code class in AiiDA to support codes that may be pulled from a container registry on demand (e.g. DockerHub or some local registry in the supercomputer centre)

Stretch goal for exceptional students (not required/expected): implement routines to re-run workflows recorded in an existing AiiDA graph, with no parameters except on which computer to run.

Skills

You will need to work with the workflow engine of AiiDA, which requires advanced python knowledge (including basic understanding of coroutines), as well as prior experience with container technologies (docker or singularity). Experience with job schedulers on clusters/supercomputers will be beneficial.

Project 4 - Make the AiiDA REST API extensible through plugins

Level: intermediate

AiiDA lives in an ecosystem of plugins that provide a wide range of functionalities, from support for certain simulation codes, over scientific workflows or new data types to support for schedulers on supercomputers (see intro to plugin internals). This project focuses on making it possible for plugins to extend the AiiDA REST API - a feature that becomes increasingly important with the integration of AiiDA into web platforms.

Expected outcomes

Under guidance of your mentors,

you will refactor the AiiDA REST API to use python entry points for registering API endpoints.
all existing endpoints (/users, /computers, /nodes, /groups, ...) will be registered through entry points themselves.
the aiida-diff demo plugin will include an example of how to add a new REST endpoint

Skills

We expect you to be familiar with object-oriented programming in python. Some familiarity with web frameworks like flask will be beneficial.

Project N - Your Idea Here

If you're already familiar with AiiDA and have your own idea on how to improve it, we're happy to consider it (you may also want to check the development roadmap for further interesting project ideas). In this case, please think about the steps you would take to attack the problem and contact us in advance so that we can draw up a rough work plan.

Mentorship

The mentors for GSOC 2021 are

Leopold Talirz @ltalirz
Chris Sewell @chrisjsewell

Please use the GSOC 2021 discussion thread to say hi and ask any questions you may have.