GSoC 2020 Projects - aiidateam/aiida-core GitHub Wiki

Getting started with

AiiDA is a python framework for managing computational science workflows, with roots in computational materials science. It helps researchers manage large numbers of simulations (1k, 10k, 100k, ...) and complex workflows involving multiple executables. At the same time, it records the provenance of the entire simulation pipeline with the aim to make it fully reproducible.

AiiDA is used in research projects at universities, research institutes and companies (examples of recent works using AiiDA are 1(https://www.nature.com/articles/s41565-017-0035-5) 2(https://pubs.acs.org/doi/10.1021/acscentsci.9b00619) 3(https://www.nature.com/articles/s41565-019-0577-9)).

To be considered as a GSoC student, we ask you to make a small pull request to aiida-core - could be a simple bug fix, improving the documentation, etc. See e.g. GitHub issues by-label

Why work on AiiDA?

  • Help accelerate the transition to open (computational) science
  • Contribute to fixing the reproducibility crisis. Computational science is a good place to start.
  • Work with a team of computational scientists (mostly physicists) who are passionate about both science and coding

A background in materials science is not needed, but a basic interest in materials science topics will make things easier for you.

Project 1 - Performance optimizations at the ORM level

Level: intermediate

AiiDA has her own front-end Object Relational Mapper (ORM) to map python objects to the corresponding records in the (Postgresql) database. This ORM allows users to create and manage objects (e.g. AiiDA nodes in the provenance graph) through the AiiDA python API.

While an ORM provides useful abstraction for the user, it adds overhead that can become a bottleneck when operating a large numbers of objects at once. For example, the export/import functionality in AiiDA, which allows to export (parts) of a provenance graph and import it in another database, requires dealing with lots of objects in the database in one go.

The goal of this project is to speed up these processes by implementing a generic ORM API for bulk object creation that works with both low-level ORM backends supported by AiiDA (the Django and SqlAlchemy libraries).

Expected outcomes

This project will

  1. implement bulk insertion functionality in the AiiDA ORM to make the importing of data as efficient as possible
  2. make data import implementation independent of the low-level ORM backend (django / sqlalchemy) by passing through AiiDA's front-end ORM

Skills

This project will require the participant to work with the ORM of AiiDA so an understanding of Object Relational Mappers is important. AiiDA's ORM is implemented with two different libraries, Django and SqlAlchemy, so previous experience with those is desirable but not required. Finally, AiiDA uses PostgreSQL as the Relational Database Management System (RDBMS), therefore basic knowledge and understanding of an SQL-type database would be of benefit.

Project 2 - Built-in support for codes encapsulated in containers (docker, shifter, singularity, ...)

Level: intermediate

AiiDA stores all calculation executions (including detailed information on inputs and outputs) in the form of a directed acyclic graph, where each calculation is represented as a node, and is linked to other data nodes representing the inputs and the outputs that it created. Outputs, in turn, can then be inputs of new calculations. This graph is generated automatically by AiiDA; by inspecting all the "ancestors" of a given data node in the graph, we have a complete description of the "provenance" of that data node, i.e. the full sequence of calculations (with their inputs) that led to its generation.

When a calculation is performed by an external code (e.g. a binary on a remote high-performance computer (HPC)), the code is included as an input of the calculation. As of today, codes in AiiDA are represented as "symlinks" to an existing executable on a remove computer, i.e., they contain a reference of the computer on which they are installed, and the full path to the executable (plus some additional metadata, such as which dynamic libraries to load at runtime).

The last years have seen an increasing adoption of containers (using technologies such as docker, singularity, shifter or sarus), including in the HPC domain, where executables are no longer compiled on the target machine but are compiled once and run in a portable, encapsulated environment. The encapsulation of the full run-time environment, as well as the availability of global container registries, constitute a major step forward in terms of reproducibility - storing the identifier of the container in the AiiDA graph makes it possible to directly re-run existing workflows without access to the computer where it was originally executed.

This project will make containerized codes first-class citizens in the AiiDA provenance graph, making it possible to re-run recorded workflows, even if simulation steps are run on different remote (super)computers.

Expected outcomes

This project will

  1. extend the Code class/interface in AiiDA, to define a code that is not necessarily already installed on a supercomputer, but may be pulled from a container registry on demand (e.g. DockerHub or some local registry in the supercomputer centre)
  2. implement routines to re-run workflows recorded in an existing AiiDA graph, with no parameters except on which computer to run.

Skills

The participant will need to work with the workflow engine of AiiDA. This requires advanced python knowledge (including basic understanding of coroutines), as well as prior experience with container technologies (docker or singularity). Experience with job schedulers on clusters/supercomputers will be beneficial.

Project 3 - Upgrade tornado dependency of AiiDA and plumpy

Level: advanced

Coroutines, and asynchronous programming in general are used in many python web technologies, such as jupyter notebooks, voilà, bokeh etc. As web technologies evolve, so do libraries for asynchronous programming, such as tornado or the asyncio module of the python standard library (available since python 3.4).

AiiDA and the workflow library plumpy used by AiiDA's engine have not kept up with recent developments, forcing AiiDA to run with outdated versions of tornado, and making it incompatible with the latest python web technology. The growing use of AiiDA in jupyter notebooks and web applications on platforms like the AiiDA lab and the Materials Cloud make it increasingly important to resolve this issue.

Expected outcomes

This project will:

  1. Replace tornado dependencies of plumpy by asyncio
  2. Replace tornado dependencies of aiida-core by asyncio

Optional bonus outcomes

Another dependency of AiiDA, the process & socket manager circus, currently is also only compatible with tornado up to v5. We note that circus is used by many other open-source projects besides AiiDA and so by updating this library not only AiiDA but a lot of other open source projects are to benefit from this development. The knowledge gained by the student of migrating aiida-core and plumpy from tornado to asyncio can be applied to circus as well in the case the primary goals and deliverables are reached before the end of the project. The AiiDA mentors will help manage interactions with the circus maintainers.

Skills

The student will need to have experience with asynchronous programming, and needs to be able to quickly dive into the plumpy & circus python packages, grasp their inner workings in order to update how their use of coroutines.

Project N - Your Idea Here

If you're already familiar with AiiDA and have your own idea on how to improve it, we're happy to consider it. In this case, please think about the steps of how you would go about attacking the problem so that we can draw up a rough work plan.

Mentorship

Available co-mentors are

We have an active Slack workspace & biweekly developer meetings.