Journaling a project with DataLad - GlascherLab/LabWiki GitHub Wiki

DataLad is a software tool that is built on Git and git-annex and enables the journaling of entire data set (including large files which usually cannot be save to GitHub). This is a very useful tool to keep a record of different stages of a project, try out different analyses in different branches, which can be later merged or discarded, derive sub-datasets for specific preprocessing pipelines and analyses that can be published independently later on, and to mirror a data set onto and online repo.

All projects in the Glascher Lab should be managed using DataLad and BIDS, even if they are "only" behavioral studies. The goal of dataset managements with DataLad is to create a detailed record of the progress of the project for yourself, but also for me (Jan), as I might have to continue working with the data, after the you have left the lab.

A brief conceptual overview of the features of DataLad can be found an introductory chapter in the DataLad Handbook. The handbook provides a comprehensive tutorial, which covers all the features of DataLad by creating and exemplary dataset. Working through the tutorial will likely take 2 days, but it's time that is well invested.

Adina Wagner, one of the developers of DataLad, gave a workshop in our institute a few years ago. You can find her online course here. The nice thing about this course is the focus on research data and it covers additional topics (e.g. what is a good file name?), but it not as comprehensive as the handbook.

For a quick reference you can refer to this cheat sheet on DataLad. Similarly, because DataLad is built on top of Git, most git commands will work as well. See this Git cheat sheet for a quick reference.

Getting Started

It is probably best practice to make a project folder a DataLad dataset even before you populated the folder with data files. This can be done using the following DataLad command:

cd /path/to/your/project/folder
datalad create -c text2git --description="Description and location of the project" .

The dot (.) refers to the current directory (to which you just cd'd and for which you have prepared the setgid-bit). Therefore, DataLad will create its files in the this folder. If you replace .with a name, then DataLad will create a folder by this name in the current directory and create its files there.

If you want to create a Datalad dataset in a folde that already contains files belonging to a project, then use the "-f" option

cd /path/to/your/project/folder
datalad create -f -c text2git --description="Description and location of the project" .

Please check out Chapter 1 of the DataLad handbook for more details.

Obtaining information about the dataset status

Now you can populate your project folder with data files, code, documentation etc. You can check the status of your dataset with

datalad status

This will give a git-like status report and list unlogged files and folders.

More detailed information about the last commit (-n 1) (including a diff-like report) can be retrieved by with

git log -p -n 1

You can also create a couple of very useful alias in your $HOME/.gitconfig file that will show a nicely formatted oneline version of the git history. The "hist" alias display the entire histors, the "newest" alias shows only the last entry. These are the lines that need to be added to your .gitconfig file:

[alias]
        hist = log --oneline --graph --decorate --all
        newest = log --oneline -n 1

You can then display the GIT / Datalad history with git hist.

Logging changes to a dataset

This is arguably the most frequent operation in Datalad. Once you have complete a task (even small ones), you can log all new edits to a dataset (e.g. new files, but also changes to existing ones) with

datalad save -m "commit message"

The Handbook has some recommendation on how to write good commit messages and what do to, if you forget to include the commit message in your datalad save command.

How often should I save changes to my dataset?

You should use datalad save frequently, whenever you have accomplished a small task. This will create a detailed record of all the things you have done with your data set.

CHANGELOG.json file for documenting larger tasks

Even through the BIDS format does not explicitly call for a CHANGELOG file, I highly recommend starting one to document the provenance of files in the dataset (i.e. how files were created). In fact, a CHANGELOG file is also recommended by the YODA principles of data management. This can be used keep a log of larger tasks that are implemented in the data set (e.g. a particular preprocessing pipeline, an specific analysis, a certain computational model etc.). The CHANGELOG file should be a JSON file (consistent with the rest of the BIDS metadata), with entries like this:

{
    "title" : "CHANGELOG for matchpennies",
    [
        {
            "date"   : "MM-DD-YYYY",
            "task"   : "type of name (e.g. preprocessing, modeling, statistical analysis etc.)",
            "name"   : "specific name of this task",
            "author" : "name of person conducting this task",
            "code"   : "relative path to script(s) for executing this task" (use string array ["str1", "str2", ...] for multiple scripts),
            "figure" : "relative path to figure(s) resulting from this task",
            "output" : "relative path to other output files resulting from this task",
            "result" : "brief description of results of this task (use \ to continue on the next line)",
            "commit" : "first 8 digits of commit ID of last 'datalad save' before adding this entry, use 'git log --oneline -n 1'"
        }
    ]
}

Entries can be multiple lines, if needed. You can also add more entries to each record, if you think it is useful for docuementing the changes.

Creating subdataset

Once you have become familiar with the basic operation of DataLad (but before you actually start working with data (preprocessing, statistics, modeling, etc.), you should think about, where you want/should take your project in the (more distant) future. Most journals nowadays require that the analysis code (and also often the data) have to be uploaded to a public repository. But this does not necessarily mean that all the data (including the raw data files) need to be included. In fact, it might be advisable to withhold the original data files, if there are other analyses and publications planned for the dataset. On the other hand, if you intend to publish the entire project as an open public dataset (e.g. in Nature Scientific Data) and adhere to Open Science principles, you have to include all data files (including the raw ones). Nevertheless, there are still files that should be kept private and apart from an open dataset (e.g. grant proposals, ethics proposals, other confidential information).

However, all these aspects turn out (and the decision is often make at a later stage of the project), it is advisable to prepare for these different scenarios. With DataLad you can create subdatasets that are derived from a root dataset, but that are contained and can be copied, share, published by itself. The online course on research data management showcases an example of this and Chapter 6 of the handbook provides more details. It is good to think about this issues early (ideally before you start processing and analyzing the data) and decided what meaningful subdataset should be planned for in the course of the project. These issues should be also discussed on lab meetings and in jour fixe meetings with me.