Reward File Management: Heudiconv (2018) - PennBBL/reward2018 GitHub Wiki

Reward File Info and Heudiconv

The REWARD dataset is a large dataset with various scan protocols collected over the course of a few years. The files can be in either monstrum, chead, or XNAT. The aim of this part of the preliminary process of revamping the REWARD dataset is to obtain information about the following:

What files currently exist in chead that are not in XNAT?
What are the scan protocols each participant went through for each scan session?
What are the parameters for the scan protocols the participants went through for each scan session?

The goal of point 1 is to ensure that all files we expect to have for the REWARD dataset actually exist in chead, where we can conduct our analyses. The goal of point 2 and 3 is to obtain information so that we can create a heuristic file that will structure our dataset into BIDS format.

About heudiconv and BIDS format

Briefly, heudiconv is a process that uses a heuristic to organize files into BIDS format. BIDS format is a standard structure that makes sharing neuroimaging data easier so that science can be made more reproducible. More information about all this can be found at this link: http://reproducibility.stanford.edu/bids-tutorial-series-part-2a/

Overview of steps

XNAT contains information about the protocols for each scan. Thus, interfacing with XNAT to obtain the number of scans in XNAT as well as the scan protocol names will help with creating a ground truth of what we should expect in our dataset.
However, it could well be the case that chead is missing some dicom files. To ensure that all dicom files are present for analyses, obtaining information about the existence of files in chead will be useful in knowing what files are missing (but that are in XNAT).
heudiconv uses information about scan protocols to create a heuristic for organizing data such that it follows BIDS format structure (see section About heudiconv and BIDS format for more info). To create a reasonable heuristic, dicom information for each scan in chead is needed. This dicom information will provide the parameters for the scan protocols each scan included so that a heuristic file can be made.
Finally, once a heuristic is agreed upon, we run the final heudiconv script to organize the data into BIDS format (and validate that it, indeed, is in BIDS format).

Step 1: Interfacing with XNAT to obtain information about files

The goal of this step is to obtain information about the files in XNAT. Specifically, we want to know the following:

number of participants and scans in each project
number of scans for eachs scan protocol

We will need to know the project names. This is tricky because sometimes the project name listed on XNAT does not match its meta-data.

To be sure the project names match what we're looking for, click on your project name. Click on any subject.
On the right hand side in the small box Actions, click View XML.
Look for the line <xnat:share label="LABEL" project="PROJECTNAME" subject_ID="SUBID"/> (note: the caps represent what the value for the given variable should be. Make sure you use the value of project when you specify your project names.

Now, to interface with XNAT, we will need to use Python 2.7. Make sure that and all the dependencies for the scripts are present. You can use your local computer for these steps. All the scripts needed are in /data/jux/BBL/reward2018/scripts/heudiconv/xnat.

Edit the xnat.cfg file with your bblxnat credentials.
Edit xnat2BIDS.py with the proper values. This will output a *_query_info.tsv file that provides all the subject and scan information as well as which projects they belong to.
This will output a *_scan_info.tsv file with the following information:
- the scan protocols (which are the headers of the tsv file)
- the order in which a subject received the scan protocol (note: each observation is a subject)

These output files for REWARD are located in /data/jux/BBL/projects/reward2018/results/xnat_info/

With these tsv files, you can now provide summaries of how many subjects are in each project and how many subjects are in each scan protocol (ACCORDING TO XNAT). From there, you can also find out if you have any missing subjects on chead. See Step 2.

Step 2: Obtaining information about the existence of files

The goal of this step is to know how many scans we have for each process that we will need as well as whether there are missing scans on chead. All scripts that you will need for this step is on /data/jux/BBL/projects/reward2018/scripts/subjectInfo. For any R scripts, you will need to the script setup.R since all the functions depend on the ones written in there.

To summarise how many subjects are in each project and how many subjects are in each scan protocol, use the script xnat_summary.R. NOTE: It was easiest to use my local computer to do this part.
- This script needs your tsv files from Step 1. Make sure they're in the same directory.
- This will output a *_query_summary.csv that tells you number of scans for each project.
- This will also output *_scans_summary.csv that will tell you number of scans for each scan protocol for each project.
- These outputs are located in /data/jux/BBL/projects/reward2018/results/xnat_info/summaries/ You can modify the functions num.query and num.scans in the setup.R script.
To find whether there are scans on XNAT that are missing on chead, you will first need a csv file that contains all the subject IDs and all the scan IDs. An example of such a file is /data/jux/BBL/projects/reward2018/results/subjectInfo/cheadRewardDicoms_09-24-2018.csv. The code that did this is extractCheadScanNames.sh.
- Now, the script searchForWhatsMissingOnCfn_10-08-2018.R will use the function findMissing to look for everything in one vector (A) and see if it is missing in another (B). In other words, we want to know if our variable of interest (A) is missing datapoints in our "ground truth" variable (B). Thus, this function takes in the A and B and outputs a dataframe of what's included and what's missing. This script looks for subjects missing on chead that are on XNAT. This script is fairly commented so you can look there for more info on all the steps I did.
  - The output of this script is /data/jux/BBL/projects/reward2018/results/subjectInfo/missingScanInfo/cfnMissingFromXnat_10-08-2018.csv.

Now, we preliminarily know what we should expect in terms of number of scans and what scans are missing. This will be informative if we want to download anything before we move to putting our data into BIDS format.

Step 3: Creating a heuristic file

To begin creating a heuristic file, we will need to obtain scan protocol information based on the dicoms that exist. This can be done using singularity on chead. The script that does this for the REWARD dataset is located in this path: /data/jux/BBL/projects/reward2018/scripts/heudiconv/dicomInfo.sh

This script binds the rawData containing dicoms with your home base
To run this, you will need to change it the username to yours

You should now expect an output directory created in your rawData folder.

For reward, this is located in /data/jux/BBL/studies/reward/rawData/output and copied into /data/jux/BBL/projects/reward2018/results/heudiconv/output.
For this directory, there is a hidden .heudiconv directory that contains dicom information for each scan for each subject.
The dicom information is a tsv file that can be found in /data/jux/BBL/projects/reward2018/results/heudiconv/output/.heudiconv/{subID}/info/dicominfo_ses-*.tsv.

To check if you have dicom infos for all participants, you can use the script /data/jux/BBL/projects/reward2018/scripts/subjectInfo/missingScanInfo/scansOnCheadWithoutDicomInfo.sh. If you find any missing dicom info files, it could be the case that it's missing dicoms on chead. You can check this with the script scansOnCheadWithoutDicomFILE.sh. Outputs for the reward dataset are located in /data/jux/BBL/projects/reward2018/results/subjectInfo/missingScanInfo.

To more cleanly look at information from these dicom infos, I used R. Briefly, R displays a clean view of the tsv file as a dataframe in R so that a heuristic file can be made. If you source the script setup.R from Step 2, you can just use the function read.tsv("NAME_OF_TSV_FILE) to easily read dicom files.

To create a robust heuristic for each scan protocol, I did the following:

For each project, I looked at three subjects.
For each subject, I created a BIDS compliant scan name and recorded the scan protocol name from the dicom info tsv files that corresponds to this BIDS compliant scan name.
For each scan protocol, I looked at the dicom info to extract parameters that could distinguish different protocols (e.g., dimensions, TR, TE). A csv version of this heuristic can be found in `/data/jux/BBL/projects/reward2018/results/heudiconv/forHeuristic/reward_heuristic.csv.

I then use this heuristic file to create if-then statements for labeling different scan protocols. This heuristic file is located in /data/jux/BBL/projects/reward2018/scripts/heudiconv/reward_heuristic.py

Step 4: Running heudiconv and checking BIDS format

Run the script runHeudiconv.sh which uses the heuristic python script reward_heuristic.py to format the directory structure into BIDS format.