Reward File Management: Heudiconv (2018) - PennBBL/reward2018 GitHub Wiki
Reward File Info and Heudiconv
The REWARD dataset is a large dataset with various scan protocols collected over the course of a few years. The files can be in either monstrum
, chead
, or XNAT
. The aim of this part of the preliminary process of revamping the REWARD dataset is to obtain information about the following:
- What files currently exist in
chead
that are not inXNAT
? - What are the scan protocols each participant went through for each scan session?
- What are the parameters for the scan protocols the participants went through for each scan session?
The goal of point 1 is to ensure that all files we expect to have for the REWARD dataset actually exist in chead
, where we can conduct our analyses. The goal of point 2 and 3 is to obtain information so that we can create a heuristic file that will structure our dataset into BIDS format.
About heudiconv and BIDS format
Briefly, heudiconv
is a process that uses a heuristic to organize files into BIDS format. BIDS format is a standard structure that makes sharing neuroimaging data easier so that science can be made more reproducible. More information about all this can be found at this link: http://reproducibility.stanford.edu/bids-tutorial-series-part-2a/
Overview of steps
XNAT
contains information about the protocols for each scan. Thus, interfacing withXNAT
to obtain the number of scans inXNAT
as well as the scan protocol names will help with creating a ground truth of what we should expect in our dataset.- However, it could well be the case that
chead
is missing some dicom files. To ensure that all dicom files are present for analyses, obtaining information about the existence of files inchead
will be useful in knowing what files are missing (but that are inXNAT
). heudiconv
uses information about scan protocols to create a heuristic for organizing data such that it follows BIDS format structure (see section About heudiconv and BIDS format for more info). To create a reasonable heuristic, dicom information for each scan inchead
is needed. This dicom information will provide the parameters for the scan protocols each scan included so that a heuristic file can be made.- Finally, once a heuristic is agreed upon, we run the final
heudiconv
script to organize the data into BIDS format (and validate that it, indeed, is in BIDS format).
Step 1: Interfacing with XNAT to obtain information about files
The goal of this step is to obtain information about the files in XNAT
. Specifically, we want to know the following:
- number of participants and scans in each project
- number of scans for eachs scan protocol
We will need to know the project names. This is tricky because sometimes the project name listed on XNAT
does not match its meta-data.
- To be sure the project names match what we're looking for, click on your project name. Click on any subject.
- On the right hand side in the small box
Actions
, clickView XML
. - Look for the line
<xnat:share label="LABEL" project="PROJECTNAME" subject_ID="SUBID"/>
(note: the caps represent what the value for the given variable should be. Make sure you use the value ofproject
when you specify your project names.
Now, to interface with XNAT
, we will need to use Python 2.7. Make sure that and all the dependencies for the scripts are present. You can use your local computer for these steps. All the scripts needed are in /data/jux/BBL/reward2018/scripts/heudiconv/xnat
.
- Edit the
xnat.cfg
file with your bblxnat credentials. - Edit
xnat2BIDS.py
with the proper values. This will output a*_query_info.tsv
file that provides all the subject and scan information as well as which projects they belong to. - This will output a
*_scan_info.tsv
file with the following information:- the scan protocols (which are the headers of the tsv file)
- the order in which a subject received the scan protocol (note: each observation is a subject)
These output files for REWARD are located in /data/jux/BBL/projects/reward2018/results/xnat_info/
With these tsv files, you can now provide summaries of how many subjects are in each project and how many subjects are in each scan protocol (ACCORDING TO XNAT
). From there, you can also find out if you have any missing subjects on chead
. See Step 2.
Step 2: Obtaining information about the existence of files
The goal of this step is to know how many scans we have for each process that we will need as well as whether there are missing scans on chead
. All scripts that you will need for this step is on /data/jux/BBL/projects/reward2018/scripts/subjectInfo
. For any R scripts, you will need to the script setup.R
since all the functions depend on the ones written in there.
-
To summarise how many subjects are in each project and how many subjects are in each scan protocol, use the script
xnat_summary.R
. NOTE: It was easiest to use my local computer to do this part.- This script needs your tsv files from Step 1. Make sure they're in the same directory.
- This will output a
*_query_summary.csv
that tells you number of scans for each project. - This will also output
*_scans_summary.csv
that will tell you number of scans for each scan protocol for each project. - These outputs are located in
/data/jux/BBL/projects/reward2018/results/xnat_info/summaries/
You can modify the functionsnum.query
andnum.scans
in thesetup.R
script.
-
To find whether there are scans on
XNAT
that are missing onchead
, you will first need a csv file that contains all the subject IDs and all the scan IDs. An example of such a file is/data/jux/BBL/projects/reward2018/results/subjectInfo/cheadRewardDicoms_09-24-2018.csv
. The code that did this isextractCheadScanNames.sh
.- Now, the script
searchForWhatsMissingOnCfn_10-08-2018.R
will use the functionfindMissing
to look for everything in one vector (A) and see if it is missing in another (B). In other words, we want to know if our variable of interest (A) is missing datapoints in our "ground truth" variable (B). Thus, this function takes in the A and B and outputs a dataframe of what's included and what's missing. This script looks for subjects missing onchead
that are onXNAT
. This script is fairly commented so you can look there for more info on all the steps I did.- The output of this script is
/data/jux/BBL/projects/reward2018/results/subjectInfo/missingScanInfo/cfnMissingFromXnat_10-08-2018.csv
.
- The output of this script is
- Now, the script
Now, we preliminarily know what we should expect in terms of number of scans and what scans are missing. This will be informative if we want to download anything before we move to putting our data into BIDS format.
Step 3: Creating a heuristic file
To begin creating a heuristic file, we will need to obtain scan protocol information based on the dicoms that exist. This can be done using singularity
on chead
. The script that does this for the REWARD dataset is located in this path: /data/jux/BBL/projects/reward2018/scripts/heudiconv/dicomInfo.sh
- This script binds the rawData containing dicoms with your home base
- To run this, you will need to change it the username to yours
You should now expect an output
directory created in your rawData folder.
- For reward, this is located in
/data/jux/BBL/studies/reward/rawData/output
and copied into/data/jux/BBL/projects/reward2018/results/heudiconv/output
. - For this directory, there is a hidden
.heudiconv
directory that contains dicom information for each scan for each subject. - The dicom information is a tsv file that can be found in
/data/jux/BBL/projects/reward2018/results/heudiconv/output/.heudiconv/{subID}/info/dicominfo_ses-*.tsv
.
To check if you have dicom infos for all participants, you can use the script /data/jux/BBL/projects/reward2018/scripts/subjectInfo/missingScanInfo/scansOnCheadWithoutDicomInfo.sh
. If you find any missing dicom info files, it could be the case that it's missing dicoms on chead
. You can check this with the script scansOnCheadWithoutDicomFILE.sh
. Outputs for the reward dataset are located in /data/jux/BBL/projects/reward2018/results/subjectInfo/missingScanInfo
.
To more cleanly look at information from these dicom infos, I used R
. Briefly, R
displays a clean view of the tsv file as a dataframe in R so that a heuristic file can be made. If you source the script setup.R
from Step 2, you can just use the function read.tsv("NAME_OF_TSV_FILE)
to easily read dicom files.
To create a robust heuristic for each scan protocol, I did the following:
- For each project, I looked at three subjects.
- For each subject, I created a BIDS compliant scan name and recorded the scan protocol name from the dicom info tsv files that corresponds to this BIDS compliant scan name.
- For each scan protocol, I looked at the dicom info to extract parameters that could distinguish different protocols (e.g., dimensions, TR, TE). A csv version of this heuristic can be found in `/data/jux/BBL/projects/reward2018/results/heudiconv/forHeuristic/reward_heuristic.csv.
I then use this heuristic file to create if-then statements for labeling different scan protocols. This heuristic file is located in /data/jux/BBL/projects/reward2018/scripts/heudiconv/reward_heuristic.py
Step 4: Running heudiconv and checking BIDS format
Run the script runHeudiconv.sh which uses the heuristic python script reward_heuristic.py to format the directory structure into BIDS format.