RatCatcher - hasselmonians/knowledge-base GitHub Wiki
RatCatcher
A general utility for parsing data and passing to analysis scripts to be run on a high-performance computing cluster.
In its simplest form, RatCatcher
is an uncomplicated class that contains all the relevant information to find data on the cluster and produce batch files. You can use it to create a batch script that can be run on a high-performance computing cluster from your raw data and a custom analysis function, and then gather your data into a table afterwards.
It is agnostic to the format of the data, and to the type of analysis performed,
so that RatCatcher
can be used to perform any number of analyses on any type of data.
RatCatcher
scales well with increasing numbers of data files.
Table of Contents
- RatCatcher
What does RatCatcher actually do?
RatCatcher
generates files on a high-performance computing cluster (or local directory)
that allow you to submit a batch job that will perform the same analysis function on many data files,
and then gather the data afterwards into a table.
Once the RatCatcher object is set up, running the batchify()
function
will create a filenames.txt
file that contains a list of file paths to your raw data files,
a filecodes.csv
file that contains any numerical information needed to access the raw data,
and a batchscript.sh
script that will be submitted to execute the jobs on the cluster.
This batch script will perform the chosen analysis on each file listed in filenames.txt
.
After the jobs have run, RatCatcher
can gather the data into a table in your local MATLAB prompt.
How do I install it?
The best way is to clone the repository, or to download and unzip. Then, just add it to your MATLAB path.
It is dependent on mtools
, so you will need that as well.
A (contrived) usage example
This is how you can test RatCatcher
on your setup.
Run script.m
in RatCatcher/test/
.
You will need to configure the paths first (see the script for details).
Then, log into the cluster,
and navigate to the directory you specified as r.remotepath
.
Then run the batch script by:
qsub batchscript-test-Test.sh
This should perform the trivial task of copying [1, 2] into output files. You can then gather the data locally with:
data_table = r.gather;
A (real) usage example
In this example, we will load up MATLAB on our local computer,
and tell RatCatcher
to
- collect from Caitlin's "A" and "B" data sets, and
- perform the "Bandwidth Estimator" analysis.
Then, on the cluster, we will
- run the auto-generated script.
Back in local MATLAB, we will
- gather the data into a table.
Create the RatCatcher
object.
At minimum, the following fields need to be filled out.
r = RatCatcher;
r.expID = {'Caitlin', 'A'; 'Caitlin', 'B'};
r.remotepath = '/projectnb/mypath2folder/cluster';
r.localpath = '/mnt/mypath2folder/cluster';
r.protocol = 'BandwidthEstimator';
Create the batch scripts.
This will generate files in r.localpath
.
r = r.batchify();
Go onto the cluster and submit the script.
ssh [email protected]
cd /projectnb/mypath2folder/cluster
qsub Caitlin-A-Caitlin-B-BandwidthEstimator.sh
Wait until the run has completed, then
data_table = r.gather;
data_table = r.stitch(data_table);
This will gather the data into a table
in MATLAB on your local computer.
stitch
ing appends the full path of the raw data.
What does RatCatcher actually do?
Behind-the-scenes, this is how RatCatcher
works:
- It finds the data specified by the object's properties using the
parse
function. - Then, it stores the filenames to the data in
filenames.txt
, and the file codes infilecodes.csv
. The actual names of the file is a bit longer, incorporating the experimental ID, and the protocol, but each file will begin withfilenames
orfilecodes
respectively. - In addition, a batch script is created.
The batch script is a shell script that specifies options to the job scheduler (SGE) on the cluster.
The batchify
function also fills out several details, such as the name of the job. The most important
job of the batch script is to tell the cluster to run MATLAB and a function called the batch function.
This is a MATLAB function that performs the requisite analysis.
Then, you can run the batch script on the cluster by submitting it using qsub
.
It will evaluate the batch function over all specified files.
# on the cluster
qsub scriptname.sh
Once the script finishes running, output files are produced in the directory specified by RatCatcher
's remotepath
.
You can gather the data into a table
in MATLAB with
data_table = r.gather();
Class properties
filenames
will be set automatically when you run batchify
, though you can also generate your own with the static build
function.
If expID
is a row vector cell array, then filenames
is a column vector cell array of full file paths to the raw data.
If expID
is a matrix cell array, then filenames
is a column vector cell array of column vector cell arrays of full file paths to the raw data.
This allows for easier separation of data and processed results into chunks.
If there aren't a lot of special conditions or parameters in your data (i.e. you just have 100 experiments in the same condition),
then using a simple expID
(and thus a single cell array of filenames) is perfectly fine.
filecodes
is a field useful for storing numerical information that allows you to specify further within a data file. For example, if you had 100 recordings and kept track of cell and tetrode number, you might have a 100 x 2
matrix for your filecodes
. These properties are intended to be available to the batchify
function so that they can be written into the batch script that contains the function call to the batch function specified in protocol
that performs the actual analysis.
If expID
is a row vector cell array, then filenames
is a column vector cell array of file codes.
If expID
is a matrix cell array, then filenames
is a column vector cell array of column vector cell arrays of file codes.
The expID
field contains an character vector or cell array of character vectors that serves as an unambiguous identifier to the raw data to be analyzed.
Say you have data saved in some filesystem, where each subfolder indicates different conditions of an experiment (dosage, animal, setup, etc.). Perhaps within each of those folders, you have yet more subfolders. That is, your experiment can be classified by two or more identifiers (e.g. animal and date of experiment).
The expID
field reads rows as increasing specificity and columns as more data.
For example, if you were working with Caitlin's dataset from this paper, your expID
would look something like this:
expID =
3×2 cell array
{'Caitlin'} {'A'}
{'Caitlin'} {'B'}
{'Caitlin'} {'C'}
This would indicate that this is Caitlin's data from clusters A
, B
, and C
. The power of the expID
is that as long as it is specified in the parse
function what to do with a certain expID
pattern, it works. You can also bypass the expID
process by delivering a list of filenames directly to the RatCatcher
functions. You can get a list of filenames with the RatCatcher.getFileNames()
function.
The protocol
field determines which analysis should be performed. RatCatcher
doesn't actually do any real calculations, but sets up the batch files needed to run the computations on a high-performance computing cluster. It looks for somewhere on your path where a function named [protocol '.batchFunction']
is.
The localpath
field contains the absolute path to where the batch files should be placed (when on your local computer) and the remotepath
field contains the absolute path from the perspective of the high-performance computing cluster.
For instance, if you mounted your cluster on your local machine at
/mnt/myproject/cluster/
then that is yourlocalpath
. If from the cluster's perspective (when accessing viassh
), your files are at/projectnb/myproject/cluster
then that is yourremotepath
. If your local computer does not have the cluster mounted,localpath
will be some path in your local file system and you will have to copy the filesRatCatcher
produces over to theremotepath
before running the script on the cluster.
project
is the name of the project on the cluster (who has to pay for the computer usage).
Mode
Another important property is mode
.
The mode can be set to one of three values: 'array'
, 'parallel'
, or 'singular'
.
You can set the mode by change the mode
property:
% set the mode to 'array' (or 'parallel', or 'singular')
r.mode = 'array';
Array jobs
In the default 'array'
mode, RatCatcher
will generate an array job on the cluster,
which will use many nodes in parallel.
This can be combined with parallelism inside of the batch function,
so single-threaded and multi-threaded jobs can be performed many times faster.
The form for the batch function is:
function batchFunction(index, location, batchname, outfile, test)
...
end
Parallel jobs
In 'parallel'
mode, RatCatcher
partitions the array job out,
so that each node is multi-threaded, and each thread is handling a different input file.
This is very fast, usually around 200x faster than non-array, non-parallel jobs at least,
but the batch function can be difficult to code.
RatCatcher
can take advantage of parallel processing to speed up analyses for large datasets.
For costly analyses, this can dramatically speed up run-time,
since the cluster will use more cores at once, though still one per data file.
The nbins
property is automatically set, but can be manually set or changed as well.
For many files but not very time-consuming analysis, it is better to set the nbins
property
to be small, to limit the number of times MATLAB
is opened on compute nodes.
By default, the nbins
property is set to optimally use the cluster,
though the assumptions of optimality are not valid if each analysis takes less than two hours.
In 'parallel'
mode, RatCatcher
will expect the batch function to run in parallel.
A parallelized batch function has the following function signature:
function batchFunction(bin_id, bin_total, location, batchname, outfile, test)
The batchify
function will automatically set up the correct arguments for you.
Inside your function, however, you must call the getParallelOptions
function,
[bin_start, bin_finish] = RatCatcher.getParallelOptions(bin_id, bin_total, location, batchname)
and run your code inside of a parfor
loop, e.g.
parfor ii = bin_start:bin_finish
[filename, filecode] = RatCatcher.read(ii, location, batchname);
% do important calculations here
% then save your outfile
save(outfile, 'VariableName')
end
Singular jobs
Finally, you can run RatCatcher
in 'singular'
mode.
'singular'
mode doesn't set up an array job, nor does it automatically use parallelism.
This mode is useful when the analysis you're doing is very lightweight
and spawning many jobs (and loading up MATLAB hundreds of times) would take longer
than just iterating through a loop (perhaps even in parallel).
The form of the batch function is
function batchFunction(location, batchname, outfile, test)
...
end
Other properties
There are a host of other properties
- filenames
- filecodes
- batchname
- batchfuncpath
- batchscriptpath
- verbose
which allow you to override options automatically set during the batching process.
By manually setting any of these class properties to be non-empty, RatCatcher
will use your preset instead.
The defaults are determined by running a series of functions
r.getFileNames()
r.getBatchScriptName()
r.getBatchScriptPath()
r.getBatchFuncPath()
which is performed inside the r.validate()
function, and is automatically performed during batchify
ing.
What's the difference between
batchify
andvalidate
?validate
updates theRatCatcher
object by updating the filenames, batch script name, batch script path, and batchfunction path properties.batchify
does all of this (it runsvalidate
) and also creates batch files on the cluster inr.localpath
.
The batchname
is the canonical kernel of text that appears in every file created by RatCatcher
.
By default, it is a combination of all the parts of the expID
and the protocol
,
so it should be unique to each experimental dataset and analysis method.
Files created by RatCatcher
have names like filenames-batchname.txt
.
The batch function is a MATLAB (or function in another language) that performs the analysis protocol
on each file containing the raw or preprocessed data.
It is determined from the name of the protocol
.
RatCatcher
will try to find a MATLAB class on your computer with the same name as the specified protocol
and thence find a function named batchFunction
.
You can also specify your own absolute path to any function on your MATLAB path to override this.
The batch script is the shell script that is run for each filename indicated. It sets up the run-time environment on the cluster and then invokes the batch function with the correct arguments.
The default script is RatCatcher-generic-script.sh
, but this can be changed
by substituting any absolute path to a shell script on your MATLAB path.
Pre-Processing
A general use of RatCatcher
looks something like this:
Set up your RatCatcher
object.
r = RatCatcher;
r.expID = {};
r.protocol = 'BandwidthEstimator';
r.remotepath = '/projectnb/mypath2folder/cluster';
r.localpath = '/mnt/mypath2folder/cluster';
r.project = 'hasselmogrp';
Then, batch your files. They will end up in r.localpath
.
r.batchify();
filenames = RatCatcher.getFileNames(identifiers, filesig, masterpath);
The identifiers
is very much like expID
except that it contains only the discrete filenames.
The filesig
is the pattern to search for within the files specified by identifiers
.
Use **
to indicate searching in all subfolders and *
to indicate searching for anything that matches the pattern. For example,
filesig = fullpath('**', '*.plx')
would find all files in the directory identifiers
and subdirectories that are Plexon (.plx
) files.
Customizing parsing the raw data
The function parse
performs a different operation based on who the experimenter (and alphanumeric code) is. If you are not built into the RatCatcher
ecosystem yet, it is important to tell parse
what to do with you. You can do this one of three ways:
- If you are a part of the Hasselmo, Howard, or Eichenbaum labs, send me an email. You know who I am.
- Generate a cell array of
filenames
and a list offilecodes
by yourself and update the default (empty) properties in yourRatCatcher
object. - Add a new experimenter name to the
parse_core
static method switch/case statement that expresses what to do to find the correct data, given an experimenter name.
Customizing your batching
You can also force batchify
to use custom data, locations, or scripts.
Under the basic usage, batchify
uses the parse
function to figure out
what filenames
and filecodes
you need based on the expID
provided
(with instructions detailed in the parse
function itself).
The function also uses
a default batch script, RatCatcher-generic-script.sh
.
This script requests a 16-core node and outputs an error and log file.
It also limits the run to 24 hours before terminating.
Then, it loads MATLAB 2018a and runs the batch function from the command line.
batchify
reads the properties from the RatCatcher
object,
so to override the defaults, update the RatCatcher
object's properties.
You can get extra feedback from the function by setting
r.verbose = true;
% r.batchify batches the files specified by the ratcatcher object
r.batchify();
Customizing your batch script
Each protocol has to have a batch function as a static class method or package function.
When you specify the analysis method (in r.protocol
), RatCatcher
will find the right batch function.
The function has to:
- Set up the MATLAB path on the cluster (if you are using MATLAB).
- Acquire the filenames and file codes from the .txt and .csv file generated by
batchify
. - Read the data from a file using the filenames as a guide.
- Perform the analysis.
- Save the data.
The index
is set by the job scheduler. If you have 10 jobs, it goes from 1-10, and is stored in the
environment variable $SGE_TASK_ID
. The batchify
function sets up the call to the MATLAB batch function from inside the batch script, and so sets the index
argument to the task ID.
A good example of a batch function can be found here.
Any custom batch script must be a .sh
file on your MATLAB
path.
The best way to check is with which(r.batchscriptpath)
because that's how batchify
actually does it.
batchify
using string parsing to fill out the correct fields in the batch script.
PROJECT_NAME
: the name of the project on the clusterBATCH_NAME
: thebatchname
NUM_FILES
: the total number of datafiles upon which the protocol will be runARGUMENT
: the actual argument passed to the batch function
In order for batchify
to work correctly, these tags should exist in the generic batch script.
They will be replaced with actual parameters during the batching process.
PROJECT_NAME
is set by theRatCatcher
propertyBATCH_NAME
is automatically generated or set by the user by an argument tobatchify
NUM_FILES
is automatically determinedARGUMENT
is more of a special case
Customizing your batch function
The prototypical batch function has the following functional call:
batchFunction(index, batchname, location, outfile, test)
index
is automatically set to correspond to theSGE_TASK_ID
which iterates up toNUM_FILES
batchname
is, unsurprisingly, thebatchname
determined as abovelocation
is automatically set to theremotepath
property ofRatCatcher
outfile
is the name of the output file, also automatically determinedtest
is a logical flag
The batch function can be any function that has these arguments in this order. It does not necessarily even have to be a MATLAB function either, if a custom batch script is used.
To see what a good batchfunction looks like, check here.
Customizing your protocol
A "protocol" is some process that operates on your data to get useful results. If this is computationally expensive and needs to happen on a lot of data, it's best to run it on a high-performance computing cluster.
Currently, the following protocols exist for RatCatcher
:
The only requirement is that your analysis have a batch function defined for it. It finds it by looking at:
path2BatchFunction = which([r.protocol '.batchFunction']);
so it's best if the batch function is a static method of a class, or part of a package. See below for more details.
You can also provide the path to a custom batch script by filling out the batchfuncpath
property of the RatCatcher
object.
Running your scripts
batchify
creates a batch script that will run (using qsub
) all the jobs on the cluster,
and put the output files where the data should be stored.
# on the cluster
qsub scriptName.sh
What is the generic batch script?
This script is a template that batchify
fills in with the correct values.
It requires 16 cores on the cluster, creates a log file and error file,
sets the name of the project, limits to a 24-hour run,
and then runs MATLAB from the command line.
Generating multiple scripts
A new script is generated for each different expID
.
A single call to the qsub
command on the cluster will run the script on each data file,
producing an output file or files for each (if directed in the batch function).
If you want to run multiple analyses, you will need multiple calls to batchify
with the correctly specified RatCatcher
object.
Post-Processing
Once the jobs have been run, the data can be gathered.
data_table = r.gather();
The paths to the raw data can be stitched onto the data table, for easy reference.
data_table = r.stitch(data_table);
Extra features
RatCatcher
also provides tools for wrangling data.
You can go from a saved data_table
to an analysis object and the Session
object
(from CMBHOME
)
by using the extract
function.
[best, root] = RatCatcher.extract(data_table, index, 'BandwidthEstimator');
Conversely, a data_table
can be indexed to find the indices which correspond to a given filename and cell number.
parse
is called to determine the filenames and file codes.
index = r.index(data_table);
Furthermore, you can use RatCatcher.getFileNames()
and RatCatcher.wrangle()
to gather lists of file names and file codes
and to sequentially load files to build filenames.txt
and filecodes.csv
metadata files.
The sort
function can be used to organize your filenames
and filecodes
alphabetically.
Batch functions can take advantage of this organization by iterating over filecodes
associated with a given filename without having to reload the same raw data file.
Setting a preference file
You can create a function called pref.m
inside of ../RatCatcher/@RatCatcher/
to automatically set up a custom RatCatcher
object every time you instantiate one. This file is ignored by git. It should look something like this:
function p = pref()
p = struct;
p.expID = {'experimenter', 'id1'; 'experimenter', 'id2'};
p.protocol = 'BandwidthEstimator';
p.localpath = 'myPath2ClusterFromLocalComputer';
p.remotepath = 'myPath2ClusterFromRemoteComputer';
p.project = 'hasselmogrp';
end
If this function exists, all future instantiated RatCatcher
objects will have these properties set. If you don't want to set a property, set it to []
instead.
License Information
RatCatcher
is written by Alec Hoyland and is released under the GNU General Public License 3.0.
The natsort
functions were written by Stephen Cobeldick (c) 2018.
mtools
were written and/or archived by sg-s.
The table of contents was created by gh-md-toc.