Execution of CancerDiscover - HelikarLab/CancerDiscover GitHub Wiki
The first step of the pipeline is to place your raw CEL
file data into the DataFiles
directory.
In the DataFiles
directory you will need to create a two-column csv
(comma separated file) called "sampleList.txt" where the first column will have the name of each CEL
file, and the second column will have the class label (e.g., normal, tumor) associated with that sample.
If you want to use the Sample data for classification:
cp SampleData/* ../DataFiles
This command will copy the data and sampleList.txt files
from SampleData
directory into the DataFiles
directory.
1. Initialization
Once you have finished making the sampleList.txt
file in the DataFiles
directory, please visit the Scripts
directory to execute the next steps of the pipeline.
There are two versions of the pipeline, BASH
and SLURM
(Simple Linux Utility for Resource Management). Depending on your access to a SLURM
scheduler, you will use one or another set of scripts. If you do have access to a SLURM
scheduler you will execute the scripts ending in .slurm
. Otherwise, you will use the scripts ending in .bash
. Due to the complexity of data manipulation, and/or the sheer size of your data, it is recommended to use SLURM scripts on a high-performance computer.
Next, in the Scripts
directory, edit the file called Configuration.txt
, to make any changes desired for processing your data including the normalization method, the size of data partitions, and which feature selection and classification algorithms are to be executed . The default settings for normalization and background correction and data partitioning are:
- Normalization: `normMethod="quantiles"`
- Background correction: `bgCorrectMethod="rma"`
- Pm value correction: `pmCorrectMethod="pmonly"`
- Summary: `summaryMethod="medianpolish"`
- Number of folds for data partitioning: `foldNumber=2`
The default setting for data partitioning is 50:50.
The default setting for feature selection algorithms will choose all possible feature selection algorithm options. You can find the list of feature selection methods and their associated file names in featureSelectionAlgorithms.lookup
under Scripts
directory.
The default setting for classification algorithms will generate models using the following algorithms:
- Decision Tree
- IBK
- Naive Bayes
- Random Forest
- Support Vector Machine
cd ../Scripts
bash initialization.bash
2. Normalization
bash masterScript_1.bash
For SLURM
users:
sbatch masterScript_1.slurm
The purpose of the above script is to perform normalization on raw CEL
data and generate the Expression set matrix.
3. Feature Selection
After normalization is complete, you will have a single file called ExpressionSet.txt
in your DataFiles
directory. The next step is to build a master feature vector file using the ExpressionSet.txt
file. The next command you use will build this master feature vector file for you using the ExpressionSet.txt
file, as well as perform data partitioning, or divide the master feature vector file into two parts; training and testing. The program will then perform feature selection using only the training portion of the master feature vector. Additionally, you can find the list of feature selection methods and their associated file names in the Scripts
directory in the file named featureSelectionAlgorithms.lookup
.
The default setting for data partitioning is 50:50, meaning the master feature vector file will be split evenly into training and testing data sets while retaining approximately even distributions of your sample classes between the two daughter files. To achieve a larger split, such as 80:20 for training/testing, in the configuration file Configuration.txt
replace the 2
with a 5
. This will tell the program to perform 5 folds, where the training file will retain 4
and the testing file will retain a single fold or 20% of the master feature vector data.
The default setting for feature selection will perform all possible forms of feature selection available unless otherwise specified in the configuration.txt
file. If you wish to change these feature selection options, in the Scripts
directory you will need to edit the file named configuration.txt
. Simply write TRUE
next to all of the feature selection methods you wish to perform and FALSE
if you do not want that method performed. Additionally, you can find the list of feature selection methods and their associated file names in the Scripts
directory in the file named featureSelectionAlgorithms.lookup
.
The following commands perform the feature selection from normalized expression matrix:
bash masterScript_2.bash
For SLURM
users:
sbatch masterScript_2.slurm
4. Model training and testing
Once feature selection has been completed, new feature vectors are made based on the ranked lists of features. The new feature vectors will be generated based on your threshold selections, and immediately used to build and test classification models using a classification algorithm of your choosing. Lastly, the directories will be reset, and your old directories and files will be placed in the CompletedExperiments
followed by a time-stamp.
The following commands perform model training and testing on the feature vectors:
bash masterScript_3.bash
For SLURM
users:
sbatch masterScript_3.slurm
The last lines of the masterScript_3
scripts will move the content of the DataFiles
to CompletedExperiments
, so the new experiment will run in DataFiles
directory. You can find all raw data, feature selection outputs, training and testing feature vectors, models, and model results in the CompletedExperiments
directory followed by a time-stamp. To run experiments with new data, begin with step 1.
Overall, the users will require to run only 4 scripts.
bash initialization.bash # Initialization
bash masterScript_1.bash # Normalization
bash masterScript_2.bash # Feature Selection
bash masterScript_3.bash # Model Training and Testing