Execution of CancerDiscover - HelikarLab/CancerDiscover GitHub Wiki

The first step of the pipeline is to place your raw CEL file data into the DataFiles directory.

In the DataFiles directory you will need to create a two-column csv (comma separated file) called "sampleList.txt" where the first column will have the name of each CEL file, and the second column will have the class label (e.g., normal, tumor) associated with that sample.

If you want to use the Sample data for classification: cp SampleData/* ../DataFiles This command will copy the data and sampleList.txt files from SampleData directory into the DataFiles directory.

1. Initialization

Once you have finished making the sampleList.txt file in the DataFiles directory, please visit the Scripts directory to execute the next steps of the pipeline.

There are two versions of the pipeline, BASH and SLURM (Simple Linux Utility for Resource Management). Depending on your access to a SLURM scheduler, you will use one or another set of scripts. If you do have access to a SLURM scheduler you will execute the scripts ending in .slurm. Otherwise, you will use the scripts ending in .bash . Due to the complexity of data manipulation, and/or the sheer size of your data, it is recommended to use SLURM scripts on a high-performance computer.

Next, in the Scripts directory, edit the file called Configuration.txt, to make any changes desired for processing your data including the normalization method, the size of data partitions, and which feature selection and classification algorithms are to be executed . The default settings for normalization and background correction and data partitioning are:

  - Normalization: `normMethod="quantiles"`
  - Background correction: `bgCorrectMethod="rma"`
  - Pm value correction: `pmCorrectMethod="pmonly"`
  - Summary: `summaryMethod="medianpolish"`
  - Number of folds for data partitioning: `foldNumber=2`

The default setting for data partitioning is 50:50.

The default setting for feature selection algorithms will choose all possible feature selection algorithm options. You can find the list of feature selection methods and their associated file names in featureSelectionAlgorithms.lookup under Scripts directory.

The default setting for classification algorithms will generate models using the following algorithms:

  - Decision Tree  
  - IBK   
  - Naive Bayes   
  - Random Forest   
  - Support Vector Machine    
   cd ../Scripts
   bash initialization.bash

2. Normalization

bash masterScript_1.bash

For SLURM users:

sbatch masterScript_1.slurm

The purpose of the above script is to perform normalization on raw CEL data and generate the Expression set matrix.

3. Feature Selection

After normalization is complete, you will have a single file called ExpressionSet.txt in your DataFiles directory. The next step is to build a master feature vector file using the ExpressionSet.txt file. The next command you use will build this master feature vector file for you using the ExpressionSet.txt file, as well as perform data partitioning, or divide the master feature vector file into two parts; training and testing. The program will then perform feature selection using only the training portion of the master feature vector. Additionally, you can find the list of feature selection methods and their associated file names in the Scripts directory in the file named featureSelectionAlgorithms.lookup.

The default setting for data partitioning is 50:50, meaning the master feature vector file will be split evenly into training and testing data sets while retaining approximately even distributions of your sample classes between the two daughter files. To achieve a larger split, such as 80:20 for training/testing, in the configuration file Configuration.txt replace the 2 with a 5. This will tell the program to perform 5 folds, where the training file will retain 4 and the testing file will retain a single fold or 20% of the master feature vector data.

The default setting for feature selection will perform all possible forms of feature selection available unless otherwise specified in the configuration.txt file. If you wish to change these feature selection options, in the Scripts directory you will need to edit the file named configuration.txt. Simply write TRUE next to all of the feature selection methods you wish to perform and FALSE if you do not want that method performed. Additionally, you can find the list of feature selection methods and their associated file names in the Scripts directory in the file named featureSelectionAlgorithms.lookup.

The following commands perform the feature selection from normalized expression matrix:

bash masterScript_2.bash

For SLURM users:

sbatch masterScript_2.slurm

4. Model training and testing

Once feature selection has been completed, new feature vectors are made based on the ranked lists of features. The new feature vectors will be generated based on your threshold selections, and immediately used to build and test classification models using a classification algorithm of your choosing. Lastly, the directories will be reset, and your old directories and files will be placed in the CompletedExperiments followed by a time-stamp.

The following commands perform model training and testing on the feature vectors:

bash masterScript_3.bash

For SLURM users:

sbatch masterScript_3.slurm

The last lines of the masterScript_3 scripts will move the content of the DataFiles to CompletedExperiments, so the new experiment will run in DataFiles directory. You can find all raw data, feature selection outputs, training and testing feature vectors, models, and model results in the CompletedExperiments directory followed by a time-stamp. To run experiments with new data, begin with step 1.

Overall, the users will require to run only 4 scripts.

bash initialization.bash 	# Initialization
bash masterScript_1.bash 	# Normalization
bash masterScript_2.bash	# Feature Selection
bash masterScript_3.bash	# Model Training and Testing