Directory Structure of the Pipeline - HelikarLab/CancerDiscover GitHub Wiki

After downloading CancerDiscover, notice there are several directories and one which contains all of the scripts necessary to process data:

DataFiles will contain raw CEL files and sampleList.txt file

Outputs directory contains resultsSummary.txt file which will have the summary of the model accuracies as well as information regarding the context which gave the highest accuracy

Scripts directory contains all of the source code

Models directory contains all of the classification models

Temp directory contains intermediate files that are generated as part of the execution of the pipeline

Feature Selection directory contains the feature selection algorithm output files and two nested directories for arff file; generation, namely Chunks and ArffPreprocessing

Chunks contains different threshold feature sets

ArffPreprocessing directory contains the feature vectors in arff format. Feature vectors made here are split into training and testing datasets in their respective directories

Train is the repository of the training data for the modeling

Test is the repository of the testing data for model testing

SampleData is a directory which contains 10 sample CEL files and their associated sampleList.txt file

Logs is a directory which contains the elapsed time in seconds for each leg of the pipeline from initialization through model testing

CompletedExperiments When the pipeline has finished running, the above directories which contain experimental data will be moved into this directory. This directory will act as a repository of old experiment files organized by a time-stamp which reads as Year-month-day-hours-minutes-seconds