Workflow Demo - Site-428/adage GitHub Wiki
Initialization
if (!dir.exists('data')) dir.create('data')
trainset.name <- 'GSE31210'
testset.name <- 'ArrayExpress'
data.base.dir <- 'data'
trainset.cel.dir <- file.path(data.base.dir,'train/cel')
trainset.pcl.dir <- file.path(data.base.dir,'train/pcl')
testset.cel.dir <- file.path(data.base.dir, 'test/cel')
testset.pcl.dir <- file.path(data.base.dir, 'test/pcl')
chip.type <- 'HG-U133_Plus_2'
annotation.data <- 'hgu133plus2.db'
library(oligo)
library(GEOquery)
library(annotate)
library(annotation.data, character.only = TRUE)
library(genefilter)
library(magrittr)
library(impute)
If you are going to change the dataset, just clear the data directory and change the values above, but we prepare the trainset and the testset differently so be careful. The trainset is the coursework data from GEO, it is much bigger than the testset, which is obtained from ArrayExpress. They are compressed differently so don't mix them up if you want to change the workflow.
Explanation
The trainset.name
and testset.name
are folders that contain your downloaded data.
Change chip.type
and annotation.data
if you are going to apply this script to a different platform.
Data preparation
preprocessing workflow
It differs slightly from the preprocessing pipeline. To enforce consistency across datasets, discarding probes based on the data is skipped. We only remove probes without a proper gene id in the annotation data.
get_anno_exprs <- function(file.names) {
data.exprs <- (file.names %>%
read.celfiles() %>%
rma() %>%
exprs() %>%
impute.knn())$data
data.anno <- rownames(data.exprs) %>%
getSYMBOL(annotation.data) %>%
{data.frame(geneID = .)} %>%
cbind(data.exprs) %>%
na.omit()
probenames <-
apply(data.anno[,-1], 1, IQR) %>%
{findLargest(rownames(data.anno),
testStat = .,
data = annotation.data)}
data <- data.anno[probenames,]
return(data)
}
Define the pipeline get_anno_exprs
from cel to annotated expression set.
Trainset
This piece of workflow starts from raw data from GEO.
getGEOSuppFiles(trainset.name, baseDir = data.base.dir) %>%
rownames() %>%
untar(exdir = trainset.cel.dir)
Download GEO dataset and untar.
All cel.gz will be in the trainset.cel.dir, but you can skip this if you have them.
trainset.exprs <-
list.celfiles(trainset.cel.dir, listGzipped = TRUE) %>%
{file.path(trainset.cel.dir,.)} %>%
get_anno_exprs()
Preprocessing.
Testset
The testset is already obtained from ArrayExpress database based on the chip information of the trainset. We have downloaded the cel files using another R script the makes compendium. ArrayExpress offers a convenient API to download a specific type of array data.
NOTE: These data are prepared differently with those from GEO, if you decide to change the data to GEO's, you'd better refer to the corresponding workflow.
list.files(file.path(data.base.dir, testset.name), pattern = '*.zip') %>%
{file.path(data.base.dir, testset.name, .)} %>%
lapply(function(file) unzip(file, exdir = testset.cel.dir))
Unzip the downloaded files.
testset.exprs <- list.celfiles(testset.cel.dir) %>%
{file.path(testset.cel.dir,.)} %>%
get_anno_exprs()
Preprocessing.
Normalization
Basically, we perform 0-1 normalization that is complusive before fed to the model.
Beware, the testset uses the trainset's range.
min_max_norm <- function(x, y) {
(x - min(y)) / (max(y) - min(y))
}
trainset.normalized <- mapply(min_max_norm,
trainset.exprs[,-1],
trainset.exprs[,-1],
SIMPLIFY = FALSE) %>%
as.data.frame() %>%
{cbind(trainset.exprs$geneID %>% data.frame(gene = .), .)}
testset.normalized <- mapply(min_max_norm,
testset.exprs[,-1],
trainset.exprs[,-1],
SIMPLIFY = FALSE) %>%
as.data.frame() %>%
{cbind(testset.exprs$geneID %>% data.frame(gene = .), .)}
rownames(testset.normalized) <- trainset.exprs$geneID
Exportation
write.table(trainset.normalized,
file = file.path(trainset.pcl.dir, 'train.pcl'),
quote = FALSE,
row.names = FALSE,
sep = '\t')
write.table(testset.normalized,
file = file.path(testset.pcl.dir, 'test.pcl'),
quote = FALSE,
row.names = FALSE,
sep = '\t')
Write to pcl. The model only accepts pcl data.
Train the model
The following code utilizes some python modules in the greenelab/adage
project.
NOTE: Python needs theano
and docopt
to run.
python Train_test_DAs/SdA_train.py data/train/pcl/train.pcl 0 50 10 500 0.1 0.01 --seed1 123 --seed2 123
if [ ! -d data/train/result ]; then
mkdir -p data/train/result
fi
mv data/train/pcl/*.txt data/train/result
The model training is done by python. The algorithm was originally coded in Python 2, and is adapted to Python 3 version.
Training options from left to right
- skip columns: 0
- network size: 50
- batch size: 10
- epoch size: 500
- corruption level: 0.1
- learning rate: 0.01
Test the model
python Train_test_DAs/SdA_test.py data/test/pcl/test.pcl 0 data/train/result/train_50_batch10_epoch500_corrupt0.1_lr0.01_seed1_123_seed2_123_network_SdA.txt 50
if [ ! -d data/test/result ]; then
mkdir -p data/test/result
fi
mv data/test/pcl/*.txt data/test/result