Workflow Demo - Site-428/adage GitHub Wiki

Initialization

if (!dir.exists('data')) dir.create('data')
trainset.name <- 'GSE31210'
testset.name <- 'ArrayExpress'
data.base.dir <- 'data'
trainset.cel.dir <- file.path(data.base.dir,'train/cel')
trainset.pcl.dir <- file.path(data.base.dir,'train/pcl')
testset.cel.dir <- file.path(data.base.dir, 'test/cel')
testset.pcl.dir <- file.path(data.base.dir, 'test/pcl')
chip.type <- 'HG-U133_Plus_2'
annotation.data <- 'hgu133plus2.db'
library(oligo)
library(GEOquery)
library(annotate)
library(annotation.data, character.only = TRUE)
library(genefilter)
library(magrittr)
library(impute)

If you are going to change the dataset, just clear the data directory and change the values above, but we prepare the trainset and the testset differently so be careful. The trainset is the coursework data from GEO, it is much bigger than the testset, which is obtained from ArrayExpress. They are compressed differently so don't mix them up if you want to change the workflow.
Explanation
The trainset.name and testset.name are folders that contain your downloaded data.
Change chip.type and annotation.data if you are going to apply this script to a different platform.

Data preparation

preprocessing workflow

It differs slightly from the preprocessing pipeline. To enforce consistency across datasets, discarding probes based on the data is skipped. We only remove probes without a proper gene id in the annotation data.

get_anno_exprs <- function(file.names) {
  data.exprs <- (file.names %>% 
                   read.celfiles() %>% 
                   rma() %>% 
                   exprs() %>%
                   impute.knn())$data
  data.anno <- rownames(data.exprs) %>% 
    getSYMBOL(annotation.data) %>% 
    {data.frame(geneID = .)} %>%
    cbind(data.exprs) %>%
    na.omit()
  probenames <- 
    apply(data.anno[,-1], 1, IQR) %>%
    {findLargest(rownames(data.anno), 
                 testStat = ., 
                 data = annotation.data)}
  data <- data.anno[probenames,]
  return(data)
}

Define the pipeline get_anno_exprs from cel to annotated expression set.

Trainset

This piece of workflow starts from raw data from GEO.

getGEOSuppFiles(trainset.name, baseDir = data.base.dir) %>% 
  rownames() %>%
  untar(exdir = trainset.cel.dir)

Download GEO dataset and untar.
All cel.gz will be in the trainset.cel.dir, but you can skip this if you have them.

trainset.exprs <- 
  list.celfiles(trainset.cel.dir, listGzipped = TRUE) %>% 
  {file.path(trainset.cel.dir,.)} %>%
  get_anno_exprs()

Preprocessing.

Testset

The testset is already obtained from ArrayExpress database based on the chip information of the trainset. We have downloaded the cel files using another R script the makes compendium. ArrayExpress offers a convenient API to download a specific type of array data.
NOTE: These data are prepared differently with those from GEO, if you decide to change the data to GEO's, you'd better refer to the corresponding workflow.

list.files(file.path(data.base.dir, testset.name), pattern = '*.zip') %>% 
  {file.path(data.base.dir, testset.name, .)} %>%
  lapply(function(file) unzip(file, exdir = testset.cel.dir))

Unzip the downloaded files.

testset.exprs <- list.celfiles(testset.cel.dir) %>% 
  {file.path(testset.cel.dir,.)} %>%
  get_anno_exprs()

Preprocessing.

Normalization

Basically, we perform 0-1 normalization that is complusive before fed to the model.
Beware, the testset uses the trainset's range.

min_max_norm <- function(x, y) {
  (x - min(y)) / (max(y) - min(y))
}
trainset.normalized <- mapply(min_max_norm, 
                              trainset.exprs[,-1], 
                              trainset.exprs[,-1], 
                              SIMPLIFY = FALSE) %>% 
  as.data.frame() %>%
  {cbind(trainset.exprs$geneID %>% data.frame(gene = .), .)}
testset.normalized <- mapply(min_max_norm, 
                             testset.exprs[,-1], 
                             trainset.exprs[,-1], 
                             SIMPLIFY = FALSE) %>% 
  as.data.frame() %>%
  {cbind(testset.exprs$geneID %>% data.frame(gene = .), .)}
rownames(testset.normalized) <- trainset.exprs$geneID

Exportation

write.table(trainset.normalized, 
            file = file.path(trainset.pcl.dir, 'train.pcl'),
            quote = FALSE,
            row.names = FALSE,
            sep = '\t')
write.table(testset.normalized,
            file = file.path(testset.pcl.dir, 'test.pcl'),
            quote = FALSE,
            row.names = FALSE,
            sep = '\t')

Write to pcl. The model only accepts pcl data.

Train the model

The following code utilizes some python modules in the greenelab/adage project.
NOTE: Python needs theano and docopt to run.

python Train_test_DAs/SdA_train.py data/train/pcl/train.pcl 0 50 10 500 0.1 0.01 --seed1 123 --seed2 123
if [ ! -d data/train/result ]; then
  mkdir -p data/train/result
fi
mv data/train/pcl/*.txt data/train/result

The model training is done by python. The algorithm was originally coded in Python 2, and is adapted to Python 3 version.
Training options from left to right

  1. skip columns: 0
  2. network size: 50
  3. batch size: 10
  4. epoch size: 500
  5. corruption level: 0.1
  6. learning rate: 0.01

Test the model

python Train_test_DAs/SdA_test.py data/test/pcl/test.pcl 0 data/train/result/train_50_batch10_epoch500_corrupt0.1_lr0.01_seed1_123_seed2_123_network_SdA.txt 50
if [ ! -d data/test/result ]; then
  mkdir -p data/test/result
fi
mv data/test/pcl/*.txt data/test/result