Chipseq Pipeline at ACCRE - shengqh/cqsperl GitHub Wiki

Login into ACCRE

In order to use ACCRE, you will need a ssh terminal to connect from you local computer to ACCRE. Usually, putty will be used in Windows-based system but all other terminal should be fine.

If you have gateway maintained by ACCRE (just like CQS), you can connect to gateway directly, otherwise, you need to connect to login.accre.vanderbilt.edu for job control. The difference between gateway and default login server is that you can run hourly or even daily long task at gateway but at most 15 minutes long task at login server.

Setup

You need to run following command which only need to be run once.

echo "set ssl:verify-certificate false" >> ~/.lftprc
echo "source /scratch/jbrown_lab/path.txt" >> ~/.bashrc

The first line is setup lftp which will be used to download data from VANTAGE. The second line is setup perl library which will be used to perform pipeline scripts generation.

You may want to check if those lines have been added successfully by:

tail ~/.lftprc
tail ~/.bashrc

Download data from VANTAGE

Assume you get username (Brown_1111) and password (#FAKE_PASSWORD) from VANTAGE and want to store your data into folder /scratch/jbrown_lab/data/3808:

cd /scratch/jbrown_lab/data
mkdir 3808
cd 3808
lftp [email protected]

Once you input the password and login into ftp server successfully, you can list the files in folder by

ls

and download all files by

mget *

or download whole folder (if there is sub folder) by:

mirror

When all files are downloaded, type "quit" to quit lftp software.

Write configuration file

Setup folder and copy template file

First of all, create your own folder under /scratch/jbrown_lab, just like mine:

mkdir /scratch/jbrown_lab/shengq2
cd /scratch/jbrown_lab/shengq2
mkdir codes
mkdir projects

We will use codes to store the configuration files and use projects to store the analysis result. Now let's copy template file into codes folder:

cd codes
cp /scratch/jbrown_lab/examples/20190909_chipseq_zf_1163_mouse.pl .

Prepare configuration file

more /scratch/jbrown_lab/examples/20190909_chipseq_zf_1163_mouse.pl

The content of the configuration file looks like:

#!/usr/bin/perl
use strict;
use warnings;

use CQS::ClassFactory;
use CQS::FileUtils;
use CQS::PerformChIPSeq;

my $def = {
  task_name  => "zf_1163_mouse",
  email      => "quanhu.sheng.1\@vumc.org",
  target_dir => create_directory_or_die("/scratch/jbrown_lab/shengq2/projects/20190909_chipseq_zf_1163_mouse_redo"),

  is_paired_end => 0,

  perform_cutadapt => 0,
  adapter          => "AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC",    #trueseq adapter
  min_read_length  => 30,

  #mapping
  aligner => "bowtie2",

  #peak calling
  peak_caller => "macs",

  files => {
    "Fed_Hepatocyte_BRD4_IP"  => ["/scratch/jbrown_lab/data/1163/1163-ZF-1-ATCACGTT_S1_R1_001.fastq.gz"],
    "Fast_Hepatocyte_BRD4_IP" => ["/scratch/jbrown_lab/data/1163/1163-ZF-2-ACTTGATG_S2_R1_001.fastq.gz"],
    "Fed_Hepatocyte_WCE"      => ["/scratch/jbrown_lab/data/1163/1163-ZF-3-TAGCTTGT_S3_R1_001.fastq.gz"],
    "Fast_Hepatocyte_WCE"     => ["/scratch/jbrown_lab/data/1163/1163-ZF-4-GGCTACAG_S4_R1_001.fastq.gz"],
  },
  treatments => {
    "Fed_Hepatocyte_BRD4_IP"  => ["Fed_Hepatocyte_BRD4_IP"],
    "Fast_Hepatocyte_BRD4_IP" => ["Fast_Hepatocyte_BRD4_IP"],
  },
  controls => {
    "Fed_Hepatocyte_BRD4_IP"  => ["Fed_Hepatocyte_WCE"],
    "Fast_Hepatocyte_BRD4_IP" => ["Fast_Hepatocyte_WCE"],
  },
  perform_chipqc => 1,
};

performChIPSeq_ucsc_mm10($def);
1;

The configuration file is actually a perl file which includes data definition, experimental design and task request.

For each new project, you need to change following entries:

task_name

  task_name  => "zf_1163_mouse",

  email      => "quanhu.sheng.1\@vumc.org",

target folder

  target_dir => create_directory_or_die("/scratch/jbrown_lab/shengq2/projects/20190909_chipseq_zf_1163_mouse_redo"),

paired end data or single end data. In this example, we use single end data.

is_paired_end => 0,

files

  files => {
    "Fed_Hepatocyte_BRD4_IP"  => ["/scratch/jbrown_lab/data/1163/1163-ZF-1-ATCACGTT_S1_R1_001.fastq.gz"],
    "Fast_Hepatocyte_BRD4_IP" => ["/scratch/jbrown_lab/data/1163/1163-ZF-2-ACTTGATG_S2_R1_001.fastq.gz"],
    "Fed_Hepatocyte_WCE"      => ["/scratch/jbrown_lab/data/1163/1163-ZF-3-TAGCTTGT_S3_R1_001.fastq.gz"],
    "Fast_Hepatocyte_WCE"     => ["/scratch/jbrown_lab/data/1163/1163-ZF-4-GGCTACAG_S4_R1_001.fastq.gz"],
  },

the file structure can be generated initially by following command, and then the sample names can be updated manually:

cqstools file_def -i /scratch/jbrown_lab/data/1163 -n \(ZF..\)

experimental design You need to define the treatments and controls. Controls section is optional.

  treatments => {
    "Fed_Hepatocyte_BRD4_IP"  => ["Fed_Hepatocyte_BRD4_IP"],
    "Fast_Hepatocyte_BRD4_IP" => ["Fast_Hepatocyte_BRD4_IP"],
  },
  controls => {
    "Fed_Hepatocyte_BRD4_IP"  => ["Fed_Hepatocyte_WCE"],
    "Fast_Hepatocyte_BRD4_IP" => ["Fast_Hepatocyte_WCE"],
  },

species

performChIPSeq_ucsc_mm10($def);

You can change it to following line if your sample is from human.

performChIPSeq_ucsc_hg19($def);

Generate ACCRE cluster job scripts

Once the configuration file is prepared, you can run this perl file to generate all cluster script files:

perl 20190909_chipseq_zf_1163_mouse.pl

Go over the folder structures

Once you run the configuration perl file, it will automatically generate multiple folders and necessary scripts for you.

cd /scratch/jbrown_lab/shengq2/projects/20190909_chipseq_zf_1163_mouse_redo
ls -la

The folders should be listed as:

drwxr-xr-x 10 shengq2 h_vangard_1  4096 Sep  9 15:44 .
drwxr-xr-x  3 shengq2 brown_lab    4096 Sep  9 15:43 ..
-rw-r--r--  1 shengq2 h_vangard_1  1338 Sep  9 15:43 20180123_chipseq_zf_1163_mouse.pl
drwxr-xr-x  5 shengq2 h_vangard_1  4096 Sep  9 15:43 bowtie2
drwxr-xr-x  5 shengq2 h_vangard_1  4096 Sep  9 15:43 bowtie2_cleanbam
drwxr-xr-x  5 shengq2 h_vangard_1  4096 Sep  9 15:43 fastqc_raw
drwxr-xr-x  5 shengq2 h_vangard_1  4096 Sep  9 15:43 macs1callpeak
drwxr-xr-x  5 shengq2 h_vangard_1  4096 Sep  9 15:43 macs1callpeak_chipqc
drwxr-xr-x  5 shengq2 h_vangard_1  4096 Sep  9 15:43 macs1callpeak_homer_annotation
drwxr-xr-x  5 shengq2 h_vangard_1  4096 Sep  9 15:43 multiqc
drwxr-xr-x  5 shengq2 h_vangard_1  4096 Sep  9 15:44 sequencetask
-rw-r--r--  1 shengq2 h_vangard_1 14832 Sep  9 15:43 zf_1163_mouse.config
-rw-r--r--  1 shengq2 h_vangard_1  5162 Sep  9 15:43 zf_1163_mouse.def

In each folder, there are three sub-folders, named as pbs, log and result. The pbs folder contains scripts. The log folder will contain log information generated by cluster system when scripts are running. The result folder will contain result files.

Submit jobs to ACCRE cluster

cd sequencetask/pbs
sh zf_1163_mouse_pipeline_st.pbs.submit