Chipseq Pipeline at ACCRE - shengqh/cqsperl GitHub Wiki
Login into ACCRE
In order to use ACCRE, you will need a ssh terminal to connect from you local computer to ACCRE. Usually, putty will be used in Windows-based system but all other terminal should be fine.
If you have gateway maintained by ACCRE (just like CQS), you can connect to gateway directly, otherwise, you need to connect to login.accre.vanderbilt.edu for job control. The difference between gateway and default login server is that you can run hourly or even daily long task at gateway but at most 15 minutes long task at login server.
Setup
You need to run following command which only need to be run once.
echo "set ssl:verify-certificate false" >> ~/.lftprc
echo "source /scratch/jbrown_lab/path.txt" >> ~/.bashrc
The first line is setup lftp which will be used to download data from VANTAGE. The second line is setup perl library which will be used to perform pipeline scripts generation.
You may want to check if those lines have been added successfully by:
tail ~/.lftprc
tail ~/.bashrc
Download data from VANTAGE
Assume you get username (Brown_1111) and password (#FAKE_PASSWORD) from VANTAGE and want to store your data into folder /scratch/jbrown_lab/data/3808:
cd /scratch/jbrown_lab/data
mkdir 3808
cd 3808
lftp [email protected]
Once you input the password and login into ftp server successfully, you can list the files in folder by
ls
and download all files by
mget *
or download whole folder (if there is sub folder) by:
mirror
When all files are downloaded, type "quit" to quit lftp software.
Write configuration file
Setup folder and copy template file
First of all, create your own folder under /scratch/jbrown_lab, just like mine:
mkdir /scratch/jbrown_lab/shengq2
cd /scratch/jbrown_lab/shengq2
mkdir codes
mkdir projects
We will use codes to store the configuration files and use projects to store the analysis result. Now let's copy template file into codes folder:
cd codes
cp /scratch/jbrown_lab/examples/20190909_chipseq_zf_1163_mouse.pl .
Prepare configuration file
more /scratch/jbrown_lab/examples/20190909_chipseq_zf_1163_mouse.pl
The content of the configuration file looks like:
#!/usr/bin/perl
use strict;
use warnings;
use CQS::ClassFactory;
use CQS::FileUtils;
use CQS::PerformChIPSeq;
my $def = {
task_name => "zf_1163_mouse",
email => "quanhu.sheng.1\@vumc.org",
target_dir => create_directory_or_die("/scratch/jbrown_lab/shengq2/projects/20190909_chipseq_zf_1163_mouse_redo"),
is_paired_end => 0,
perform_cutadapt => 0,
adapter => "AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC", #trueseq adapter
min_read_length => 30,
#mapping
aligner => "bowtie2",
#peak calling
peak_caller => "macs",
files => {
"Fed_Hepatocyte_BRD4_IP" => ["/scratch/jbrown_lab/data/1163/1163-ZF-1-ATCACGTT_S1_R1_001.fastq.gz"],
"Fast_Hepatocyte_BRD4_IP" => ["/scratch/jbrown_lab/data/1163/1163-ZF-2-ACTTGATG_S2_R1_001.fastq.gz"],
"Fed_Hepatocyte_WCE" => ["/scratch/jbrown_lab/data/1163/1163-ZF-3-TAGCTTGT_S3_R1_001.fastq.gz"],
"Fast_Hepatocyte_WCE" => ["/scratch/jbrown_lab/data/1163/1163-ZF-4-GGCTACAG_S4_R1_001.fastq.gz"],
},
treatments => {
"Fed_Hepatocyte_BRD4_IP" => ["Fed_Hepatocyte_BRD4_IP"],
"Fast_Hepatocyte_BRD4_IP" => ["Fast_Hepatocyte_BRD4_IP"],
},
controls => {
"Fed_Hepatocyte_BRD4_IP" => ["Fed_Hepatocyte_WCE"],
"Fast_Hepatocyte_BRD4_IP" => ["Fast_Hepatocyte_WCE"],
},
perform_chipqc => 1,
};
performChIPSeq_ucsc_mm10($def);
1;
The configuration file is actually a perl file which includes data definition, experimental design and task request.
For each new project, you need to change following entries:
- task_name
task_name => "zf_1163_mouse",
email => "quanhu.sheng.1\@vumc.org",
- target folder
target_dir => create_directory_or_die("/scratch/jbrown_lab/shengq2/projects/20190909_chipseq_zf_1163_mouse_redo"),
- paired end data or single end data. In this example, we use single end data.
is_paired_end => 0,
- files
files => {
"Fed_Hepatocyte_BRD4_IP" => ["/scratch/jbrown_lab/data/1163/1163-ZF-1-ATCACGTT_S1_R1_001.fastq.gz"],
"Fast_Hepatocyte_BRD4_IP" => ["/scratch/jbrown_lab/data/1163/1163-ZF-2-ACTTGATG_S2_R1_001.fastq.gz"],
"Fed_Hepatocyte_WCE" => ["/scratch/jbrown_lab/data/1163/1163-ZF-3-TAGCTTGT_S3_R1_001.fastq.gz"],
"Fast_Hepatocyte_WCE" => ["/scratch/jbrown_lab/data/1163/1163-ZF-4-GGCTACAG_S4_R1_001.fastq.gz"],
},
the file structure can be generated initially by following command, and then the sample names can be updated manually:
cqstools file_def -i /scratch/jbrown_lab/data/1163 -n \(ZF..\)
- experimental design You need to define the treatments and controls. Controls section is optional.
treatments => {
"Fed_Hepatocyte_BRD4_IP" => ["Fed_Hepatocyte_BRD4_IP"],
"Fast_Hepatocyte_BRD4_IP" => ["Fast_Hepatocyte_BRD4_IP"],
},
controls => {
"Fed_Hepatocyte_BRD4_IP" => ["Fed_Hepatocyte_WCE"],
"Fast_Hepatocyte_BRD4_IP" => ["Fast_Hepatocyte_WCE"],
},
- species
performChIPSeq_ucsc_mm10($def);
You can change it to following line if your sample is from human.
performChIPSeq_ucsc_hg19($def);
Generate ACCRE cluster job scripts
Once the configuration file is prepared, you can run this perl file to generate all cluster script files:
perl 20190909_chipseq_zf_1163_mouse.pl
Go over the folder structures
Once you run the configuration perl file, it will automatically generate multiple folders and necessary scripts for you.
cd /scratch/jbrown_lab/shengq2/projects/20190909_chipseq_zf_1163_mouse_redo
ls -la
The folders should be listed as:
drwxr-xr-x 10 shengq2 h_vangard_1 4096 Sep 9 15:44 .
drwxr-xr-x 3 shengq2 brown_lab 4096 Sep 9 15:43 ..
-rw-r--r-- 1 shengq2 h_vangard_1 1338 Sep 9 15:43 20180123_chipseq_zf_1163_mouse.pl
drwxr-xr-x 5 shengq2 h_vangard_1 4096 Sep 9 15:43 bowtie2
drwxr-xr-x 5 shengq2 h_vangard_1 4096 Sep 9 15:43 bowtie2_cleanbam
drwxr-xr-x 5 shengq2 h_vangard_1 4096 Sep 9 15:43 fastqc_raw
drwxr-xr-x 5 shengq2 h_vangard_1 4096 Sep 9 15:43 macs1callpeak
drwxr-xr-x 5 shengq2 h_vangard_1 4096 Sep 9 15:43 macs1callpeak_chipqc
drwxr-xr-x 5 shengq2 h_vangard_1 4096 Sep 9 15:43 macs1callpeak_homer_annotation
drwxr-xr-x 5 shengq2 h_vangard_1 4096 Sep 9 15:43 multiqc
drwxr-xr-x 5 shengq2 h_vangard_1 4096 Sep 9 15:44 sequencetask
-rw-r--r-- 1 shengq2 h_vangard_1 14832 Sep 9 15:43 zf_1163_mouse.config
-rw-r--r-- 1 shengq2 h_vangard_1 5162 Sep 9 15:43 zf_1163_mouse.def
In each folder, there are three sub-folders, named as pbs, log and result. The pbs folder contains scripts. The log folder will contain log information generated by cluster system when scripts are running. The result folder will contain result files.
Submit jobs to ACCRE cluster
cd sequencetask/pbs
sh zf_1163_mouse_pipeline_st.pbs.submit