example - njdbickhart/RAPTR-SV GitHub Wiki

#Example Workflow

In this example/tutorial, we will process raw sequence data in fastq format and call CNVs from the alignments. Just as a demonstration, I will give you individualized commands, but you should REALLY consider processing your data with a pipeline script! I have a repository that contains several pipeline scripts for processing sequence data, to get you started.

##Prerequisites You will need to install the following programs prior to processing your data:

##Data files Just to make things easier, I will refer to the following file types by using pseudonyms:

The fastq file for the first read of your sample: sequence_1.fq
The fastq file for the second read of your sample: sequence_2.fq
Your reference genome fasta file: reference.fa
Gap locations within your reference genome fasta: reference.gaps.bed
The SAM file from BWA alignment: sample.sam
The BAM (or binary version of the SAM file) file: sample.bam
Post processing BAM file: sample.sorted.nodup.bam
RAPTR-SV preprocessing step "flat" file: sample.preprocessing.flat
RAPTR-SV cluster SV calls: sample.cluster.(either "insertions", "deletions" or "tand")

##Workflow step one: Reference genome preprocessing Generating an appropriate reference genome file is a crucial step before you undertake any analysis of sequence data! If you are downloading data from NCBI, often the reference genome chromosome names have huge titles with special characters (ie. '|' and ':'). These characters are not very "unix-friendly" so it pays for you to consider converting them to simple karyotype chromosome names (ie. "chr1", "chr2") and to order them numerically. If you need to preprocess the chromosome names and data in this way, here are some simple perl "one-liners" that can help you to do this:

# This one liner splits your reference fasta into separate, ordered, chromosome names. It assumes that the chromosomes are ordered Karyotypically
$ perl -e '$h = <>; $c = 1; open(OUT, "> chr$c.fa"); print OUT ">chr$c\n"; while(<>){chomp; if($_=~ /^>/){$c++; close OUT; open(OUT, "> chr$c.fa"); print OUT ">chr$c\n";}else{print OUT "$_\n";}' < reference.fa
# Now, check to make sure that the chromosome names are correct! If you have an 'X' chromosome, you may have to alter the file name, and the first line of the fasta file
# Finally, run "cat" on all of the chromosome names to make one large fasta file:
$ cat chr1.fa chr2.fa (... all the remaining files ...) > reference.ordered.fa

In order to produce rapid alignment results, BWA (the recommended alignment program), needs to generate an index of all possible kmers within your reference fasta, and then process them using the Burrows-Wheeler transform. In order to generate this index, you need to run the following command on your fasta file before alignment:

$ bwa index reference.fa

This should generate a series of files within the same directory as the reference.fa file. Also, you will need to generate a fasta index (".fai") file using samtools. You can generate this file by running this command:

$ samtools faidx reference.fa

Just in case Picard needs a sequence dictionary file, it is best to generate one ahead of time as well:

$ java -jar ~/PicardDirectory/CreateSequenceDictionary.jar REFERENCE=reference.fa OUTPUT=reference.dict

You should be all set! Just please make sure that the ".dict," ".fai," and BWA index files were all generated by the versions of the programs that you will use for the final analysis! Also, it pays to use the same reference genome fasta for ALL subsequent analysis, as the names of the chromosomes will need to be consistent in your alignments and RAPTR-SV data.

##Workflow step two: Alignment Next, you will need to align your sequence reads to the reference genome that you prepared in the first step. We will be using BWA and Samtools for this step. I would highly recommend using "read groups" in all of your alignment files. It is good practice and it saves you quite a bit of reprocessing down the road if you ever want to combine samples for analysis. Please read the GATK FAQ about BAM files to get a sense of how Read Groups should be generated for your data.

First, BWA alignment of your reads (replace the '-R' argument with appropriate read group information for your sample):

$ bwa mem -t (number of cpu threads to use) -M -R '@RG\tID:seq1.lib.run\tLB:seq1.lib\tPL:ILLUMINA\tSM:seq1' reference.fa sequence_1.fq sequence_2.fq > sample.sam

Next, convert the output SAM file into a bam file using samtools, and sort it by chromosome location

$ samtools view -bS sample.sam | samtools sort -@ (number of cpu threads) -o sample.bam -

Before starting the next steps, you will need to index the BAM file for fast access, using samtools:

$ samtools index sample.bam

##Workflow step three: BAM post processing The final step we need to perform before RAPTR-SV analysis is to ensure that our BAM is properly sorted and that optical duplicates are marked in the bam file. First, you may need to combine BAM files from the same biological sample into one larger file. In order to do this, you can use samtools to merge all of the files together (please NOTE: only the most recent version of samtools combines read groups from different BAM files together by default!):

$ samtools merged sample.merged.sorted.bam sample1.bam sample2.bam ... sampleN.bam

In order to mark optical duplicates in your data, you will need to use Picard tools:

$ java -jar ~/PicardDirectory/MarkDuplicates.jar INPUT=sample.bam OUTPUT=sample.sorted.nodup.bam METRICS_FILE=sample.dup.metrics VALIDATION_STRINGENCY=LENIENT

After this final step, your sorted, merged and duplicate marked BAM (sample.sorted.nodup.bam) should be ready for RAPTR-SV processing!

##Workflow step three: RAPTR-SV Great! You've made it this far! Now, here is the next stage of the process: RAPTR-SV preprocessing. You will want to generate the necessary meta-data in order to identify SVs from your sample, and the default parameters for RAPTR-SV preprocessing should be a good start. Here is how to run the preprocessing step:

$ java -jar /directorytoRAPTR-SV/RAPTR-SV.jar preprocess -i sample.sorted.nodup.bam -o sample.preprocess -r reference.fa

RAPTR-SV uses several temporary files in order to store data on disk (to avoid storing most of that data in your precious system RAM!), and the default location to generate this data is your system's "tmp" directory. On Unix-based systems this is typically found in the following path: /tmp . If your system administrator set the size of the /tmp partition to be only a few gigabytes, RAPTR-SV might quickly use up that amount of space and die with an error! To get around this, make a temporary directory on a larger partition, and add the following command to your subsequent RAPTR-SV commands:

-p path/to/temp/directory

RAPTR-SV will clean up after itself, but it needs the space to run both the preprocessing and clustering steps!

You can also set more CPU threads for the program to use by setting the '-t' argument:

-t (number of threads)

This will show a (nearly) linear increase of speed for clustering, but you will find fewer benefits for preprocessing. Still, more threads cannot hurt your processing speed!

After the preprocess step finishes, you will have four files in the directory next to your sample's bam (if you used the "-g" option in the preprocessing step, the "D" will be replaced with the sample Read Group ID for the BAM):

sample.preprocess.D.bam <- this is the split read alignment BAM file
sample.preprocess.D.anchor.bam <- these are the anchor reads to the split reads
sample.preprocess.D.divet <- these are the discordant read pairings
sample.preprocess.flat <- this is the "master file location" file that tells the RAPTR-SV cluster mode where to find the previously mentioned files

You will use the ".flat" file for the next step, which is the actual SV calling using RAPTR-SV cluster mode:

$ java -jar /directorytoRAPTR-SV/RAPTR-SV.jar cluster -s sample.preprocess.flat -g reference.gaps.bed -o sample.cluster

Cluster mode should generate three additional files, which contain your final SV calls. Here are the names for these files:

sample.cluster.deletions <- RAPTR-SV deletion calls
sample.cluster.insertions <- Insertion SVs (mostly called from discordant read pairings)
sample.cluster.tand <- Tandem duplication SVs

##Workflow step four: Subsequent use While I work on improving the general ease of use of the program, I have included several "helper" programs in my GitHub account to help process the data from RAPTR-SV into better callsets. From personal use, here are my suggestions (I would be very happy to hear of your user experience and/or custom utilities, so please feel free to contact me at the above listed email address!):

####Filtering

I have created a filtration program (written in perl) that helps narrow down the list of RAPTR-SV calls to a more manageable size, and format the output into a more standard type (here: BED format). In order to run this program, you will need to clone my "perl_toolchain" repository and run a local::lib setup using the install scripts in the base folder of the repository. The repository is at this address: https://github.com/njdbickhart/perl_toolchain

Here is an example usage of the filtration program:

$ perl ~/perl_toolchain/sequence_data_scripts/filterRAPTRSVFiles.pl -b sample.sorted.nodup.bam -i sample.cluster.deletions -o output.filtered.bed

####Annotation

Usually, you will want to check to see if your SV calls intersect with other genomic features (ie. genes). I have created a java program that will automatically intersect your SV calls from RAPTR-SV (from the filtered script, or raw output) with multiple genomic feature files, and generate a spreadsheet using the jxl library (http://jexcelapi.sourceforge.net/ ). The repository for this program is at this address: https://github.com/njdbickhart/AnnotateUsingGenomicInfo