GBS Bioinformatics Protocols - erynmcfarlane/StatsGenLabProtocols GitHub Wiki
GBS protocols
This is a place to describe how and why each step is done for our GBS bioinformatics. This is very liberally taken from Liz Mandeville's GBS vignette and Amanda Meuser's Bioinformatics Protocol, but updated with tips and tricks specific to the StatsGen Lab at York. This is intended to be a working document, so please add to it as you discover things useful to our lab.
Data Management Steps
Decompressing .ora files
When we get libraries back from Sick kids, they come as a .fastq.ora file that needs to be decompressed before we can demultiplex.
This is done using an Illumina proprietary software called DRAGEN, which can be downloaded here: DRAGEN
More information on how this decompression works is here: Ora Compression Help
orad -d --gz Library.fastq.ora --ora-reference ~/Downloads/orad_2_6_1/oradata/
You will either need to have added the orda (DRAGEN) software to your bin to make this work, or you will need to reference where it lives (i.e. in downloads).
-d decompression
--gz decompress in gz format
--ora-reference tells the software where the ordata references are. Our data will be compressed with homo sapiens reference, because this is how sick kids does this so that they can put multiple libraries on the lane. I've been reassured that this shouldn't result in a loss of data.
Moving the library to compute canada
Once you've downloaded the library from sick kids (and either before or after you've decompressed it), you'll have to transfer it to compute canada. The easiest way to do this is using rsync (but see more directions and options here: transferring data).
From within the folder where the library is, this will look like this: NB: For the StatsGen Lab Eryn will almost always do this step, as she will set the permissions for the library once it's on cedar.
rsync -avzh --no-g --no-p Library1-Mouse_Fescue_S1_L008_R1_001.fastq.ora [email protected]://home/mcfarlas/projects/def-mcfarlas/GBS_libraries
Unzipping the .fastq.gz
Once you have a .fastq.gz file on compute canada somewhere (probably cedar!), you want to unzip it so that you can use the .fastq file itself. This step really could take a while (first time I did this, it took at least an hour, but cedar was cranky and I was using York wifi, so who knows what normal looks like).
gunzip filename.fastq.gz
After you've done this step, you can take a look at how many reads you have (which is the number of lines in your fastq file divided by 4 - if this doesn't make sense to you, make sure to take a look at what to expect in a fastq file).
You can look at the number of lines like this:
wc -l filename.fastq
It might also surprise you how long this takes.
Demultiplexing
The next step is to demultiplex your data. This is because we have pooled all of our data to sequence in one lane at a time. We're going to use a program called Sabre to do this. Sabre is available on compute canada.
To use sabre you need a barcode file which is txt file where the first column is the barcode and the second column is the name of the individual fastq file you would like to make for that barcode/individual:
For example:
barcode1 barcode1_output_file.fastq barcode2 barcode2_output_file.fastq etc...
To run single ended demultiplexing with sabre will look like this:
sabre se -f input_file.fastq -b barcode_data.txt -u unknown_barcode.fastq -m 2
-f is the input fastq file, which you have just unzipped above
-b is the barcode file
-u is the unknown output, specifically where the reads with no barcodes will go. We will have a lot of these because we've done pairend sequencing, but we only have a barcode on one side (the EcoRI end). This means that half of our reads aren't barcoded and get jettisoned.
-m is the number of allowed mismatches. Note that Amanda Meuser used 2, I'm running down where this decision came from.
Probably we won't run this using an interactive job (remember that there are interactive and batch jobs), but using a slurm script.
One thing to keep in mind is that we need just the unique identifier for sabre to work, not the entire barcode (which includes EcoRI and the Illumina primer). Also, the identifier needs to be the reverse complement of the unique identifier (thank you to Liz Mandeville for explaining this to me, and making us a key!). This means that the barcodes in your sabre barcodes file for demultiplexing should only be 8 - 10 basepairs long, not the entire barcode from the barcode plates file.
Alignment
Indexing
The first step is that we need to index the reference genome. This only needs to be done once for each reference genome. Essentially, indexing will tell our alignment software where specific locations are in the genome (like the index in the back of a book). This makes alignment go faster. For a much better explanation, read here..
bwa index /reference/genome/file.fa
Depending on your workflow, for example, if you're the first in the lab to work on your specific species, the indexing might be in a nextflow pipeline. Or, if you're working on a species that we've worked on before (as of right now, this doesn't apply to anyone), then you can skip this step and use the indexed genome that someone else has indexed.
Alignment
The next step is to align the reads for each individual to the reference genome to give each individual their own genome (extremely low coverage, we don't really consider or use this as a genome because so little is covered for GBS). This is computationally intensive because it needs to happen separately for each individual (pulling from the fastq files made in the demultiplexing step).
There are a couple of ways that we can do this.
The first is using nextflow - this is probably my preference, because it is an organized way to get each of the jobs launched for each individual. However, I'm not sure what this is going to do to priority.
The basic command for alignment using bwa is:
bwa mem -t 16 reference.fa sampleID.fq > bwa/sampleID.sam
To get this to work, the reference genome needs to have been indexed, and all of the other indexing files need to be in the folder with it. Additionally, this assumes that you've already made a bwa working directory inside your species directory. "-t" tell bwa how many threads to use. If this is in a shell script, this should match #SBATCH --ntasks-per-node=16 at the top of the script
After alignment, you need to convert from sam to bam. This converts the sam file to a compress binary file.
samtools view -b -S -o bwa/sampleID.bam bwa/sampleID.sam
Finally, we sort and index the bam files
samtools sort bwa/sampleID.bam -o bwa/sampleID.sorted.bam
samtools index bwa/sampleID.sorted.bam
Variant Calling
After alignment, sorting and indexing, it's time to bring all of the individuals files together again to call variants. This is when we look at all of the sequences, and find the SNPs that are both regularly called and sufficiently variable to be informative.
Additional Resources
Bioinformatics Pipeline document created by Amanda Meuser
Bioinformatics Vignette created by Liz Mandeville
bioinformatics_vignette_fall2020.pdf
Nextflow resources
Nextflow on Compute Canada
Nextflow for Newcomers
Nextflow for Genomics Note - this isn't the pipeline we're going to use (we're going to use samtools and bcftools instead), but it gives an idea of how to do this all using Nextflow.