Example analysis pipeline - zkstewart/psQTL GitHub Wiki

Preparation

Prior to psQTL, you should map your reads to your genome however you normally do so, getting sorted BAM results. An example is indicated for a single sample below, but any popular program should work.

bwa mem -t <CPUS> -R <readgroup> <genomeFASTA> <forwardreads> <reverseread> > sample1.sam
samtools sort -@ <CPUS> -O bam -o sample1.sorted.bam -m <MEM>G sample1.sam
bamtools index sample1.sorted.bam

Next, you may optionally want to mark PCR duplicates. The necessity of this is debateable, but doing so is unlikely to hurt.

picard MarkDuplicates I=sample1.sorted.bam O=sample1.sorted.md.bam M=sample1.md_metrics.txt
bamtools index sample1.sorted.md.bam

Afer doing this for all of your samples, you next want to format a metadata file for the analysis. This file should be tab-delimited with two columns and no header. The left column indicates your sample IDs, and the right indicates which of two groups it belongs to. Note that the sample ID should be the same as your BAM files, just with the suffix removed. So for example, if your file is sample1.sorted.md.bam then your sample ID is sample1 and your suffix is .sorted.md.bam.

sample1	group1
sample2 group2
sample3 group2

Now you're ready to use psQTL_prep.py to initialise your working directory. Using this, we'll instruct psQTL to set up a new folder/directory for an analysis of your mapped BAM files using the metadata file you've got ready to go. Note that we're going to name our working directory psqtl_analysis, our metadata file is named metadata.tsv, and our BAM files are all located in /location/of/bam_files with the suffix .sorted.md.bam. It's important that all of your BAM files have that exact same suffix!

python /location/of/psQTL_prep.py initialise -d psqtl_analysis --meta metadata.tsv --bam /location/of/bam_files --bamSuffix .sorted.md.bam

The program will tell you that some values have been stored in various caches - these caches are what psQTL uses to remember where your file are located and what options you've specified.

With the initialisation done, we're ready to call variants now. The program will remember where your metadata and BAM files are located, but you will also need to let psQTL know where the genome FASTA file is located; this should be the exact same file as you used when mapping reads. Lastly, we need to provide some information on how we want to filter our variants in terms of their quality scores. Our genome file will be called /location/of/genome.fasta. We'll use the recommended value of 30 for quality scores and we'll also use 12 threads to speed things along.

python /location/of/psQTL_prep.py call -d psqtl_analysis -f /location/of/genome.fasta --qual 30 --threads 12

We should expect this process to take a while. If we're interested in predicting genomic deletions to see if they appear to be QTLs as well, we can use the depth functionality of psQTL to do that. We'll indicate the same genome FASTA file as before, and let psQTL know what window size we want to predict deletions in.

python /location/of/psQTL_prep.py depth -d psqtl_analysis -f /location/of/genome.fasta --windowSize 1000 --threads 12

Once the analyses are complete, we can view the analysis folder for some details on what we've predicted. Expect this to tell you what your files are, how many variants you called (before and after filtering), and how many windows in your genome appear to contain a deletion in one or more of your samples.

python /location/of/psQTL_prep.py view -d psqtl_analysis

Processing

Either using psQTL or independently yourself, you should have one or both of 1) a variant VCF file, and 2) a deletion VCF-like file. We want to process the file(s) and calculate how much the two groups segregate at variants or genomic deletions.

'call' variants

To calculate the Euclidean distance (ED) between the two groups' small variant calls, we can begin by running:

python /location/of/psQTL_proc.py ed -i call -d psqtl_analysis

That will produce the psQTL_call.alleles_ed.tsv.gz and psQTL_call.genotypes_ed.tsv.gz files within your analysis directory which contain raw ED values for each variant in your VCF file.

We could instead run the command like below by specifying two parent which are contained in our VCF file:

python /location/of/psQTL_proc.py ed -i call -d psqtl_analysis --parents p1 p2

That would produce the aforementioned two files, as well as the psQTL_call.inheritance_ed.tsv.gz file. All three of these files contain a segregation statistic calculated in a slightly different way, with details of their differences available here.

'depth' copy number variants

To look at the segregation between groups at each genomic window region using the deletion VCF-like file, we can run:

python /location/of/psQTL_proc.py ed -i depth -d psqtl_analysis

This will produce the psQTL_depth.ed.tsv.gz file with the raw ED values for segregation at each genomic window region.

sPLS-DA analysis

Other statistics that may be of benefit are the Balanced Accuracy (BA) and feature selection processes employed by sPLS-DA. To generate these statistics for both 'call' and 'depth' files, we can do:

python /location/of/psQTL_proc.py splsda -i call depth -d psqtl_analysis --threads 12

If you only want to run it for either 'call' or 'depth', you can accomplish that by only specifying -i call or -i depth, respectively. However, using both -i call depth will perform an additional sPLS-DA step which is of benefit.

This will produce several files in the psqtl_analysis/splsda directory, and can be plotted in subsequent steps.

Post-processing

Plotting

After running the above processing step, you are now ready to plot the results. psQTL relies upon the visual inspection of results, with a usual process involving you plotting ED and/or BA statistics across the entire genome, then manually looking for regions that have elevated segregation. You can then plot just those regions to give you a "zoomed in" look at the statistics, and iterate upon this process until you are happy with what you've found.

To begin, you might plot all your data for all the chromosomes like:

python /location/of/psQTL_post.py plot -d psqtl_analysis \
    -i call depth -m ed splsda
    -p line scatter -s circos \
    -f /location/of/genome.fasta \
    -o plot.png

Note that we're plotting the data circularly (-s circos) as this will give us the obligatory "pretty publication plot" and also happens to be a good option for densely presenting data.

By specifying -m ed splsda we're telling psQTL to plot not just the Euclidean distance segregation statistic, but also the results from sPLS-DA analysis. These are both presented on different rows or "tracks" in the plot.

We're also plotting the individual data points as well as a smoothed line for both the ed and splsda results using -p line scatter. When specifying these two plot types together, they get overlaid to produce a single row or "track" in your visualisation.

With the resulting plot, you can look for peaks in segregation statistics on your scatter/line plot, or look for SNPs and CNVs that were selected by sPLS-DA as indicated by diamonds. These both help to locate where the evidence most strongly suggests segregation between your two populations, which in theory will be your major QTL(s).

Hypothetically, let's say you find that chr2 is of interest, specifically in the region from 3.8 Mbp to 4.2 Mbp. We can plot that region in greater detail like below:

python /location/of/psQTL_post.py plot -d psqtl_analysis \
    -i call depth -m ed splsda
    -p line scatter coverage genes -s horizontal \
    -f /location/of/genome.fasta \
    --regions chr2:3800000-4200000 \
    -o plot2.png

Note that we're plotting the data horizontally now (-s horizontal) since that is often more appropriate for a single region being visualised.

By running the above command, we're also producing additional plot types since we specified coverage genes as part of our -p argument. The coverage data might help us to contextualise any CNVs that occur, and make it easier to detect if the segregation is due to duplication in one group, or perhaps deletion in the other. The genes data lets us visually identify genes that are within this potential QTL region, and from there begin a literature search to see if any of these genes have been previously associated with the phenotypic trait that segregates between our two populations.

However, by omitting the --ed argument, we've only been visualising one type of segregation statistic as found in the psQTL_call.alleles_ed.tsv.gz file we created earlier. We can specify --ed genotypes or --ed inheritance for one of the other two files, assuming they exist. These alternate statistics may help to identify different types of segregation better (or worse) than the default.

We might also want to consult the data in psQTL_call.[alleles / inheritance / genotypes]_ed.tsv.gz or psQTL_depth.ed.tsv.gz to pick out the variants that segregate most strongly in this QTL. The reporting module can help facilitate this.

Reporting

Based on our plots, we've theoretically identified a region that we're interested in on chr2 from 3.8 Mbp to 4.2 Mbp. We can generate a report table indicating the genes located within this region as rows, and their proximity to segregating SNPs like:

python /location/of/psQTL_post.py report -d psqtl_analysis \
    -m ed-call -t genes \
    -o report1.tsv

We can report on candidate genes and their proximity to potentially segregating CNVs like:

python /location/of/psQTL_post.py report -d psqtl_analysis \
    -m ed-depth -t genes \ # switched ed-call to ed-depth
    -o report2.tsv

Or for candidate genes and their proximity to selected or integrated sPLS-DA features like:

python /location/of/psQTL_post.py report -d psqtl_analysis \
    -m splsda -t genes \ # switched to splda
    -o report3.tsv

We can of course change what type of ED statistic we're looking at just like we did when plotting by adding in the --ed inheritance or --ed genotypes option.

And we can also switch -t genes for -t markers to generate a different table which makes each SNP or windowed region (in which CNVs were assesed) a row in the table. That might enable you to identify a relevant marker to focus on and potentially use for genomic selection. To do so, you'll likely want to consult your VCF file in a more manual capacity, as this stage is outside of the scope of psQTL (at least for now).

⚠️ **GitHub.com Fallback** ⚠️