Example analysis pipeline - zkstewart/psQTL GitHub Wiki

Preparation

Prior to psQTL, you should map your reads to your genome however you normally do so, getting sorted BAM results. An example is indicated for a single sample below, but any popular program should work.

bwa mem -t <CPUS> -R <readgroup> <genomeFASTA> <forwardreads> <reverseread> > sample1.sam
samtools sort -@ <CPUS> -O bam -o sample1.sorted.bam -m <MEM>G sample1.sam
bamtools index sample1.sorted.bam

Next, you may optionally want to mark PCR duplicates. The necessity of this is debateable, but doing so is unlikely to hurt.

picard MarkDuplicates I=sample1.sorted.bam O=sample1.sorted.md.bam M=sample1.md_metrics.txt
bamtools index sample1.sorted.md.bam

Afer doing this for all of your samples, you next want to format a metadata file for the analysis. As detailed earlier this file should be tab-delimited with two columns and no header. The left column indicates your sample IDs, and the right indicates which of two bulks it belongs to. Note that the sample ID should be the same as your BAM files, just with the suffix removed. So for example, if your file is sample1.sorted.md.bam then your sample ID is sample1 and your suffix is .sorted.md.bam.

sample1	bulk1
sample2 bulk2
sample3 bulk2

Now you're ready to use psQTL_prep.py to initialise your working directory. Using this, we'll instruct psQTL to set up a new folder/directory for an analysis of your mapped BAM files using the metadata file you've got ready to go. Note that we're going to name our working directory psqtl_analysis, our metadata file is named metadata.tsv, and our BAM files are all located in /location/of/bam_files with the suffix .sorted.md.bam. It's important that all of your BAM files have that exact same suffix!

python /location/of/psQTL_prep.py initialise -d psqtl_analysis --meta metadata.tsv --bam /location/of/bam_files --bamSuffix .sorted.md.bam

The program will tell you that some values have been stored in various caches - these caches are what psQTL uses to remember where your file are located and what options you've specified.

With the initialisation done, we're ready to call variants now. The program will remember where your metadata and BAM files are located, but you will also need to let psQTL know where the genome FASTA file is located; this should be the exact same file as you used when mapping reads. Lastly, we need to provide some information on how we want to filter our variants in terms of their quality scores. Our genome file will be called /location/of/genome.fasta. We'll use the recommended value of 30 for quality scores and we'll also use 12 threads to speed things along.

python /location/of/psQTL_prep.py call -d psqtl_analysis -f /location/of/genome.fasta --qual 30 --threads 12

We should expect this process to take a while. If we're interested in predicting genomic deletions to see if they appear to be QTLs as well, we can use the depth functionality of psQTL to do that. We'll indicate the same genome FASTA file as before, and let psQTL know what window size we want to predict deletions in.

python /location/of/psQTL_prep.py depth -d psqtl_analysis -f /location/of/genome.fasta --windowSize 1000 --threads 12

Once the analyses are complete, we can view the analysis folder for some details on what we've predicted. Expect this to tell you what your files are, how many variants you called (before and after filtering), and how many windows in your genome appear to contain a deletion in one or more of your samples.

python /location/of/psQTL_prep.py view -d psqtl_analysis

Processing

Either using psQTL or independently yourself, you should have one or both of 1) a variant VCF file, and 2) a deletion VCF-like file. We want to process the file(s) and calculate how much the two bulks segregate at variants or genomic deletions. One statistic used to measure that is the Euclidean distance (ED), and we can calculate that for the variant call file with:

python /location/of/psQTL_proc.py ed -i call -d psqtl_analysis

That will produce the psQTL_variants.ed.tsv.gz file within your analysis directory which contains the raw ED values for each variant in your VCF file.

Alternatively, to look at the segregation at genomic window regions using the deletion VCF-like file, we can do it with:

python /location/of/psQTL_proc.py ed -i depth -d psqtl_analysis

This will produce the psQTL_depth.ed.tsv.gz file with the raw ED values for segregation at each genomic window region.

Another statistic that may be of benefit are the Balanced Accuracy (BA) and feature selection processes employed by sPLS-DA. To generate these statistics for both call and depth files, we can do:

python /location/of/psQTL_proc.py splsda -i call depth -d psqtl_analysis --threads 12

This will produce several files in the psqtl_analysis/splsda directory, and can be plotted in subsequent steps.

Post-processing

Plotting

After running the above processing step, you are now ready to plot the results. psQTL relies upon the visual inspection of results, with a usual process involving you plotting ED and/or BA statistics across the entire genome, then manually looking for regions that have elevated segregation. You can then plot just those regions to give you a "zoomed in" look at the statistics, and iterate upon this process until you are happy with what you've found.

To begin, you might plot all your data for all the chromosomes like:

python /location/of/psQTL_post.py plot -d psqtl_analysis \
    -i call depth -m ed splsda
    -p line scatter -s circos \
    -f /location/of/genome.fasta \
    -o plot.png

Note that we're plotting the data circularly (-s circos) as this will give us the obligatory "pretty publication plot" and also happens to be a good option for densely presenting data.

Using this plot, you can look for peaks in segregation statistics on your line plot, or look for SNPs and CNVs that were selected by sPLS-DA as indicated by diamonds. These both help to locate where the evidence most strongly suggests segregation between your two populations, which in theory will be your major QTL(s).

Hypothetically, let's say you find that chr2 is of interest, specifically in the region from 3.8 Mbp to 4.2 Mbp. We can plot that region in greater detail like below:

python /location/of/psQTL_post.py plot -d psqtl_analysis \
    -i call depth -m ed splsda
    -p line scatter coverage genes -s horizontal \
    -f /location/of/genome.fasta \
    -o plot2.png

Note that we're plotting the data horizontally now (-s horizontal) since that is more appropriate for a single region being visualised. Using this plot, we can visually identify genes that are within this QTL region, and from there begin a literature search to see if any of these genes have been previously associated with the phenotypic trait that segregates between our two populations.

We might also want to consult the data files in psQTL_variants.ed.tsv.gz or psQTL_depth.ed.tsv.gz to pick out the variants that segregate most strongly in this QTL. It goes beyond the scope of psQTL, but using that information you might want to design marker regions for genomic selection of plants or investigate their impact in more detail e.g., seeing whether a particular variant is changing amino acids or disrupting a gene's reading frame.

Reporting

This is still to be implemented. It should let you tabulate reports to interrogate the data in the context of gene proximity to variants or deletions.

⚠️ **GitHub.com Fallback** ⚠️