Home - zkstewart/psQTL GitHub Wiki

Background

Traditional bulked segregant analysis

A variety of methods and pipelines exist for performing a bulked segregant analysis (BSA). In this type of experiment, organisms will be partitioned (segregated) into two populations (bulks) according to some differing phenotype, then tissue samples will be obtained from all individuals of each population. Those samples are pooled together and DNA is extracted together. Data analysis aims to identify differences in the allele frequencies of these two DNA pools to determine if there is a systematic difference that would point to the existence of one or more quantitative trait loci (QTL).

Historically, this has been necessary due to the cost of sequencing, especially when populations consist of dozens or hundreds of individuals. However, there are biases associated with this approach including but not limited to:

Unequal amounts of DNA obtained from each individual sample within the pooled sample may skew the allele frequency especially if the overrepresented sample is of a different genotype than most of its peers.
It can be hard or impossible to tell whether a population is a mix of homozygous reference (0/0) and homozygous alternate (1/1) alleles, or is heterozygous for that allele (0/1).

To address this second issue, it is common that parents would be separately sequenced to provide insight into the likely genotype of offspring. However, it is not uncommon for parent samples to be unavailable or for the true parents of some organisms to be unknown.

Per-sample segregant analysis

Reduced sequencing costs open up the possibility for sequencing each individually obtained tissue sample. The benefits of doing so include but are not limited to:

The lack of sample pooling means that unequal DNA bias can be eliminated. This means that allele frequency does not need to be estimated, it can be known for the populations.
Samples can be individually genotyped, and we can know the exact proportion of the population that is of homozygous or heterozygous genotype.

Because of the specific knowledge of each individual's genotype, we can run an analysis without parent samples being available. And, we can use statistics that benefit from the knowledge of each sample's genotype rather than the collective allele frequency, in order to obtain greater power when predicting QTLs.

Segregation at deletion sites

It's well established that SNPs and small indels are known to be responsible for influencing phenotype in QTLs. However, large deletions (such as those that deactivate or eliminate genes or their regulatory elements) can also be a major contributor to phenotypic difference among organisms. If such a deletion is responsible for the phenotypic segregation in two populations, you might identify it when analysing variant site segregation only through the variant's linkage to the deletion. This isn't guaranteed however, and hence it can be useful to specifically analyse deletions and how their occurrence segregates between populations.

Introduction to psQTL

psQTL runs a per-sample segregant analysis (PSA; also known as ISA or individual segregant analysis) to improve the statistical power of predicting QTLs relative to traditional BSA approaches. It offers three modules to streamline a PSA experiment, including:

It will prepare your data which includes predicting small variants (SNPs and indels) and larger deleted regions (CNVs) with psQTL_prep.py.
It will process your data to statistically variants that segregate between two populations with psQTL_proc.py.
It will plot and report your results to allow for interrogation and understanding of the likely QTL(s) and what genes may be associated with the QTL using psQTL_post.py.

What to find in this Wiki

Refer to the Installing psQTL page for a detailed run-through of psQTL's installation.

Refer to the Prerequisites page for links to the software and packages that psQTL relies upon.

Refer to the Using psQTL page for a detailed overview of each psQTL function. This page breaks down into subpages for each of the three parts of the psQTL pipeline ordered from psQTL_prep.py, to psQTL_proc.py, and finishing with psQTL_post.py

Refer to the Interpreting results page for an overview on how to interpret the plots that psQTL generates with examples.

Refer to the Example analysis pipeline to see a mock analysis and how the different options of psQTL can be used to identify results.

Refer to the About segregation statistics page for information on the statistics that psQTL uses for prediction QTLs.

Refer to the Rerunning parts of psQTL page for help with situations where you want to change parameters or files in an existing analysis folder.

Refer to the FAQs page for answers to previously asked questions.

Planned features implementation

Several features are half-implemented or are planned for future implementation. These include:

Automatic assignment of samples to groups to optimise segregation statistics
1. This may be useful to identify potential phenotyping error, or discover latent QTLs in existing sequenced populations where the segregating trait was never phenotyped

The author is also open to suggestions for features which may be useful. Feel free to open issues in this repository to provide such suggestions, although keep in mind that there is no guarantee any suggestion will be acted upon especially if it would be time consuming.