FAQs - zkstewart/psQTL GitHub Wiki

What is the difference between a 'bulk' and a 'group'?

In a typical bulked segregant analysis (BSA), we consider the two populations with divergent phenotypes to be bulks, because the method of sequencing occurs after producing a bulked DNA sample. For a per sample segregant analysis (PSA), no bulking occurs. Hence, we simply refer to these two populations as the two groups that are being compared in PSA.

How does psQTL differ from BSA methods?

psQTL is situated as a replacement for BSA methods like QTL-seq in situations where you have sequenced each sample individually, rather than performing bulked or pooled sequencing. If you have performed this type of sequencing, psQTL can be used in any situation that a BSA method would apply.

Thinking in terms of the data encoded within a VCF, the psQTL method makes use of genotype calls (GT) for each sample to obtain the true frequency of alleles assuming no genotype calling errors. This differs to BSA, which makes use of allele depth (AD) of pooled sequencing to obtain an estimated frequency of alleles within a group.

Other than obtaining greater precision of the allele frequency, PSA as implemented in psQTL allows for novel statistics to be employed to measure segregation between groups which benefits from the knowledge of each sample's genotype rather than the collective allele frequency. Refer to the About segregation statistics Wiki page for more details.

Is psQTL suitable for my polyploid organism?

Yes, psQTL is able to work with polyploids! At this time, due to the difficulty associated with accurately calling variants in polyploid organisms, psQTL does not attempt to perform variant calling for you as part of the psQTL_prep.py call function.

You are encouraged to devise a system that works for your species in particular and produce a filtered VCF to input to psQTL as python /location/of/psQTL_prep.py initialise -d /location/to/run/the/analysis --fvcf yourPolyploidVariants.vcf. Copy number variant calling by psQTL is able to accommodate polyploidy, and all downstream processes are similarly capable of handling polyploid data.

Is psQTL appropriate for my population?

Any crossing which produces segregation in the phenotype, and for which we reasonably should expect the difference in the phenotype to be attributable to variants which occur in one group and not the other, are suited to psQTL analysis.

In practice, this often means that a biparental cross has occurred, and the progeny from that cross are observed to segregate strongly for a particular phenotype. In this situation, we should expect there to be particular genetics being inherited in one group which are associated with the phenotype, with the other group not inheriting those genetics.

psQTL does not enforce any experimental design, and so you can theoretically input data from any two groups. However, there is no guarantee that the results will be meaningful if those samples are not closely related as we'd find in a biparental mapping population. Good experimental design will be critical to the success of any BSA or PSA outcome.

Can psQTL handle heterozygous organisms?

Yes, psQTL can work with data coming from a heterozygous organism. No calculations being performed by psQTL make any assumptions regarding heterozygosity. All psQTL aims to do is compare the allele or genotype frequency between two groups and locate variants where this frequency is biased, such that the variant occurs more often in one group than another. Non-QTL regions are expected to show little to no bias in that frequency, whereas QTL regions will show a greater degree of bias. That principle holds true regardless of heterozygosity level.

I already have my own VCF of variant calls. Can psQTL analyse this?

Yes, as detailed above, psQTL is capable of accepting your own custom VCF, so long as it conforms to the normal formatting of a VCF file.

Does psQTL expect data to originate from WGS, or can I use exome or DArT sequencing?

psQTL operates agnostically to sequencing technology type, and as such any type of data is suitable. However, if you are working with sequencing types like exome or DArT sequencing which are not evenly distributed throughout the genome, you may want to change how CNV prediction occurs.

Specifically, when using psQTL_prep.py depth, you may want to change the --windowSize value from its default of 1000 to something better suited to your sequencing. For exome sequencing, having a window size loosely based on the average length of a gene (introns included) may better capture meaningful differences in copy numbers; in such a case, --windowSize 10000 might be a good choice. Likewise for DArT sequencing, the average expected spacing between each restriction enzyme site may help to inform your chosen window size.

If you find yourself unsure, don't worry. Just try the default option first, and if your CNV output results look a bit messy, try adding a zero onto your window size and re-run psQTL to see if that makes more sense of your data.

How does psQTL handle missing data?

psQTL recognises missing data in the GT field of a VCF as being denoted by a .. It will treat partially missing (e.g., 0/.) as being fully missing.

When plotting or reporting Euclidean distance results with psQTL_post.py, the --missing value allows you to choose to filter out SNPs or CNVs which have a missing sample number exceeding that percentage threshold.

sPLS-DA is more strict as it does not tolerate missing data. Hence, any amount of missing data at a site will result in its filtration.

My population is from a selfed organism. How do I input that with --parents?

You can input the same sample twice like --parents sample12 sample12 to indicate selfing, and any relevant parts of the psQTL code will handle it appropriately.

Can I input .gz files?

When it comes to genome FASTA and GFF3 files, or a VCF you have produced yourself, they can be input as-is or as a gzip'd (or bgzip'd) file. Files produced by psQTL are generally gzip'd by default, and psQTL expects them to remain how they were created. In other words, if you want to unzip an intermediate file and inspect it manually in a decompressed state, make a copy of the file elsewhere first.