Using psQTL - zkstewart/psQTL GitHub Wiki

Format a metdata file before running psQTL

psQTL requires you to format a metadata file as a tab-delimited text file (.TSV) with two columns and no header. See a mock example below.

sample1	group1
sample2	group1
sample3	group2
sample4	group2

The left column must contain your sample identifiers, with the right column indicating which group the sample belongs to with the possible options being group1 or group2. It doesn't matter which phenotype is labelled as group1 or group2 as this will not influence your results. For example, it doesn't matter if your group1 is disease-resitant and group2 is disease-susceptible or vice versa. You just need to partition your samples into two populations.

For the sample identifiers, if you are providing BAM files for psQTL to analyse and call variants from, you should ensure that the start of each file begins with that sample identifier and is immediately followed by a consistent file suffix. For a sample termed sample1, you should ensure that you have a BAM file that looks like sample1{bamSuffix} where {bamSuffix} is provided on the command line using the --bamSuffix option.

If you've provided a VCF file, then these sample identifiers should be the same as found in your VCF header line that begins with #CHROM.

You can provide a metadata file that doesn't exactly match your VCF or BAM files. If you do so, the VCF filtering will only consider the samples contained within your metadata file, and results produced by psQTL_proc.py will only use the indicated samples when calculating the segregation statistics. However, I think you should avoid doing so unless you're intentionally using this mechanism to filter your results. The program will warn you when there's a discrepancy.

Step-by-step through the psQTL pipeline

The psQTL_prep.py script is used to set up a directory within which an analysis will be run. At this stage, you specify all relevant file locations and optionally produce VCF and VCF-like files for variant and CNV predictions. See the psQTL_prep page for details on how you can do this.

The psQTL_proc.py script is used to process the VCF and VCF-like files and calculate segregation statistics that enable QTL identification. See the psQTL_proc page for more details.

Finally, the psQTL_post.py script is used to post-process segregation statistics, and generate plot visualisations and/or report tables which can be used to identify QTLs and potential candidate genes or marker variants. See the psQTL_post page for details on how to do this.