psQTL_post - zkstewart/psQTL GitHub Wiki

The command-line usage shown here assumes Option 2 was used to obtain the software, as described at the Installing psQTL page. Note that program invocation is different if you obtain psQTL through conda, dropping the initial python and the .py file suffix.

For example, python psQTL_post.py -h becomes psQTL_post -h. Keep this in mind when reading the following instructions.

psQTL_proc

This script has two modes which are detailed by asking for help information on the command-line.

python psQTL_post.py -h

usage: psQTL_post.py [-h] [-v] {plot,report} ...

psQTL_post.py processes the output of psQTL_proc.py to either 1) plot segregation statistics or 2) report on gene proximity to potential QTLs. The segregation statistics can be plotted as a combination of
line plots, scatter plots, alignment coverage plots, and/or gene locations. The gene proximity report will identify genes that are proximal to or contained within potential QTL regions. The input directory
is expected to have been 'initialise'd by psQTL_prep.py and 'process'ed by psQTL_proc.py.

positional arguments:
  {plot,report}
    plot         Plot segregation statistics
    report       Report gene proximity to potential QTLs

options:
  -h, --help     show this help message and exit
  -v, --version  show program's version number and exit

plot

The plot mode takes results generated by psQTL_proc.py ed and/or psQTL_proc.py splsda and visualises them for QTL identification. For interpretation tips and examples of these plots, see the Interpreting results page. Command-line help information is as follows:

usage: psQTL_post.py plot [-h] -d WORKINGDIRECTORY -f GENOMEFASTA -o OUTPUTFILENAME [--ed {alleles,inheritance,genotypes}] [--power POWER] [--regions REGIONS [REGIONS ...]] [--exclude EXCLUSIONSTSV] [--missing MISSINGFILTER] [-v] -i {call,depth} [{call,depth} ...] -m
                          {ed,splsda} [{ed,splsda} ...] -p {line,scatter,coverage,genes} [{line,scatter,coverage,genes} ...] -s {horizontal,circos} [--annotation ANNOTATIONGFF3] [--wma WMASIZE] [--coverageSamples COVERAGESAMPLES [COVERAGESAMPLES ...]]
                          [--highlights HIGHLIGHTS [HIGHLIGHTS ...]] [--noGeneNames] [--width WIDTH] [--height HEIGHT] [--space AXISSPACE]

options:
  -h, --help            show this help message and exit
  -d WORKINGDIRECTORY   Specify the location where the analysis is being performed
  -f GENOMEFASTA        Specify the location of the genome FASTA file
  -o OUTPUTFILENAME     Specify the location to write the output file; for 'plot', this must end with '.pdf', '.png', or '.svg'; for 'report', this must end with '.tsv' or '.csv'
  --ed {alleles,inheritance,genotypes}
                        Optionally, specify the type of 'call' Euclidean distance measurement to use; 'inheritance' is only available if psQTL_proc.py was previously run with the --parents argument
  --power POWER         Optionally, specify the power to raise Euclidean distances to reduce noise (default: 4)
  --regions REGIONS [REGIONS ...]
                        Optionally, specify which regions to plot. Providing no input to this argument will plot all chromosomes; otherwise, specify the chromosome(s) to plot by their individual genome contig identifiers (e.g., 'chr1') with the option to specify a range within
                        the chromosome with chr:start-end format (e.g., 'chr1:1000000-2000000').
  --exclude EXCLUSIONSTSV
                        Optionally, specify a headerless 3-column TSV format file (chrom start end) indicating genomic regions to exclude from any plotted or reported outputs.
  --missing MISSINGFILTER
                        Optionally, specify the proportion of missing data that is tolerated in either group before a variant is filtered out (recommended: 0.5)
  -v, --version         show program's version number and exit
  -i {call,depth} [{call,depth} ...]
                        Specify one or both of 'call' and 'depth' to indicate which types of results to process.
  -m {ed,splsda} [{ed,splsda} ...]
                        Specify whether you are analysing 'ed' (Euclidean distance) and/or 'splsda' (Sparse Partial Least Squares Discriminant Analysis) measurements
  -p {line,scatter,coverage,genes} [{line,scatter,coverage,genes} ...]
                        Specify one or more plot types to generate
  -s {horizontal,circos}
                        Specify the style of plot to generate
  --annotation ANNOTATIONGFF3
                        Optionally, specify the location of the genome annotation GFF3 file if you want to plot gene locations
  --wma WMASIZE         LINE PLOT: optionally, specify the number of previous values to consider during weighted moving average calculation (default: 5)
  --coverageSamples COVERAGESAMPLES [COVERAGESAMPLES ...]
                        COVERAGE PLOT: Optionally, specify one or more samples to plot coverage data as individual lines; these samples will be omitted from the group values
  --highlights HIGHLIGHTS [HIGHLIGHTS ...]
                        Optionally, specify plotted regions to highlight using the same format as --regions; these regions will have an opaque background colour used to indicate their location(s).
  --noGeneNames         GENES PLOT: Optionally, provide this flag if you have set '-p genes' and would like gene models to be shown without their names being displayed; this could be useful if the gene names become illegible at the size you are plotting
  --width WIDTH         Optionally, specify the total output plot width (default: calculated internally with 5 per region for horizontal or a flat value of 8 for circos)
  --height HEIGHT       Optionally, specify the output plot height (default: calculated internally with 5 per plot type for horizontal or a flat value of 8 for circos)
  --space AXISSPACE     CIRCOS PLOT: Optionally, specify the space (in degrees) to allow for the Y-axis labels in the top centre of the plot (default: 15)

-f requires the reference genome FASTA file that statistics were generated from.

-i can accept one or both of the 'call' and 'depth' values e.g., -i call depth. Results from the small variants (call) and/or CNVs (depth) will be plotted depending on your selection.

-m controls what type of measurements we want to visualise or report, specifically referring to ED measurements or sPLS-DA values. Note that these statistics were generated independently for the 'call' and/or 'depth' variants; by specifying -i call depth -m ed splsda you will plot each pairwise combination of results i.e., the ED values for call and depth datasets, as well as the sPLS-DA results for the call and depth datasets.

-p is used to indicate which plot types you want to produce, with one, some, or all values being able to be specified for a single plot e.g., -p line scatter coverage genes. Refer to the Interpreting results page for examples and details.

-s allows you to choose whether you want to produce a circos or horizontal plot representation of the same underlying data.

--ed controls which ED result you want to visualise or report on. You can choose to assess the results from the standard alleles ED formula (which is the default option if unspecified), the alternative genotypes ED formula, or, if you've run with --parents during psQTL_proc.py you can instead choose the inheritance ED results. Refer to the About segregation statistics page for more details of these statistics.

--power is used to apply a power transformation to all Euclidean distance values. The default value of 4 is appropriate for minimising noise and maximising segregation signals in the data, and is based on the paper cited in the About segregation statistics page. You probably shouldn't change this.

--missing is used to filter variants and CNVs for which data is missing in the specified proportion of one or more groups. This filtering is very important to ensure that segregation results are not arising from comparison of a single sample in (e.g.,) group 1 when compared to samples in group 2. Hence, you should leave this at the default of 0.5 or even consider making it more strict if you have small group sizes.

--wma is used to control the Weighted Moving Average value for smoothing the line plot, with a default value of 5. Decreasing this value towards 0 will decrease the amount of smoothing, while increasing the value will give the line more smoothing. Line smoothing can be useful both for visual clarity in the plot, as well as to identify genomic regions where the segregation statistic is consistently high or low. This helps to minimise the impact of data variability and potentially clarify where QTL regions are really occurring.

--space is used when producing a Circos plot, and controls how much space (in degrees) is left for Y-axis labels. You might increase this from the default of 10 if you find that your Y-axis labels are overlapping the data portion of your plot.

--width and --height are used to control the plot dimensions when producing horizontal plots; these variables do not currently affect -s circos plots. Values are given in inches akin to how ggplot2 in R operates.

--regions is used to indicate specific genomic regions you wish to visualise; not specifying any values here will result in the entire genome being visualised. One or more genomic regions can be specified in chromosome:start-end format, with the chromosome value expected to refer to contigs identified within the genome FASTA file as input with the -f parameter.

--exclude is used to indicate a BED file indicating regions within which statistics should be ignored. In practice, you might use this to filter out repeat regions where spurious variant calls may be more likely. Start and end positions are expected to conform to the standards as defined in the UCSC BED file format definition page.

--highlights is used to indicate specific genomic regions you want to highlight. Values are input in the same way as --regions, with the indicated region being given an opaque orange-yellow background to allow for visual demarcation of important sections within larger regions.

--noGeneNames is a flag argument, and is used to turn OFF the display of gene names produced by -p genes. This is useful in cases where the plot is not scaled for gene name text to be properly legible. Note that gene names only display in -s horizontal plots and hence this argument has no effect on -s circos plots.

--coverageSamples is used only when you are producing a -p coverage plot. Up to three samples can have their median-normalised read coverage visualised as individual lines in this plot. The remaining members of each group will have their coverage values visualised as a kind of "moving boxplot" with a median line and Q1 and Q3 margins indicated with shading. Note that the samples input to --coverageSamples are not otherwise included in their group coverage values. This can let you identify how particular samples might deviate from the rest of their group. In practice you may also use this for plotting the two parent samples which may let you identify where each group may be inheriting their CNV(s) from.

report

The plot mode takes results generated by psQTL_proc.py ed and/or psQTL_proc.py splsda and tabulates them to allow inspection of candidate genes and genetic markers.

usage: psQTL_post.py report [-h] -d WORKINGDIRECTORY -f GENOMEFASTA -o OUTPUTFILENAME [--ed {alleles,inheritance,genotypes}] [--power POWER] [--regions REGIONS [REGIONS ...]] [--missing MISSINGFILTER] [-v]
                            -m {ed-call,ed-depth,splsda} -t {genes,markers} -a ANNOTATIONGFF3 [--radius RADIUSSIZE]

options:
  -h, --help            show this help message and exit
  -d WORKINGDIRECTORY   Specify the location where the analysis is being performed
  -f GENOMEFASTA        Specify the location of the genome FASTA file
  -o OUTPUTFILENAME     Specify the location to write the output file; for 'plot', this must end with '.pdf', '.png', or '.svg'; for 'report', this must end with '.tsv' or '.csv'
  --ed {alleles,inheritance,genotypes}
                        Optionally, specify the type of 'call' Euclidean distance measurement to use; 'inheritance' is only available if psQTL_proc.py was previously run with the --parents argument
  --power POWER         Optionally, specify the power to raise Euclidean distances to reduce noise (default: 4)
  --regions REGIONS [REGIONS ...]
                        Optionally, specify which regions to plot. Providing no input to this argument will plot all chromosomes; otherwise, specify the chromosome(s) to plot by their individual genome
                        contig identifiers (e.g., 'chr1') with the option to specify a range within the chromosome with chr:start-end format (e.g., 'chr1:1000000-2000000').
  --missing MISSINGFILTER
                        Optionally, specify the proportion of missing data that is tolerated in either group before a variant is filtered out (recommended: 0.5)
  -v, --version         show program's version number and exit
  -m {ed-call,ed-depth,splsda}
                        Specify whether you are analysing 'ed-call' (Euclidean distance of 'call' variants), 'ed-depth' (Euclidean distance of 'depth' CNVs), or 'splsda' (Sparse Partial Least Squares
                        Discriminant Analysis) measurements
  -t {genes,markers}    Specify whether your output should be focused on 'genes' or 'markers'
  -a ANNOTATIONGFF3     Specify the location of the genome annotation GFF3 file
  --radius RADIUSSIZE   Optionally, specify the radius (in bp) surrounding a 'call' variant or 'depth' CNV window that you want to consider as being 'proximal' to a gene (default: 50000)

Many arguments provided here behave identically to how they work in plot. Options which exist only here, or behave differently, include:

-m controls what type of measurements we want to visualise or report, specifically referring to ED measurements or sPLS-DA values including BA and selected 'call' or 'depth' variants. The argument behaves differently with report than seen with plot; when reporting, you instead specify one of either ed-call, ed-depth, or splsda. Only one value can be reported on in each file, so we need to combine what is the -i and -m parameters that you find in the plotting module.

-t modifies the output format to summarise statistics with respect to genes or markers which include 'call' and 'depth' variants. See the Interpreting results page for details on these report formats.

--radius controls how much of a radius surrounds each marker or gene. When run with -t genes, the radius sets the region size within which segregation statistics are considered to be proximal to that gene and are counted and averaged. When run with -t markers, it controls the region within which the most proximal gene to that marker will be found (if none exist within this region, no genes are reported as being proximal). An appropriate --radius value should make it easier to identify candidate genes which occur nearby variants or CNVs which may influence their function or expression, with the default being 50000 for a distance of 50 Kbp.