psQTL_proc - zkstewart/psQTL GitHub Wiki

The command-line usage shown here assumes Option 2 was used to obtain the software, as described at the Installing psQTL page. Note that program invocation is different if you obtain psQTL through conda, dropping the initial python and the .py file suffix.

For example, python psQTL_proc.py -h becomes psQTL_proc -h. Keep this in mind when reading the following instructions.

psQTL_proc

This script has two modes which are detailed by asking for help information on the command-line.

python psQTL_proc.py -h

usage: psQTL_proc.py [-h] [-v] {ed,splsda} ...

psQTL_proc.py processes VCF or VCF-like files containing variant or CNV predictions for samples belonging to two groups. The working directory is expected to have been 'initialise'd by psQTL_prep.py such
that all necessary files have been validated and prepared for analysis.

positional arguments:
  {ed,splsda}
    ed           Compute Euclidean distance of 'call' and/or 'depth' variants
    splsda       Compute local sPLS-DA of 'call' and/or 'depth' variants

options:
  -h, --help     show this help message and exit
  -v, --version  show program's version number and exit

ed

The ed mode calculates the Euclidean distance (ED) segregation statistic which measures how distinct the allele, genotype, or haplotype inheritance is between your two segregating populations. See the About segregation statistics page for further details. Asking for help on the command-line will return:

usage: psQTL_proc.py ed [-h] -d WORKINGDIRECTORY -i {call,depth} [{call,depth} ...] [-v] [--parents PARENTSAMPLES PARENTSAMPLES] [--considerIdentical]

options:
  -h, --help            show this help message and exit
  -d WORKINGDIRECTORY   Specify the location where the analysis is being performed
  -i {call,depth} [{call,depth} ...]
                        Specify one or both of 'call' and 'depth' to indicate which types of variants to process.
  -v, --version         show program's version number and exit
  --parents PARENTSAMPLES PARENTSAMPLES
                        Optionally, provide the names of the two parents used to generate the groups; this is used to apply an alternative form of ED which leverages the parents' genotypes to extract more
                        signal out of your data. If not provided, the standard ED will be used.
  --considerIdentical   Optionally, provide this flag to prevent filtration of variants where both groups' genotypes are identical; this can occur when both groups have the same variant with respect to the
                        reference genome. Not recommended unless you have a specific reason to do so.

-i can accept one or both of the 'call' and 'depth' values e.g., -i call depth. The ED statistic will be calculated from the small variant VCF file, and/or from the CNV VCF-like file, depending on which of these options you choose.

--considerIdentical is an optional argument; it is not recommended that you set this unless you very deliberately want to change how psQTL handles invariant sites. I am currently unaware of a situation where you would want to do this, but the option is provided just in case such a need ever arises.

--parents is an optional argument. You can omit this option to produce only the two naive segregation statistics which do not incorporate knowledge of parent genotypes. If you specify both parent samples to this option e.g., --parents sample1 sample, you can also generate the haplotype inheritance ED statistic. To calculate this statistic, psQTL first filters genotype calls in progeny that are impossible given the known parental genotypes. For example, if the two parent genotypes are 0/0 and 1/1, all progeny are expected to have a genotype of 0/1; any deviation from this is not possible and will result in the sample(s) being filtered. This statistic is also incompatible with odd-numbered ploidy values since it is difficult or impossible to identify which allele is being inherited from a parent when aneuploid gametes are involved. If you are working with e.g., a triploid organism should not specify this option since errors will be raised.

splsda

The splsda mode runs local sPLS-DA within windowed regions of each chromosome/contig. See the About segregation statistics page for further details.

usage: psQTL_proc.py splsda [-h] -d WORKINGDIRECTORY -i {call,depth} [{call,depth} ...] [-v] [--threads THREADS] [--splsdaWindowSize SPLSDAWINDOWSIZE] [--maf MAFFILTER] [--ber BERFILTER]
                            [--maxiters MAXITERATIONS] [--nrepeat NUMREPEATS]

options:
  -h, --help            show this help message and exit
  -d WORKINGDIRECTORY   Specify the location where the analysis is being performed
  -i {call,depth} [{call,depth} ...]
                        Specify one or both of 'call' and 'depth' to indicate which types of variants to process.
  -v, --version         show program's version number and exit
  --threads THREADS     Optionally, specify the number of threads to use when running sPLS-DA analyses (default: 1)
  --splsdaWindowSize SPLSDAWINDOWSIZE
                        Optionally, specify the window size that sPLS-DA will be run within (recommended:
                        100000, unless --windowSizeIsSNPs is set, in which case you should lower this
                        according to some division of the number of variants you expect to find in each
                        chromosome or contig)
  --windowSizeIsSNPs    Optionally, provide this flag to indicate that the value given to
                        --splsdaWindowSize is the number of SNPs to include in each window,
                        rather than the length in base pairs
  --maf MAFFILTER       Optionally, specify the Minor Allele Frequency to filter variants by prior to sPLS-DA analysis (recommended: 0.05; equivalent to 5 percent MAF)
  --ber BERFILTER       Optionally, specify the Balanced Error Rate to select variants during sPLS-DA analysis (recommended: 0.4)
  --maxiters MAXITERATIONS
                        Optionally, specify the maximum number of iterations allowed for sPLS-DA convergence (recommended: 10000)
  --nrepeat NUMREPEATS  Optionally, specify the number of times to repeat M-fold validation during sPLS-DA optimisation (recommended: 20 (default) or more if you have the time)

-i can accept one or both of the 'call' and 'depth' values e.g., -i call depth. sPLS-DA will be run in windowed regions using data from the small variant VCF file, and/or from the CNV VCF-like file, depending on which of these options you choose.

--maf specifies the Minor Allele Frequency (MAF) value that variants or CNVs will be filtered by prior to sPLS-DA analysis. It has a recommended default value of 0.05 if you do no specify this argument. It is used to speed up the analysis which may otherwise be incredibly slow if we do not filter variants or CNVs with a low MAF. Variants or CNVs with a low MAF are unlikely to be truly segregating across groups, and their filtration does not result in any relevant data points being omitted.

--ber specifies the Balanced Error Rate (BER) value. It has a recommended default option of 0.4 if you do not specify this argument. It is also used to speed up the analysis by only identifying SNPs or CNVs within regions that offer some benefit for predicting which group each sample belongs to. A value of 0.5 means that sPLS-DA is completely unable to predict a sample's group based on the variants that occur within the region, with values approaching 0 indicating greater success with this prediction.

--maxiters and --nrepeat control techniques that sPLS-DA uses to obtain more robust results. They both have suitable defaults if you do not specify this argument. They are used during cross-fold validation of group prediction results, and having higher values here will generally gives your results more robustness. This is a tradeoff against time, as increasing either value will have a major impact on the time to complete the sPLS-DA analysis.

--splsdaWindowSize is different than the --windowSize as seen in psQTL_prep.py. It is used to specify the window size within which SNPs or CNVs will be assessed. The default value of 100000 is intended to group multiple SNPs or multiple CNVs together for combined analysis when assuming that defaults were using during psQTL_prep.py. If you did not use the default --windowSize during psQTL_prep.py, you should set --splsdaWindowSize to be equal to or larger than the --windowSize value.

--windowSizeIsSNPs interacts with the --splsdaWindowSize parameter. It modifies the number given to --splsdaWindowSize such that it will define the length of each window by the number of variants (call SNPs or depth CNVs), rather than the length in base pairs. This could be useful for non-uniform sequencing technologies (e.g., exome or restriction enzyme-based approaches).

For both --splsdaWindowSize and --windowSizeIsSNPs, note that psQTL may tweak the exact value given so as to ensure that each window is of an approximately equal size. As a toy example, on a contig of 100 bp in length, if you specify --splsdaWindowSize 99 you would normally end up with two windows: one with length 99 bp, and another with length 1 bp. This may lead to biases in the statistics presented for the shorter window length. Instead, psQTL will recognise that, based on the provided window length value, we are going to end up with two windows and will ensure that both windows are 50 bp in length to mitigate any bias.