psQTL_prep - zkstewart/psQTL GitHub Wiki

The command-line usage shown here assumes Option 2 was used to obtain the software, as described at the Installing psQTL page. Note that program invocation is different if you obtain psQTL through conda, dropping the initial python and the .py file suffix.

For example, python psQTL_prep.py -h becomes psQTL_prep -h. Keep this in mind when reading the following instructions.

psQTL_prep

This script has several modes which are detailed by asking for help information on the command-line.

python psQTL_prep.py -h

usage: psQTL_prep.py [-h] [-v] {initialise,init,depth,call,view} ...

psQTL_prep.py manages several steps in preparation for the psQTL pipeline. A working directory must be 'initialise'd
before downstream processing and post- processing takes place. An analysis directory can be 'view'ed to see what
samples are present. Otherwise, if you have not done so already, you may optionally use this script to 'call' variants
or calculate the 'depth' of coverage (and CNV predictions arising from that) for your samples before proceeding on to
use the psQTL_proc.py script for further analysis.

positional arguments:
  {initialise,init,depth,call,view}
    initialise (init)   Initialise a working directory for the psQTL pipeline
    depth               Generate samtools depth files per sample
    call                Call variants per sample
    view                View the metadata of an analysis directory

options:
  -h, --help            show this help message and exit
  -v, --version         show program's version number and exit

initialise

The first mode is initialise which is the starting point of all psQTL analyses. It will create a working directory and populate it with caches that allow psQTL to remember where your input files are located. Asking for help information returns the following:

usage: psQTL_prep.py initialise [-h] -d WORKINGDIRECTORY [-v] [--meta METADATAFILE] [--vcf VCFFILE]
                                [--fvcf FILTEREDVCFFILE] [--dvcf DEPTHFILE] [--bam BAMFILES [BAMFILES ...]]
                                [--bamSuffix BAMSUFFIX] [--windowSize WINDOWSIZE] [--qual QUALFILTER]

options:
  -h, --help            show this help message and exit
  -d WORKINGDIRECTORY   Specify the location where the analysis is being performed
  -v, --version         show program's version number and exit
  --meta METADATAFILE   Optionally, specify the location of a metadata TSV file containing two columns indicating 1)
                        sample ID and 2) the group it belongs to
  --vcf VCFFILE         Optionally, specify a VCF file containg per-sample variant calls that you have already
                        produced
  --fvcf FILTEREDVCFFILE
                        Optionally, specify a filtered VCF file containg per-sample variant calls that you have
                        already produced
  --dvcf DEPTHFILE      Optionally, specify a depth VCF-like file containing per-sample allele copy numbers that you
                        have already produced
  --bam BAMFILES [BAMFILES ...]
                        Optionally, specify one or more locations of BAM files and/or directories containing BAM files
                        for variant calling and/or depth calculations
  --bamSuffix BAMSUFFIX
                        Optionally, specify the suffix used to denote BAM files; relevant if directories are provided
                        to --bam
  --windowSize WINDOWSIZE
                        Optionally, specify the window size that reads will be binned into for CNV calling
                        (recommended: 1000)
  --qual QUALFILTER     Optionally, specify the QUAL value that variants must equal or exceed to be included in the
                        final VCF file (recommended: 30.0)

--vcf and --fvcf provide the ability to input your own VCF file with variant calling results e.g., from GATK, freebayes, or whichever your tool of choice is. Specifying the file to --vcf allows you to use psQTL to perform a simple QUAL score filter of your data; in all likelihood you've done your own QC, and hence you should input your file through --fvcf to indicate that it is already filtered.

--dvcf allows you to provide your own VCF*-like* file containing allele copy numbers formatted as VCF GT fields. In practice, since no other tool produces such a file, this option is of limited benefit and only provided for future extensibility.

--bam and --bamSuffix provide the ability to indicate your read mappings in BAM format. These files can be used to call variants, or to assess the read coverage and produce allele copy number estimates for CNV prediction.

-p allows you to specify the ploidy level of your organism, and is used to interpret the median-normalised coverage data in terms of the most likely number of allele copies. The only influence it has on small variant handling is to raise an error if you try to call variants in a non-diploid, to make it clear that this behaviour isn't supported by psQTL's inbuilt variant calling pipeline

--windowSize is detailed below under the depth subheading; it is used specifically when calling CNVs, and has no influence on small variant calling.

call

The call mode provides an in-built method of small variant prediction which pipelines the bcftools mpileup through bcftools call methods. It is optionally provided as an assistant for researchers working on standard diploid organisms.

usage: psQTL_prep.py call [-h] -d WORKINGDIRECTORY [-v] -p PLOIDYNUM -f GENOMEFASTA [--qual QUALFILTER] [--bam BAMFILES [BAMFILES ...]] [--bamSuffix BAMSUFFIX] [--threads THREADS]

options:
  -h, --help            show this help message and exit
  -d WORKINGDIRECTORY   Specify the location where the analysis is being performed
  -v, --version         show program's version number and exit
  -p PLOIDYNUM          Specify the ploidy number of the samples being analysed e.g., '2' for diploid, '4' for tetraploid, etc.
  -f GENOMEFASTA        Specify the location of the genome FASTA file that BAM files are aligned to
  --qual QUALFILTER     Optionally, specify the QUAL value that variants must equal or exceed to be included in the final VCF file (recommended: 30.0)
  --bam BAMFILES [BAMFILES ...]
                        Optionally, specify one or more locations of BAM files and/or directories containing BAM files for variant calling and/or depth calculations
  --bamSuffix BAMSUFFIX
                        Optionally, specify the suffix used to denote BAM files; relevant if directories are provided to --bam
  --threads THREADS     Optionally, specify the number of threads to use for bcftools variant calling (default: 1)

-p is as detailed above; for call mode it is a mandatory argument which will only accept a value of 2 which is to make it very explicit that only diploidy is handled appropriately by this mode.

-f requires the reference genome FASTA file that reads were mapped to.

--bam and --bamSuffix are as detailed above.

--qual is for specifying the QUAL score that the VCF will be filtered by. Variants will be retained only if they are >= QUAL.

--threads allow for multiple CPU threads to be used during variant calling to speed up its operation.

depth

The depth mode provides an in-built method of allele copy number prediction which utilises a custom method of summing reads within window regions and dividing by the median. This median-normalised coverage allows for allele copy numbers to be inferred which highlights potential genomic duplications and/or deletions that occur within your samples.

usage: psQTL_prep.py depth [-h] -d WORKINGDIRECTORY [-v] -p PLOIDYNUM -f GENOMEFASTA [--windowSize WINDOWSIZE] [--bam BAMFILES [BAMFILES ...]] [--bamSuffix BAMSUFFIX] [--threads THREADS]

options:
  -h, --help            show this help message and exit
  -d WORKINGDIRECTORY   Specify the location where the analysis is being performed
  -v, --version         show program's version number and exit
  -p PLOIDYNUM          Specify the ploidy number of the samples being analysed e.g., '2' for diploid, '4' for tetraploid, etc.
  -f GENOMEFASTA        Specify the location of the genome FASTA file that BAM files are aligned to
  --windowSize WINDOWSIZE
                        Optionally, specify the window size that reads will be binned into for CNV calling (recommended: 1000)
  --bam BAMFILES [BAMFILES ...]
                        Optionally, specify one or more locations of BAM files and/or directories containing BAM files for depth calculations
  --bamSuffix BAMSUFFIX
                        Optionally, specify the suffix used to denote BAM files; relevant if directories are provided to --bam
  --threads THREADS     Optionally, specify the number of threads to use for samtools depth calculations (default: 1)

-p is as detailed above; for depth mode you may specify any ploidy number according to the ploidy of your organism.

-f requires the reference genome FASTA file that reads were mapped to.

--bam and --bamSuffix are as detailed above.

--threads allow for multiple CPU threads to be used during read depth asessment to speed up its operation.

--windowSize controls the size of the non-overlapping windows within which read alignments are summed along each chromosome/contig. There are several reasons for doing this, rather than looking at each individual nucleotide position. These include:

Read alignment depth per position can vary quite a lot over short spans, so summing read depth in larger windows allows for some of that variability to be smoothed out.
It reduces the computation time for read depth, since treating each individual position in the genome as a separate data point would massively increase the computation time.
CNVs are generally defined as being insertions or deletions of 1,000 bp or longer, and the default value for psQTL reflects that desire for true CNVs to be located rather than shorter indels which may already be detected during variant calling.

The default of 1000 is recommended for WGS data, but if you are using a form of sequencing which does not evenly cover all genomic regions e.g., exome or DArT sequencing, you may want to increase this value. In short, your goal would be to choose a value that:

Is large enough to allow for multiple sequenced regions to be summed together to minimise data variability effects.
- For exome sequencing this might be based on the average length of a gene (introns included).
- For DArT sequencing this might be based on the average expected spacing between each restriction enzyme site.
Is narrow enough to allow real CNVs to be identified without having their variation 'smoothed' over by summing them with adjacent non-CNV regions.

Choosing a window size that potentially encompasses two of those features (i.e., the average length of a gene, or the spacing between enzyme sites) may be a useful objective metric to base your decision around.

Otherwise, I would recommend you try running psQTL with a step-wise increase in the window size from 10000 to 100000 to see how that impacts your CNV segregation results.

view

The depth mode provides a helper functionality to obtain some simple metrics on your variant and CNV predictions.

usage: psQTL_prep.py view [-h] -d WORKINGDIRECTORY [-v]

options:
  -h, --help           show this help message and exit
  -d WORKINGDIRECTORY  Specify the location where the analysis is being performed
  -v, --version        show program's version number and exit

An example of its output is shown below.

## psQTL_prep.py - View Directory ##
# Parameters:
Working directory: /location/of/your/working/directory
BAM files: [ ... ]

# Metadata details:
Metadata file: /location/of/your/metadata.tsv
Group 1 samples (n=21): [ ... ]
Group 2 samples (n=29): [ ... ]

# 'Call' VCF details:
VCF file: None

# Filtered 'call' VCF details:
Filtered VCF file: /location/of/your/variants.vcf.gz
Quality filter: 30.0
Num. filtered variants: 18608599
Filtered samples (n=70): [ ... ]
Filtered contigs (n=10): ['chr1', 'chr2', 'chr3', 'chr4', 'chr5', 'chr6', 'chr7', 'chr8', 'chr9', 'chr10']

# Depth VCF-like details:
Depth file: /location/of/your/working/directory/depth/psQTL_depth.vcf.gz
Window size: 1000 bp
Total num. bins: 323739
Num. bins with CNV: 279522
Samples (n=70): [ ... ]
Contigs (n=10): ['chr1', 'chr2', 'chr3', 'chr4', 'chr5', 'chr6', 'chr7', 'chr8', 'chr9', 'chr10']

# Potential issues:
VCF file may need to be set or generated before further analysis can proceed

Analysis viewing complete!
Program completed successfully!

Some details to note include:

Long lists are abbreviated with ... here, but would normally showcase sample IDs as found in your VCF header line or each individual BAM files.
The 'call' VCF shows as None since we have specified a pre-generated and filtered VCF to the --fvcf option as detailed above.
Some details are given including the number of small variants in our VCF (18608599), as well as the number of windows that read depth was summed within (323739) and how many of those windows show potential CNVs where not all samples show the same number of allele copies (279522).

The "Potential issues" can flag several possible problems in your analysis. Here, the warning is not problematic for us. We did not specify a --vcf option since we provided input directly to --fvcf instead. However, it can detail other potential problems which can include:

Metadata file needs to be set before further analysis can proceed indicates that you have not specified a --meta value during psQTL_prep.py initialise which will need to be rectified before continuing with the psQTL pipeline.
Metadata only indicates two samples; analysis interpretation may be limited warns you that looking for QTLs between only two segregating individuals will return results which are compromised.
VCF file may need to be set or generated before further analysis can proceed or Depth file may need to be set or generated before further analysis can proceed give warnings that small variant and CNV prediction data is lacking, with downstream analsis steps not able to work with 'call' or 'depth' data, respectively.
VCF contigs (n=X) do not match depth contigs (n=Y) may flag a potential discrepancy in the reference genome FASTA used for small variant and CNV prediction. If there are contigs in your reference genome which e.g., do not have any variants predicted, you can ignore this warning.
<X> samples do not match <Y> samples may flag discrepancies between the samples that occur in your VCF and VCF-like files, or between those files and your metadata.
- psQTL is built to handle the use of subsetted metadata, such that your metadata has fewer samples listed than are found in your VCF or VCF*-like* files.
- Alternatively, if samples are listed in your metadata which do not exist in your VCF or VCF*-like* files, you should expect errors later on.