User Manual - ToBoDev/assessPool GitHub Wiki

It is highly recommended that new users read this walk-through before starting to use assessPool. Contact us with bugs or questions by submitting an issue via github, or emailing us at [email protected]!

Download and Installation

Getting assessPool is easy! Open a terminal window on your computer, navigate to your desired working directory, and clone the project with:

git clone https://github.com/ToBoDev/assessPool.git

In order to update assessPool, from your desired working directory*:

git pull

*note: if you have edited any scripts (including assessPool_main.Rmd), you will need to use an - f flag. This will overwrite any edits, including filter settings.

git pull -f

It is important that you have a recent version of R (3.4.2+) installed. Although assessPool is largely self-contained with regard to R libraries, it does require a few external dependencies:

RStudio¹, an integrated development environment (IDE) for R. It is available as open-source software - installers and tarballs are located at the link provided.
CPAN, Comprehensive Perl Archive Network, included with Perl installation. You can test whether or not cpan is already installed on your machine with: cpan --version
vcftools², a program package designed for working with VCF files. The aim of VCFtools is to provide easily accessible methods for working with complex genetic variation data in the form of VCF files. VCFtools is available as a zip or tarball at the link provided.
vcflib³, a C++ library for parsing and manipulating VCF files. Download instructions can be found here. Make sure you follow the instructions for editing your .bashrc so that vcflib can be called globally.
(OPTIONAL) GNU parallel⁴, a shell tool for executing jobs in parallel. See link for installation instructions. It is not required that you install parallel, however if you have large datasets it can speed up your analysis substantially.

PoPoolation2⁵ also requires the use of the Perl module "twotailed.pm". You can check to see if you have this package by running the following lines of code (on Linux):

sudo apt-get install perl-doc
perldoc -l Text::NSP::Measures::2D::Fisher::twotailed

If these lines return a path to the module, you're good to go! Run these lines to add it to your @INC path:

perldoc -l Text::NSP::Measures::2D::Fisher::twotailed | export PERL5LIB=
perldoc -l Text::NSP::Measures::2D::Fisher::twotailed | export PERLLIB=

If instead you get a "No Documentation Found" error, run the following lines of code to install it and add it to your path:

sudo cpan App::cpanminus
sudo cpanm Text::NSP::Measures::2D::Fisher::twotailed
perldoc -l Text::NSP::Measures::2D::Fisher::twotailed | export PERL5LIB=
perldoc -l Text::NSP::Measures::2D::Fisher::twotailed | export PERLLIB=

Updating assessPool

We're adding new features and bug fixes all the time! To get the latest version, navigate to your assessPool directory on the command line and type the following lines. IMPORTANT: These git calls will overwrite your local files, so if you want to save your current custom parameter setup or R notebook output, make sure to move the .Rmd and .nb.html files to a safe place before running.

git fetch --all
git reset --hard origin/master

Getting Started

Once you have assessPool and its dependencies installed, you're ready to go (almost). Open RStudio, and navigate to File > Open Project. Select the file assessPool.Rproj from your assessPool directory. This may automatically open the R Markdown file assessPool_main.Rmd - if not, navigate to File > Open File in RStudio and select it from the assessPool directory. Once you have the R Markdown script in front of you, take a quick look at how it's set up. The file is composed of code chunks, which look something like this:

```{r}
#this is a comment
this is some R code```

assessPool is designed to be run by stepping through each code chunk in order. It's also modular, so chunks can be re-run with different parameters before proceeding to the next step. The first chunk is LibraryImport, with the header below:

Run this chunk by clicking on the green arrow in the top right, and then sit back and relax. The first time you run this chunk, RStudio will install all R libraries required by the script. Using the r-package Packrat, the libraries will install in the self-contained packrat directory within assessPool and will not affect other installs on your system. Packrat should install itself automatically, but if it does not and instead gives you an error, simply install it yourself with the line install.packages("packrat") in your RStudio console. If vcflib is not globally installed, this chunk will also download and make this program inside your assessPool/scripts/ directory. This will take some time, so this is a good time to go make a cup of coffee - or set up your input files in the next code chunk.

assessPool requires only a .vcf (Variant Call Format) file produced by SNP-calling software such as FreeBayes or GATK and a .fasta reference file produced by assembly software such as rainbow or PEAR. If you are using assessPool after the dDocent pipeline, these files will be called 'reference.fasta' and either 'TotalRawSNPs.vcf' (raw) and 'Final.recode.vcf' (filtered). Generally speaking, default dDocent filters are not calibrated for pooled data, so we recommend that you either (1) start with the raw file and use assessPool's built-in filtering steps (more on this later) or (2) filter on your own using vcftools or another program.

The vcf and reference files can be placed in the assessPool/ directory, where they will be found automatically; it is important that you only have one of each in the directory. If you have multiple vcf/reference files or if they are located in a different directory, set them up this way instead in the next chunk ("PreAnalysisParameters"):

vcf_file <- "/path/to/file/myfile.vcf"
ref_file <- "/path/to/file/myfile.fasta"

This is also the place to set your project name, which will create a folder in the main assessPool/ directory to store your output. If you are running assessPool multiple times with different input files or parameters, it is important to remember to change your project name so your original files will not be overwritten. NOTE: As with most command-line applications, please make sure not to include any space in your project name, e.g. "Genus species" should be written as "Genus_species" or "GenusSpecies".

VCF Filtering

It is strongly recommended that you filter your vcf file before proceeding - while it is impossible to remove all sequencing errors while keeping only real variant sites, filtering can help to improve your signal to noise ratio and give you more confident results. You can either (1) filter your vcf prior to assessPool, or (2) use the recommended filtering steps provided in assessPool, which are included with pooled data in mind. If your vcf has already been filtered to your liking, proceed to the next section below.

Otherwise, the first filtering chunk is named "PreFilteringParameters". The following options can be changed:

show.filter.output <- TRUE
vcftools_PATH <- "export PATH=<PATH_TO_VCFTOOLS_0.1.12_or_later>:$PATH;"

The first controls how verbose your filtering output will be; change the TRUE to FALSE if you want to hide detailed % error information. The section also helps the script find the correct version of VCFtools. If you see the output using VCFtools v0.1.12 or higher after running this chunk, you do not need to change this path. If instead your VCFtools is out of date, there are a few ways to solve this problem. The easiest is to install a newer version of VCFtools on your computer. If instead you have multiple versions of the software installed, provide a path to the correct version like so:

vcftools_PATH <- "export PATH=/path/to/your/bin/:$PATH;"

This will prevent assessPool from using a previous version of VCFtools, which will throw errors later on down the line. You will need to re-run this chunk after setting the path. If you are now using the right version of VCFtools, you are now ready to start filtering! The filtering steps included in assessPool loosely follow dDocent recommendations and are, in the order they appear in the script:

Filter by pool number: Removes variant sites called in fewer than x pools, specified by min.pool.number. Setting this parameter at 2 will preserve the maximum amount of informative sites - a more conservative threshold would be approximately 50% of the total number of pools.
Filter by quality score: Removes variant sites with a quality score lower than x, specified by min.quality.score. The default minimum quality score is 30, but this should be adjusted per your data.
Filter by minimum depth: Removes variant sites with a total read depth of less than x, specified by min.depth.threshold. The default minimum depth is 3; read more about the rationale of choosing a low minimum depth on the dDocent filtering tutorial. This filter also requires a maximum amount of dropped pools per site, specified by max.missing. This should be set to your total number of pools subtracted by the minimum pool number you set in step one.
Filter by allele length: Removes variant sites with a length of over x base pairs, specified by max.allele.length. The default maximum length is 10, but this should be adjusted per your data.
Filter by mis-paired reads: Removes variant sites in which there is a mismatch in properly paired reads between the reference and alternate allele, i.e. all the reads supporting the reference allele are paired, but the reads supporting the alternate allele are not. Removing these sites helps to reduce cut site polymorphisms and paralogs.
Filter by quality/depth ratio: Removes variant sites with a quality score:depth ratio below x, specified by quality.depth.ratio. This filter helps remove high-coverage sites that have an over-inflated quality score due to high coverage - read more on the dDocent filtering tutorial. A threshold of .25 is recommended for RADseq data.
Filter by quality/maximum mean depth: Similar to above filter, removes variant sites above quality/mean depth cutoff that do not have quality scores of 2 times the depth. Read more on the dDocent filtering tutorial.
Filter by maximum mean depth: Removes variant sites with a maximum mean depth of over x, specified by max.mean.depth.threshold. This filter helps to remove multi-copy loci and paralogs, and can be informed by looking at the histogram produced in the previous chunk. If your histogram has a long trailing tail of sites that are high above the mean depth (to the right), choose a threshold here that removes this tail. You can examine the results of this threshold by re-running the histogram chunk.

You can view a summary of your filtering progress by running the filtering visualization chunk, which produces a bargraph of variant sites preserved after each filtering step. In this example, a roughly equivalent number of sites were dropped after each step:

Don't worry if large amounts of sites are dropped during filtering - smaller subsets of higher-quality data will still yield informative patterns. Of course, these suggested filters are not comprehensive and may not be sufficient for every dataset. If you wish to apply additional filters, you may filter your data prior to assessPool and skip these steps entirely.

Pre-PoPoolation2 Setup

Now that you have a filtered vcf file, there are a few more steps to take before running PoPoolation2. The first is the "PreAnalysisRun" chunk, which reads the vcf file, extracts necessary data, cleans up formatting, and returns several dataframes that will be used later in the analysis. If you have filtered your vcf using the steps above, the program will automatically find this filtered file and use it for the remainder of the analysis.

The following chunk is "PopoolationParameters", in which you have the following options for data inclusion and analysis inclusion:

include.multiallelic : set to TRUE if you want to include SNPs with >2 alleles, FALSE if you only want to include biallelic SNPs.
include.indels : set to TRUE if you want to include insertions and deletions, FALSE if you only want to include SNPs.
perform_snpfreq : set to TRUE if you want to run PoPoolation2 SNP frequency analysis (not currently included in later steps)
perform_fst : set to TRUE if you want to run PoPoolation2 Fixation index (required for future steps)
perform_fet : set to TRUE if you want to run PoPoolation2 Fisher's Exact Test (required for future steps)

If both include.multiallelic and include.indels are set to FALSE, only biallelic SNPs are included. Note that both Fst and Fet analyses are required for later steps in the pipeline; you have the option of running them one at a time if you are interested in trying out different parameters. These parameters are:

min_count : The minimum count of the minor allele, used for SNP identification. If you have already filtered your data, it is okay to set this value fairly low.
min_cov : The minimum coverage, used for SNP identification. You can set this value to the same minimum you set in filtering.
max_cov : The maximum coverage, which can be set in several ways:
- 500 : A maximum coverage of 500 will be used for all populations.
- c(500, 200, 400) : A maximum coverage of 500 will be used for the first population, a maximum coverage of 200 for the second population and so on
- 2% : The 2% highest coverages will be ignored. This value is independently estimated for every population.
min_covered_fract, window_size, step_size : These values affect sliding window size and coverage. If you are analyzing site-by-site, leave all of these variables at '1'. Support for sliding windows will be implemented in future versions.
pool_size : Number of individuals in each pool. This can be approximate, but is mandatory. It can be set in several ways:
- 50 : All samples have a pool size of 50
- c(50, 60, 100) : First sample has a pool size of 50, second has a pool size of 60, etc.

Running PoPoolation2

While PoPoolation2 natively handles multiple pools, it will discard missing data (i.e., if a SNP is called in only 11/12 of your pools, it will be dropped). To get around this and keep all potentially informative sites, we run PoPoolation2 in a pairwise fashion and combine the output.

The next code chunk, "PopoolationSetupRun", takes the earlier dataframes produced by the script and coerces them into the .sync format required. It also prints output about the types and relative amounts of sites in your data; if you notice a decrease in the number of SNPs as compared to the number post-filtering, it is likely because indels longer than 1bp are dropped at this step. This step also produces several output files:

in <project name>/output/ :
- All variable sites saved in <project name>_<num pools>_allvar.txt
- All SNPs saved in <project name>_<num pools>_allsnps.txt
- All bi-SNPs saved in <project name>_<num pools>_bisnps.txt
- All INDELs saved in <project name>_<num pools>_allindels.txt
in <project name>/popoolation/ :
- popool2_run.sh : Generated script for running all PoPoolation2 analyses per pair of populations
- <project name>_<pool1>_<pool2>.sync : One .sync file created per each pair of populations

If you are unsure about your parameters or filtering, now is the time to re-run the previous steps. Once you are ready to move forward, take a look at the chunk "PopoolationRun". This chunk has the following options:

use_parallel : set to TRUE if you have gnu-parallel installed on your machine and have multiple processors. Set to FALSE if you do not wish to run in parallel (note - this can take several hours).
no_cores : number of cores to use for analysis - it is recommended that you not use all cores available, as this may interfere with any other running processes.

Once you have set parallelization options, run this chunk to run PoPoolation2 for each analysis for each pair of pools. This may take a little while depending on pool size and number of pools.

Post-PoPoolation2 Analysis

Take a look at your<project name>/popoolation/ directory after the run is completed - for each combination of populations, you should see a .fst file, a .fet file, and .params files for both. Let's take a look at one of the .params files first:

These files are simply logs of parameters used in the PoPoolation2 run; they won't be used in the analysis, but are useful to look back on. Let's next take a look at the .fst and .fet files:

The columns important to us are the first (contig ID), second (position), fifth (depth at site), and sixth (value - in this case Fet). In the next steps, we will combine sites by contig and position in order to evaluate SNPs over multiple populations. The next chunk is "AnalysisParameters", which allows you to adjust multiple options controlling the final output and visualization:

strong_diff : A Fst value between 0 and 1 considered strong differentiation - this will vary substantially from dataset to dataset. Note - this will only affect output files, not any summary calculations.
p_cutoff : A Fet p-value value considered significant. Note - this will only affect output files, not any summary calculations.
fasta_generation : If set to true, will pull contig sequences and create FASTA files for both strongly differentiated and alternatively fixed sites for later use in BLAST or similar applications.
first_mincov, last_mincov, cov_step : Minimum coverage levels to use for analysis series; coverage will go from the min to the max by the step, e.g. first_mincov=5, last_mincov=75, cov_step=5 will produce analyses for 5x, 10x, 15x...70x, 75x coverage.

The final data-wrangling chunk is "AnalysisRun", which will return dataframes needed for visualization steps, summarize number of sites per coverage level, and write the following output files in <project name>_/output/:

All variable sites saved in <project name>_allvar_postPopoolation2.csv
All variable sites called in all pools saved in <project_name>_calledAllPools_postPopoolation2.csv
All strongly differentiated sites saved in <project_name>_strongly_differentiated_sites.csv
All alternatively fixed sites saved in <project_name>_fixed_snps.csv
All contigs which contain a strongly differentiated site saved in <project_name>_strongly_differentiated_sites_fasta.fa
All contigs which contain a fixed site saved in <project_name>_fixed_sites_fasta.fa
All significant p-value sites saved in <project_name>_lowP_snps.csv

Visualization

Coming soon...

Output files

In /output/:

*_allvar.txt : All variable sites after filtering, SYNC format with header.
*_allsnps.txt : All SNPs (biallelic & multialleic) after filtering, SYNC format with header.
*_bisnps.txt : All biallelic SNPs after filtering, SYNC format with header.
*_allindels.txt : All 1bp insertions and deletions after filtering, SYNC format with header.
*_allvar_postPopoolation2.csv : All variable sites after PoPoolation2, csv format.
*_calledAllPools_postPopoolation2.csv : All variable sites called in all pools, csv format.
*_strongly_differentiated_sites.csv : All strongly differentiated (in at least one comparison) variable sites, csv format.
*_fixed_snps.csv : All alternatively fixed (in at least one comparison) variable sites, csv format.
*_lowP_snps.csv : All variables sites with a significant p-value (from Fisher's Exact Test, in at least one comparison), csv format.
.csv files above have the following columns:
- snpid : Contig name + variable site position in contig
- CHROM : Contig name
- POS : Variable site position in contig
- NS : Number of pools in which variable site is called
- TDP : Total read depth across all pools
- TYPE : Variable site type (SNP, complex, etc.)
- AN : Number of alleles at site
- REF : Reference allele
- ALT : Alternate allele
- Rlen : Length (bp) of reference allele
- Alen : Length (bp) of alternate allele
- DP.* : Read depth for each pool
- RO.* : Number of reference allele observations for each pool
- AO.* : Number of alternate allele observations for each pool
- *.fet : FET p-value for each comparison between pools
- *.fst : FST score for each comparison between pools
*_strongly_differentiated_sites_fasta.fa : All contigs which contain a strongly differentiated site in at least one comparison, fasta format.
*_fixed_sites_fasta.fa : All contigs which contain a fixed site in at least one comparison, fasta format.

In /popoolation/:

*.sync : PoPoolation2 input in allele counts per SNP, per pairwise comparison.
*.fst : PoPoolation2 FST output in FST values per SNP, per pairwise comparison.
*.fst.params : Log file per FST comparison.
*.fet : FET output in log p-values per SNP, per pairwise comparison.
*.fet.params : Log file per FET comparison.
*_rc : Major and minor alleles per SNP, per pairwise comparison.
*_pwc : Differences in allele frequencies per SNP, per pairwise comparison.

All output files in this directory are produced by PoPoolation2. See documentation for more detailed information about these files.

In /logs/:

filter.log : Saves parameters and summary output from filtering steps.
setup.log : Saves parameters and summary output from vcf and sync file setup steps.
popoolation2.log : Saves parameters and summary output from PoPoolation2.
analysis.log : Saves parameters and summary output from FST/FET analysis.

Bibliography

RStudio Team (2015). RStudio: Integrated Development for R. RStudio, Inc., Boston, MA URL http://www.rstudio.com/.
The Variant Call Format and VCFtools (2011). Petr Danecek, Adam Auton, Goncalo Abecasis, Cornelis A. Albers, Eric Banks, Mark A. DePristo, Robert Handsaker, Gerton Lunter, Gabor Marth, Stephen T. Sherry, Gilean McVean, Richard Durbin and 1000 Genomes Project Analysis Group, Bioinformatics.
vcflib: A C++ library for parsing and manipulating VCF files. Erik Garrison. https://github.com/vcflib/vcflib
O. Tange (2011): GNU Parallel - The Command-Line Power Tool, ;login: The USENIX Magazine, February 2011:42-47.
Robert Kofler, Ram Vinay Pandey, Christian Schlötterer; PoPoolation2: identifying differentiation between populations using sequencing of pooled DNA samples (Pool-Seq), Bioinformatics, Volume 27, Issue 24, 15 December 2011, Pages 3435–3436, https://doi.org/10.1093/bioinformatics/btr589