BPP Assumptions and Input files - Pas-Kapli/bpp-tutorial GitHub Wiki

The major assumptions of Bpp are the following

1. There is no recombination within a locus.

To ensure this assumption is met a good strategy is to target short (e.g., 500 to 1,000 bp) genomic segments (loci).

2. There is free recombination between loci.

The different loci should be physically distant from one another in the genome. This ensures that recombination between them is common allowing the loci to have approximately independent histories.

3. All loci are characterized by neutral evolution

Loci should be evolving neutrally, implying that their gene trees are not affected to a significant extent by natural selection

Despite this assumption, protein-coding genes appear to be useable for BPP analysis even if they show obvious evidence of purifying selection. Most proteins perform similar functions in closely related species and the main effect of purifying selection on nonsynonymous mutations is a reduction of the neutral mutation rate for the locus. Studies comparing species trees inferred using exons and using introns or noncoding DNA gave highly consistent results between the two kinds of data Shi and Yang, 2018. Nevertheless, it is prudent to analyze non-coding and coding regions of the genome as separate data sets.

4. All loci evolve clock-like

Currently, BPP assumes a constant rate over time (the molecular clock). This a suitable assumption for analyzing closely related species with sequence divergences below ∼10%. However, relaxed clock models are currently implemented in BPP, so stay tuned and check for updates if your working with deeper phylogenies!

5. For the MSC model without introgression, we assume there is no migration (gene flow) between species

If there is evidence for substantial gene flow among the species it is more proper to analyse the data under the MSci model.

Input files:

For all BPP analyses there are three necessary input files:

The Imap file that contains the information about the assignment of the samples to species
The Sequence file that contains the sequences
The Control file that contains all the parameters for the analysis and the paths to input and output files.