Sampling species - thekswenson/Zombi_wiki GitHub Wiki

It is possible that we need to obtain a sample of all the surviving species in our datasets. In that case, we can resort to the script SpeciesSampler to prepare the data. The usage is

python SpeciesSampler Mode Input ExperimentFolder

This will generate new datasets in which the species that have been not sampled are removed from the output. The modes are:

  • i: The user gives a file with the species that must be preserved (one species per line).
  • r: The user gives a number between 0 and 1 to determine the proportion of species that will be randomly sampled
  • n: The user gives a number to determine how many species are randomly sampled
  • w: The user gives a file (.tsv) with the name of each lineage in the species tree and the probabilities of sampling that lineage. If the numbers add up to a number over 1, the values are normalized

Samples are created in ./ExperimentFolder/SAMPLE_#

If you launch again the script, you will create a new SAMPLE folder

A (very easy) example

Let us say that we want to create a dataset with 100 species and their genomes, and then sample only 15 species. We would first compute T and G:

python Zombi T SpeciesTreeParameters.tsv TestFolder

python Zombi G GenomeParameters.tsv TestFolder

And then we simply run:

python SpeciesSampler n 15 TestFolder