Preparing data for bpp - Pas-Kapli/bpp-tutorial GitHub Wiki

Imap file

The imap file is a two-column file the contains the assignment of the samples to species. In the first column of the file are the names of the samples followed by the species name it corresponds to. Each sample name needs to be the same across loci.

The Imap file format is as follows:

SampleName1  SpeciesName
SampleName2  SpeciesName
SampleName3  SpeciesName
SampleName4  SpeciesName
SampleName5  SpeciesName
SampleName6  SpeciesName
SampleName7  SpeciesName
SampleName8  SpeciesName
SampleName9  SpeciesName
SampleName10 SpeciesName

*The spaces between the SampleName and the SpeciesName can be space(s) or tab(s) or a mixture of the two characters.

Download the Imap file for the Brown Frogs Imap.txt.

wget https://raw.githubusercontent.com/Pas-Kapli/bpp-tutorial/master/A01_Frogs/data/Imap.txt

Sequence file

Bpp takes as input a single file containing all the aligned loci of the dataset. In this file, the alignments are arranged one after the other, in contrast to other phylogenetic software that usually concatenate them one next to the other.

The sequence file format is as follows:

[#SEQUENCES LOCUS1] [#ALIGNMENT_SITES_LOCUS1]

SequenceNameA^SampleName1     GGAGCCAACAGAGTTTAACGTTCT...
SequenceNameB^SampleName2     GGAGCCAACAGAGTTTAACATTCT...
SequenceNameC^SampleName3     GGAGCCAACAGAGTTTAACGTTCT...
SequenceNameD^SampleName4     GGAGCCAACAGAGTTTAACGTTCT...
SequenceNameE^SampleName5     GGAGCCAACAGAGTTTAACGTTCT...
SequenceNameF^SampleName6     GGAGCCAACAGAGTTTAACGTTCT...
SequenceNameG^SampleName7     GGAGCCAACAGAGTTTAACGTTCT...
SequenceNameH^SampleName8     GGAGCCAACAGAGTTTAACGTTCT...
SequenceNameI^SampleName9     GGAGCCAACAGAGTTTAACGTTCT...
SequenceNameJ^SampleName10    GGAGCCAACAGAGTTTAACGTTCT...
.
.

[#SEQUENCES LOCUS2] [#ALIGNMENT_SITES_LOCUS2]

SequenceNameA^SampleName1     TCCCTTTCTCGGGCATTG...
SequenceNameB^SampleName2     TCCCTTTCTCGGGCATTG...
SequenceNameB^SampleName3     TCCCTTTCTCRGGCATTG...
SequenceNameC^SampleName4     TCCCTTTCTCGGGCATTG...
SequenceNameD^SampleName5     TCCCTTTCTCGGGCATTG...
SequenceNameE^SampleName6     TCCCTTTCTCGGGCATTG...
SequenceNameF^SampleName7     TCCCTTTCTCGGGCATTG...
SequenceNameG^SampleName8     TCCCTTTCTCGGGCATTG...
SequenceNameH^SampleName9     TCCCTTTCTCGGGCATTG...
SequenceNameI^SampleName10    TCCCTTTCTCGGGCATTG...
.
.

The SampleNames refer to the particular individual animals sequenced for this locus and correspond to the SampleNames in the Imap file. The SequenceNames may be codes for the particular sequence and they are ignored by the program.

The following format is also valid:

[#SEQUENCES LOCUS1] [#ALIGNMENT_SITES_LOCUS1]

^SampleName1     GGAGCCAACAGAGTTTAACGTTCT...
^SampleName2     GGAGCCAACAGAGTTTAACATTCT...
^SampleName3     GGAGCCAACAGAGTTTAACGTTCT...
^SampleName4     GGAGCCAACAGAGTTTAACGTTCT...
^SampleName5     GGAGCCAACAGAGTTTAACGTTCT...
^SampleName6     GGAGCCAACAGAGTTTAACGTTCT...
^SampleName7     GGAGCCAACAGAGTTTAACGTTCT...
^SampleName8     GGAGCCAACAGAGTTTAACGTTCT...
^SampleName9     GGAGCCAACAGAGTTTAACGTTCT...
^SampleName10    GGAGCCAACAGAGTTTAACGTTCT...
.
.

Prepare sequence file for large datasets

Preparing the sequence file will need a manual approach which might be a challenging task for large datasets. Below is an example of how we can convert a collection of independent phylip formatted files into a single BPP sequence input file.

In a folder containing the desired phylip alignments, we run the following for loop. This adds all alignments in a single file with two spaces between each alignment.

Download a zip folder containing the four alignment files, unzip it and create the "bpp_seqfile.txt" with these commands:

wget https://github.com/Pas-Kapli/bpp-tutorial/raw/master/A01_Frogs/data/individual_loci.zip
unzip individual_loci.zip
rm individual_loci.zip
cd individual_loci
for i in *phy; do cat ${i}; printf '\n%.0s' {1..2}; done > bpp_seqfile.txt

Explore the new file with less:

less bpp_seqfile.txt

Move up and down with Page-Up/Down, quit with q

Next, we want to change the sample names according to the BPP sequence format. That is easily done using bash with a sed command.

To rename a single species the syntax of the command would look like this (but do not run the command):

sed -i "s/^SampleName /\^SampleName/g" bpp_seqfile.txt

Optional explanantion: The command replaces every occurrence of 'SampleName' that appears at the beginning of a line by '^SampleName'. The command is a 'search and replace' and searches the first part between the '/ ... /' and replaces it with the second part between '/ ... /'. The first ^ character means "start of line", i.e. we want to only match 'SampleName' that appears at the beginning of the line. The second ^ symbol appears as '^'. The backslash (\) negates the effect of ^ and turns it into a symbol rather than an action.

Download a bash script with the sed commands for the frog data rename_frog_seqfile.sh and execute it with the following commands:

wget https://raw.githubusercontent.com/Pas-Kapli/bpp-tutorial/master/A01_Frogs/data/individual_loci/rename_frog_seqfile.sh
chmod +x rename_frog_seqfile.sh
./rename_frog_seqfile.sh bpp_seqfile.txt

After checking the file with less we move it one folder back:

mv bpp_seqfile.txt ../
cd ../
ls
bpp_seqfile.txt  Imap.txt  individual_loci

We now have two of the three input files for bpp, the Imap.txt and the bpp_seqfile.txt.

If for any reason you haven't managed to create it here is the formatted Seqfile for BPP

Next, learn how to format the control file for Species Tree inference