Preparing data for bpp - Pas-Kapli/bpp-tutorial GitHub Wiki
Imap file
The imap file is a two-column file the contains the assignment of the samples to species. In the first column of the file are the names of the samples followed by the species name it corresponds to. Each sample name needs to be the same across loci.
The Imap file format is as follows:
SampleName1 SpeciesName
SampleName2 SpeciesName
SampleName3 SpeciesName
SampleName4 SpeciesName
SampleName5 SpeciesName
SampleName6 SpeciesName
SampleName7 SpeciesName
SampleName8 SpeciesName
SampleName9 SpeciesName
SampleName10 SpeciesName
*The spaces between the SampleName
and the SpeciesName
can be space(s) or tab(s) or a mixture of the two characters.
Download the Imap file for the Brown Frogs Imap.txt.
wget https://raw.githubusercontent.com/Pas-Kapli/bpp-tutorial/master/A01_Frogs/data/Imap.txt
Sequence file
Bpp takes as input a single file containing all the aligned loci of the dataset. In this file, the alignments are arranged one after the other, in contrast to other phylogenetic software that usually concatenate them one next to the other.
The sequence file format is as follows:
[#SEQUENCES LOCUS1] [#ALIGNMENT_SITES_LOCUS1]
SequenceNameA^SampleName1 GGAGCCAACAGAGTTTAACGTTCT...
SequenceNameB^SampleName2 GGAGCCAACAGAGTTTAACATTCT...
SequenceNameC^SampleName3 GGAGCCAACAGAGTTTAACGTTCT...
SequenceNameD^SampleName4 GGAGCCAACAGAGTTTAACGTTCT...
SequenceNameE^SampleName5 GGAGCCAACAGAGTTTAACGTTCT...
SequenceNameF^SampleName6 GGAGCCAACAGAGTTTAACGTTCT...
SequenceNameG^SampleName7 GGAGCCAACAGAGTTTAACGTTCT...
SequenceNameH^SampleName8 GGAGCCAACAGAGTTTAACGTTCT...
SequenceNameI^SampleName9 GGAGCCAACAGAGTTTAACGTTCT...
SequenceNameJ^SampleName10 GGAGCCAACAGAGTTTAACGTTCT...
.
.
[#SEQUENCES LOCUS2] [#ALIGNMENT_SITES_LOCUS2]
SequenceNameA^SampleName1 TCCCTTTCTCGGGCATTG...
SequenceNameB^SampleName2 TCCCTTTCTCGGGCATTG...
SequenceNameB^SampleName3 TCCCTTTCTCRGGCATTG...
SequenceNameC^SampleName4 TCCCTTTCTCGGGCATTG...
SequenceNameD^SampleName5 TCCCTTTCTCGGGCATTG...
SequenceNameE^SampleName6 TCCCTTTCTCGGGCATTG...
SequenceNameF^SampleName7 TCCCTTTCTCGGGCATTG...
SequenceNameG^SampleName8 TCCCTTTCTCGGGCATTG...
SequenceNameH^SampleName9 TCCCTTTCTCGGGCATTG...
SequenceNameI^SampleName10 TCCCTTTCTCGGGCATTG...
.
.
The SampleNames
refer to the particular individual animals sequenced for this locus and correspond to the SampleNames
in the Imap file. The SequenceNames
may be codes for the particular sequence and they are ignored by the program.
The following format is also valid:
[#SEQUENCES LOCUS1] [#ALIGNMENT_SITES_LOCUS1]
^SampleName1 GGAGCCAACAGAGTTTAACGTTCT...
^SampleName2 GGAGCCAACAGAGTTTAACATTCT...
^SampleName3 GGAGCCAACAGAGTTTAACGTTCT...
^SampleName4 GGAGCCAACAGAGTTTAACGTTCT...
^SampleName5 GGAGCCAACAGAGTTTAACGTTCT...
^SampleName6 GGAGCCAACAGAGTTTAACGTTCT...
^SampleName7 GGAGCCAACAGAGTTTAACGTTCT...
^SampleName8 GGAGCCAACAGAGTTTAACGTTCT...
^SampleName9 GGAGCCAACAGAGTTTAACGTTCT...
^SampleName10 GGAGCCAACAGAGTTTAACGTTCT...
.
.
Prepare sequence file for large datasets
Preparing the sequence file will need a manual approach which might be a challenging task for large datasets. Below is an example of how we can convert a collection of independent phylip formatted files into a single BPP sequence input file.
In a folder containing the desired phylip alignments, we run the following for loop. This adds all alignments in a single file with two spaces between each alignment.
Download a zip folder containing the four alignment files, unzip it and create the "bpp_seqfile.txt" with these commands:
wget https://github.com/Pas-Kapli/bpp-tutorial/raw/master/A01_Frogs/data/individual_loci.zip
unzip individual_loci.zip
rm individual_loci.zip
cd individual_loci
for i in *phy; do cat ${i}; printf '\n%.0s' {1..2}; done > bpp_seqfile.txt
Explore the new file with less
:
less bpp_seqfile.txt
Move up and down with Page-Up/Down, quit with q
Next, we want to change the sample names according to the BPP sequence format. That is easily done using bash with a sed
command.
To rename a single species the syntax of the command would look like this (but do not run the command):
sed -i "s/^SampleName /\^SampleName/g" bpp_seqfile.txt
Optional explanantion: The command replaces every occurrence of 'SampleName' that appears at the beginning of a line by '^SampleName'. The command is a 'search and replace' and searches the first part between the '/ ... /' and replaces it with the second part between '/ ... /'. The first ^ character means "start of line", i.e. we want to only match 'SampleName' that appears at the beginning of the line. The second ^ symbol appears as '^'. The backslash (\) negates the effect of ^ and turns it into a symbol rather than an action.
Download a bash script with the sed
commands for the frog data rename_frog_seqfile.sh and execute it with the following commands:
wget https://raw.githubusercontent.com/Pas-Kapli/bpp-tutorial/master/A01_Frogs/data/individual_loci/rename_frog_seqfile.sh
chmod +x rename_frog_seqfile.sh
./rename_frog_seqfile.sh bpp_seqfile.txt
After checking the file with less
we move it one folder back:
mv bpp_seqfile.txt ../
cd ../
ls
bpp_seqfile.txt Imap.txt individual_loci
We now have two of the three input files for bpp, the Imap.txt and the bpp_seqfile.txt.
If for any reason you haven't managed to create it here is the formatted Seqfile for BPP