Waltkthrough Divergence - lczech/popoolation GitHub Wiki
Walkthrough for calculating the divergence between species
Requirements
see [Manual]
Data
Download the full genomes of D. melanogaster, D. simulans and D. yakuba from FlyBase (http://flybase.org/). used files:
- dmel-all-chromosome-r5.29.fasta.gz
- dsim-all-chromosome-r1.3.fasta.gz
- dyak-all-chromosome-r1.3.fasta.gz
Unzip the files
Walkthrough
Prepare reference genomes
Many downstream software tools have problems with headers containing any whitespace. As a cautionary measure it is recommended to keep only the not-whitespace characters of the fasta headers.
awk '{print $1}' dmel-all-chromosome-r5.29.fasta > dmel-short.fasta
awk '{print $1}' dsim-all-chromosome-r1.3.fasta > dsim-short.fasta
awk '{print $1}' dyak-all-chromosome-r1.3.fasta > dyak-short.fasta
Create multiple genome alignments using mauve
Start Mauve by double clicking.
In case Mauve complains about insufficient memory you may assign more memory by using (progressiveMauve has to be added to the path):
java -Xmx2000m -jar Mauve.jar
- Click File and Align with progressiveMauve
- Click Add sequence and select dmel-short.fasta
- Click Add sequence and select dsim-short.fasta
- Click Add sequence and select dyak-short.fasta
- Note:* the order of the sequences is important. The first should be
the reference sequence, i.e the genome for which the divergence will
be calculated and the last should be the outgroup (D. yakuba)
- Click output file and enter: dmel-dsim-dyak
- Press Align
Parse the Mauve output
The output of Mauve is in the multi fasta alignment format. To calculate the divergence of D.melanogaster using a sliding window approach it is necessary to reformat the output.
perl mauve-parser.pl --ref-input dmel-short.fasta --input mel-sim-yak --output multi-align-dmel-anchored --crosscheck
This will create an output like this:
2L 10588 A A A
2L 10589 T - -
2L 10590 T - T
2L 10591 T T T
2L 10592 T T T
- column 1: the chromosome id of D. melanogaster
- column 2: the position in the respective chromosome
- column 3: the reference character of D. melanogaster
- column 4: the reference character of D. simulans
- column 5: the reference character of D. yakuba
Calculate the divergence using a sliding window approach
perl calculate-dxy.pl --input multi-align-dmel-anchored --window-size 50000 --step-size 50000 --output divergence
Reformat for the script Visualize-output.pl
awk 'BEGIN{OFS="\t"}{print $1,$2,"-","-",$3}' divergence > divergence.reformated
Create a graphical overview
perl ~/dev/PopGenTools/Visualise-output.pl --input divergence.reformated --output dmel.div5k.pdf --ylab dxy --chromosomes "X 2L 2R 3L 3R 4"
The result should be like this: http://popoolation.googlecode.com/files/dmel.div5k.pdf