Constructionofevolutionarytree - BGIGPD/BestPractices4Pathogenomics GitHub Wiki
Construction of evolutionary tree
1.Overview
Phylogenetic tree construction is a powerful tool in evolutionary biology that helps to understand the evolutionary relationships among different species or strains of viruses. This document outlines the steps and considerations for constructing a phylogenetic tree for a virus of interest, which can provide insights into the evolutionary processes and the degree of relatedness to known viruses. Such analysis is crucial for rapidly identifying the host source of viral diseases, exploring transmission pathways, and assessing whether the virus has undergone mutations in the short term. This information can guide clinical decisions, including the choice of medication.
The construction of a viral phylogenetic tree can be done using either the whole genome or signature genes. Since the RNA-dependent RNA polymerase (RdRp) protein is relatively conserved among RNA viruses, it is commonly used for building phylogenetic trees.
2.Objectives
Construction of an evolutionary tree of betacoronaviruses (https://journals.asm.org/doi/full/10.1128/jvi.01953-16) Fig2
3. Softwares and Datas
This workflow requires the installation of several bioinformatics tools and the use of specific data files.
Softwares
- BLAST v2.16.0
- trimal v1.5
- MAFFT v7.525
- seqkit v2.8.2
- iqtree v2.3.6
Datas
- Betacoronaviruses background genome
4. Steps
0.1. Software Installation
First, we'll install all the necessary bioinformatics tools using Conda.
conda install mafft trimal iqtree blast seqkit -c bioconda -c conda-forge
0.2. Download Betacoronaviruses background genome Data
You can download these sequences at NCBI by searching for their IDs.
You can also use NCBI's web tool Batch Entrez to batch download.
I have downloaded it, you can copy it directly to your own directory.
mkdir tree
cd tree
# copy Betacoronaviruses background genome
cp /home/shipeibo/tree/beta_rdrp.fasta ./
# copy Betacoronaviruses rdrp protein sequence
cp /home/shipeibo/tree/rdrp.faa ./
1.0. BLAST Database Creation and Search
Create a BLAST database from the protein FASTA file and perform a BLAST search to identify related sequences.
mkdir blast_db
makeblastdb -in rdrp.faa -dbtype prot -out ./blast_db/rdrp
blastx -query beta_rdrp.fasta -db ./blast_db/rdrp -out results.txt -outfmt 6 -evalue 1e-5
The meaning of each column of blastx results
Extract the relevant IDs, begin, and end positions from the BLAST results.
awk '{print $1,$7,$8}' results.txt > ID_begin_end.txt
2.0. Sequence Extraction
Extract the rdrp sequences based on the IDs and positions obtained from the BLAST results.
#!/bin/bash
FASTA_FILE="beta_rdrp.fasta"
ID_FILE="ID_begin_end.txt"
OUTPUT_FILE="extracted_sequences.fasta"
> $OUTPUT_FILE
while read -r ID BEGIN END; do
seqkit subseq --chr $ID -r $BEGIN:$END $FASTA_FILE >> $OUTPUT_FILE
done < $ID_FILE
3.0. Sequence Alignment
Align the extracted sequences using MAFFT.
mafft --auto extracted_sequences.fasta > extracted_sequences_aln.fasta
4.0. Sequence Trimming
Trim the aligned sequences to remove poorly aligned regions using TrimAl.
trimal -in extracted_sequences_aln.fasta -out extracted_sequences_aln_trimal.fasta -gt 0.8 -cons 5
parameters of trimal
5.0. Phylogenetic Tree Construction
Construct a phylogenetic tree using IQ-TREE. If you learn more information about iqtree , you can look at iqtree website .
mkdir iqtree
iqtree -s extracted_sequences_aln_trimal.fasta --prefix ./iqtree/betacoronaviruses -T 4 --mem 3G --ufboot 1000 --boot-trees