Constructionofevolutionarytree - BGIGPD/BestPractices4Pathogenomics GitHub Wiki

Construction of evolutionary tree

1.Overview

Phylogenetic tree construction is a powerful tool in evolutionary biology that helps to understand the evolutionary relationships among different species or strains of viruses. This document outlines the steps and considerations for constructing a phylogenetic tree for a virus of interest, which can provide insights into the evolutionary processes and the degree of relatedness to known viruses. Such analysis is crucial for rapidly identifying the host source of viral diseases, exploring transmission pathways, and assessing whether the virus has undergone mutations in the short term. This information can guide clinical decisions, including the choice of medication.

The construction of a viral phylogenetic tree can be done using either the whole genome or signature genes. Since the RNA-dependent RNA polymerase (RdRp) protein is relatively conserved among RNA viruses, it is commonly used for building phylogenetic trees.

2.Objectives

Construction of an evolutionary tree of betacoronaviruses (https://journals.asm.org/doi/full/10.1128/jvi.01953-16) Fig2

3. Softwares and Datas

This workflow requires the installation of several bioinformatics tools and the use of specific data files.

Softwares

  • BLAST v2.16.0
  • trimal v1.5
  • MAFFT v7.525
  • seqkit v2.8.2
  • iqtree v2.3.6

Datas

  • Betacoronaviruses background genome

4. Steps

0.1. Software Installation

First, we'll install all the necessary bioinformatics tools using Conda.

conda install mafft trimal iqtree blast seqkit -c bioconda -c conda-forge

0.2. Download Betacoronaviruses background genome Data

You can download these sequences at NCBI by searching for their IDs.

You can also use NCBI's web tool Batch Entrez to batch download.

I have downloaded it, you can copy it directly to your own directory.


mkdir tree

cd tree
# copy Betacoronaviruses background genome
cp /home/shipeibo/tree/beta_rdrp.fasta ./ 
# copy Betacoronaviruses rdrp  protein sequence
cp /home/shipeibo/tree/rdrp.faa  ./

1.0. BLAST Database Creation and Search

Create a BLAST database from the protein FASTA file and perform a BLAST search to identify related sequences.

mkdir blast_db
makeblastdb -in rdrp.faa -dbtype prot -out ./blast_db/rdrp
blastx -query beta_rdrp.fasta -db ./blast_db/rdrp -out results.txt -outfmt 6 -evalue 1e-5

The meaning of each column of blastx results

Extract the relevant IDs, begin, and end positions from the BLAST results.

awk '{print $1,$7,$8}' results.txt > ID_begin_end.txt

2.0. Sequence Extraction

Extract the rdrp sequences based on the IDs and positions obtained from the BLAST results.

#!/bin/bash

FASTA_FILE="beta_rdrp.fasta"
ID_FILE="ID_begin_end.txt"
OUTPUT_FILE="extracted_sequences.fasta"

> $OUTPUT_FILE

while read -r ID BEGIN END; do
    seqkit subseq --chr $ID -r $BEGIN:$END $FASTA_FILE >> $OUTPUT_FILE
done < $ID_FILE

3.0. Sequence Alignment

Align the extracted sequences using MAFFT.

mafft --auto extracted_sequences.fasta > extracted_sequences_aln.fasta

4.0. Sequence Trimming

Trim the aligned sequences to remove poorly aligned regions using TrimAl.

trimal -in extracted_sequences_aln.fasta -out extracted_sequences_aln_trimal.fasta -gt 0.8 -cons 5

parameters of trimal

5.0. Phylogenetic Tree Construction

Construct a phylogenetic tree using IQ-TREE. If you learn more information about iqtree , you can look at iqtree website .

mkdir iqtree 
iqtree -s extracted_sequences_aln_trimal.fasta --prefix ./iqtree/betacoronaviruses -T 4 --mem 3G --ufboot 1000 --boot-trees