Virusworkflowexample - BGIGPD/BestPractices4Pathogenomics GitHub Wiki

example:Surveillance of Bat Coronaviruses in Kenya: Identifying Relatives of Human Coronaviruses NL63 and 229E and Their Recombination History

Background

Africa is recognized as a major hotspot for zoonotic emerging diseases. With its rich biodiversity, the continent is home to a multitude of bat species, several of which are known to harbor important zoonotic diseases such as Marburg hemorrhagic fever and rabies. Our initial screening efforts have revealed the presence of diverse coronaviruses (CoVs) in African bats, including samples collected in the southern regions of Kenya and other African countries like South Africa, Nigeria, and Ghana. Recent studies have further provided compelling evidence that HCoV-229E has its origins in bat viruses circulating in Africa, highlighting the zoonotic potential of bat-borne CoVs from this continent.

Data Introduction

In this study, we analyzed 30 metatranscriptome sequencing datasets from Kenyan bats to identify relatives of human coronaviruses and to explore their recombination history. data path: /home/shipeibo/data

Methodology

1. Filtering

1.1: Clean reads with fastp

fastp --detect_adapter_for_pe \\
      --dedup \\
      --dup_calc_accuracy 3 \\
      --dont_eval_duplication \\
      --qualified_quality_phred 20 \\
      --n_base_limit 5 \\
      --average_qual 20 \\
      --length_required 50 \\
      --low_complexity_filter \\
      --correction \\
      --thread 8 \\
      -i sample_test_r1.fq.gz \\
      -o  sample_test_r1.fastp.fq.gz \\
      -I sample_test_r2.fq.gz\\
      -O  sample_test_r2.fastp.fq.gz \\
      --json sample_test.json \\
      --html  sample_test.html

1.2: Remove rRNA with Bowtie2

bowtie2 --local --threads 8 -1 sample_test_r1.fastp.fq.gz -2 sample_test_r2.fastp.fq.gz -x ./rRNA -S sample_test.rRNA.sam --un-conc-gz ./sample_test
mv sample_test.1 sample_test.cleanreads.1.fq.gz
mv sample_test.2 sample_test.cleanreads.2.fq.gz

2. Assembly

megahit --memory 20000000000 --min-contig-len 300 -t 12 --out-dir ./megahit --out-prefix sample_test -1 ./sample_test/sample_test.cleanreads.1.fq.gz -2 ./sample_test/sample_test.cleanreads.2.fq.gz
perl -pe 's/^>/>sample_test-/' ./sample_test/megahit/sample_test.contigs.fa > ./sample_test/megahit/sample_test_addname.fna

3. Virus Identification

Step 3.0 Using seqkit to filter contigs with insufficient length

seqkit seq -m 1000 sample_test_addname.fna > sample_test_addname_gt1k.fna

Step 3.1 Using getORF to find all ORF translates from all assembled contigs

getorf -sequence sample_test_addname_gt1k.fna -outseq sample_test_addname_gt1k.orfs.faa -find 0 -table 1 -minsize 600

Step 3.2 Using HMMsearch to search those ORF translates onto RdRp profiles to find markerprotein

hmmsearch --cpu 20 --noali -o /dev/null --tblout hmmsearch.tblout -E 1e-5 /home/renzirui/database/RdRp_profiles/RdRp_HMM_profile_CLUSTALO.db sample_test_addname_gt1k.orfs.faa

Step 3.3 Fetch the sequence ID with those significant hit to RdRp profiles

cat hmmsearch.tblout | grep -v '#' | awk '{print $1}' > significant_hitID.list

Step 3.4 Mapping those ORF sequence ID into its contig nucleotide ID

cat significant_hitID.list | perl -ne '@a=split/_/; $out=join('_',@a[0,$#a-1]); print "$out\n"' > significant_hitID_contigid.list

Step 3.5 Grep those contigs with hallmark genes (RdRp) from all assemblies

seqkit grep -f significant_hitID_contigid.list sample_test_addname_gt1k.fna > hitRdRp_contigs.fna

Step 3.6 Primitive viral contig classification using VirusHost Database

diamond blastx -o blastx_vhdbcds_results.txt -d <PATH_TO_YOUR_VHDB_DIAMOND_INDEX> -q hitRdRp_contigs.fna --threads 8 --sensitive --max-target-seqs 1 --evalue 1E-5 --block-size 2.0 --index-chunks 1 --outfmt 6 qseqid qlen sseqid slen qstart qend sstart send evalue bitscore length pident mismatch gaps stitle qcovhsp scovhsp

Step 3.7 Fetch the viral sequence matched to the Vertebrate-associated viruses

cat blastx_vhdbcds_results.txt | grep Vertebrata | awk '{print $1}' > vertebrate_contigID.list
seqkit grep -f vertebrate_contigID.list hitRdRp_contigs.fna > vertebrate_assoc_viruses.fna

Step 3.8 run checkv

checkv end_to_end hitRdRp_contigs.fna output_directory -t 16

4. Phylogenetic Tree Construction

The construction of a viral phylogenetic tree can be done using either the whole genome or signature genes. Since the RNA-dependent RNA polymerase (RdRp) protein is relatively conserved among RNA viruses, it is commonly used for building phylogenetic trees.

4.1 download background genome Data

You can download these sequences at NCBI by searching for their IDs.

You can also use NCBI's web tool Batch Entrez to batch download.

4.2 RDRP BLAST Database Creation and Search

makeblastdb -in rdrp.faa -dbtype prot -out ./blast_db/rdrp
blastx -query beta_rdrp.fasta -db ./blast_db/rdrp -out results.txt -outfmt 6 -evalue 1e-5

4.3 Sequence Extraction

#!/bin/bash

awk '{print $1,$7,$8}' results.txt > ID_begin_end.txt

FASTA_FILE="beta_rdrp.fasta"
ID_FILE="ID_begin_end.txt"
OUTPUT_FILE="extracted_sequences.fasta"

> $OUTPUT_FILE

while read -r ID BEGIN END; do
    seqkit subseq --chr $ID -r $BEGIN:$END $FASTA_FILE >> $OUTPUT_FILE
done < $ID_FILE

4.4 Phylogenetic Tree Construction

Align the extracted sequences using MAFFT. Trim the aligned sequences to remove poorly aligned regions using TrimAl. Construct a phylogenetic tree using IQ-TREE.

mafft --auto extracted_sequences.fasta > extracted_sequences_aln.fasta
trimal -in extracted_sequences_aln.fasta -out extracted_sequences_aln_trimal.fasta -gt 0.8 -cons 5
mkdir iqtree 
iqtree -s extracted_sequences_aln_trimal.fasta --prefix ./iqtree/betacoronaviruses -T 4 --mem 3G --ufboot 1000 --boot-trees

5. Viral Genome Annotation

Viruses, despite their great abundance and significance in biological systems, remain largely mysterious. Indeed, the vast majority of the perhaps hundreds of millions of viral species on the planet remain undiscovered. Additionally, many viruses deposited in central databases like GenBank and RefSeq are littered with genes annotated as ‘hypothetical protein’ or the equivalent, a efficient virus discovery and annotation tool is needed to discover familiar or divergent viral sequences from user-input contigs.

5.1 Viral Genome Annotation

cenotetaker3 -c ./vertebrate_assoc_viruses.fna -r my_meta_ct3 -p T

5.2 Visulize genome map output file *.gbf

Visulize the *.gbf in Geneious Also there are some other tools such as UCSC Genome Browser, IGV, SnapGene Viewer can do this, but some of them require converting the file format first.

6. Recombination Analysis

Gene recombination is an important mechanism in the evolution of viruses, allowing for rapid genetic variation. To reveal the role of genetic mechanisms such as recombination in certain gene evolutions, recombination analysis is necessary to identify potential recombination signals and assess their reliability. This tutorial will guide beginners through the use of Simplot, a popular tool for virus recombination analysis.

Step 6.1: Sequence Alignment

Group and align sequences. It is recommended to use unique names for different reference sequence groups to facilitate automatic grouping by Simplot.
Save the alignment as a .fasta file.

mafft --auto extracted_sequences.fasta > extracted_sequences_aln.fasta

Step 6.2: Running Simplot

Launch Simplot and open SeqPage. Click on "File" -> "Open" and select the saved .fasta file. All sequences will be selected by default.
Change the way groups are identified 3
Adjusting parameters :Window Size ,Step Size,Vertical Axis
Switch to the "Simplot" tab and click on "Command" -> "Query" and select Human_coronavirus_229E
The "Start Scan" button in the bottom right corner will become highlighted.

Step 6.3: Result Output

Simplot can save plots directly as BMP images or metafiles, but the resolution may not meet journal requirements. Therefore, it is recommended to export the data and plot it in Excel or other software.

Click on "File" -> "Save Chart Values as CSV" to export the data as a CSV file.
Open the CSV file in Excel, select the data, and insert a line chart to further beautify the plot.