Virusworkflowexample - BGIGPD/BestPractices4Pathogenomics GitHub Wiki
example:Surveillance of Bat Coronaviruses in Kenya: Identifying Relatives of Human Coronaviruses NL63 and 229E and Their Recombination History
Background
Africa is recognized as a major hotspot for zoonotic emerging diseases. With its rich biodiversity, the continent is home to a multitude of bat species, several of which are known to harbor important zoonotic diseases such as Marburg hemorrhagic fever and rabies. Our initial screening efforts have revealed the presence of diverse coronaviruses (CoVs) in African bats, including samples collected in the southern regions of Kenya and other African countries like South Africa, Nigeria, and Ghana. Recent studies have further provided compelling evidence that HCoV-229E has its origins in bat viruses circulating in Africa, highlighting the zoonotic potential of bat-borne CoVs from this continent.
Data Introduction
In this study, we analyzed 30 metatranscriptome sequencing datasets from Kenyan bats to identify relatives of human coronaviruses and to explore their recombination history. data path: /home/shipeibo/data
Methodology
1. Filtering
1.1: Clean reads with fastp
fastp --detect_adapter_for_pe \\
--dedup \\
--dup_calc_accuracy 3 \\
--dont_eval_duplication \\
--qualified_quality_phred 20 \\
--n_base_limit 5 \\
--average_qual 20 \\
--length_required 50 \\
--low_complexity_filter \\
--correction \\
--thread 8 \\
-i sample_test_r1.fq.gz \\
-o sample_test_r1.fastp.fq.gz \\
-I sample_test_r2.fq.gz\\
-O sample_test_r2.fastp.fq.gz \\
--json sample_test.json \\
--html sample_test.html
1.2: Remove rRNA with Bowtie2
bowtie2 --local --threads 8 -1 sample_test_r1.fastp.fq.gz -2 sample_test_r2.fastp.fq.gz -x ./rRNA -S sample_test.rRNA.sam --un-conc-gz ./sample_test
mv sample_test.1 sample_test.cleanreads.1.fq.gz
mv sample_test.2 sample_test.cleanreads.2.fq.gz
2. Assembly
megahit --memory 20000000000 --min-contig-len 300 -t 12 --out-dir ./megahit --out-prefix sample_test -1 ./sample_test/sample_test.cleanreads.1.fq.gz -2 ./sample_test/sample_test.cleanreads.2.fq.gz
perl -pe 's/^>/>sample_test-/' ./sample_test/megahit/sample_test.contigs.fa > ./sample_test/megahit/sample_test_addname.fna
3. Virus Identification
Step 3.0 Using seqkit to filter contigs with insufficient length
seqkit seq -m 1000 sample_test_addname.fna > sample_test_addname_gt1k.fna
Step 3.1 Using getORF to find all ORF translates from all assembled contigs
getorf -sequence sample_test_addname_gt1k.fna -outseq sample_test_addname_gt1k.orfs.faa -find 0 -table 1 -minsize 600
Step 3.2 Using HMMsearch to search those ORF translates onto RdRp profiles to find markerprotein
hmmsearch --cpu 20 --noali -o /dev/null --tblout hmmsearch.tblout -E 1e-5 /home/renzirui/database/RdRp_profiles/RdRp_HMM_profile_CLUSTALO.db sample_test_addname_gt1k.orfs.faa
Step 3.3 Fetch the sequence ID with those significant hit to RdRp profiles
cat hmmsearch.tblout | grep -v '#' | awk '{print $1}' > significant_hitID.list
Step 3.4 Mapping those ORF sequence ID into its contig nucleotide ID
cat significant_hitID.list | perl -ne '@a=split/_/; $out=join('_',@a[0,$#a-1]); print "$out\n"' > significant_hitID_contigid.list
Step 3.5 Grep those contigs with hallmark genes (RdRp) from all assemblies
seqkit grep -f significant_hitID_contigid.list sample_test_addname_gt1k.fna > hitRdRp_contigs.fna
Step 3.6 Primitive viral contig classification using VirusHost Database
diamond blastx -o blastx_vhdbcds_results.txt -d <PATH_TO_YOUR_VHDB_DIAMOND_INDEX> -q hitRdRp_contigs.fna --threads 8 --sensitive --max-target-seqs 1 --evalue 1E-5 --block-size 2.0 --index-chunks 1 --outfmt 6 qseqid qlen sseqid slen qstart qend sstart send evalue bitscore length pident mismatch gaps stitle qcovhsp scovhsp
Step 3.7 Fetch the viral sequence matched to the Vertebrate-associated viruses
cat blastx_vhdbcds_results.txt | grep Vertebrata | awk '{print $1}' > vertebrate_contigID.list
seqkit grep -f vertebrate_contigID.list hitRdRp_contigs.fna > vertebrate_assoc_viruses.fna
Step 3.8 run checkv
checkv end_to_end hitRdRp_contigs.fna output_directory -t 16
4. Phylogenetic Tree Construction
The construction of a viral phylogenetic tree can be done using either the whole genome or signature genes. Since the RNA-dependent RNA polymerase (RdRp) protein is relatively conserved among RNA viruses, it is commonly used for building phylogenetic trees.
4.1 download background genome Data
You can download these sequences at NCBI by searching for their IDs.
You can also use NCBI's web tool Batch Entrez to batch download.
4.2 RDRP BLAST Database Creation and Search
makeblastdb -in rdrp.faa -dbtype prot -out ./blast_db/rdrp
blastx -query beta_rdrp.fasta -db ./blast_db/rdrp -out results.txt -outfmt 6 -evalue 1e-5
4.3 Sequence Extraction
#!/bin/bash
awk '{print $1,$7,$8}' results.txt > ID_begin_end.txt
FASTA_FILE="beta_rdrp.fasta"
ID_FILE="ID_begin_end.txt"
OUTPUT_FILE="extracted_sequences.fasta"
> $OUTPUT_FILE
while read -r ID BEGIN END; do
seqkit subseq --chr $ID -r $BEGIN:$END $FASTA_FILE >> $OUTPUT_FILE
done < $ID_FILE
4.4 Phylogenetic Tree Construction
Align the extracted sequences using MAFFT. Trim the aligned sequences to remove poorly aligned regions using TrimAl. Construct a phylogenetic tree using IQ-TREE.
mafft --auto extracted_sequences.fasta > extracted_sequences_aln.fasta
trimal -in extracted_sequences_aln.fasta -out extracted_sequences_aln_trimal.fasta -gt 0.8 -cons 5
mkdir iqtree
iqtree -s extracted_sequences_aln_trimal.fasta --prefix ./iqtree/betacoronaviruses -T 4 --mem 3G --ufboot 1000 --boot-trees
5. Viral Genome Annotation
Viruses, despite their great abundance and significance in biological systems, remain largely mysterious. Indeed, the vast majority of the perhaps hundreds of millions of viral species on the planet remain undiscovered. Additionally, many viruses deposited in central databases like GenBank and RefSeq are littered with genes annotated as ‘hypothetical protein’ or the equivalent, a efficient virus discovery and annotation tool is needed to discover familiar or divergent viral sequences from user-input contigs.
5.1 Viral Genome Annotation
cenotetaker3 -c ./vertebrate_assoc_viruses.fna -r my_meta_ct3 -p T
5.2 Visulize genome map output file *.gbf
Visulize the *.gbf in Geneious Also there are some other tools such as UCSC Genome Browser, IGV, SnapGene Viewer can do this, but some of them require converting the file format first.
6. Recombination Analysis
Gene recombination is an important mechanism in the evolution of viruses, allowing for rapid genetic variation. To reveal the role of genetic mechanisms such as recombination in certain gene evolutions, recombination analysis is necessary to identify potential recombination signals and assess their reliability. This tutorial will guide beginners through the use of Simplot, a popular tool for virus recombination analysis.
Step 6.1: Sequence Alignment
- Group and align sequences. It is recommended to use unique names for different reference sequence groups to facilitate automatic grouping by Simplot.
- Save the alignment as a
.fasta
file.
mafft --auto extracted_sequences.fasta > extracted_sequences_aln.fasta
Step 6.2: Running Simplot
- Launch Simplot and open SeqPage. Click on "File" -> "Open" and select the saved
.fasta
file. All sequences will be selected by default. - Change the way groups are identified 3
- Adjusting parameters :Window Size ,Step Size,Vertical Axis
- Switch to the "Simplot" tab and click on "Command" -> "Query" and select Human_coronavirus_229E
- The "Start Scan" button in the bottom right corner will become highlighted.
Step 6.3: Result Output
Simplot can save plots directly as BMP images or metafiles, but the resolution may not meet journal requirements. Therefore, it is recommended to export the data and plot it in Excel or other software.
- Click on "File" -> "Save Chart Values as CSV" to export the data as a CSV file.
- Open the CSV file in Excel, select the data, and insert a line chart to further beautify the plot.