Workshop: Virus Identification & Viral genome quality assessment

Overview

In today's workshop, you will learn how to identify RNA virus from assembled contigs from metatranscriptomic data. This will be done using HMM-based profile search on predicted ORFs against public RdRp profile database. You will then locate the corresponding viral nucleic acid sequences that contain RdRp. Based on VirusHostDB, you will perform a homologous search to do primitive classification of those viral sequences, and finally, we will learn to assess the quality and completeness of the identified viral genomes using CheckV.

Objectives

Learn how to find hallmark gene (RdRP) and the corespondent contigs of RNA virus using HMMER
Learn how to do some primitive classification by homologous search using DIAMOND
Learn how to assess the quality of viral genome using CheckV

Softwares and Databases

Softwares

HMMER v3.3 or later
CheckV v1.0.1 or later
DIAMOND v2.1.8

Databases

RdRp-scan
CheckV-db
VirusHost database

Steps

Part 0. Software installation & Database deployment

Installation of CheckV

Using conda or mamba (recommended):

CheckV has some scrict prerequisites of dependent softwares or tools, so we create a new environment for it
```
conda create -n checkv -c conda-forge -c bioconda checkv=1.0.3
conda activate checkv
```

Installation of HMMER

[SKIP TODAY] Using conda or mamba (Skipped today because the checkv will install HMMER at the same time)
```
conda install -c conda-forge -c bioconda hmmer
```

Installation of other softwares or tools for today's workshop

seqkit for sequence manipulation
EMBOSS for run getorf

conda install -c bioconda -c conda-forge seqkit emboss

Download the database of CheckV

[ Downloaded, SKIP TODAY ] If you install using conda or pip you will need to download the database:

checkv download_database ./

Configure the database of CheckV

CheckV database do not need extra configuration, but you can use environmental variable to specify the path to the database, or just write it into your .bashrc file

export CHECKVDB=/home/renzirui/database/checkv-db-v1.5

Or you can explicitly specify the path to the database using the parameter -d when executing the checkV

Download the RdRP HMM profile of RdRp-scan

RdRp-scan has archived in the GitHub so we can just simply clone it
```
git clone https://github.com/JustineCharon/RdRp-scan.git
```

Copy the RdRp profile from the repo directory to your own directory

mkdir RdRp_profiles
cp RdRp-scan/Profile_db_and_alignments/RdRp_HMM_profile_CLUSTALO.db.h3* RdRp_profiles/

Check if the copy is successful using ls to list your directory, below represent a success copy

$ ls RdRp_profiles/
RdRp_HMM_profile_CLUSTALO.db.h3f  RdRp_HMM_profile_CLUSTALO.db.h3i  RdRp_HMM_profile_CLUSTALO.db.h3m  RdRp_HMM_profile_CLUSTALO.db.h3p

Download the protein database from the VirusHost database

Open the Index of /ftp/db/virushostdb (genome.jp) to find the file you need
[ Downloaded, SKIP TODAY ] Download it on the server using curl ,wget or axel below provide an example of wget
```
wget https://www.genome.jp/ftp/db/virushostdb/virushostdb.formatted.cds.faa.gz
```

Build the diamond blast index for the protein databse from VirusHost DB

Using diamond makedb to build the index

--in path to input fasta file

--db output db name

-t specify the number of threads for makedb

diamond makedb --in /home/renzirui/database/VirusHostDB/virushostdb.formatted.cds.faa.gz --db virushostdb.formatted.cds.dmnd -t 8

Part 1. Identify putative viral genome from metatranscriptomic assemblies

You can first copy the demo dataset to your own directory

cp /home/renzirui/workshop_virusidentify/dataset/workshop_assembled_demo.fna <DESTINATION_DIRECTORY>

Step 1.1 Using seqkit to filter contigs with insufficient length

seqkit seq -m 1000 workshop_assembled_demo.fna > workshop_assembled_demo_gt1k.fna

Step 1.1 Using getORF to find all ORF translates from all assembled contigs

getorf -sequence workshop_assembled_demo_gt1k.fna -outseq workshop_assembled_demo_gt1k.orfs.faa -find 0 -table 1 -minsize 600

Step 1.2 Using HMMsearch to search those ORF translates onto RdRp profiles to find markerprotein

hmmsearch --cpu 20 --noali -o /dev/null --tblout hmmsearch.tblout -E 1e-5 /home/renzirui/database/RdRp_profiles/RdRp_HMM_profile_CLUSTALO.db workshop_assembled_demo_gt1k.orfs.faa

Step 1.3 Fetch the sequence ID with those significant hit to RdRp profiles

cat hmmsearch.tblout | grep -v '#' | awk '{print $1}' > significant_hitID.list

Step 1.4 Mapping those ORF sequence ID into its contig nucleotide ID

cat significant_hitID.list | perl -ne '@a=split/_/; $out=join('_',@a[0,$#a-1]); print "$out\n"' > significant_hitID_contigid.list

Step 1.5 Grep those contigs with hallmark genes (RdRp) from all assemblies

seqkit grep -f significant_hitID_contigid.list workshop_assembled_demo_gt1k.fna > hitRdRp_contigs.fna

Step 1.6 Primitive viral contig classification using VirusHost Database

diamond blastx -o blastx_vhdbcds_results.txt -d <PATH_TO_YOUR_VHDB_DIAMOND_INDEX> -q hitRdRp_contigs.fna --threads 8 --sensitive --max-target-seqs 1 --evalue 1E-5 --block-size 2.0 --index-chunks 1 --outfmt 6 qseqid qlen sseqid slen qstart qend sstart send evalue bitscore length pident mismatch gaps stitle qcovhsp scovhsp

Step 1.7 Fetch the viral sequence matched to the Vertebrate-associated viruses

cat blastx_vhdbcds_results.txt | grep Vertebrata | awk '{print $1}' > vertebrate_contigID.list
seqkit grep -f vertebrate_contigID.list hitRdRp_contigs.fna > vertebrate_assoc_viruses.fna

Part 2. Quality assessment of putative viral genomes

Step 2.1 run checkv

checkv end_to_end hitRdRp_contigs.fna output_directory -t 16

VirusIdentificationAndQualityAssessment - BGIGPD/BestPractices4Pathogenomics GitHub Wiki

Workshop: Virus Identification & Viral genome quality assessment

Overview

Objectives

Softwares and Databases

Softwares

Databases

Steps

Part 0. Software installation & Database deployment

Installation of CheckV

Installation of HMMER

Installation of other softwares or tools for today's workshop

Download the database of CheckV

Configure the database of CheckV

Download the RdRP HMM profile of RdRp-scan

Download the protein database from the VirusHost database

Build the diamond blast index for the protein databse from VirusHost DB

Part 1. Identify putative viral genome from metatranscriptomic assemblies

Step 1.1 Using seqkit to filter contigs with insufficient length

Step 1.1 Using getORF to find all ORF translates from all assembled contigs

Step 1.2 Using HMMsearch to search those ORF translates onto RdRp profiles to find markerprotein

Step 1.3 Fetch the sequence ID with those significant hit to RdRp profiles

Step 1.4 Mapping those ORF sequence ID into its contig nucleotide ID

Step 1.5 Grep those contigs with hallmark genes (RdRp) from all assemblies

Step 1.6 Primitive viral contig classification using VirusHost Database

Step 1.7 Fetch the viral sequence matched to the Vertebrate-associated viruses

Part 2. Quality assessment of putative viral genomes

Step 2.1 run checkv

⚠️ GitHub.com Fallback ⚠️

VirusIdentificationAndQualityAssessment - BGIGPD/BestPractices4Pathogenomics GitHub Wiki

Workshop: Virus Identification & Viral genome quality assessment

Overview

Objectives

Softwares and Databases

Softwares

Databases

Steps

Part 0. Software installation & Database deployment

Installation of CheckV

Installation of HMMER

Installation of other softwares or tools for today's workshop

Download the database of CheckV

Configure the database of CheckV

Download the RdRP HMM profile of RdRp-scan

Download the protein database from the VirusHost database

Build the diamond blast index for the protein databse from VirusHost DB

Part 1. Identify putative viral genome from metatranscriptomic assemblies

Step 1.1 Using seqkit to filter contigs with insufficient length

Step 1.1 Using getORF to find all ORF translates from all assembled contigs

Step 1.2 Using HMMsearch to search those ORF translates onto RdRp profiles to find markerprotein

Step 1.3 Fetch the sequence ID with those significant hit to RdRp profiles

Step 1.4 Mapping those ORF sequence ID into its contig nucleotide ID

Step 1.5 Grep those contigs with hallmark genes (RdRp) from all assemblies

Step 1.6 Primitive viral contig classification using VirusHost Database

Step 1.7 Fetch the viral sequence matched to the Vertebrate-associated viruses

Part 2. Quality assessment of putative viral genomes

Step 2.1 run checkv

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️