VirusIdentificationAndQualityAssessment - BGIGPD/BestPractices4Pathogenomics GitHub Wiki

Workshop: Virus Identification & Viral genome quality assessment

Overview

In today's workshop, you will learn how to identify RNA virus from assembled contigs from metatranscriptomic data. This will be done using HMM-based profile search on predicted ORFs against public RdRp profile database. You will then locate the corresponding viral nucleic acid sequences that contain RdRp. Based on VirusHostDB, you will perform a homologous search to do primitive classification of those viral sequences, and finally, we will learn to assess the quality and completeness of the identified viral genomes using CheckV.

Objectives

  • Learn how to find hallmark gene (RdRP) and the corespondent contigs of RNA virus using HMMER
  • Learn how to do some primitive classification by homologous search using DIAMOND
  • Learn how to assess the quality of viral genome using CheckV

Softwares and Databases

Softwares

  • HMMER v3.3 or later
  • CheckV v1.0.1 or later
  • DIAMOND v2.1.8

Databases

  • RdRp-scan
  • CheckV-db
  • VirusHost database

Steps

Part 0. Software installation & Database deployment

Installation of CheckV

  • Using conda or mamba (recommended):

    CheckV has some scrict prerequisites of dependent softwares or tools, so we create a new environment for it

    conda create -n checkv -c conda-forge -c bioconda checkv=1.0.3
    conda activate checkv

Installation of HMMER

  • [SKIP TODAY] Using conda or mamba (Skipped today because the checkv will install HMMER at the same time)

    conda install -c conda-forge -c bioconda hmmer

Installation of other softwares or tools for today's workshop

  • seqkit for sequence manipulation
  • EMBOSS for run getorf
conda install -c bioconda -c conda-forge seqkit emboss

Download the database of CheckV

[ Downloaded, SKIP TODAY ] If you install using conda or pip you will need to download the database:

checkv download_database ./

Configure the database of CheckV

CheckV database do not need extra configuration, but you can use environmental variable to specify the path to the database, or just write it into your .bashrc file

export CHECKVDB=/home/renzirui/database/checkv-db-v1.5

Or you can explicitly specify the path to the database using the parameter -d when executing the checkV

Download the RdRP HMM profile of RdRp-scan

  • RdRp-scan has archived in the GitHub so we can just simply clone it

    git clone https://github.com/JustineCharon/RdRp-scan.git
  • Copy the RdRp profile from the repo directory to your own directory

    mkdir RdRp_profiles
    cp RdRp-scan/Profile_db_and_alignments/RdRp_HMM_profile_CLUSTALO.db.h3* RdRp_profiles/
  • Check if the copy is successful using ls to list your directory, below represent a success copy

    $ ls RdRp_profiles/
    RdRp_HMM_profile_CLUSTALO.db.h3f  RdRp_HMM_profile_CLUSTALO.db.h3i  RdRp_HMM_profile_CLUSTALO.db.h3m  RdRp_HMM_profile_CLUSTALO.db.h3p
    

Download the protein database from the VirusHost database

  • Open the Index of /ftp/db/virushostdb (genome.jp) to find the file you need

  • [ Downloaded, SKIP TODAY ] Download it on the server using curl ,wget or axel below provide an example of wget

    wget https://www.genome.jp/ftp/db/virushostdb/virushostdb.formatted.cds.faa.gz

Build the diamond blast index for the protein databse from VirusHost DB

  • Using diamond makedb to build the index

    --in path to input fasta file

    --db output db name

    -t specify the number of threads for makedb

    diamond makedb --in /home/renzirui/database/VirusHostDB/virushostdb.formatted.cds.faa.gz --db virushostdb.formatted.cds.dmnd -t 8

Part 1. Identify putative viral genome from metatranscriptomic assemblies

You can first copy the demo dataset to your own directory

cp /home/renzirui/workshop_virusidentify/dataset/workshop_assembled_demo.fna <DESTINATION_DIRECTORY>

Step 1.1 Using seqkit to filter contigs with insufficient length

seqkit seq -m 1000 workshop_assembled_demo.fna > workshop_assembled_demo_gt1k.fna

Step 1.1 Using getORF to find all ORF translates from all assembled contigs

getorf -sequence workshop_assembled_demo_gt1k.fna -outseq workshop_assembled_demo_gt1k.orfs.faa -find 0 -table 1 -minsize 600

Step 1.2 Using HMMsearch to search those ORF translates onto RdRp profiles to find markerprotein

hmmsearch --cpu 20 --noali -o /dev/null --tblout hmmsearch.tblout -E 1e-5 /home/renzirui/database/RdRp_profiles/RdRp_HMM_profile_CLUSTALO.db workshop_assembled_demo_gt1k.orfs.faa

Step 1.3 Fetch the sequence ID with those significant hit to RdRp profiles

cat hmmsearch.tblout | grep -v '#' | awk '{print $1}' > significant_hitID.list

Step 1.4 Mapping those ORF sequence ID into its contig nucleotide ID

cat significant_hitID.list | perl -ne '@a=split/_/; $out=join('_',@a[0,$#a-1]); print "$out\n"' > significant_hitID_contigid.list

Step 1.5 Grep those contigs with hallmark genes (RdRp) from all assemblies

seqkit grep -f significant_hitID_contigid.list workshop_assembled_demo_gt1k.fna > hitRdRp_contigs.fna

Step 1.6 Primitive viral contig classification using VirusHost Database

diamond blastx -o blastx_vhdbcds_results.txt -d <PATH_TO_YOUR_VHDB_DIAMOND_INDEX> -q hitRdRp_contigs.fna --threads 8 --sensitive --max-target-seqs 1 --evalue 1E-5 --block-size 2.0 --index-chunks 1 --outfmt 6 qseqid qlen sseqid slen qstart qend sstart send evalue bitscore length pident mismatch gaps stitle qcovhsp scovhsp

Step 1.7 Fetch the viral sequence matched to the Vertebrate-associated viruses

cat blastx_vhdbcds_results.txt | grep Vertebrata | awk '{print $1}' > vertebrate_contigID.list
seqkit grep -f vertebrate_contigID.list hitRdRp_contigs.fna > vertebrate_assoc_viruses.fna

Part 2. Quality assessment of putative viral genomes

Step 2.1 run checkv

checkv end_to_end hitRdRp_contigs.fna output_directory -t 16
⚠️ **GitHub.com Fallback** ⚠️