VirusIdentificationAndQualityAssessment - BGIGPD/BestPractices4Pathogenomics GitHub Wiki
In today's workshop, you will learn how to identify RNA virus from assembled contigs from metatranscriptomic data. This will be done using HMM-based profile search on predicted ORFs against public RdRp profile database. You will then locate the corresponding viral nucleic acid sequences that contain RdRp. Based on VirusHostDB, you will perform a homologous search to do primitive classification of those viral sequences, and finally, we will learn to assess the quality and completeness of the identified viral genomes using CheckV.
- Learn how to find hallmark gene (RdRP) and the corespondent contigs of RNA virus using HMMER
- Learn how to do some primitive classification by homologous search using DIAMOND
- Learn how to assess the quality of viral genome using CheckV
- HMMER v3.3 or later
- CheckV v1.0.1 or later
- DIAMOND v2.1.8
- RdRp-scan
- CheckV-db
- VirusHost database
-
Using conda or mamba (recommended):
CheckV has some scrict prerequisites of dependent softwares or tools, so we create a new environment for it
conda create -n checkv -c conda-forge -c bioconda checkv=1.0.3 conda activate checkv
-
[SKIP TODAY] Using conda or mamba (Skipped today because the checkv will install HMMER at the same time)
conda install -c conda-forge -c bioconda hmmer
- seqkit for sequence manipulation
- EMBOSS for run getorf
conda install -c bioconda -c conda-forge seqkit emboss
[ Downloaded, SKIP TODAY ] If you install using conda
or pip
you will need to download the database:
checkv download_database ./
CheckV database do not need extra configuration, but you can use environmental variable to specify the path to the database, or just write it into your .bashrc
file
export CHECKVDB=/home/renzirui/database/checkv-db-v1.5
Or you can explicitly specify the path to the database using the parameter -d
when executing the checkV
-
RdRp-scan has archived in the GitHub so we can just simply clone it
git clone https://github.com/JustineCharon/RdRp-scan.git
-
Copy the RdRp profile from the repo directory to your own directory
mkdir RdRp_profiles cp RdRp-scan/Profile_db_and_alignments/RdRp_HMM_profile_CLUSTALO.db.h3* RdRp_profiles/
-
Check if the copy is successful using
ls
to list your directory, below represent a success copy$ ls RdRp_profiles/ RdRp_HMM_profile_CLUSTALO.db.h3f RdRp_HMM_profile_CLUSTALO.db.h3i RdRp_HMM_profile_CLUSTALO.db.h3m RdRp_HMM_profile_CLUSTALO.db.h3p
-
Open the Index of /ftp/db/virushostdb (genome.jp) to find the file you need
-
[ Downloaded, SKIP TODAY ] Download it on the server using
curl
,wget
oraxel
below provide an example of wgetwget https://www.genome.jp/ftp/db/virushostdb/virushostdb.formatted.cds.faa.gz
-
Using diamond makedb to build the index
--in
path to input fasta file--db
output db name-t
specify the number of threads for makedbdiamond makedb --in /home/renzirui/database/VirusHostDB/virushostdb.formatted.cds.faa.gz --db virushostdb.formatted.cds.dmnd -t 8
You can first copy the demo dataset to your own directory
cp /home/renzirui/workshop_virusidentify/dataset/workshop_assembled_demo.fna <DESTINATION_DIRECTORY>
seqkit seq -m 1000 workshop_assembled_demo.fna > workshop_assembled_demo_gt1k.fna
getorf -sequence workshop_assembled_demo_gt1k.fna -outseq workshop_assembled_demo_gt1k.orfs.faa -find 0 -table 1 -minsize 600
hmmsearch --cpu 20 --noali -o /dev/null --tblout hmmsearch.tblout -E 1e-5 /home/renzirui/database/RdRp_profiles/RdRp_HMM_profile_CLUSTALO.db workshop_assembled_demo_gt1k.orfs.faa
cat hmmsearch.tblout | grep -v '#' | awk '{print $1}' > significant_hitID.list
cat significant_hitID.list | perl -ne '@a=split/_/; $out=join('_',@a[0,$#a-1]); print "$out\n"' > significant_hitID_contigid.list
seqkit grep -f significant_hitID_contigid.list workshop_assembled_demo_gt1k.fna > hitRdRp_contigs.fna
diamond blastx -o blastx_vhdbcds_results.txt -d <PATH_TO_YOUR_VHDB_DIAMOND_INDEX> -q hitRdRp_contigs.fna --threads 8 --sensitive --max-target-seqs 1 --evalue 1E-5 --block-size 2.0 --index-chunks 1 --outfmt 6 qseqid qlen sseqid slen qstart qend sstart send evalue bitscore length pident mismatch gaps stitle qcovhsp scovhsp
cat blastx_vhdbcds_results.txt | grep Vertebrata | awk '{print $1}' > vertebrate_contigID.list
seqkit grep -f vertebrate_contigID.list hitRdRp_contigs.fna > vertebrate_assoc_viruses.fna
checkv end_to_end hitRdRp_contigs.fna output_directory -t 16