Home - gabrielvpina/viralquest GitHub Wiki

Welcome to the ViralQuest wiki!

ViralQuest is a pipeline for viral identification and characterization in DNA and RNA samples, it uses a FASTA file of input (not FASTQ) and returns all the information about viral sequences in it.

Basics steps in ViralQuest

1. Select viral sequences

ViralQuest is a pipeline driven to efficiently search viral sequences in a file, it's possible use a raw FASTA file of an assembled sample and align all the sequences at once against a large database, but this step takes a lot of time, energy and computer processing. ViralQuest uses some strategies to select possible viral sequences in a sample to reduce time and increase efficiency in the process. This strategies are based in:

Sequence align - It's possible to align sequences with a small databases of viral proteins, like RefSeq Viral Release, to select all sequences with some similarity with viral proteins via BLASTx algothm;
Hidden Markov Models - Viral clusters that contain protein information of viral conserved regions. They are pre-created models (RVDB, VFAM and eggNOG Viral OGs) that can be compared with aminoacid data to search some similiraty through HMMER algorithm;

In summary, ViralQuest align the sequences to a minor database of viral characterized proteins and predict the possible Open Reading Frames (ORFs) of all sequences and compare then with a HMM models that contain viral conserved regions. This two results combined can increase the efficient search of viral elements in a FASTA file.

2. Global Alignment

After select the possible viral sequences, the global alignment with large databases of nucleotide (nt/core_nt) and aminoacids (nr/refseq_protein) remove false positives of the first search and returns a better characterization of the viral sequences. Due the removal of non-viral sequences in step 1, this process is considerable faster than a normal global aligment with all raw sequences.

3. Taxonomic Characterization

The results of BLASTx are collected, and the match species is compared with the databases of:

NCBI Taxonomy;
ICTV Master Species List;

This search may return Phylum, Class, Order, Family, Genus, Species and Scientific name of the subject viral sequence.

4. Conserved Domains Analysis

To further characterize viral sequences, the analysis of conserved regions is necessary to ensure the results and better characterize sequences of possible novel viruses. The HMM model chosen to realize this task is Pfam, and as a complementary information, a database of functional annotation of Pfam conserved regions is also used in this analysis.