FAQs - PaleovirologyLab/hi-fever GitHub Wiki

Below are the answers to some frequently asked questions when using HI-FEVER. If your question is not here feel free to contact us.

How can I tell what is a "true" EVE versus a false positive?

This is a very challenging question and there is still no perfect method for this. HI-FEVER addresses this challenge by providing information from many sources to help users distinguish between EVEs and false positives. The best approach is still to combine evidence from multiple sources.

Firstly, the summary table provides a predicted classification based on criteria drawing upon the taxonomy and labels of reciprocal database hits. Secondly, there are a few rules you can apply to each EVE to help distinguish:

Look at the top hit for the reciprocal nr database. If there are host proteins in the titles then it is very likely to be a false hit. Example of host protein labels are ubiquitin, GTPase, helicase (usually enzymes).
Anything with a taxonomy of Bamforvirae in either the nr or RVDB hits should be carefully scrutinised – this group of viruses are notorious for having many cross-matches to cellular proteins. It can be easier to consider them all false hits unless there are any viral hallmark genes (capsids, structural proteins etc).
There will be a lot of cross-matches between endogenous retroviruses and retrotransposons, especially for the polymerase genes. There will also be cross-matches between endogenous retroviruses and co-opted cellular genes like syncytin. For the validation of retroviral EVEs it is recommended to use additional tools like RepeatMasker, Censor GIRI searches or LTRHarvest.

The summary table classification aims to help with the initial steps of EVE validation, however we recommend checking each candidate EVE of interest through alignments, clustering or phylogenies to confirm.

Do I need a powerful computer to run HI-FEVER?

No! We have designed HI-FEVER to be run on anything from a laptop to a computing cluster. The minimum requirements are ~4GB of space for the MINI reciprocal databases and some temporary space for the genomes and intermediate files. With less powerful computers the workflow may take longer but should still finish within a few hours (based on 100,000 viral protein queries against 20 vertebrate genomes). If you do have more computational power you can consider using the full reciprocal databases and runtimes should be shorter.