Usage - PaleovirologyLab/hi-fever GitHub Wiki

Inputs

  • Path to data directory (--data_path): Path to the directory with your input files.
  • Query proteins (--query_file_aa): The viral proteins that you want to search for in the host genomes. You can provide it as a .fasta file or diamond database (.dmnd) built from your query proteins. The workflow will automatically detect the type of file you provide.
  • Host genome(s) (--ftp_file): The genomes you would like to screen. Provide a custom ftp file in plain text (example at data/ftp_list.txt)
  • E-mail (--email): users need to provide an email in that is used to fetch the NCBI-database with Efetch in order to get taxonomical information about the protein hits and the host genomes.
  • Proteins for reciprocal search (--reciprocal_nr_db and --reciprocal_rvdb_db): HI-FEVER performs a second DIAMOND search to expand on the annotation of the EVE candidates. By default, HI-FEVER runs reciprocal searches against two databases that should be provided at data/MINI-nr_rep_seq-clustered_70id_80c_wtaxa.dmnd and data/MINI_rvdbv28_wtaxa.dmnd (download them from OSF).
  • Output directory (--outdir): Provide a name for folder where results will be stored. Default to hi-fever/output

Outputs

HI-FEVER creates 2 folders in the --outputdir:

Sequences (fastas folder):

  • loci-context-coordinates.fasta.gz nucleotide sequences of the candidate EVEs including the genomic context (flanking regions etc.).
  • loci-merged-coordinates.fasta.gz nucleotide sequences of the candidate EVEs.

Note: Sequences in this folder are candidate EVEs. We recommend a further filtering step for these sequences since many EVE candidates are likely to be cross-matches to host proteins. You can use the annotation results in the SQL folder to identify the locus-id of interesting candidates and observe their top hits to determine if they are true EVEs or have homology to host proteins.

Tables and metadata (sql folder):

  • assembly_metadata.tsv information about the genome assemblies provided in the ftp file including taxonomy, submitter, assembly level etc.
  • assembly_stats.tsv statistics about the genome assemblies provided in the ftp file including size and coverage
  • genewise.tsv predicted reconstructed sequences of each EVE candidate
  • locus_assembly_map.tsv file mapping the genomic locus ID to the assembly from which it came
  • matches.dmnd.annot.tsv results of the initial forward DIAMOND search of query proteins against genome assemblies
  • predicted_ORFs.tsv predicted ORFs within the region of each candidate EVE
  • reciprocal-*-matches.dmnd.tsv results of the reciprocal DIAMOND searches of candidate EVEs against reciprocal databases. There will be one result file for each reciprocal database search.
  • taxonomy_table.tsv taxonomy data for every genome assembly and reciprocal hit from the HI-FEVER run

Parameters

A key advantage of HI-FEVER is the flexibility provided by parameter customisation. Example combinations of parameters for different research scopes and scenarios can be found in example commands.

Inputs and outputs

  • --query_file_aa data/circoviridae.fa: Custom protein query file in fasta format (default: protein_query.fasta)
  • --ftp_file data/assemblies.txt: Custom ftp file in plain text (default: ftp_list.txt)
  • --email [email protected]: Email address required for efetching taxonomy data from NCBI (REQUIRED)
  • --reciprocal_db file custom.dmnd: File with proteins or database used during the reciprocal search (default: minimal_reciprocal.dmnd)
  • --outdir herpesviridae_vs_tarsier: Name of output directory (default: hifever_output)
  • -with-report report.html: Create Nextflow html workflow report (includes run time, user information, task metadata, and CPU, memory, and I/O usage)

Query clustering

  • --mmseqs_minseqid 0.70: Sequence identity threshold for clustering of the protein query (default: 0.95 = 95%). Input protein queries are clustered prior to the search to avoid redundant searches of highly identical proteins
  • --mmseqs_cover 0.80: Minimum percentage of cluster member sequence length that must overlap with the representative sequence (default: 0.90 = 90%)

Diamond search settings

  • --diamond_forks 8: DIAMOND fork count. By default Nextflow attempts to run all available forward DIAMOND tasks in parallel (one for each currently downloaded assembly), which can lead to overuse of memory resources and job termination. On local machines and clusters, it is therefore suggested to limit the number of parallel DIAMOND tasks (i.e., "forks") allowed at once. For setting the value, we recommend total cores / 12. Note that this does not affect the reciprocal DIAMOND search, which uses all available CPUs in a single process (default: 4 for cluster workflow)
  • --diamond_mode ultra-sensitive: DIAMOND sensitivity (default: very-sensitive)
  • --diamond_matrix BLOSUM45: DIAMOND substitution matrix (default: BLOSUM62)
  • chunk_size 10000: Size of host genome to process per DIAMOND forward search (default: 50000). Reducing this number will return more hits but take longer

EVE criteria settings

  • --interval 500: Maximum interval length allowed between features in the context FASTA (default: 1000). This parameter modifies the allowed distance between DIAMOND hits before they are merged and counted as one contiguous hit
  • --flank 5000: Maximum length of (available) flanking sequence to add upstream and downstream of detected features in the context FASTA (default: 3000). This parameter allows you to modify how much host sequence upstream and downstream of the DIAMOND hits you would like to extract and keep for downstream analysis.
  • --orf_size_nt 300: Minimum length of extracted ORFs, in nucleotides (default: 150). This parameter allows you to alter the size threshold used when searching for ORFs within EVE hits

Genewise settings

  • --stop_task soft-mask: Modify in-frame STOP codons in the Genewise coding DNA sequence output (default: remove). Options: remove [delete in-frame STOPs from the coding sequence], soft-mask [convert in-frame STOPs to lowercase
  • --pairs_per_task 100: Subset ($n$) of total loci count ($N$) processed by each Genewise task. To ensure equivalent workload across tasks and a high level of parallelisation, Genewise operations are split into $T$ tasks, where $T = N/n$ (default: 500)
  • --genewise_matrix BLOSUM45: Genewise substitution matrix (default: BLOSUM62). Options: BLOSUM80, BLOSUM62, BLOSUM45, BLOSUM30

Example inputs

HI-FEVER searches host genomes for matches against protein queries. As such, it requires a file providing the ftp list of genomes to be screened and a fasta file of protein queries. Examples of these files are below:

Query proteins

A FASTA file containing protein sequences to search for (default 'protein_query.fasta') e.g:

>ADI48253.1 putative Rep [Circoviridae TM-6c]
MQSVNWCFTLNNYTNEDVNKLKQVKCRYICLGFEVGDKKQTPHIQGFIQFEKKVRLSVWKKINKKIHAEI
MKGTIEQAINYCKKSGTFEERGEIIKMGERRDLKEAKKKCAEVGLRAITDCESTYNLQVIRNCQIMLEYH
EKERDFKPEVIWIYGESGAGKTKYISEKCAEVDTYWKDATKWWNGYDRHEITVMDDFRASNMKMNELLKL
IDRYPHRVEIKGGFRQMLSKKIYISSIMHPKDVYNLPEEPVKQLLRRIDTIIKI

>NP_042987.1 Rep [Human betaherpesvirus 6A]
MFSIINPSDDFWTKDKYIMLTIKGPMEWEAEIPGISTDFFCKFSNVSVPHFRDMHSPGAPDIKWITACTK
MIDVILNYWNNKTAVPTPAKWYAQAENKAGRPSLILLIALDGIPSATIGKHTTEIRGVLIKDFFDGNAPK
IDDWCTYAKTKKNGGGTQVFSLSYIPFALLQIIRPQFQWAWTNINELGDVCDEIHRKHIISHFNKKPNVK
LMLFPKDGINGISLKSKFLGTIEWLSDLGIVTEDAWIRRDIRSYMQLLTLTHGDVLIHRALSIAKKRIRA
TRKAIDFIAHIDTDFQIYENPVYQLFCLQSFDPILAGTILYQWLSHRGGKKNTVSFIGPPGCGKSMLTGA
ILENIPLHGILHGSLNTKNLRAYGQVLVLWWKDISINFDNFNIIKSLLGGQKIIFPINENDHVQIGPCPI
IATSCVDIRSMVHSNLHKINLSQRVYNFTFDKVIPRNFPVIQKDDINQFLFWARNRSINCFIDYTVPKIL

Host genomes

The host genomes file (default: ftp_list.txt) should be a list of links to genome assemblies to screen. Links to assemblies available from NCBI can be found on the RefSeq and GenBank ftp sites or in the ftp tab on the NCBI genome page.

https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/330/505/GCA_000330505.1_EIA2_v2
https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/023/065/795/GCA_023065795.1_ASM2306579v1
https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/208/925/GCF_000208925.1_JCVI_ESG2_1.0

Preparing a reciprocal database

An essential step in hi-fever is the reciprocal DIAMOND search. There are three options of input databases that you can use for this step based on your research question and your computational resources:

  1. Minimal reciprocal database (recommended for most users, requires ~3GB of storage space)

By default, hi-fever includes a reciprocal database (hi-fever/data/minimal_reciprocal.dmnd) with the most frequent and informative hits you can get while searching for EVEs in vertebrate genomes. We provide it as an alternative to the full NCBI non reduntand protein database (nr) and Reference Virus Database (RVDB). which normally require >100GB of storage space. Whilst we have made all efforts to preserve a diversity of informative proteins in the minimal reciprocal database, there is a chance that the closest EVE match will not be represented.

  1. Full reciprocal databases (recommended for large workstations and cluster computers, requires ~115 GB of storage space)

Download the latest version of both the NCBI nr database and RVDB proteins database for the most accurate EVE annotations using:

cd data
# NCBI nr database
wget https://files.osf.io/v1/resources/tejwd/providers/googledrive/nr_rep_seq.fasta.gz
wget https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/accession2taxid/prot.accession2taxid.FULL.gz
gunzip prot.accession2taxid.FULL.gz
wget https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdmp.zip
unzip taxdmp.zip
diamond makedb --in nr_rep_seq.fasta.gz -d nr_clustered_wtaxa --taxonmap prot.accession2taxid.FULL --taxonnodes nodes.dmp --taxonnames names.dmp --threads 16

# RVDB
wget https://rvdb-prot.pasteur.fr/files/U-RVDBv26.0-prot.fasta.xz
unxz U-RVDBv26.0-prot.fasta.xz
mmseqs createdb U-RVDBv26.0-prot.fasta DB
mmseqs linclust --min-seq-id 0.98 --cov-mode 1 -c 0.90 DB DB_clu tmp
mmseqs createsubdb DB_clu DB DB_clu_rep
mmseqs convert2fasta DB_clu_rep DB_clu_rep.fasta
mv DB_clu_rep.fasta U-RVDBv26.0-prot-clustered-minid0.98-cov0.90.fasta
awk '{if($0~">") {split($0,a,"|"); print ">"a[3],substr($0,2,length($0))} else print $0}' U-RVDBv26.0-prot-clustered-minid0.98-cov0.90.fasta > U-RVDBv26.0-prot-clustered-minid0.98-cov0.90-relabelled.fasta
diamond makedb --in U-RVDBv26.0-prot-clustered-minid0.98-cov0.90-relabelled.fasta -d rvdbv26_clustered_wtaxa.dmnd --taxonmap prot.accession2taxid.FULL --taxonnodes nodes.dmp --taxonnames names.dmp

Note: A preclustered of nr version is made available by Arcadia Science. RVDB was developed by Arifa Khan's group at CBER. For hi-fever, RVDB-prot is required, a protein version maintained at the Institut Pasteur with archives available here. The commands above offer a template for downloading and formatting these databases, though it is recommended to edit them to install the most recent database releases. These commands require the diamond and mmseqs, both of which are provided in the conda environment distributed with HI-FEVER.

  1. Custom reciprocal database (recomended only for users with previous knowledge about their EVEs): hi-feverbuilts a reciprocal database a user-provided fasta file or a prebuilt diamond database (specified under the --reciprocal_db parameter). This is not recommended unless you already have a lot of information about the EVEs that you are searching for, as it will not easily distinguish between true EVEs and host cross-matches.

Example commands

Some example sets of parameters are shown below as a guide to how to customise your HI-FEVER run.

A default run

nextflow main.nf --query_file_aa viruses.fasta --ftp_file genomes.txt --outdir hi_fever_results

Using the full NCBI and nr databases for the reciprocal DIAMOND search. Recommended where possible but is resource intensive and will take considerably longer.

nextflow main.nf --query_file_aa viruses.fasta --ftp_file genomes.txt --outdir hi_fever_results --full_reciprocal TRUE

Providing a custom database (in fasta format) for the reciprocal DIAMOND search. Suitable if you have a shortlist of proteins to confirm your EVEs, however may lose information on host cross-matches.

nextflow main.nf --query_file_aa viruses.fasta --ftp_file genomes.txt --outdir hi_fever_results --reciprocal_db reciprocal_proteins.fasta

Providing a custom database (in dmnd format) for the reciprocal DIAMOND search. Suitable if you have a shortlist of proteins to confirm your EVEs and have prebuilt a DIAMOND database from them.

nextflow main.nf --query_file_aa viruses.fasta --ftp_file genomes.txt --outdir hi_fever_results --dont_build_reciprocal TRUE --reciprocal_db reciprocal_proteins.fasta

Clustering the query proteins at low identity prior to DIAMOND searches. Suitable if you have many similar query proteins and are looking for more general matches.

nextflow main.nf --query_file_aa viruses.fasta --ftp_file genomes.txt --outdir hi_fever_results --mmseqs_minseqid 0.70 mmseqs_cover 0.60

Customising the diamond search modes. Suitable if you want to run a less computationally-intensive search.

nextflow main.nf --query_file_aa viruses.fasta --ftp_file genomes.txt --outdir hi_fever_results --diamond-forks 1 --diamond_mode sensitive

Returning longer candidate EVEs by merging more distant neigboring hits and returning longer flanking sequences. Suitable when searching for multiple protein integrations from one virus eg. proviruses.

nextflow main.nf --query_file_aa viruses.fasta --ftp_file genomes.txt --outdir hi_fever_results --interval 3000 --flank 5000