Usage - PaleovirologyLab/hi-fever GitHub Wiki
Inputs
- Path to data directory (
--data_path
): Path to the directory with your input files. - Query proteins (
--query_file_aa
): The viral proteins that you want to search for in the host genomes. You can provide it as a.fasta
file or diamond database (.dmnd
) built from your query proteins. The workflow will automatically detect the type of file you provide. - Host genome(s) (
--ftp_file
): The genomes you would like to screen. Provide a customftp
file in plain text (example atdata/ftp_list.txt
) - E-mail (
--email
): users need to provide an email in that is used to fetch the NCBI-database withEfetch
in order to get taxonomical information about the protein hits and the host genomes. - Proteins for reciprocal search (
--reciprocal_nr_db
and--reciprocal_rvdb_db
): HI-FEVER performs a second DIAMOND search to expand on the annotation of the EVE candidates. By default, HI-FEVER runs reciprocal searches against two databases that should be provided atdata/MINI-nr_rep_seq-clustered_70id_80c_wtaxa.dmnd
anddata/MINI_rvdbv28_wtaxa.dmnd
(download them from OSF). - Output directory (
--outdir
): Provide a name for folder where results will be stored. Default tohi-fever/output
Outputs
HI-FEVER creates 2 folders in the --outputdir
:
Sequences (fastas
folder):
loci-context-coordinates.fasta.gz
nucleotide sequences of the candidate EVEs including the genomic context (flanking regions etc.).loci-merged-coordinates.fasta.gz
nucleotide sequences of the candidate EVEs.
Note: Sequences in this folder are candidate EVEs. We recommend a further filtering step for these sequences since many EVE candidates are likely to be cross-matches to host proteins. You can use the annotation results in the SQL folder to identify the
locus-id
of interesting candidates and observe their top hits to determine if they are true EVEs or have homology to host proteins.
Tables and metadata (sql
folder):
assembly_metadata.tsv
information about the genome assemblies provided in the ftp file including taxonomy, submitter, assembly level etc.assembly_stats.tsv
statistics about the genome assemblies provided in the ftp file including size and coveragegenewise.tsv
predicted reconstructed sequences of each EVE candidatelocus_assembly_map.tsv
file mapping the genomic locus ID to the assembly from which it camematches.dmnd.annot.tsv
results of the initial forward DIAMOND search of query proteins against genome assembliespredicted_ORFs.tsv
predicted ORFs within the region of each candidate EVEreciprocal-*-matches.dmnd.tsv
results of the reciprocal DIAMOND searches of candidate EVEs against reciprocal databases. There will be one result file for each reciprocal database search.taxonomy_table.tsv
taxonomy data for every genome assembly and reciprocal hit from the HI-FEVER run
Parameters
A key advantage of HI-FEVER is the flexibility provided by parameter customisation. Example combinations of parameters for different research scopes and scenarios can be found in example commands.
Inputs and outputs
--query_file_aa data/circoviridae.fa
: Custom protein query file in fasta format (default: protein_query.fasta)--ftp_file data/assemblies.txt
: Custom ftp file in plain text (default: ftp_list.txt)--email [email protected]
: Email address required for efetching taxonomy data from NCBI (REQUIRED)--reciprocal_db file custom.dmnd
: File with proteins or database used during the reciprocal search (default: minimal_reciprocal.dmnd)--outdir herpesviridae_vs_tarsier
: Name of output directory (default: hifever_output)-with-report report.html
: Create Nextflow html workflow report (includes run time, user information, task metadata, and CPU, memory, and I/O usage)
Query clustering
--mmseqs_minseqid 0.70
: Sequence identity threshold for clustering of the protein query (default: 0.95 = 95%). Input protein queries are clustered prior to the search to avoid redundant searches of highly identical proteins--mmseqs_cover 0.80
: Minimum percentage of cluster member sequence length that must overlap with the representative sequence (default: 0.90 = 90%)
Diamond search settings
--diamond_forks 8
: DIAMOND fork count. By default Nextflow attempts to run all available forward DIAMOND tasks in parallel (one for each currently downloaded assembly), which can lead to overuse of memory resources and job termination. On local machines and clusters, it is therefore suggested to limit the number of parallel DIAMOND tasks (i.e., "forks") allowed at once. For setting the value, we recommend total cores / 12. Note that this does not affect the reciprocal DIAMOND search, which uses all available CPUs in a single process (default: 4 for cluster workflow)--diamond_mode ultra-sensitive
: DIAMOND sensitivity (default: very-sensitive)--diamond_matrix BLOSUM45
: DIAMOND substitution matrix (default: BLOSUM62)chunk_size 10000
: Size of host genome to process per DIAMOND forward search (default: 50000). Reducing this number will return more hits but take longer
EVE criteria settings
--interval 500
: Maximum interval length allowed between features in the context FASTA (default: 1000). This parameter modifies the allowed distance between DIAMOND hits before they are merged and counted as one contiguous hit--flank 5000
: Maximum length of (available) flanking sequence to add upstream and downstream of detected features in the context FASTA (default: 3000). This parameter allows you to modify how much host sequence upstream and downstream of the DIAMOND hits you would like to extract and keep for downstream analysis.--orf_size_nt 300
: Minimum length of extracted ORFs, in nucleotides (default: 150). This parameter allows you to alter the size threshold used when searching for ORFs within EVE hits
Genewise settings
--stop_task soft-mask
: Modify in-frame STOP codons in the Genewise coding DNA sequence output (default: remove). Options: remove [delete in-frame STOPs from the coding sequence], soft-mask [convert in-frame STOPs to lowercase--pairs_per_task 100
: Subset ($n$) of total loci count ($N$) processed by each Genewise task. To ensure equivalent workload across tasks and a high level of parallelisation, Genewise operations are split into $T$ tasks, where $T = N/n$ (default: 500)--genewise_matrix BLOSUM45
: Genewise substitution matrix (default: BLOSUM62). Options: BLOSUM80, BLOSUM62, BLOSUM45, BLOSUM30
Example inputs
HI-FEVER searches host genomes for matches against protein queries. As such, it requires a file providing the ftp list of genomes to be screened and a fasta file of protein queries. Examples of these files are below:
Query proteins
A FASTA file containing protein sequences to search for (default 'protein_query.fasta') e.g:
>ADI48253.1 putative Rep [Circoviridae TM-6c]
MQSVNWCFTLNNYTNEDVNKLKQVKCRYICLGFEVGDKKQTPHIQGFIQFEKKVRLSVWKKINKKIHAEI
MKGTIEQAINYCKKSGTFEERGEIIKMGERRDLKEAKKKCAEVGLRAITDCESTYNLQVIRNCQIMLEYH
EKERDFKPEVIWIYGESGAGKTKYISEKCAEVDTYWKDATKWWNGYDRHEITVMDDFRASNMKMNELLKL
IDRYPHRVEIKGGFRQMLSKKIYISSIMHPKDVYNLPEEPVKQLLRRIDTIIKI
>NP_042987.1 Rep [Human betaherpesvirus 6A]
MFSIINPSDDFWTKDKYIMLTIKGPMEWEAEIPGISTDFFCKFSNVSVPHFRDMHSPGAPDIKWITACTK
MIDVILNYWNNKTAVPTPAKWYAQAENKAGRPSLILLIALDGIPSATIGKHTTEIRGVLIKDFFDGNAPK
IDDWCTYAKTKKNGGGTQVFSLSYIPFALLQIIRPQFQWAWTNINELGDVCDEIHRKHIISHFNKKPNVK
LMLFPKDGINGISLKSKFLGTIEWLSDLGIVTEDAWIRRDIRSYMQLLTLTHGDVLIHRALSIAKKRIRA
TRKAIDFIAHIDTDFQIYENPVYQLFCLQSFDPILAGTILYQWLSHRGGKKNTVSFIGPPGCGKSMLTGA
ILENIPLHGILHGSLNTKNLRAYGQVLVLWWKDISINFDNFNIIKSLLGGQKIIFPINENDHVQIGPCPI
IATSCVDIRSMVHSNLHKINLSQRVYNFTFDKVIPRNFPVIQKDDINQFLFWARNRSINCFIDYTVPKIL
Host genomes
The host genomes file (default: ftp_list.txt
) should be a list of links to genome assemblies to screen. Links to assemblies available from NCBI can be found on the RefSeq and GenBank ftp sites or in the ftp tab on the NCBI genome page.
https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/330/505/GCA_000330505.1_EIA2_v2
https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/023/065/795/GCA_023065795.1_ASM2306579v1
https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/208/925/GCF_000208925.1_JCVI_ESG2_1.0
Preparing a reciprocal database
An essential step in hi-fever
is the reciprocal DIAMOND search. There are three options of input databases that you can use for this step based on your research question and your computational resources:
- Minimal reciprocal database (recommended for most users, requires ~3GB of storage space)
By default, hi-fever
includes a reciprocal database (hi-fever/data/minimal_reciprocal.dmnd
) with the most frequent and informative hits you can get while searching for EVEs in vertebrate genomes. We provide it as an alternative to the full NCBI non reduntand protein database (nr
) and Reference Virus Database (RVDB
). which normally require >100GB of storage space. Whilst we have made all efforts to preserve a diversity of informative proteins in the minimal reciprocal database, there is a chance that the closest EVE match will not be represented.
- Full reciprocal databases (recommended for large workstations and cluster computers, requires ~115 GB of storage space)
Download the latest version of both the NCBI nr database and RVDB proteins database for the most accurate EVE annotations using:
cd data
# NCBI nr database
wget https://files.osf.io/v1/resources/tejwd/providers/googledrive/nr_rep_seq.fasta.gz
wget https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/accession2taxid/prot.accession2taxid.FULL.gz
gunzip prot.accession2taxid.FULL.gz
wget https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdmp.zip
unzip taxdmp.zip
diamond makedb --in nr_rep_seq.fasta.gz -d nr_clustered_wtaxa --taxonmap prot.accession2taxid.FULL --taxonnodes nodes.dmp --taxonnames names.dmp --threads 16
# RVDB
wget https://rvdb-prot.pasteur.fr/files/U-RVDBv26.0-prot.fasta.xz
unxz U-RVDBv26.0-prot.fasta.xz
mmseqs createdb U-RVDBv26.0-prot.fasta DB
mmseqs linclust --min-seq-id 0.98 --cov-mode 1 -c 0.90 DB DB_clu tmp
mmseqs createsubdb DB_clu DB DB_clu_rep
mmseqs convert2fasta DB_clu_rep DB_clu_rep.fasta
mv DB_clu_rep.fasta U-RVDBv26.0-prot-clustered-minid0.98-cov0.90.fasta
awk '{if($0~">") {split($0,a,"|"); print ">"a[3],substr($0,2,length($0))} else print $0}' U-RVDBv26.0-prot-clustered-minid0.98-cov0.90.fasta > U-RVDBv26.0-prot-clustered-minid0.98-cov0.90-relabelled.fasta
diamond makedb --in U-RVDBv26.0-prot-clustered-minid0.98-cov0.90-relabelled.fasta -d rvdbv26_clustered_wtaxa.dmnd --taxonmap prot.accession2taxid.FULL --taxonnodes nodes.dmp --taxonnames names.dmp
Note: A preclustered of nr version is made available by Arcadia Science. RVDB was developed by Arifa Khan's group at CBER. For hi-fever, RVDB-prot is required, a protein version maintained at the Institut Pasteur with archives available here. The commands above offer a template for downloading and formatting these databases, though it is recommended to edit them to install the most recent database releases. These commands require the diamond and mmseqs, both of which are provided in the conda environment distributed with HI-FEVER.
- Custom reciprocal database (recomended only for users with previous knowledge about their EVEs):
hi-fever
builts a reciprocal database a user-provided fasta file or a prebuilt diamond database (specified under the--reciprocal_db
parameter). This is not recommended unless you already have a lot of information about the EVEs that you are searching for, as it will not easily distinguish between true EVEs and host cross-matches.
Example commands
Some example sets of parameters are shown below as a guide to how to customise your HI-FEVER run.
A default run
nextflow main.nf --query_file_aa viruses.fasta --ftp_file genomes.txt --outdir hi_fever_results
Using the full NCBI and nr databases for the reciprocal DIAMOND search. Recommended where possible but is resource intensive and will take considerably longer.
nextflow main.nf --query_file_aa viruses.fasta --ftp_file genomes.txt --outdir hi_fever_results --full_reciprocal TRUE
Providing a custom database (in fasta format) for the reciprocal DIAMOND search. Suitable if you have a shortlist of proteins to confirm your EVEs, however may lose information on host cross-matches.
nextflow main.nf --query_file_aa viruses.fasta --ftp_file genomes.txt --outdir hi_fever_results --reciprocal_db reciprocal_proteins.fasta
Providing a custom database (in dmnd format) for the reciprocal DIAMOND search. Suitable if you have a shortlist of proteins to confirm your EVEs and have prebuilt a DIAMOND database from them.
nextflow main.nf --query_file_aa viruses.fasta --ftp_file genomes.txt --outdir hi_fever_results --dont_build_reciprocal TRUE --reciprocal_db reciprocal_proteins.fasta
Clustering the query proteins at low identity prior to DIAMOND searches. Suitable if you have many similar query proteins and are looking for more general matches.
nextflow main.nf --query_file_aa viruses.fasta --ftp_file genomes.txt --outdir hi_fever_results --mmseqs_minseqid 0.70 mmseqs_cover 0.60
Customising the diamond search modes. Suitable if you want to run a less computationally-intensive search.
nextflow main.nf --query_file_aa viruses.fasta --ftp_file genomes.txt --outdir hi_fever_results --diamond-forks 1 --diamond_mode sensitive
Returning longer candidate EVEs by merging more distant neigboring hits and returning longer flanking sequences. Suitable when searching for multiple protein integrations from one virus eg. proviruses.
nextflow main.nf --query_file_aa viruses.fasta --ftp_file genomes.txt --outdir hi_fever_results --interval 3000 --flank 5000