Task: prepareref - sanger-pathogens/ariba Wiki
This prepares reference data for use with the task
The input can be either provided by the user, or downloaded
using the task [[
prepareref will do the following:
Sanity check the input fasta file(s) and metadata file(s). Inconsistent/bad data will be removed and reported in log files. If this happens, a warning will be written to stderr. It is important to check any removed sequences and/or variants. If you are missing a gene from your final output after running ARIBA, please check that it was not removed by
Run cd-hit on the sequences that pass stage 1. An independent run of cd-hit is carried out for each the four groups of reference sequences: non-coding and variant only, non-coding and presence absence, gene and vairant only, gene and presence/absence.
The output directory it makes is then used as input to
Using downloaded data
ariba getref was run with the output files prefix
getref_out, then run
prepareref like this:
ariba prepareref -f getref_out.fa -m getref_out.tsv prepareref.out
to make a new directory called
prepareref.out, which can be
used as input to the task [[
run|Task:-run]]. For example, to
get and use ARG-ANNOT data:
ariba getref argannot getref.out ariba prepareref -f getref.out.fa -m getref.out.tsv prepareref.out
and the new directory
preareref.out can be used as input to
User-provided data - FASTA only
If you have a file of genes called
in.fasta, the usage is:
ariba prepareref --all_coding yes -f in.fasta out_dir
or if all the sequences are non-coding:
ariba prepareref --all_coding no -f in.fasta out_dir
If you have a mix of genes and non-coding sequences, or associated variants of interest, this information must be supplied in a metadata file. See the next section.
User-provided data - FASTA and metadata
The usage is:
ariba prepareref -f file.fa -m metadata.tsv out_dir
where the reference sequences are in the FASTA file
their associated metadata is in
metadata.tsv. Every sequence in
the FASTA file must have at least one corresponding entry in the
Reference sequences can either be non-coding or genes (but must always be nucleotide sequences). Gene sequences will have extra analysis performed, for example looking for non-synonymous amino acid changes.
Additionally, reference sequences can be either of:
Presence/absence sequences. ARIBA will look for these sequences in the input reads and report any that it finds, and also any variants between these sequences and the input reads.
Variants only sequences. These should have known variant details specified in the metadata file (see below). ARIBA reports only when it finds at least one of the given variants in each of these these sequences. If a sample has one of these sequences, but does not have one of the supplied variants, then it is not reported. If you supply a variants only sequence, but no variant, then the sequence will be removed during sanity checks.
In case you have sequences and metadata split over multiple files, the
-m can be specified more than once. For example:
ariba prepareref -f in.1.fa -f in.2.fa -f in.3.fa \ -m meta.1.tsv -m meta.2.tsv output_directory
Each sequence must have associated metadata in one (or more) of the
metadata files - it does not matter which file. For example, a sequence
in.1.fa must have metadata in
meta.2.tsv (or in both) - it does not have to be in
The sequence names must be unique across all input files. Sequence names in the FASTA file(s) must match exactly to the names in the metadata.
A metadata file (
01.filter.check_metadata.tsv) is comprised of reference sequences that passed the sanity check at the stage 1 of
prepareref. The file must be tab-delimited and consist of six columns:
Sequence name. This must match the name used in the FASTA file.
1to indicate whether or not this is a gene. Use
0for a non-coding sequence and
1for a gene.
1to indicate if this is a presence/absence sequence or a variants only sequence. Use
0for presence/absence and
1for variants only.
If this line is describing a variant, put it here in the form <wild type><position><variant type>, for example K10L. Put a "." if this line is not describing a variant. The reference sequence must have either the wild type or variant type at the given position, otherwise the variant will be removed during sanity checks. If the sequence is a gene (
1in column 2), this is taken to be an amino acid change. If the sequence is non-coding (
0in column 2), this is taken to be a nucleotide change.
Variants can be put into groups. This is the group name for this variant. To not put it into a group, use ".". This can be useful when summarising across several runs. Suppose there are two alleles for a gene, both of which confer the same resistance if they have a particular SNP. Putting those SNPs in the same group allows ARIBA to track them as a group and simply report whether each sample has any variant from the group at the summary stage.
Free text that can be used to describe the sequence and/or the variant. Put a "." if you do not want to provide a description.
More than one line can be used for each reference sequence, for when there are multiple variants or descriptions.
Here are some example lines:
To just describe the sequence, but no variant:
sequence1 1 0 . . this is a description of sequence1
ie this is a presence/absence gene and we are simply giving a description, not a variant.
A nucleotide variant:
sequence2 0 0 A42G group1 description of variant
ie this is a non-coding sequence where the wild type is an A at position 42 and we are interested if there is a G at that position. We have put the variant into a group called "group1".
An amino acid variant in a gene
sequence3 1 1 I10L . .
ie the wild type is an I at position 10 of the amino acid sequence. The 1 in column 3 indicates that it is a variants-only gene, which means we are only interested in the given amino acid change. In this example, no description and no group have been given.
In addition to the metadata file,
prepareref produces a log file (
01.filter.check_metadata.log) to list the removed reference sequences (see output file
01.filter.check_genes.log for their exact sequences).
Reference sequence and cluster names
When the ARIBA pipeline is run, it must give each cluster a name. This is the method used to name the clusters:
Gather everything up to the first dot (.) in the name of each sequence in the cluster.
If everything in (1) is the same, use that name. Otherwise, look for a common prefix for everything in (1). If there is a common prefix, then use it, but add "-" on the end of the common prefix.
If there was no common prefix, take the most common string from (1), and add "+" on the end to indicate there were other names.
However, cluster names must be at least 3 characters long. If everything above fails, then the cluster is simply called "cluster". To avoid non-unique cluster names, -1, -2, -3 etc is appended to the cluster names where necessary.
When downloading a supported reference using
getref, we attempt
to name the sequences in such a way as to result in "sensible" cluster names.
This usually works, however, there are always a few sequence names where
we cannot automatically determine a sensible name.
The following options affect the clustering using CD-HIT:
--no_cdhit. Using this will prevent clustering. Instead, each input sequence is put into its own "cluster". Incompatible with
--cdhit_clusters FILENAME. Use this to specify how the sequences should be clustered, instead of using CD-HIT. The file format is one cluster per line. Sequence names separated by whitespace. Incompatible with
--cdhit_min_id FLOAT. Sequence identity threshold (cd-hit option -c). Default: 0.9.
--cdhit_min_length FLOAT. Length difference cutoff (cd-hit option -s). Default: 0.
--cdhit_max_memory INT. Memory limit in MB (cd-hit option -M). Use 0 for unlimited.
--min_gene_length INT. Minimum allowed length in nucleotides of reference genes. Genes shorter than this are removed. Default: 6.
--max_gene_length INT. Maximum allowed length in nucleotides of reference genes. Genes longer than this are removed. Default: 10000.
--min_noncoding_length INT. Minimum allowed length in nucleotides of non-coding sequences. Sequences shorter than this are removed. Default: 6.
--max_noncoding_length INT. Maximum allowed length in nucleotides of non-coding sequences. Sequences longer than this are removed. Default: 20000.
--genetic_code INT. Number of genetic code to use. Currently supported 1,4,11. Default: 11.
--force. Overwrite output directory, if it already exists. Without this option, output directory must not already exist.
--threads INT. Number of threads (currently only applies to cdhit). Default: 1.
--verbose. Be verbose. Silent by default, unless there are errors.
If you get the following error:
Warning: Some seqs are too long, please rebuild the program with make parameter MAX_SEQ=new-maximum-length (e.g. make MAX_SEQ=10000000) Not fatal, but may affect results !! .... Fatal Error: in diag_test_aapn_est, MAX_DIAG reached Program halted !!
Then please see issue #278 for the workaround.