Task: prepareref - sanger-pathogens/ariba GitHub Wiki

Task: prepareref

This prepares reference data for use with the task run. The input can be either provided by the user, or downloaded using the task [[getref|Task:-getref]].

prepareref will do the following:

  1. Sanity check the input fasta file(s) and metadata file(s). Inconsistent/bad data will be removed and reported in log files. If this happens, a warning will be written to stderr. It is important to check any removed sequences and/or variants. If you are missing a gene from your final output after running ARIBA, please check that it was not removed by prepareref.

  2. Run cd-hit on the sequences that pass stage 1. An independent run of cd-hit is carried out for each the four groups of reference sequences: non-coding and variant only, non-coding and presence absence, gene and vairant only, gene and presence/absence.

The output directory it makes is then used as input to the task run.

Using downloaded data

Assuming ariba getref was run with the output files prefix getref_out, then run prepareref like this:

ariba prepareref -f getref_out.fa -m getref_out.tsv prepareref.out

to make a new directory called prepareref.out, which can be used as input to the task [[run|Task:-run]]. For example, to get and use ARG-ANNOT data:

ariba getref argannot getref.out
ariba prepareref -f getref.out.fa -m getref.out.tsv prepareref.out

and the new directory preareref.out can be used as input to ariba run.

User-provided data - FASTA only

If you have a file of genes called in.fasta, the usage is:

ariba prepareref --all_coding yes -f in.fasta out_dir

or if all the sequences are non-coding:

ariba prepareref --all_coding no -f in.fasta out_dir

If you have a mix of genes and non-coding sequences, or associated variants of interest, this information must be supplied in a metadata file. See the next section.

User-provided data - FASTA and metadata

The usage is:

ariba prepareref -f file.fa -m metadata.tsv out_dir

where the reference sequences are in the FASTA file file.fa, and their associated metadata is in metadata.tsv. Every sequence in the FASTA file must have at least one corresponding entry in the metadata file.

Reference sequences

Reference sequences can either be non-coding or genes (but must always be nucleotide sequences). Gene sequences will have extra analysis performed, for example looking for non-synonymous amino acid changes.

Additionally, reference sequences can be either of:

  1. Presence/absence sequences. ARIBA will look for these sequences in the input reads and report any that it finds, and also any variants between these sequences and the input reads.

  2. Variants only sequences. These should have known variant details specified in the metadata file (see below). ARIBA reports only when it finds at least one of the given variants in each of these these sequences. If a sample has one of these sequences, but does not have one of the supplied variants, then it is not reported. If you supply a variants only sequence, but no variant, then the sequence will be removed during sanity checks.

In case you have sequences and metadata split over multiple files, the options -f and -m can be specified more than once. For example:

ariba prepareref -f in.1.fa -f in.2.fa -f in.3.fa \
    -m meta.1.tsv -m meta.2.tsv output_directory

Each sequence must have associated metadata in one (or more) of the metadata files - it does not matter which file. For example, a sequence in file in.1.fa must have metadata in meta.1.tsv or meta.2.tsv (or in both) - it does not have to be in meta.1.tsv.

The sequence names must be unique across all input files. Sequence names in the FASTA file(s) must match exactly to the names in the metadata.

Metadata

A metadata file (01.filter.check_metadata.tsv) is comprised of reference sequences that passed the sanity check at the stage 1 of prepareref. The file must be tab-delimited and consist of six columns:

  1. Sequence name. This must match the name used in the FASTA file.

  2. A 0 or 1 to indicate whether or not this is a gene. Use 0 for a non-coding sequence and 1 for a gene.

  3. A 0 or 1 to indicate if this is a presence/absence sequence or a variants only sequence. Use 0 for presence/absence and 1 for variants only.

  4. If this line is describing a variant, put it here in the form <wild type><position><variant type>, for example K10L. Put a "." if this line is not describing a variant. The reference sequence must have either the wild type or variant type at the given position, otherwise the variant will be removed during sanity checks. If the sequence is a gene (1 in column 2), this is taken to be an amino acid change. If the sequence is non-coding (0 in column 2), this is taken to be a nucleotide change.

  5. Variants can be put into groups. This is the group name for this variant. To not put it into a group, use ".". This can be useful when summarising across several runs. Suppose there are two alleles for a gene, both of which confer the same resistance if they have a particular SNP. Putting those SNPs in the same group allows ARIBA to track them as a group and simply report whether each sample has any variant from the group at the summary stage.

  6. Free text that can be used to describe the sequence and/or the variant. Put a "." if you do not want to provide a description.

More than one line can be used for each reference sequence, for when there are multiple variants or descriptions.

Here are some example lines:

  • To just describe the sequence, but no variant:

    sequence1 1 0 . . this is a description of sequence1

    ie this is a presence/absence gene and we are simply giving a description, not a variant.

  • A nucleotide variant:

    sequence2 0 0 A42G group1 description of variant

    ie this is a non-coding sequence where the wild type is an A at position 42 and we are interested if there is a G at that position. We have put the variant into a group called "group1".

  • An amino acid variant in a gene

    sequence3 1 1 I10L . .

    ie the wild type is an I at position 10 of the amino acid sequence. The 1 in column 3 indicates that it is a variants-only gene, which means we are only interested in the given amino acid change. In this example, no description and no group have been given.

In addition to the metadata file, prepareref produces a log file (01.filter.check_metadata.log) to list the removed reference sequences (see output file 01.filter.check_genes.log for their exact sequences).

Reference sequence and cluster names

When the ARIBA pipeline is run, it must give each cluster a name. This is the method used to name the clusters:

  1. Gather everything up to the first dot (.) in the name of each sequence in the cluster.

  2. If everything in (1) is the same, use that name. Otherwise, look for a common prefix for everything in (1). If there is a common prefix, then use it, but add "-" on the end of the common prefix.

  3. If there was no common prefix, take the most common string from (1), and add "+" on the end to indicate there were other names.

However, cluster names must be at least 3 characters long. If everything above fails, then the cluster is simply called "cluster". To avoid non-unique cluster names, -1, -2, -3 etc is appended to the cluster names where necessary.

When downloading a supported reference using getref, we attempt to name the sequences in such a way as to result in "sensible" cluster names. This usually works, however, there are always a few sequence names where we cannot automatically determine a sensible name.

Other options

The following options affect the clustering using CD-HIT:

  • --no_cdhit. Using this will prevent clustering. Instead, each input sequence is put into its own "cluster". Incompatible with --cdhit_clusters.

  • --cdhit_clusters FILENAME. Use this to specify how the sequences should be clustered, instead of using CD-HIT. The file format is one cluster per line. Sequence names separated by whitespace. Incompatible with --no_cdhit.

  • --cdhit_min_id FLOAT. Sequence identity threshold (cd-hit option -c). Default: 0.9.

  • --cdhit_min_length FLOAT. Length difference cutoff (cd-hit option -s). Default: 0.

  • --cdhit_max_memory INT. Memory limit in MB (cd-hit option -M). Use 0 for unlimited.

Miscellaneous options:

  • --min_gene_length INT. Minimum allowed length in nucleotides of reference genes. Genes shorter than this are removed. Default: 6.

  • --max_gene_length INT. Maximum allowed length in nucleotides of reference genes. Genes longer than this are removed. Default: 10000.

  • --min_noncoding_length INT. Minimum allowed length in nucleotides of non-coding sequences. Sequences shorter than this are removed. Default: 6.

  • --max_noncoding_length INT. Maximum allowed length in nucleotides of non-coding sequences. Sequences longer than this are removed. Default: 20000.

  • --genetic_code INT. Number of genetic code to use. Currently supported 1,4,11. Default: 11.

  • --force. Overwrite output directory, if it already exists. Without this option, output directory must not already exist.

  • --threads INT. Number of threads (currently only applies to cdhit). Default: 1.

  • --verbose. Be verbose. Silent by default, unless there are errors.

Troubleshooting

If you get the following error:

Warning:
Some seqs are too long, please rebuild the program with make parameter MAX_SEQ=new-maximum-length (e.g. make MAX_SEQ=10000000)
Not fatal, but may affect results !!
....
Fatal Error:
in diag_test_aapn_est, MAX_DIAG reached
Program halted !!

Then please see issue #278 for the workaround.

⚠️ **GitHub.com Fallback** ⚠️