Task: aln2meta - sanger-pathogens/ariba GitHub Wiki

Task: aln2meta

This can be used if you have a multiple alignment of one or more sets of reference sequences and SNP information that you want to call using ARIBA. The variant grouping option of ARIBA can be used to track the "same" SNPs across all the sequences.

The procedure is explained using the following example on toy data.

Example

Make the input files

We will use the following toy sequences. They are supposed to represent different alleles of the same (very short!) gene.

>seq1
ATGGCTAATTAG
>seq2
ATGTTTAATTAG
>seq3
ATGTTTTGTAATTAG
>seq4
ATGTTTGATAATTAG

They translate to the following amino acid sequences.

>seq1
MAN*
>seq2
MFN*
>seq3
MFCN*
>seq4
MFDN*

Here is a multiple alignment of the amino acid sequences:

>seq1
M-AN*
>seq2
MF-N*
>seq3
MFCN*
>seq4
MFDN*

and the corresponding nucleotide sequences:

>seq1
ATG---GCTAATTAG
>seq2
ATGTTT---AATTAG
>seq3
ATGTTTTGTAATTAG
>seq4
ATGTTTGATAATTAG

This final file is the one that must be used as input to ariba aln2meta. Every sequence must have the same length in this file (length includes the gaps).

In addition, a file of SNP information is needed. Suppose we know the following two SNPs confer antibiotic resistance:

  1. A2D in sequence seq1
  2. F2E in sequence seq4

ARIBA can be used to identify the corresponding SNPs in any of the sequences. The second required file is a TSV file containing information on these SNPs. It must have four columns:

  1. Sequence name. Must exactly match a sequence the multialignment FASTA file.

  2. The SNP, for example A2D.

  3. Group name. If you do not want to put the SNP into a group, use ".".

  4. A description of the SNP, for example "Causes resistance to antibiotic x".

In this example, we will use the file:

seq1	A2D	group1	Description of A2D.group1
seq4	F2E	group2	Description of F2E.group2

Run aln2meta

Run aln2meta like this:

ariba aln2meta seqs.aln.fa snps.tsv coding out

where:

  • seqs.aln.fa is the multifasta alignment file of nucleotide sequences

  • snps.tsv is the TSV file of SNP information

  • coding, because these are coding sequences. For non-coding sequences, use noncoding instead, and the SNPs should be nucleotide SNPs, as opposed to amino acids.

  • out is the prefix of the names of the output files.

Note that ARIBA sanity checks the SNPs against the sequences. It outputs these two warnings:

Warning: position has a gap in sequence  seq2 corresponding to variant A2D (group1) in sequence  seq1 ... Ignoring for seq2
Warning: position has a gap in sequence  seq1 corresponding to variant F2E (group2) in sequence  seq4 ... Ignoring for seq1

which makes sense looking at the sequences. For example, the A2D variant in seq2 aligns to a gap in seq1, so it gets ignored for seq1 (but included for the other sequences).

Run prepareref

The aln2meta command above outputs three files, which can be used as input to ariba prepareref like this:

prepareref -f out.fa -m out.tsv --cdhit_clusters out.cluster out.prepareref

and then ariba run can be run as normal.

More than one set of multiple alignments

It is possible to use more than one set of multiple alignments, eg you have several genes, each of which have multiple alleles and SNPs of interest. Run aln2meta once for each gene/set of alleles. For example:

    ariba aln2meta seqs.aln.1.fa snps.1.tsv coding out1
    ariba aln2meta seqs.aln.2.fa snps.2.tsv coding out2
    ariba aln2meta seqs.aln.3.fa snps.3.tsv coding out3

Then cat the relevant files together and run prepareref:

    cat out*fa > all.fa
    cat out*tsv > all.tsv
    cat out*cluster > all.cluster

    ariba prepareref -f all.fa -m all.tsv --cdhit_clusters all.cluster out.prepareref

(or you could not cat the files, and instead use -f and -m once for each file), and finally ariba run can be run as normal.