Task: run - sanger-pathogens/ariba GitHub Wiki

Task: run

This runs the main ARIBA local assembly pipeline.

Assuming ariba prepareref has been run, with the output directory called ref, run the pipeline with

ariba run ref reads_1.fq reads_2.fq output_dir

where the reads_1.fq, reads_2.fq are the names of the forwards and reverse paired reads files. The reads files can be in any format that is compatible with minimap (in particular, gzipped).

Important: ARIBA assumes that read N in the file reads_1.fq is the mate of read N in the file reads_2.fq. All output files will be put in a new directory called out_dir.

To see all the options, use --help:

ariba run --help

Report file

The most important file is report.tsv, an example of which can be found here. The meaning of the columns in report.tsv is as follows.

Column Description
1. ariba_ref_name ariba name of reference sequence chosen from cluster (needs to rename to stop some tools breaking)
2. ref_name original name of reference sequence chosen from cluster, before renaming
3. gene 1=gene, 0=non-coding (same as metadata column 2)
4. var_only 1=variant only, 0=presence/absence (same as metadata column 3)
5. flag cluster flag
6. reads number of reads in this cluster
7. cluster name of cluster
8. ref_len  length of reference sequence
9. ref_base_assembled number of reference nucleotides assembled by this contig
10. pc_ident %identity between reference sequence and contig
11. ctg name of contig matching reference
12. ctg_len length of contig
13. ctg_cov mean mapped read depth of this contig
14. known_var is this a known SNP from reference metadata? 1 or 0
15. var_type The type of variant. Currently only SNP supported
16. var_seq_type Variant sequence type. if known_var=1, n or p for nucleotide or protein
17. known_var_change if known_var=1, the wild/variant change, eg I42L
18. has_known_var if known_var=1, 1 or 0 for whether or not the assembly has the variant
19. ref_ctg_change amino acid or nucleotide change between reference and contig, eg I42L
20. ref_ctg_effect effect of change between reference and contig, eg SYS, NONSYN (amino acid changes only)
21. ref_start start position of variant in reference
22. ref_end end position of variant in reference
23. ref_nt nucleotide(s) in reference at variant position
24. ctg_start start position of variant in contig
25. ctg_end end position of variant in contig
26. ctg_nt nucleotide(s) in contig at variant position
27. smtls_total_depth  total read depth at variant start position in contig, reported by mpileup
28. smtls_nts nucleotides on contig, as reported by mpileup. The first is the contig nucleotide
29. smtls_nts_depth depths on contig, as reported by mpileup. One number per nucleotide in the previous column
30. var_description description of variant from reference metdata
31. free_text other free text about reference sequence, from reference metadata

If a gene is assembled with no variants then there will be one row for that gene, with information only in columns 1-12 (and possibly 30) and the remaining columns are dots. Otherwise, there is one row per variant. If you want a short summary of genes present and the corresponding flags, run:

cut -f1,4 report.tsv | uniq

Other files

The other files written to the output directory are as follows.

  • assembled_seqs.fa.gz. This is a gzipped FASTA of the assembled sequences. During assembly, the sequence flanking each reference sequence is assembled, but in this file only the parts of the contigs that match the reference sequences are kept. Furthermore, since no filter of nucleotide identity or reference coverage is applied when generating this file, partial matches from contigs to reference sequences are also included.

  • assembled_genes.fa.gz. This is a gzipped FASTA file of assembled gene sequences. It does not contain non-coding sequences (those are in assembled_seqs.fa.gz), only genes. When comparing a local assembly to a gene, mismatches near the end of the gene can cause the alignment to be too short. ARIBA tries to extend the match by looking for start and stop codons. The extended sequences are in this file, whereas the unextended sequences are in assembled_seqs.fa.gz.

  • assemblies.fa.gz. This is a gzipped FASTA file of the assemblies. It contains the complete, unedited, contigs. Since flanking regions of each reference sequence is assembled, the contig is usually longer than its matching reference sequence. As a result, file sizes should often show the following relationship: assemblies.fa.gz > assembled_genes.fa.gz and assembled_seqs.fa.gz, and sometimes assembled_seqs.fa.gz > assembled_genes.fa.gz.

  • debug.report.tsv. Not only does this file contain all rows of report.tsv, but also includes rows about synonymous mutations.

  • log.clusters.gz. Detailed logging is kept for the progress of each cluster. This is a gzipped file containing all the logging information.

  • version_info.txt. This contains detailed information on the versions of ARIBA and its dependencies. It is the output of running the task version.

Options

nucmer options:

  • --nucmer_min_id INT. Minimum alignment identity (the delta-filter -i option). Default: 90.
  • --nucmer_min_len INT. Minimum alignment length (the delta-filter -l option). Default: 20.
  • --nucmer_breaklen INT. Value to use for -breaklen when running nucmer. Default: 200.

Assembly options:

  • --assembly_cov INT. Target read coverage when sampling reads for assembly. The reads are randomly downsampled to the given coverage for assembly. If this option is more than the actual read coverage, then all reads are used for assembly. Default: 50.
  • --min_scaff_depth INT. Minimum number of read pairs needed as evidence for scaffold link between two contigs. Default: 10.
  • --assembler {fermilite,spades}. Default: fermilite. Spades will need to be installed on the host.
  • --spades_mode {wgs,sc,rna}. If using Spades assembler, either use default WGS mode, Single Cell mode (spades.py --sc) or RNA mode (spades.py --rna). Use SC or RNA mode if your input is from a viral sequencing with very uneven and deep coverage. Set --assembly_cov to some high value if using SC or RNA mode.
  • --spades_options SPADES_OPTIONS. Extra options to pass to Spades assembler. Sensible default options will be picked based on --spades_mode argument. Anything set here will replace the defaults completely. Example: --spades_options '--phred-offset 33'

Other options:

  • --threads INT. Experimental. Number of threads. Will run clusters in parallel, but not minimap. Default: 1.
  • --assembled_threshold FLOAT (between 0 and 1). If proportion of gene assembled (regardless of into how many contigs) is at least this value then the flag gene_assembled is set. Default: 0.95.
  • --gene_nt_extend INT. Max number of nucleotides to extend ends of gene matches to look for start/stop codons. Default: 30.
  • --unique_threshold FLOAT (between 0 and 1). If proportion of bases in gene assembled more than once is <= this value, then the flag unique_contig is set. Default: 0.03.
  • --force. Overwrite output directory, if it already exists
  • --noclean. Do not clean up intermediate files
  • --tmp_dir TMP_DIR. Existing directory in which to create a temporary directory used for local assemblies.
  • --verbose. Be verbose.