Softwares - FranckLejzerowicz/metagenomix GitHub Wiki

Softwares currently available to use in metagenomix. To use a software in your pipeline, write its name as indicated after each point below in the pipeline configuration file.

Note that some tools are holistic, meaning that will run on all samples or all groups of samples that are pooled for co-assembly (please read about the co-assembly pooling mechanism). Such holistic softwares are labelled with a cyclone emoji 🌀, and softwares that are not yet implemented are labelled with the construction emoji 🚧.

Sequence filtering

  • filtering performs the filtering of fastq files so that every sequences that aligns to any of the passed databases is discarded

Pre-processing

  • count was developed for metagenomix specifically to count the reads in sequence data files
  • fastqc A quality control analysis tool for high throughput sequencing data
  • fastp provide fast all-in-one preprocessing for FastQ files. This tool is developed in C++ with multithreading supported to afford high performance (paper)
  • atropos for specific, sensitive, and speedy trimming of NGS reads (paper)
  • cutadapt finds and removes adapter sequences, primers, poly-A tails and other types of unwanted sequence from your high-throughput sequencing reads (paper)
  • kneadata performs quality control on metagenomic and metatranscriptomic sequencing data.
  • hifiadapterfilt Convert .bam to .fastq and remove reads with remnant PacBio adapter sequences (paper)
  • mapdamage2 tracks and quantify damage patterns in ancient DNA sequences (paper)
  • 🚧 berokka trims, circularises and orients long read bacterial genome assemblies (no paper yet)

Ecological distance measure

  • 🌀 simka is a de novo comparative metagenomics tool that represents each dataset as a k-mer spectrum and compute several classical ecological distances between them (paper)

Paired-read merging

  • flash Merge paired-end reads from next-generation sequencing experiments (paper)
  • pear Assembles Illumina paired-end reads if the DNA fragment sizes are smaller than twice the length of reads (paper)
  • ngmerge Operates on paired-end high-throughput sequence reads in two distinct "stitch" and "adapter-removal" modes (paper)
  • bbmerge Merge two overlapping paired reads into a single read (paper)

Alignment

  • bowtie2 Perform sequence alignment against custom, bowtie2-formatted databases (paper)
  • minimap2 aligns DNA or mRNA sequences against a large reference database (paper)
  • 🚧 salmon Wicked-fast transcript quantification from RNA-seq data (paper)
  • 🚧 kallisto quantifying abundances of transcripts from RNA-Seq data and other sequencing reads using pseudoalignment (paper)

Reads mapping

For this general task were developed two "softwares" (i.e., names to add in the pipeline configuration file) to align the sequences obtained as output of any step onto the sequences obtained as output of another step (as long as both outputs are generated first in the order of pipeline config):

  • mapping_* will align the specific preprocessing reads specified as suffix, using different aligners, such as BWA, bowtie2, minimap2 (see user-parameters configuration file):
    • mapping_fastp will map the (suffix=) fastp reads (if obtained in the pipeline configuration)
    • mapping_filtering will map the (suffix=) filtering reads (if obtained in the pipeline configuration)
    • mapping_ ...
  • pysam_* can only be used after mapping_*, as it is meant to count the mapped reads, but this is possible for any target specified as suffix (docs):
    • *pysam (default: no suffix) will count reads for each of the entries used as reference for the above mapping (e.g., reads per contig, if mapping_fastp was done on an assembly).
    • *pysam_prodigal (suffix: prodigal) will count reads for each of the proteins predicted based on the contigs used as reference (if mapping_fastp was done on the assembly used to predict protein).
    • *pysam_metawrap_refine (suffix: metawrap_refine) will count reads for each of the refined bins obtained from the contigs (if mapping_fastp was done on the assembly used to bin).

Profiling

  • midas is an integrated pipeline that leverages >30,000 reference genomes to estimate bacterial species abundance and strain-level genomic variation, including gene content and SNPs, from shotgun metagenomes (paper)
  • midas2 integrated pipeline for profiling strain-level genomic variations in shotgun metagenomic data (same workflow as MIDAS but uses larger MIDAS Reference Databases (MIDASDBs), faster and more scalable (paper)
  • kraken2 is a taxonomic sequence classifier that assigns taxonomic labels to DNA sequences (paper)
  • bracken is a highly accurate statistical method that computes the abundance of species in DNA sequences from a metagenomics sample (paper)
  • 🚧 humann Abundance profiling of microbial taxa and metabolic pathways. (paper)
  • shogun is a modular, accurate and scalable framework for microbiome quantification (paper)
  • 🌀 woltka is a versatile program for determining the structure and functional capacity of microbiomes (paper)
  • 🚧 metaxa2 improves the identification and taxonomic classification of small and large subunit rRNA in metagenomic data (paper)
  • 🚧 phylophlan is an integrated pipeline for large-scale phylogenetic profiling of genomes and metagenomes (paper)
  • 🚧 phyloflash Reconstruct the SSU rRNAs and explore phylogenetic composition of an illumina (meta)genomic dataset. (paper)
  • 🚧 mocat2 Taxonomic and functional abundance profiling, reads assembler and gene prediction. (paper)
  • 🚧 motus estimates relative taxonomic abundance of known and currently unknown microbial community members using metagenomic shotgun sequencing data (paper)
  • 🚧 ngless is a domain-specific language for next-generation sequencing data processing (paper)
  • 🚧 ngmetaprofiler is a profiler for metagenomics based on NGLess (paper)
  • 🚧 kaiju is a fast and sensitive taxonomic classifier for metagenomics (paper)
  • 🚧 metaphlan4 profiles the composition of microbial communities (Bacteria, Archaea and Eukaryotes) from metagenomic shotgun sequencing data (i.e. not 16S) with species-level (paper)
  • 🚧 closedref picks OTUs using a closed reference and constructs an OTU table (paper)

Pooling

Assembling

  • spades is an assembly toolkit containing various assembly pipelines (paper)
  • 🌀 quast evaluates genome/metagenome assemblies by computing various metrics (paper)
  • plass assembles short read sequencing data on a protein level (paper)
  • flye is a de novo assembler for single-molecule sequencing reads, such as those produced by PacBio and Oxford Nanopore Technologies (paper)
  • canu is a fork of the Celera Assembler designed for high-noise single-molecule sequencing (such as the PacBio RSII or Oxford Nanopore MinION) (paper)
  • necat is an error correction and de-novo assembly tool for Nanopore long noisy reads (paper)
  • megahit is an ultra-fast and memory-efficient NGS assembler (paper)
  • unicycler is a hybrid assembly pipeline for bacterial genomes (paper)
  • metamic identifies and corrects misassemblies of (meta)genomic assemblies (paper)
  • 🚧 tricycler generates consensus long-read assemblies for bacterial genomes (paper)

Binning

  • metawrap Flexible pipeline for genome-resolved metagenomic data analysis. (paper)
  • binspreader refine metagenome-assembled genomes (MAGs) obtained from existing tools (paper)
  • yamb Yet Another Metagenome Binner - semi-automatic pipeline for metagenomic contigs binning. (paper)
  • 🚧 groopm Metagenomic binning suite. (paper)
  • 🚧 vamb Variational autoencoder for metagenomic binning. (paper)
  • 🚧 mycc combines genomic signatures, marker genes and optional contig coverages within one or multiple samples (paper)
  • 🚧 semibin for binning with deep learning, handles both short and long reads (paper)
  • 🚧 metabinner ensemble method to recover individual genomes from complex microbial communities (paper)

Annotation

Protein prediction

  • prodigal is a fast, reliable protein-coding gene prediction for prokaryotic genomes (paper)

Protein annotation

  • The following two search_ analyses were developed for metagenomix specifically:
  • eggnogmapper uses precomputed orthologous groups and phylogenies from the eggNOG database (http://eggnog5.embl.de) to transfer functional information from fine-grained orthologs only (paper)
  • 🚧 rundbcan Search for CAZymes (paper)
  • 🚧 metaclade2 Multi-source domain annotation. (paper)
  • 🚧 grasp2 is a gene-centric homolog search tool (paper)
  • 🚧 graspx is a guided reference-based assembler of short peptides (paper)
  • 🚧 signalp predicts the presence of signal peptides and location of their cleavage sites in proteins (paper)
  • 🚧 itasser predicts protein structure annotates function based on protein structure (paper)
  • 🚧 🌀 metamarker is a de novo pipeline to discover novel metagenomic biomarkers. (paper)
  • 🚧 srst2 reports the presence of Sequence Types and/or reference genes (paper)

rRNA

  • barrnap predicts the location of ribosomal RNA genes in genomes (no paper)

Repeats

  • trf analyzes DNA sequences to find tandem repeats (paper)
  • kmerssr finds Simple Sequence Repeats (SSRs) in a sequence (presumably of DNA or RNA) (paper)
  • divissr is a DNA tandem repeat identification tool (paper)

Circularity

  • ccfind detect circular complete genomes with clues of terminal redundancy, originally designed for identification of complete virus genomes from metagenome assembly (paper)
  • circlator circularizes genome assemblies (paper)
  • 🚧 circlemap is a method for circular DNA detection based on probabilistic mapping of ultrashort reads (paper)

Eukaryotic sequences

  • tiara is based on deep learning to identify eukaryotic sequences in the metagenomic data (paper)

Genomes

  • prokka performs rapid prokaryotic genome annotation (paper)

Antibiotic-resistance genes

  • 🌀 antismash allows the rapid genome-wide identification, annotation and analysis of secondary metabolite biosynthesis gene clusters in bacterial and fungal genomes (paper)
  • staramr scans genome contigs against the ResFinder, PlasmidFinder, and PointFinder databases (paper)
  • deeparg predicts ARGs from metagenomes using deep learning (paper)
  • metamarc is a set of profile Hidden Markov Models developed for the purpose of screening and profiling resistance genes in DNA-based metagenomic data (paper)
  • karga is a k-mer-based antibiotic gene resistance analyzer, a multi-platform Java toolkit for identifying ARGs from metagenomic short read data (paper)
  • kargva identifies antibiotic resistance from sequencing data conferred by point mutations in bacterial genes (paper)
  • abritamr runs AMRfinderPlus and collates results into functional classes (paper)
  • 🚧 amrfinderplus Identify acquired ARGs in bacterial protein and/or assembled nucleotide sequences as well as known resistance-associated point mutations for several taxa. (paper)
  • abricate allows for mass screening of contigs for antimicrobial resistance or virulence genes (no paper yet)
  • hamronization parses multiple antimicrobial resistance analysis reports into a common data structure (no paper)
  • metacompare estimate potential for ARGs to be disseminated into human pathogens from a given environmental sample using resistome risk prioritization (paper)
  • 🚧 resfinder identifies acquired antimicrobial resistance genes in total or partial sequenced isolates of bacteria (paper)
  • 🚧 amrplusplus2 is an easy to use app that identifies and characterizes resistance genes within sequence data (paper)
  • 🚧 argsoap annotates and classify antibiotic resistance gene-like sequences from metagenomic data (paper)
  • 🚧 ariba identifies antibiotic resistance genes by running local assemblies (paper)

Integrons

Macromolecular systems and pathways

  • macsyfinder detects macromolecular systems in protein datasets using systems modelling and similarity search (paper)
  • 🌀 diting infers and compares biogeochemical pathways in metagenomic data (paper)
  • 🚧 tmhmm Prediction of transmembrane helices in proteins. (paper)
  • 🚧 ioncom Ligand-specific method for small ligand (including metal and acid radical ions) binding site prediction. (paper)
  • 🚧 deeptmhmm is a deep learning model for transmembrane topology prediction and classification (paper)
  • 🚧 keggcharter represents genomic potential and transcriptomic expression into KEGG pathways (paper)

Plasmids

  • plasforest is a random forest classifier of contigs to identify contigs of plasmid origin in contig and scaffold genomes (paper)
  • mobsuite allows for clustering, reconstruction and typing of plasmids from draft assemblies (paper)
  • plasmidfinder identifies and types plasmid replicons in whole-genome sequencing data (paper)
  • platon Identifies and characterizes bacterial plasmid-borne contigs from short-read draft assemblies (paper)
  • genomad identifies mobile genetic elements (paper)
  • 🚧 plasclass classifies sequences of plasmid or chromosomal origin (paper)
  • 🚧 oritfinder identifies origin of transfers in DNA sequences of bacterial mobile genetic elements (paper)
  • 🚧 deeplasmid Separates plasmids from chromosomal sequences (ML). (paper)
  • 🚧 rfplasmid predicts plasmid contigs from assemblies (paper)
  • 🚧 plasflow predicts plasmid sequences in metagenomic assemblies (paper)
  • 🚧 pprmeta identifies phages and plasmids from metagenomic fragments using deep learning (paper)

Viruses

  • viralverify classifies contigs (output of metaviralSPAdes or other assemblers) as viral, non-viral or uncertain, based on gene content (paper)
  • 🚧 coconet is a binning method for viral metagenomes (paper)
  • 🚧 threecac ("3CAC") is a three-class classifier designed to classify contigs in mixed metagenome assemblies as phages, plasmids, chromosomes, or uncertain (paper)
  • 🚧 deepvirfinder Identifying viruses from metagenomic data by (deep learning). (paper)
  • 🚧 wish predicts prokaryotic hosts from metagenomic phage contigs (paper)
  • 🚧 virstrain identifies RNA virus at the strain-level based on short reads (paper)

Genome operation

  • 🌀 drep dereplicates the binned genomes to obtain MAGs across samples (paper)
  • 🚧 skani computes fast, robust ANI and aligned fraction for metagenomic genomes and contigs (paper)

Genome quality check and annotation

  • checkm provides a set of tools for assessing the quality of genomes recovered from isolates, single cells, or metagenomes (paper)
  • checkm2 assesses the quality of metagenome-derived genome bins using machine learning (paper)
  • 🚧 gtdbtk allows for the taxonomic classification of bacterial and archaeal genomes based on GTDB. (paper)
  • 🚧 busco assesses genome assembly and annotation completeness with Benchmarking Universal Single-Copy Orthologs (paper)
  • 🚧 pirate identifies and classifies orthologous gene families in bacterial pangenomes over a wide range of sequence similarity thresholds (paper)

Strain-level analysis

  • lorikeet is a within-species variant analysis pipeline for metagenomic communities that utilizes both long and short reads (no paper yet)
  • 🌀 strainphlan tracks individual strains across a large set of samples (paper)
  • 🚧 instrain allows for stain-level analyses for co-occurring genome populations (paper)
  • 🚧 panphlan is a strain-level profiler for gene composition of individual strains (paper)
  • 🚧 strainsifter detects a bacterial strain in one or more metagenome(s) (paper)
  • 🚧 strainpro allows for strain-level profiling (paper)

Pipeline suites

Visualisation

  • 🚧 graphlan High-quality circular representations of taxonomic and phylogenetic trees.
  • 🚧 mummer2circos Circular bacterial genome plots based on BLAST or NUCMER/PROMER alignments.
⚠️ **GitHub.com Fallback** ⚠️