Utility Commands - Ecogenomics/CheckM GitHub Wiki

CheckM also provides a number of additional commands that may be useful for exploring genome bins. Some of these commands also produce summary statistics of sequences required by specific plotting commands.

unbinned

Given a set of genome bins along with a FASTA file of sequences determined which of these sequences are not currently contained in a genome bin. This is useful for determine which sequences for an assembly were not binned by an automated binning algorithm.

Example: > checkm unbinned ./bins seqs.fna unbinned.fna unbinned_stats.tsv

coverage

Produces coverage profiles for all sequences within a set of genome bins. This command requires indexed and sorted BAM files produced with a tool such as BWA. Coverage profiles are required for a number of the plots produced by CheckM.

Example: > checkm coverage ./bins coverage.tsv example_1.bam example_2.bam

tetra

Produces tetranucleotide signatures for all sequences within a FASTA file. Tetranucleotide signatures are required for a number of the plots produced by CheckM.

Example: > checkm tetra seqs.fna tetra.tsv

profile

Produces a table indicating the percentage of reads mapped to an assembly which are assigned to each genome bin or assigned to unbinned contigs. This information is also used to determine the percentage of each genome bin relative to all genome bins under consideration. This is a useful indication of the relative percentage of different populations within a community when the majority of populations are represented by a genome bin and the majority of reads map to the assembly. If the majority of reads do not map to the assembly, these results will not be a reliably indication of the relative proportion of different populations. This command requires a file indicating the coverage profile of all sequences within the genome bins. This file can be creates with the coverage command described above.

The columns of the output table are:

  • Bin Id: unique identifier of bin
  • Bin size (Mbp): size of bin in Mbp
  • mapped reads: number of reads mapped to contigs comprising the bin
  • % mapped reads: (reads mapped to bin)/(total number of reads mapped to assembly)
  • % binned populations: estimates the proportion of a bin relative to all recovered bins. This is determined from the percentage of reads mapped to a bin normalized for the size of the bin and considering only reads mapped to bins. Specifically, it is calculated as [(% mapped reads)/(bin size)]*(1/C), where C is the sum of size adjusted bin coverages over all bins (i.e., the sum of (% mapped reads)/(bin size) over all bins).
  • % community: estimate the proportion of a bin relative to the number of reads mapped to assembled contigs and adjusted for the size of the bin. Assuming the majority of reads map to an assembled contig this is an estimate of the relative proportion of a bin (i.e., population) within the community. It is calculated as the (% binned population) * (100 - (percentage of reads assigned to unbinned contigs))

Example: > checkm profile coverage.tsv

join_tables

Joins two tab-separated values tables. The first column of each table is used as a unique identifier for joining the tables. A typical use of this command is to join the profile table produced by the profile command with the default output of the qa command.

Example: > checkm join_tables table1.tsv table2.tsv

ssu_finder

Identifies SSU (16S and 18S) rRNA genes residing on sequences if these sequences are contained within a genome bin.

Example: > checkm ssu_finder seqs.fna ./bins ./ssu_finder