Analyzing MetaCompass results - marbl/MetaCompass GitHub Wiki

Analyze results

MetaCompass output folder provides four results:

metacompass_assembly_stats.tsv   - Assembly statistics 
metacompass.final.ctg.fa         - Assembled contigs in fasta format
metacompass.geonmes.coverage.txt - Coverage per genome(Breadth by Default)
metacompass_summary.tsv          - De novo and Reference-guided assembly contigs list

Assembly statistics

"metacompass_assembly_stats.tsv" header is as follows:

File	# Contigs	Total Size(Kbp)	Min Size	Max Size(Kbp)	Average Size	Median Size	N50	# N50 contigs	Size at 1Mbp (Kbp)	Number @ 1Mbp	Size at 2Mbp (Kbp)	Number @ 2Mbp	Size at 4Mbp (Kbp)	Number @ 4Mbp	Size at 10Mbp (Kbp)	Number @ 10Mbp	GC content [%]

These headers are further described in the table below:

Header Description Example
File Name of fasta file tutorial_example1/thao2000.0.assembly.out/contigs.final.fasta
# Contigs Number of contigs 1
Total Size(Kbp) Total assembly size in Kbp 157534
Min Size size of the shortest contig 157534
Max Size(Kbp) size of the largest contig 157534
Average Size Average contig size 157534.00
Median Size Median contig size 157534.00
N50 minimum contig length needed to cover 50% of the total metagenome 157534
# N50 contigs Also known as L50. L50 count is defined as the smallest number of contigs whose length sum produces N50 (from https://en.wikipedia.org/wiki/N50,_L50,_and_related_statistics#N50) 1
Size at 1Mbp (Kbp) ) represents the size of the largest contig C such that the sum of all contigs larger than C exceeds 1Mbp 0.00
Number @ 1Mbp represents the number of contigs larger than 1Mbp 0
Size at 2Mbp (Kbp) represents the size of the largest contig C such that the sum of all contigs larger than C exceeds 2Mbp 0.00
Number @ 2Mbp represents the number of contigs larger than 2Mbp 0
Size at 4Mbp (Kbp) represents the size of the largest contig C such that the sum of all contigs larger than C exceeds 4Mbp 0.00
Number @ 4Mbp represents the number of contigs larger than 4Mbp 00.00
Size at 10Mbp (Kbp) represents the size of the largest contig C such that the sum of all contigs larger than C exceeds 10Mbp 0
Number @ 10Mbp represents the number of contigs larger than 10Mbp
GC content [%] The GC content (percentage) is the number of GC nucleotides divided by the total nucleotides. 14.55

Assembled contigs in fasta format

MetaCompass fasta file "metacompass.final.ctg.fa" contains a combinations of Reference-guided Contigs from reads aligning to reference genomes, and De novo contigs from unmapped reads(See megahit for more info). Contigs names have different format depending on the assembly method. For example, reference guided contigs start with the reference genome accesssion number followed by a contig id number and "_pilon". Two examples of fasta headers:

>NC_018417.1_0_pilon
>k99_234 flag=1 multi=628.3844 len=471

Assembly Coverage

"metacompass.genomes_coverage.txt" contains the breadth of coverage per reference genome. Multimapped reads are assigned to the reference genome with the highest breadth of coverage. This file header is as follows:

Ref_id	bases	Ref_length	coverage

These headers are further described in the table below:

Header Description Example
Ref_id Accession number of reference genome NC_018417.1
bases Number of read bases aligned to the genome in bp 157537
Ref_length Size of Reference genome in bp 157543
coverage breadth of coverage 0.999962

Contigs summary

"metacompass_summary.tsv" contains metadata per assembled contig. This file header is as follows:

contig ID	contig size	reference genome	position start	position end	genome name

These headers are further described in the table below:

Header Description Example
contig ID Name of fasta file NC_018417.1_0_pilon
contig size Size of contig in bp 157534
reference genome Accession number NC_018417.1
position start start of alingment 6
position end end of alignment 157540
genome name size of the shortest contig Candidatus Carsonella ruddii HT isolate Thao2000, complete genome

Note that only reference-guided assembled contigs contain all information. De novo contigs only contain only a contig ID and contig size.