Analyzing MetaCompass results - marbl/MetaCompass GitHub Wiki
Analyze results
MetaCompass output folder provides four results:
metacompass_assembly_stats.tsv - Assembly statistics
metacompass.final.ctg.fa - Assembled contigs in fasta format
metacompass.geonmes.coverage.txt - Coverage per genome(Breadth by Default)
metacompass_summary.tsv - De novo and Reference-guided assembly contigs list
Assembly statistics
"metacompass_assembly_stats.tsv" header is as follows:
File # Contigs Total Size(Kbp) Min Size Max Size(Kbp) Average Size Median Size N50 # N50 contigs Size at 1Mbp (Kbp) Number @ 1Mbp Size at 2Mbp (Kbp) Number @ 2Mbp Size at 4Mbp (Kbp) Number @ 4Mbp Size at 10Mbp (Kbp) Number @ 10Mbp GC content [%]
These headers are further described in the table below:
Header | Description | Example |
---|---|---|
File | Name of fasta file | tutorial_example1/thao2000.0.assembly.out/contigs.final.fasta |
# Contigs | Number of contigs | 1 |
Total Size(Kbp) | Total assembly size in Kbp | 157534 |
Min Size | size of the shortest contig | 157534 |
Max Size(Kbp) | size of the largest contig | 157534 |
Average Size | Average contig size | 157534.00 |
Median Size | Median contig size | 157534.00 |
N50 | minimum contig length needed to cover 50% of the total metagenome | 157534 |
# N50 contigs | Also known as L50. L50 count is defined as the smallest number of contigs whose length sum produces N50 (from https://en.wikipedia.org/wiki/N50,_L50,_and_related_statistics#N50) | 1 |
Size at 1Mbp (Kbp) | ) represents the size of the largest contig C such that the sum of all contigs larger than C exceeds 1Mbp | 0.00 |
Number @ 1Mbp | represents the number of contigs larger than 1Mbp | 0 |
Size at 2Mbp (Kbp) | represents the size of the largest contig C such that the sum of all contigs larger than C exceeds 2Mbp | 0.00 |
Number @ 2Mbp | represents the number of contigs larger than 2Mbp | 0 |
Size at 4Mbp (Kbp) | represents the size of the largest contig C such that the sum of all contigs larger than C exceeds 4Mbp | 0.00 |
Number @ 4Mbp | represents the number of contigs larger than 4Mbp | 00.00 |
Size at 10Mbp (Kbp) | represents the size of the largest contig C such that the sum of all contigs larger than C exceeds 10Mbp | 0 |
Number @ 10Mbp | represents the number of contigs larger than 10Mbp | |
GC content [%] | The GC content (percentage) is the number of GC nucleotides divided by the total nucleotides. | 14.55 |
Assembled contigs in fasta format
MetaCompass fasta file "metacompass.final.ctg.fa" contains a combinations of Reference-guided Contigs from reads aligning to reference genomes, and De novo contigs from unmapped reads(See megahit for more info). Contigs names have different format depending on the assembly method. For example, reference guided contigs start with the reference genome accesssion number followed by a contig id number and "_pilon". Two examples of fasta headers:
>NC_018417.1_0_pilon
>k99_234 flag=1 multi=628.3844 len=471
Assembly Coverage
"metacompass.genomes_coverage.txt" contains the breadth of coverage per reference genome. Multimapped reads are assigned to the reference genome with the highest breadth of coverage. This file header is as follows:
Ref_id bases Ref_length coverage
These headers are further described in the table below:
Header | Description | Example |
---|---|---|
Ref_id | Accession number of reference genome | NC_018417.1 |
bases | Number of read bases aligned to the genome in bp | 157537 |
Ref_length | Size of Reference genome in bp | 157543 |
coverage | breadth of coverage | 0.999962 |
Contigs summary
"metacompass_summary.tsv" contains metadata per assembled contig. This file header is as follows:
contig ID contig size reference genome position start position end genome name
These headers are further described in the table below:
Header | Description | Example |
---|---|---|
contig ID | Name of fasta file | NC_018417.1_0_pilon |
contig size | Size of contig in bp | 157534 |
reference genome | Accession number | NC_018417.1 |
position start | start of alingment | 6 |
position end | end of alignment | 157540 |
genome name | size of the shortest contig | Candidatus Carsonella ruddii HT isolate Thao2000, complete genome |
Note that only reference-guided assembled contigs contain all information. De novo contigs only contain only a contig ID and contig size.