03 Explore input data - GeertsManon/EEG_Metagenomics GitHub Wiki
Introduction
Understanding your input data is essential for interpreting the results in Anvi'o, so let's explore these files first before diving into Anvi'o. Go to the folder exercises:
cd exercises
We will be working with four different input files, each providing a different layer of information about our metagenomic sample. Together, they form the foundation for binning and downstream analysis in Anvi'o:
- an assembly file containing the reconstructed contigs
- a mapping file recording how reads align to those contigs
- a read depth summary file
- a contigs database that organizes everything in a structured format
Input data
1. Assembly
File name: contigs.fasta
A FASTA file is a text format for storing DNA or protein sequences. In the context of metagenome assembly, it contains the reconstructed sequences (contigs) that the assembler created by piecing together overlapping sequencing reads. Each contig represents a continuous stretch of DNA sequence from an organism in your sample. The file has a simple structure: header lines start with > and contain the sequence name, followed by lines of nucleotide sequence (A, T, G, C).
You can explore the first lines by typing:
head contigs.fasta
This will output:
>c_000000000001
GCGAAGGCAAGCCAGAGATCAGCGGCAGGACGACGGCGCAAACTTGATCGGCGCTGCGACGCAACTCCAATCCGCGGT...
2. Mapping
File name: mappedReadsToContigs.bam
A BAM (Binary Alignment Map) file is a compressed binary format that stores alignment information - essentially a record of where each sequencing read maps to the assembled contigs. After assembly, we realign the original reads to the contigs to calculate sequencing depth. The BAM file logs each read's position, orientation, and alignment quality, with each line showing the read, its mapping site, and the mapping quality. This enables us to determine the number of reads supporting each base in the assembly. Using this data, we calculate the average read depth per contig, reflecting how many times a specific DNA segment has been sequenced.
Don't try to open the BAM file; you need specialized tools to read it. If you're curious, this is what it looks like:
LH00478:262:22JHYLLT4:6:1202:33304:9012 1:N:0:TACACCAT+CAACGACT 145 c_000000000001 1 44 12=1X138= c_000000000157 4817 0 GCGAAGGCAAGCTAGAGATCAGCGGCAGGACGACGGCGCAAACTTGATCGGCGCTGCGACGCAACTCCAATCCGCGGTGCCCGTGGCCGCGCTCCACGTCGGGTTGAGGTGGATCGAGATCGAGCCCGTGGTGCCGATGGCGTCTTTGGAC IIIIIIIIIIII-IIIIIIIIIIIIIIIIIIIIII9IIIIIII9IIIII9IIIIIIIII9IIIII9IIIIIIIII9II9IIIIIIIII-IIIIII9I9IIIIIIIIIIIIIIIIIIII9IIIIIIII9IIII9I99I9III-99IIIIIII NM:i:1 AM:i:44
LH00478:262:22JHYLLT4:6:2293:31038:24141 1:N:0:TACACCAT+CAACGACT 163 c_000000000001 11 45 151= = 63 203 GCCAGAGATCAGCGGCAGGACGACGGCGCAAACTTGATCGGCGCTGCGACGCAACTCCAATCCGCGGTGCCCGTGGCCGCGCTCCACGTCGGGTTGAGGTGGATCGAGATCGAGCCCGTGGTGCCGATGGCGTCTTTGGACGACACGTGAA IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII NM:i:0 AM:i:45
LH00478:262:22JHYLLT4:6:1325:49181:7023 1:N:0:TACACCAT+CAACGACT 99 c_000000000001 38 45 151= = 142 255 GCAAACTTGATCGGCGCTGCGACGCAACTCCAATCCGCGGTGCCCGTGGCCGCGCTCCACGTCGGGTTGAGGTGGATCGAGATCGAGCCCGTGGTGCCGATGGCGTCTTTGGACGACACGTGAATGCTGCCTGTCGTCTCCACAGAGCCCG IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII9IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII--IIIIII9IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII NM:i:0 AM:i:45
LH00478:262:22JHYLLT4:6:1335:45985:8690 1:N:0:TACACCAT+CAACGACT 89 c_000000000001 55 45 151= = 55 0 TGCGACGCAACTCCAATCCGCGGTGCCCGTGGCCGCGCTCCACGTCGGGTTGAGGTGGATCGAGATCGAGCCCGTGGTGCCGATGGCGTCTTTGGACGACACGTGAATGCTGCCTGTCGTCTCCACAGAGCCCGTCGAAATGTACTTGGTC IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII9II9I9IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII9IIII99I9-IIIIIIIIIIIIIIIIIIIIIIIII9IIIIII NM:i:0 AM:i:0
LH00478:262:22JHYLLT4:6:2293:31038:24141 1:N:0:TACACCAT+CAACGACT 83 c_000000000001 63 45 151= = 11 -203 AACTCCAATCCGCGGTGCCCGTGGCCGCGCTCCACGTCGGGTTGAGGTGGATCGAGATCGAGCCCGTGGTGCCGATGGCGTCTTTGGACGACACGTGAATGCTGCCTGTCGTCTCCACAGAGCCCGTCGAAATGTACTTGGTCGATGCAAT IIIIIIIIIIIIIIII9IIIIIIIIIIIIIIIIII-IIIIIIIIIIIIIII-IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII NM:i:0 AM:i:45
LH00478:262:22JHYLLT4:6:2193:33029:21423 1:N:0:TACACCAT+CAACGACT 99 c_000000000001 67 44 116= = 67 116 CCAATCCGCGGTGCCCGTGGCCGCGCTCCACGTCGGGTTGAGGTGGATCGAGATCGAGCCCGTGGTGCCGATGGCGTCTTTGGACGACACGTGAATGCTGCCTGTCGTCTCCACAG IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII9IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII NM:i:0 AM:i:44
LH00478:262:22JHYLLT4:6:2193:33029:21423 1:N:0:TACACCAT+CAACGACT 147 c_000000000001 67 44 116= = 67 -116 CCAATCCGCGGTGCCCGTGGCCGCGCTCCACGTCGGGTTGAGGTGGATCGAGATCGAGCCCGTGGTGCCGATGGCGTCTTTGGACGACACGTGAATGCTGCCTGTCGTCTCCACAG IIIIIIIIIIII9IIIIIIIIIIIIIIIIII9IIIII9IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII NM:i:0 AM:i:44
LH00478:262:22JHYLLT4:6:1465:2942:7612 1:N:0:TACACCAT+CAACGACT 153 c_000000000001 71 45 151= = 71 0 TCCGCGGTGCCCGTGGCCGCGCTCCACGTCGGGTTGAGGTGGATCGAGATCGAGCCCGTGGTGCCGATGGCGTCTTTGGACGACACGTGAATGCTGCCTGTCGTCTCCACAGAGCCCGTCGAAATGTACTTGGTCGATGCAATGACGATGC IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII9IIIIIIIIIIIIIIIIII NM:i:0 AM:i:0
LH00478:262:22JHYLLT4:6:2418:9084:9923 1:N:0:TACACCAT+CAACGACT 163 c_000000000001 73 45 151= = 102 180 CGCGGTGCCCGTGGCCGCGCTCCACGTCGGGTTGAGGTGGATCGAGATCGAGCCCGTGGTGCCGATGGCGTCTTTGGACGACACGTGAATGCTGCCTGTCGTCTCCACAGAGCCCGTCGAAATGTACTTGGTCGATGCAATGACGATGCCC IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII NM:i:0 AM:i:45
LH00478:262:22JHYLLT4:6:1209:9084:2583 1:N:0:TACACCAT+CAACGACT 153 c_000000000001 86 45 151= = 86 0 GCCGCGCTCCACGTCGGGTTGAGGTGGATCGAGATCGAGCCCGTGGTGCCGATGGCGTCTTTGGACGACACGTGAATGCTGCCTGTCGTCTCCACAGAGCCCGTCGAAATGTACTTGGTCGATGCAATGACGATGCCCGAGATGCTCGTGC IIIII9IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII9IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII-IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII NM:i:0 AM:i:0
3. Read depth
File name: readDepthPerContig.txt
While the BAM file contains detailed position-by-position alignment information, a read depth statistics file provides summary metrics for each contig. Instead of storing millions of individual read alignments, it calculates useful statistics such as average read depth, the percentage of the contig covered by reads, and read depth variability. This makes it much easier to quickly assess the quality and reliability of each assembled contig.
You can explore the first lines by typing:
head readDepthPerContig.txt | column -t
This will output:
#ID Avg_fold Length Ref_GC Covered_percent Covered_bases Plus_reads Minus_reads Read_GC Median_fold Std_Dev
c_000000000001 10.6976 25924 0.0000 100.0000 25924 938 946 0.6895 11 4.48
c_000000000002 9.2982 20637 0.0000 100.0000 20637 651 651 0.6892 9 4.13
c_000000000003 8.5586 18718 0.0000 100.0000 18718 548 545 0.7063 8 4.27
c_000000000004 9.2294 17934 0.0000 99.8773 17912 561 559 0.6982 9 3.79
c_000000000005 8.7689 17558 0.0000 99.7095 17507 512 527 0.6759 9 3.77
c_000000000006 9.1415 17296 0.0000 100.0000 17296 544 542 0.6934 9 4.03
c_000000000007 10.0901 16779 0.0000 99.5471 16703 570 578 0.6810 9 4.60
c_000000000008 11.9507 16643 0.0000 100.0000 16643 674 680 0.6692 12 4.38
c_000000000009 10.3193 16186 0.0000 99.4131 16091 562 569 0.6972 10 4.37
Key columns:
Avg_fold: Average read depth of a certain contigLength: Contig length in base pairsCovered_percent: What percentage of the contig is covered by reads. This is a useful metric. When the assembly is correct, the reads cover the entire contig. If a segment of the contig suddenly lacks mapped reads—reducing theCovered_percent—that part is likely unreliable or an artificially assembled DNA sequence.
❓ Question 1: Look at the coverage information for contig
c_000000005985. What is the average fold coverage (Avg_fold) for this contig?❓ Question 2: How many bases of this contig are covered by reads (
Covered_bases)?💡 Tip: Use
grepto search for this specific contig in thereadDepthPerContig.txtfile.
4. Contigs Database
File name: contigs.db
The contigs.db is a SQLite database created by Anvi'o to organize all analysis results in a structured, queryable format. Currently, it includes the contigs from contigs.fasta along with gene predictions or gene calls. During this session, we'll incorporate additional layers of information, such as taxonomic annotations and manual binning decisions.
You can explore the contigs database by typing:
anvi-db-info contigs.db
This will output:
DB Info (no touch)
===============================================
Database Path ................................: contigs.db
description ..................................: [Not found, but it's OK]
db_type ......................................: contigs (variant: unknown)
version ......................................: 24
DB Info (no touch also)
===============================================
project_name .................................: DRC_Cave_A
contigs_db_hash ..............................: hash1042bd48
split_length .................................: 20000
kmer_size ....................................: 4
num_contigs ..................................: 8516
total_length .................................: 29138064
num_splits ...................................: 8719
gene_level_taxonomy_source ...................: None
gene_function_sources ........................: None
genes_are_called .............................: 1
external_gene_calls ..........................: 0
external_gene_amino_acid_seqs ................: 0
skip_predict_frame ...........................: 0
splits_consider_gene_calls ...................: 1
scg_taxonomy_was_run .........................: 0
scg_taxonomy_database_version ................: None
trna_taxonomy_was_run ........................: 0
trna_taxonomy_database_version ...............: None
reaction_network_ko_annotations_hash .........: None
reaction_network_kegg_database_release .......: None
reaction_network_modelseed_database_sha ......: None
reaction_network_consensus_threshold .........: None
reaction_network_discard_ties ................: None
creation_date ................................: 1769435245.95836
* Please remember that it is never a good idea to change these values. But in some
cases it may be absolutely necessary to update something here, and a
programmer may ask you to run this program and do it. But even then, you
should be extremely careful.
AVAILABLE GENE CALLERS
===============================================
* 'pyrodigal-gv' (32,197 gene calls)
AVAILABLE FUNCTIONAL ANNOTATION SOURCES
===============================================
* No functional annotations found in this contigs database :/
AVAILABLE HMM SOURCES
===============================================
* This contigs db does not have HMMs :/
❓ Question 3: How many contigs does this assembly contain?
❓ Question 4: What is the total length (in bp) of all contigs combined (i.e., the size of your metagenome)?
Small recap
Here's how your data was created (don't worry, this was done for you!):
Raw Reads (DRC_cave_A sample)
↓
Quality Filtering & Trimming
↓
Metagenomic Assembly
↓
Mapping reads back to contigs
↓
Coverage calculation
↓
YOUR STARTING POINT: Unbinned contigs ready for manual binning in Anvi'o!