03 Explore input data - GeertsManon/EEG_Metagenomics GitHub Wiki

Introduction

Understanding your input data is essential for interpreting the results in Anvi'o, so let's explore these files first before diving into Anvi'o. Go to the folder exercises:

cd exercises

We will be working with four different input files, each providing a different layer of information about our metagenomic sample. Together, they form the foundation for binning and downstream analysis in Anvi'o:

  • an assembly file containing the reconstructed contigs
  • a mapping file recording how reads align to those contigs
  • a read depth summary file
  • a contigs database that organizes everything in a structured format

Input data

1. Assembly

File name: contigs.fasta

A FASTA file is a text format for storing DNA or protein sequences. In the context of metagenome assembly, it contains the reconstructed sequences (contigs) that the assembler created by piecing together overlapping sequencing reads. Each contig represents a continuous stretch of DNA sequence from an organism in your sample. The file has a simple structure: header lines start with > and contain the sequence name, followed by lines of nucleotide sequence (A, T, G, C).

You can explore the first lines by typing:

head contigs.fasta

This will output:

>c_000000000001
GCGAAGGCAAGCCAGAGATCAGCGGCAGGACGACGGCGCAAACTTGATCGGCGCTGCGACGCAACTCCAATCCGCGGT...

2. Mapping

File name: mappedReadsToContigs.bam

A BAM (Binary Alignment Map) file is a compressed binary format that stores alignment information - essentially a record of where each sequencing read maps to the assembled contigs. After assembly, we realign the original reads to the contigs to calculate sequencing depth. The BAM file logs each read's position, orientation, and alignment quality, with each line showing the read, its mapping site, and the mapping quality. This enables us to determine the number of reads supporting each base in the assembly. Using this data, we calculate the average read depth per contig, reflecting how many times a specific DNA segment has been sequenced.

Don't try to open the BAM file; you need specialized tools to read it. If you're curious, this is what it looks like:

LH00478:262:22JHYLLT4:6:1202:33304:9012   1:N:0:TACACCAT+CAACGACT  145  c_000000000001  1   44  12=1X138=  c_000000000157  4817  0     GCGAAGGCAAGCTAGAGATCAGCGGCAGGACGACGGCGCAAACTTGATCGGCGCTGCGACGCAACTCCAATCCGCGGTGCCCGTGGCCGCGCTCCACGTCGGGTTGAGGTGGATCGAGATCGAGCCCGTGGTGCCGATGGCGTCTTTGGAC  IIIIIIIIIIII-IIIIIIIIIIIIIIIIIIIIII9IIIIIII9IIIII9IIIIIIIII9IIIII9IIIIIIIII9II9IIIIIIIII-IIIIII9I9IIIIIIIIIIIIIIIIIIII9IIIIIIII9IIII9I99I9III-99IIIIIII  NM:i:1  AM:i:44
LH00478:262:22JHYLLT4:6:2293:31038:24141  1:N:0:TACACCAT+CAACGACT  163  c_000000000001  11  45  151=       =               63    203   GCCAGAGATCAGCGGCAGGACGACGGCGCAAACTTGATCGGCGCTGCGACGCAACTCCAATCCGCGGTGCCCGTGGCCGCGCTCCACGTCGGGTTGAGGTGGATCGAGATCGAGCCCGTGGTGCCGATGGCGTCTTTGGACGACACGTGAA  IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII  NM:i:0  AM:i:45
LH00478:262:22JHYLLT4:6:1325:49181:7023   1:N:0:TACACCAT+CAACGACT  99   c_000000000001  38  45  151=       =               142   255   GCAAACTTGATCGGCGCTGCGACGCAACTCCAATCCGCGGTGCCCGTGGCCGCGCTCCACGTCGGGTTGAGGTGGATCGAGATCGAGCCCGTGGTGCCGATGGCGTCTTTGGACGACACGTGAATGCTGCCTGTCGTCTCCACAGAGCCCG  IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII9IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII--IIIIII9IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII  NM:i:0  AM:i:45
LH00478:262:22JHYLLT4:6:1335:45985:8690   1:N:0:TACACCAT+CAACGACT  89   c_000000000001  55  45  151=       =               55    0     TGCGACGCAACTCCAATCCGCGGTGCCCGTGGCCGCGCTCCACGTCGGGTTGAGGTGGATCGAGATCGAGCCCGTGGTGCCGATGGCGTCTTTGGACGACACGTGAATGCTGCCTGTCGTCTCCACAGAGCCCGTCGAAATGTACTTGGTC  IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII9II9I9IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII9IIII99I9-IIIIIIIIIIIIIIIIIIIIIIIII9IIIIII  NM:i:0  AM:i:0
LH00478:262:22JHYLLT4:6:2293:31038:24141  1:N:0:TACACCAT+CAACGACT  83   c_000000000001  63  45  151=       =               11    -203  AACTCCAATCCGCGGTGCCCGTGGCCGCGCTCCACGTCGGGTTGAGGTGGATCGAGATCGAGCCCGTGGTGCCGATGGCGTCTTTGGACGACACGTGAATGCTGCCTGTCGTCTCCACAGAGCCCGTCGAAATGTACTTGGTCGATGCAAT  IIIIIIIIIIIIIIII9IIIIIIIIIIIIIIIIII-IIIIIIIIIIIIIII-IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII  NM:i:0  AM:i:45
LH00478:262:22JHYLLT4:6:2193:33029:21423  1:N:0:TACACCAT+CAACGACT  99   c_000000000001  67  44  116=       =               67    116   CCAATCCGCGGTGCCCGTGGCCGCGCTCCACGTCGGGTTGAGGTGGATCGAGATCGAGCCCGTGGTGCCGATGGCGTCTTTGGACGACACGTGAATGCTGCCTGTCGTCTCCACAG                                     IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII9IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII                                     NM:i:0  AM:i:44
LH00478:262:22JHYLLT4:6:2193:33029:21423  1:N:0:TACACCAT+CAACGACT  147  c_000000000001  67  44  116=       =               67    -116  CCAATCCGCGGTGCCCGTGGCCGCGCTCCACGTCGGGTTGAGGTGGATCGAGATCGAGCCCGTGGTGCCGATGGCGTCTTTGGACGACACGTGAATGCTGCCTGTCGTCTCCACAG                                     IIIIIIIIIIII9IIIIIIIIIIIIIIIIII9IIIII9IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII                                     NM:i:0  AM:i:44
LH00478:262:22JHYLLT4:6:1465:2942:7612    1:N:0:TACACCAT+CAACGACT  153  c_000000000001  71  45  151=       =               71    0     TCCGCGGTGCCCGTGGCCGCGCTCCACGTCGGGTTGAGGTGGATCGAGATCGAGCCCGTGGTGCCGATGGCGTCTTTGGACGACACGTGAATGCTGCCTGTCGTCTCCACAGAGCCCGTCGAAATGTACTTGGTCGATGCAATGACGATGC  IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII9IIIIIIIIIIIIIIIIII  NM:i:0  AM:i:0
LH00478:262:22JHYLLT4:6:2418:9084:9923    1:N:0:TACACCAT+CAACGACT  163  c_000000000001  73  45  151=       =               102   180   CGCGGTGCCCGTGGCCGCGCTCCACGTCGGGTTGAGGTGGATCGAGATCGAGCCCGTGGTGCCGATGGCGTCTTTGGACGACACGTGAATGCTGCCTGTCGTCTCCACAGAGCCCGTCGAAATGTACTTGGTCGATGCAATGACGATGCCC  IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII  NM:i:0  AM:i:45
LH00478:262:22JHYLLT4:6:1209:9084:2583    1:N:0:TACACCAT+CAACGACT  153  c_000000000001  86  45  151=       =               86    0     GCCGCGCTCCACGTCGGGTTGAGGTGGATCGAGATCGAGCCCGTGGTGCCGATGGCGTCTTTGGACGACACGTGAATGCTGCCTGTCGTCTCCACAGAGCCCGTCGAAATGTACTTGGTCGATGCAATGACGATGCCCGAGATGCTCGTGC  IIIII9IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII9IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII-IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII  NM:i:0  AM:i:0

3. Read depth

File name: readDepthPerContig.txt

While the BAM file contains detailed position-by-position alignment information, a read depth statistics file provides summary metrics for each contig. Instead of storing millions of individual read alignments, it calculates useful statistics such as average read depth, the percentage of the contig covered by reads, and read depth variability. This makes it much easier to quickly assess the quality and reliability of each assembled contig.

You can explore the first lines by typing:

head readDepthPerContig.txt | column -t

This will output:

#ID             Avg_fold  Length  Ref_GC  Covered_percent  Covered_bases  Plus_reads  Minus_reads  Read_GC  Median_fold  Std_Dev
c_000000000001  10.6976   25924   0.0000  100.0000         25924          938         946          0.6895   11           4.48
c_000000000002  9.2982    20637   0.0000  100.0000         20637          651         651          0.6892   9            4.13
c_000000000003  8.5586    18718   0.0000  100.0000         18718          548         545          0.7063   8            4.27
c_000000000004  9.2294    17934   0.0000  99.8773          17912          561         559          0.6982   9            3.79
c_000000000005  8.7689    17558   0.0000  99.7095          17507          512         527          0.6759   9            3.77
c_000000000006  9.1415    17296   0.0000  100.0000         17296          544         542          0.6934   9            4.03
c_000000000007  10.0901   16779   0.0000  99.5471          16703          570         578          0.6810   9            4.60
c_000000000008  11.9507   16643   0.0000  100.0000         16643          674         680          0.6692   12           4.38
c_000000000009  10.3193   16186   0.0000  99.4131          16091          562         569          0.6972   10           4.37

Key columns:

  • Avg_fold: Average read depth of a certain contig
  • Length: Contig length in base pairs
  • Covered_percent: What percentage of the contig is covered by reads. This is a useful metric. When the assembly is correct, the reads cover the entire contig. If a segment of the contig suddenly lacks mapped reads—reducing the Covered_percent—that part is likely unreliable or an artificially assembled DNA sequence.

Question 1: Look at the coverage information for contig c_000000005985. What is the average fold coverage (Avg_fold) for this contig?

Question 2: How many bases of this contig are covered by reads (Covered_bases)?

💡 Tip: Use grep to search for this specific contig in the readDepthPerContig.txt file.

4. Contigs Database

File name: contigs.db

The contigs.db is a SQLite database created by Anvi'o to organize all analysis results in a structured, queryable format. Currently, it includes the contigs from contigs.fasta along with gene predictions or gene calls. During this session, we'll incorporate additional layers of information, such as taxonomic annotations and manual binning decisions.

You can explore the contigs database by typing:

anvi-db-info contigs.db

This will output:

DB Info (no touch)
===============================================
Database Path ................................: contigs.db                                                                                                                                                                                                             
description ..................................: [Not found, but it's OK]                                                                                                                                                                                               
db_type ......................................: contigs (variant: unknown)                                                                                                                                                                                             
version ......................................: 24                                                                                                                                                                                                                     

                                                                                                                                                                                                                                                                       
DB Info (no touch also)
===============================================
project_name .................................: DRC_Cave_A                                                                                                                                                                                                             
contigs_db_hash ..............................: hash1042bd48                                                                                                                                                                                                           
split_length .................................: 20000                                                                                                                                                                                                                  
kmer_size ....................................: 4                                                                                                                                                                                                                      
num_contigs ..................................: 8516                                                                                                                                                                                                                   
total_length .................................: 29138064                                                                                                                                                                                                               
num_splits ...................................: 8719                                                                                                                                                                                                                   
gene_level_taxonomy_source ...................: None                                                                                                                                                                                                                   
gene_function_sources ........................: None                                                                                                                                                                                                                   
genes_are_called .............................: 1                                                                                                                                                                                                                      
external_gene_calls ..........................: 0                                                                                                                                                                                                                      
external_gene_amino_acid_seqs ................: 0                                                                                                                                                                                                                      
skip_predict_frame ...........................: 0                                                                                                                                                                                                                      
splits_consider_gene_calls ...................: 1                                                                                                                                                                                                                      
scg_taxonomy_was_run .........................: 0                                                                                                                                                                                                                      
scg_taxonomy_database_version ................: None                                                                                                                                                                                                                   
trna_taxonomy_was_run ........................: 0                                                                                                                                                                                                                      
trna_taxonomy_database_version ...............: None                                                                                                                                                                                                                   
reaction_network_ko_annotations_hash .........: None                                                                                                                                                                                                                   
reaction_network_kegg_database_release .......: None                                                                                                                                                                                                                   
reaction_network_modelseed_database_sha ......: None                                                                                                                                                                                                                   
reaction_network_consensus_threshold .........: None                                                                                                                                                                                                                   
reaction_network_discard_ties ................: None                                                                                                                                                                                                                   
creation_date ................................: 1769435245.95836                                                                                                                                                                                                       
                                                                                                                                                                                                                                                                       
* Please remember that it is never a good idea to change these values. But in some
  cases it may be absolutely necessary to update something here, and a
  programmer may ask you to run this program and do it. But even then, you
  should be extremely careful.

                                                                                                                                                                                                                                                                       
AVAILABLE GENE CALLERS
===============================================
* 'pyrodigal-gv' (32,197 gene calls)                                                                                                                                                                                                                                   

                                                                                                                                                                                                                                                                       
AVAILABLE FUNCTIONAL ANNOTATION SOURCES
===============================================
* No functional annotations found in this contigs database :/                                                                                                                                                                                                          

                                                                                                                                                                                                                                                                       
AVAILABLE HMM SOURCES
===============================================
* This contigs db does not have HMMs :/ 

Question 3: How many contigs does this assembly contain?

Question 4: What is the total length (in bp) of all contigs combined (i.e., the size of your metagenome)?

Small recap

Here's how your data was created (don't worry, this was done for you!):

Raw Reads (DRC_cave_A sample)
    ↓
Quality Filtering & Trimming
    ↓
Metagenomic Assembly
    ↓
Mapping reads back to contigs
    ↓
Coverage calculation
    ↓
YOUR STARTING POINT: Unbinned contigs ready for manual binning in Anvi'o!