Microbiome Helper 2 Combining taxonomic annotations across MAGs, contigs and reads - LangilleLab/microbiome_helper GitHub Wiki

Authors: Robyn Wright Modifications by: NA Based on initial versions by: NA

Please note: We are still testing/developing this so use with caution :)

This workflow is even more experimental/in development than some other sections of this Wiki. The aim here is to be able to combine taxonomic profiles across MAGs/contigs/reads for a sample, and it assumes that you have already constructed MAGs from your samples. We feel like analyses are often carried out separately for MAGs and for reads, with neither of them really incorporating information for the whole microbiome - many reads can't be classified because they belong to novel taxa. MAGs are easier to classify because they incorporate more information, but many reads won't be combined into MAGs, particularly if they belong to rare taxa. This issue is described in this paper, but we wanted to apply this pipeline using tools that we usually use. It follows the general workflow of:

Run general QC on samples
Assemble contigs and MAGs from samples
1. Classify MAGs using GTDB-tk
Remove reads that map to either contigs or MAGs from files (and get counts of reads that map to MAGs/contigs)
Classify the remaining reads as well as the contigs taxonomically using Kraken 2

The classification of the reads/contigs is also a multi-step process, and one that we're still refining:

Filter the reads with a large database that contains genomes from a wide span of the tree of life (we use one constructed using NCBI RefSeq V214 genomes - all archaeal/bacterial/viral genomes and only complete eukaryotic genomes) - this will separate them to:
1. Unclassified reads
2. Bacterial reads
3. Archaeal reads
4. Eukaryotic reads
5. Viral reads
Take the unclassified + archaeal + bacterial reads/contigs and run them through the Kraken 2 GTDB database
Take the eukaryotic reads/contigs and run them through the Kraken 2 micro-eukaryote database Optional: Run GeCoCheck on the GTDB/MicroEuk output
Run Bracken on the GTDB/MicroEuk classified reads
Collate all taxonomic classifications for MAGs/contigs/reads

1. Get reads that map to MAGs/contigs

Get all contig names that are included in MAGs
Filter contigs file to those that are included in MAGs and those that aren't included in MAGs
Go through all BAM files and get details of which contigs they map to for all reads. For each read:
- Determine which contigs it maps to with mapping quality >= 30
- If a read maps to multiple contigs, determine which is best (highest mapping quality)
For each read, store whether they map to a contig, a MAG, or neither

Python commands used:

#conda activate GeCoCheck-v0.0.5
import os
from Bio import SeqIO
import pickle
import pysam
from collections import defaultdict

mags = os.listdir('anvio/FINAL_2500_filtered_wtax/summary/bin_by_bin/')
for mag in mags:
  contigs = 'anvio/FINAL_2500_filtered_wtax/summary/bin_by_bin/'+mag+'/'+mag+'-contigs.fa'
  new_fn = 'MAG'+mag.split('_')[-1]
  os.system('cp '+contigs+' mags/'+new_fn+'.fa')

mags = os.listdir('mags/')  
all_contig_names = []
mag_contig = {}
for mag in mags:
  this_mag_contigs = []
  for record in SeqIO.parse('mags/'+mag, "fasta"):
    this_mag_contigs.append(record.id)
    all_contig_names.append(record.id)
  mag_contig[mag.replace('.fa', '')] = set(this_mag_contigs)
  
contig_mag = {}
for mag in mag_contig:
  for cntg in mag_contig[mag]:
    contig_mag[cntg] = mag

all_contig_names = set(all_contig_names)

contigs = 'anvio/megahit_out/final.contigs.fixed.fa'
contigs_in_mags, contigs_not_in_mags = [], []
contigs_not_mags_names = []

for record in SeqIO.parse(contigs, "fasta"):
  if record.id in all_contig_names:
    contigs_in_mags.append(record)
  else:
    contigs_not_in_mags.append(record)
    contigs_not_mags_names.append(record.id)
    
SeqIO.write(contigs_in_mags, "contigs/contigs_in_mags.fasta", "fasta")
SeqIO.write(contigs_not_in_mags, "contigs/contigs_not_in_mags.fasta", "fasta")

contigs_not_mags_names = set(contigs_not_mags_names)

samples = ['D1', 'D2', 'D3', 'D4', 'D5', 'D6', 'D7', 'D8', 'D9', 'D10', 'D11', 'D12']
for sample in samples:
  print(sample)
  bam_f = 'anvio/bam_files/'+sample+'.bam'
  reads = defaultdict(list)
  with pysam.AlignmentFile(bam_f, "rb") as bam_file:
    for read in bam_file:
      if read.mapping_quality >= 30:
        reads[read.query_name].append([read.reference_name, read.mapping_quality])
  best_reads = {}
  for read in reads:
    if len(reads[read]) == 1:
      best_name = reads[read][0][0]
    else:
      best_hit, best_name = 0, ''
      for hit in reads[read]:
        if hit[1] > best_hit:
          best_hit =hit[1]
          best_name = hit[0]
    if best_name in contig_mag:
      best_reads[read] = [best_name, contig_mag[best_name]]
    else:
      best_reads[read] = [best_name, 'Contig']
  with open('read_mapping/'+sample+'.txt', 'w') as f:
    for read in best_reads:
      q = f.write(read+'\t'+best_reads[read][0]+'\t'+best_reads[read][1]+'\n')

Next, we filter the read files to reads that are in contigs (or MAGs) and reads that aren't in contigs:

from Bio import SeqIO

samples = ['D1', 'D2', 'D3', 'D4', 'D5', 'D6', 'D7', 'D8', 'D9', 'D10', 'D11', 'D12']
for sample in samples:
  reads = set()
  for row in open('read_mapping/'+sample+'.txt', 'r'):
    reads.add(row.split('\t')[0])
  reads_contigs_mags, reads_no_contigs = [], []
  for record in SeqIO.parse('joined_reads/'+sample+'.fq', "fastq"):
    if record.id in reads:
      reads_contigs_mags.append(record)
    else:
      reads_no_contigs.append(record)
  print(sample, len(reads_contigs_mags), len(reads_no_contigs))
  SeqIO.write(reads_contigs_mags, "read_mapping/"+sample+"_in_contigs.fastq", "fastq")
  SeqIO.write(reads_no_contigs, "read_mapping/"+sample+"_not_in_contigs.fastq", "fastq")

Get the counts of reads in contigs/MAGs:

#conda activate kraken2-v2.17.1
import os
import pandas as pd
from collections import defaultdict

files = [f for f in os.listdir('read_mapping/') if '.txt' in f]
count_list = []
for f in files:
  counts = defaultdict(int)
  total_count = 0
  for row in open('read_mapping/'+f, 'r'):
    this_row = row.replace('\n', '').split('\t')
    if this_row[2] == 'Contig':
      counts[this_row[1]] += 1
      total_count += 1
    else:
      counts[this_row[2]] += 1
      total_count += 1
  print(f, total_count)
  this_counts = pd.DataFrame.from_dict(counts, orient='index', columns=[f.replace('.txt', '')])
  if isinstance(count_list, list):
    count_list = this_counts
  else:
    count_list = pd.concat([count_list, this_counts]).fillna(value=0)
    count_list = count_list.groupby(by=count_list.index).sum()

count_list.to_csv('MAG_contig_read_counts.csv')

At the end of this, the files that we will be using are:

Contigs that weren't included in MAGs
Reads that weren't included in contigs (and therefore not in MAGs)

2. Classify taxonomy

Classify with Kraken 2 + NCBI RefSeq V214

conda activate kraken2-v2.17.1

mkdir kraken2_outraw
mkdir kraken2_kreport
    
parallel -j 2 --eta 'kraken2 --db /bigpool/ramdisk/RefSeqV214_5Oct2022_AllBacArcVir_CompEuk_KrakOther/ \
    --threads 12 --output kraken2_outraw/{/.}.kraken --report kraken2_kreport/{/.}.kreport \
    --use-names --confidence 0 --memory-mapping --paired {}' \
    ::: read_mapping/*.fastq
    
kraken2 --db /bigpool/ramdisk/RefSeqV214_5Oct2022_AllBacArcVir_CompEuk_KrakOther/ \
    --threads 24 --output kraken2_outraw/contigs_not_in_mags.kraken --report kraken2_kreport/contigs_not_in_mags.kreport \
    --use-names --confidence 0 --memory-mapping contigs/contigs_not_in_mags.fasta

Use KrakenTools to filter reads files

conda activate GeCoCheck-v0.0.5
cp /home/shared/IMR_shared/KielReeseMGS_RunNS194_analysis/extract_kraken_reads.py .

mkdir seqs_for_gtdb_classification

#get bacterial and archaeal reads
parallel -j 13 --progress --eta 'python extract_kraken_reads.py -k kraken2_outraw/{/.}.kraken -s {} -t 2 2157 -o seqs_for_gtdb_classification/{/.}.fasta -r kraken2_kreport/{/.}.kreport --include-children --append' ::: read_mapping/*.fast*

#get unclassified reads
parallel -j 13 --progress --eta 'python extract_kraken_reads.py -k kraken2_outraw/{/.}.kraken -s {} -t 0 -o seqs_for_gtdb_classification/{/.}_unclassified.fasta -r kraken2_kreport/{/.}.kreport --append' ::: read_mapping/*.fast*

You should now have files containing bacterial, archaeal, and unclassified reads.

Classify unclassified + bacteria + archaea with GTDB

conda activate kraken2-v2.17.1

mkdir kraken2_outraw_gtdb
mkdir kraken2_kreport_gtdb

parallel -j 2 --eta --progress 'kraken2 --db /bigpool/ramdisk/k2_gtdb_genome_reps_20250609/ \
    --threads 12 --output kraken2_outraw_gtdb/{/.}.kraken --report kraken2_kreport_gtdb/{/.}.kreport \
    --use-names --confidence 0 --memory-mapping --paired {}' \
    ::: seqs_for_gtdb_classification/*.fasta

Run Bracken:

mkdir bracken_out_gtdb

parallel -j 12 'bracken -d /bigpool/shared/kraken_databases/k2_gtdb_genome_reps_20250609/ \
    -i {} -o bracken_out_gtdb/{/.}.genus.bracken \
    -r 150 -l G -t 10' \
    ::: kraken2_kreport_gtdb/*.kreport
    
rm kraken2_kreport_gtdb/*_bracken_genuses.kreport

parallel -j 12 'bracken -d /bigpool/shared/kraken_databases/k2_gtdb_genome_reps_20250609/ \
    -i {} -o bracken_out_gtdb/{/.}.phylum.bracken \
    -r 150 -l P -t 10' \
    ::: kraken2_kreport_gtdb/*.kreport
    
rm kraken2_kreport_gtdb/*_bracken_phylums.kreport

parallel -j 12 'bracken -d /bigpool/shared/kraken_databases/k2_gtdb_genome_reps_20250609/ \
    -i {} -o bracken_out_gtdb/{/.}.species.bracken \
    -r 150 -l S -t 10' \
    ::: kraken2_kreport_gtdb/*.kreport
    
rm kraken2_kreport_gtdb/*_bracken_species.kreport
rm bracken_out_gtdb/contigs_not_in_mags*

mkdir bracken_out_merged_gtdb

combine_bracken_outputs.py --files bracken_out_gtdb/*.genus.bracken -o bracken_out_merged_gtdb/merged_output.genus.bracken
combine_bracken_outputs.py --files bracken_out_gtdb/*.phylum.bracken -o bracken_out_merged_gtdb/merged_output.phylum.bracken
combine_bracken_outputs.py --files bracken_out_gtdb/*.species.bracken -o bracken_out_merged_gtdb/merged_output.species.bracken

3. Combine taxonomy for reads and MAGs

import pandas as pd
import os
from Bio import SeqIO
import math

mc_counts = pd.read_csv('MAG_contig_read_counts.csv', index_col=0, header=0)
bracken = pd.read_csv('bracken_out_merged_gtdb/merged_output.species.bracken', index_col=0, header=0, sep='\t')

read_counts = {}
fastq = os.listdir('joined_reads/')
for f in fastq:
  print(f)
  count = 0
  for record in SeqIO.parse('joined_reads/'+f, "fastq"):
    count += 1
  read_counts[f.replace('.fq', '')] = count
  
read_counts_df = pd.DataFrame({k: [v] for k, v in read_counts.items()}, index=['Reads'])
read_counts_df.to_csv('Joined_reads_counts.csv')

bracken_rename, bracken_keeping = {}, ['taxonomy_id']
for col in bracken.columns:
  if 'bracken_num' in col:
    bracken_keeping.append(col)
    bracken_rename[col] = col.split('_')[0]

bracken = bracken.loc[:, bracken_keeping].rename(columns=bracken_rename)
bracken['Reads_Contigs_MAGs'] = 'Reads'
bracken = bracken.reset_index()
  
kraken_contigs = {}
for row in open('kraken2_outraw_gtdb/contigs_not_in_mags.kraken', 'r'):
  this_row = row.replace('\n', '').split('\t')
  tid = this_row[2].split('(taxid ')
  kraken_contigs[this_row[1]] = [tid[0], tid[1].replace(')', '')]

gtdb_mags = {}
bac_mags = pd.read_csv('gtdbtk_out/gtdbtk.bac120.summary.tsv', index_col=0, header=0, sep='\t')
arc_mags = pd.read_csv('gtdbtk_out/gtdbtk.ar53.summary.tsv', index_col=0, header=0, sep='\t')
mags = pd.concat([bac_mags, arc_mags])
for row in mags.index.values:
  tax = mags.loc[row, 'classification'].split('__')
  tax = [t.split(';')[0] for t in tax]
  tax = [t for t in tax if t != '']
  gtdb_mags[row] = tax[-1]

mc_counts['name'] = ''
mc_counts['taxonomy_id'] = ''
for row in mc_counts.index.values:
  if row in kraken_contigs:
    mc_counts.loc[row, 'name'] = kraken_contigs[row][0]
    mc_counts.loc[row, 'taxonomy_id'] = kraken_contigs[row][1]
  elif row in gtdb_mags:
    mc_counts.loc[row, 'name'] = gtdb_mags[row]

mc_counts = mc_counts.reset_index()
mc_counts = mc_counts.set_index('name')
mc_counts = mc_counts.rename(columns={'index':'Reads_Contigs_MAGs'})
bracken = bracken.set_index('name')

mc_counts_orig = pd.read_csv('MAG_contig_read_counts.csv', index_col=0, header=0)
mc_counts_orig.loc['Sum', :] = mc_counts_orig.sum(axis=0)

bracken_sum = bracken.loc[:, ['D'+str(a) for a in range(1,13)]].sum(axis=0)

samples = ['D'+str(a) for a in range(1,13)]
#account for the counts in contigs and MAGs being paired not single reads
for sample in samples:
  mc_counts[sample] = mc_counts[sample]*2

sample_sums = {}
for sample in samples:
  sample_sums[sample] = [sum(mc_counts[sample]), sum(bracken[sample])]

all_counts = pd.concat([mc_counts, bracken])

for sample in samples:
  all_counts.loc['Total', sample] = sum(sample_sums[sample])
  all_counts.loc['Unclassified', sample] = read_counts[sample]-sum(sample_sums[sample])
  all_counts.loc['Mapped to MAGs/contigs', sample] = sample_sums[sample][0]
  all_counts.loc['Reads', sample] = sample_sums[sample][1]
  
all_counts.to_csv('Reads_Contigs_MAGs_counts.csv')

all_counts = all_counts.drop(['Total', 'Unclassified', 'Mapped to MAGs/contigs', 'Reads'])
all_counts = all_counts.rename(index={'':'Unclassified'})

taxa = list(set(list(all_counts.index.values)))
tax_list = list(all_counts.index.values)
new_df = []
for tax in taxa:
  if tax_list.count(tax) == 1:
    if all_counts.loc[tax, 'Reads_Contigs_MAGs'] == 'Reads':
      part_list = [tax, 'Reads', '', '', 0]
    elif 'MAG' in all_counts.loc[tax, 'Reads_Contigs_MAGs']:
      part_list = [tax, '', '', all_counts.loc[tax, 'Reads_Contigs_MAGs'], 0]
    else:
      part_list = [tax, '', all_counts.loc[tax, 'Reads_Contigs_MAGs'], '', 1]
    new_df.append(part_list+list(all_counts.loc[tax, ['taxonomy_id']+['D'+str(a) for a in range(1,13)]]))
  else:
    tid = list(set(list(all_counts.loc[tax, 'taxonomy_id'].values)))
    this_tax = [tax, '', '', '', 0, tid[0]]
    mags, contigs = [], []
    for a in range(1,13):
      this_tax.append(sum(all_counts.loc[tax, 'D'+str(a)].values))
    for val in all_counts.loc[tax, 'Reads_Contigs_MAGs'].values:
      if val == 'Reads':
        this_tax[1] = 'Reads'
      elif 'MAG' in val:
        mags.append(val)
      else:
        contigs.append(val)
    this_tax[2] = ';'.join(contigs)
    this_tax[3] = ';'.join(mags)
    this_tax[4] = len(contigs)
    new_df.append(this_tax)