01b_Getting_ASVs_for_Meadow_Mags - esogin/seagrassOmics GitHub Wiki

Input for 16S tree

Created May 28, 2019

Updated: May 28, 2019

We are now working with two datasets - full length 16S - and a set of 114 bins that originated from the same input samples. Both datasets have been annotated using the GBTCK tool kit so we can get the same taxonomic strings for the input data.

Expected problem: Often times there are more full length ASVs associated to the same taxonomy. e.g., for the tax string: d__Bacteria;p__Desulfobacterota;c__Desulfobacteria;o__Desulfobacterales;f__BuS5;g__;s__ there are ~1000 ASV tags that correspond to that taxa string. 😒

Working solution: Take the Fasta sequence from each of the ASVs that we can identify with the same string. Using the Illumina libraries, map the cleaned library reads to the 16S ASV sequences using a high identity threshold. Determine which ASV tag is located in what habitat to try to narrow down the number of sequences needed for the 16S tree(s).

To do:

Match all taxonomy strings of meadow bins to ASV tags
Get fasta sequences of ASVs of interest
Map library reads to new Fasta file
Plot results

Data processing notes:

1. Match all taxonomy strings of meadow bins to ASV tags ✅

Taxonomic string of bins is likely less complete then of the 16S string (16S is full length from PacBio, more accurate prediction)- because of this here are some searching rules:

Look for taxonomic string at lowest level of bin (e.g., if genius is classified, look there first). -> if hit, copy ASV and list bin -> If no hit, look for next taxonomic rank, e.g., family, and assign bin/asv ID only if there is no genus reported. The logic: the bin got a genus level rank, but I can't find it in the ASV, perhaps the ASV is incorrectly classified.
If searching at the family level - and no genus is listed for bin, select all genuses at that family level

2. Get Fasta of ASVs of interest ✅

First I copied unique ASV.fa file to Cologne & list of target ASVs identified in 1.
Then I fixed the headers of the ASVs file so I can subset with my list sed 's/;.*//g' ASVs.fa > ASVs_fixed.fa
Subset the Fasta file with the asvs of interest seqtk subseq ASVs_fixed.fa target_asvs > meadow_asvs.fa
Check that we got the right number of sequences grep '>' meadow_asvs.fa | wc -l and 'wc -l target_asvs'

Good news! Both numbers match we can move to step 3!

3. Map cleaned sequences of metagenomes to the ASV Fasta file

This uses BBMap with a mini of 0.99. Hopefully we get hits!

bbmap.sh in=$in1 in2=$in2 ref=$ref minid=0.99 covstats="$lib"_COVSTATS scafstats="$lib"_scafstats.txt statsfile="$lib"_stderr