02_Gene Catalog - esogin/seagrassOmics GitHub Wiki

Gene Catalog Approach

Maggie Sogin

Created: May 19, 2019

Updated: July 19, 2019

The goal of this approach is to created a gene catalog that can be used to map our libraries against in order to address the overall question: are there more sugar related genes inside then outside a seagrass meadow.

This page is divided into several parts:

Dereplicate sequences ✅ , cluster sequences ✅, call genes ✅
Annotation ✅
Remapping ✅
Getting accurate featureCounts

1. Dereplicate sequences, cluster sequences, call genes ✅

Before running these scripts, the assembles of all the individual libraries and the coassembly were combed into a single Fasta file. It was important to rename all the Fasta headers to avoid duplicates in the file names. Because there could be several of the same assembly, we first dereplicated the sequences using search.

vsearch --derep_fulllength allcontigs_renamed.fa --output clusters/rep_set.fa  --relabel group --relabel_keep

Next, we wanted to cluster sequences together using vsearch uclust algorithm. However, this never completed in a reasonable time frame likely because we simply have too much data at hand. We choose to proceed instead with using mmseqs to cluster sequences based on proteins (https://github.com/soedinglab/MMseqs2). This was quite fast!

mmseqs easy-linclust clusters/rep_set.fa clusters tmp

From the clustered sequences, we called genes using prodigal.

prodigal -i clusters/clusters_rep_seq.fasta -o genes/coords.gbk  -a genes/orfs.faa  -p meta -d genes/genes.fa

UPDATE Because I want to feed data into feature counts, needed to get a off file, I have already done that annotation and mapping with Cazymes from the previous prodigal call. I re-ran prodigial as below.

prodigal -i clusters/clusters_rep_seq.fasta -o genes/coords.gff  -a genes/orfs.faa  -p meta -d genes/genes.fa -f gff

I checked one sequence to make sure it was the same (with the same file headers) - now will compare files using:

cmp --silent $old $new || echo "files are different"

Result: files are different. Re-do annotation and mapping steps starting with 2.1.

2. Annotation

This I am struggling with the most. I am currently running Prokka, but I worry this will take a long time.

A meta genome prokka call prokka clusters_rep_seq.fasta --outdir result --metagenome --cpus 48 --mincontiglen 500

UPDATE: Prokka failed to annotate CDS, see Prokka page for trouble shooting Prokka hits

Instead, I would like to implement the CAZy database comparison. However, run dbcan fails on my dataset (too big), the hmmscaner appears to have worked but I am a bit concerned about the error message in the log file (Error: Unexpected error 6 in opening hmm file caZy/dbCAN-fam-HMMs.txt.)

Next steps in getting annotations: (My Plan)

2.1. Try to re-implement the hmmscan option with the cazymes database. Start from scratch. 2.2. Also compare gene library to other database (e.g., transporter database)

After I have the annotations: 2.3. Assign gene calls to fasta headers in gene library (help???)

2.1 re-implement annotations with cazymes

Download and format cazyme database from stratch. Use the "old" database (dbcan2 download)form the web page and follow instructions here - dbCAN2 readme

The database used is

dbCAN HMMdb release 7.0 08/25/2018

After running, parse output to only include values that have E-value < 1e-18 and coverage > 0.35 as suggested for bacteria. Different enzyme classes also have different best thresholds.

** We have also performed evaluation for the five CAZyme classes separately, which suggests that the best threshold varies for different CAZyme classes (please see http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4132414/ for details). Basically to annotate GH proteins, one should use a very relax coverage cutoff or the sensitivity will be low (Supplementary Tables S4 and S9); (ii) to annotate CE families a very stringent E-value cutoff and coverage cutoff should be used; otherwise the precision will be very low due to a very high false positive rate (Supplementary Tables S5 and S10)

Follow these instructions: http://bcb.unl.edu/dbCAN2/download/Databases/dbCAN-old@UGA/readme.txt

general call: finished:

hmmscan -cpu 48 --domtblout catalog.out.dm - db/dbCAN-HMMdb-V7.txt orfs.faa > catalog.out

After running the hmmscan, use the parser to get the data of interest

./hmmscan_parsher.sh cazy.out.dm > parsed

rename headers (of gene file!!) with cazy annotations
remap libraries to gene file with cazy annotations
run featurecounts

Renaming fasta headers in gene file

fix fasta headers for matching

 sed 's/#.*//g' genes.fa > genes.fixed.fa

Get key out from cazy out put formant needs to be in: key_name keyname_GENECALL

cut -f 3,1 cazy.out.parsed > names.txt
awk '{ print $2 " " $1}' names.txt > temp
sed 's/ /\t/g' temp > names.fixed.txt
#concatenates names and annotation 
awk '{print $0, $1 "_" $NF}' names.fixed.txt  > temp 
# adds tab between fields
sed 's/ /\t/g' temp > temp2 
cut -f 1,3 temp > new_headers.txt

rename fast headers with seqkit https://github.com/shenwei356/seqkit

Check first it will work for sequences of interest

cut -f 1 names.fixed.txt > name.list
seqtk subseq genes.fixed.fa name.list > cazy_seqs.fa
seqkit replace -p '(.+)$' -r '{kv}' -k new_headers.txt cazy_seqs.fa --keep-key > cazy_seqs.out.fa

Re-apply to entire gene catalog

seqkit replace -p "(group)" -r "{kv}_$2" -k new_headers.txt  genes.fixed.fa --keep-key  > genes_cazy_annotation.fa

Re-map cleaned libraries to gene file with cazy annotations

cd /scratch/sogin/tmp.$JOB_ID/maps/
libs=$(echo 3847_{A..I})
for lib in $libs; do
    cp ${lib}_highfreq_kmers_1.00.0_0.cor.fastq.gz;
    cp ${lib}_highfreq_kmers_2.00.0_0.cor.fastq.gz;
    bbmap.sh in=${lib}_highfreq_kmers_1.00.0_0.cor.fastq.gz in2=${lib}_highfreq_kmers_2.00.0_0.cor.fastq.gz ref=genes_cazy_annotation.fa covstats="$lib"_covstats out="$lib".bam scafstats="$lib"_scafstats.txt statsfile="$lib"_stderr;
done
rm *fastq.gz

Update: July 10, 2019 Note: The above for-loop is taking a) a long time and b) I think the wrong files. We want to map all the raw reads to the dataset, not just the normalized ones.

Here is my alternative approach:

Template script:

mkdir /scratch/sogin/tmp.$JOB_ID/ -p; 
rsync -a /opt/extern/bremen/symbiosis/sogin/Data/SedimentMG/processed_reads/libraries/library_3847/gene_catalog2/cazymes/maps_nn/ /scratch/sogin/tmp.$JOB_ID/
cd /scratch/sogin/tmp.$JOB_ID/
#
echo "map to catalog";
#
#cd /scratch/sogin/tmp.$JOB_ID/maps/
libs=LIB
cp /opt/extern/bremen/symbiosis/sogin/Data/SedimentMG/processed_reads/libraries/library_3847/$lib/qc/"$lib"_q2_ktrimmed.fq.gz ./;
bbmap.sh in=${lib}_q2_ktrimmed.fq.gz ref=genes_cazy_annotation.fa covstats="$lib"_covstats out="$lib".bam scafstats="$lib"_scafstats.txt statsfile="$lib"_stderr;
rm *fastq.gz
rm genes_cazy_annotation.fa
rm -r ref/
#
mkdir /opt/extern/bremen/symbiosis/sogin/Data/SedimentMG/processed_reads/libraries/library_3847/gene_catalog/cazymes/maps_nn/"$lib"/
rsync -a /scratch/sogin/tmp.$JOB_ID/ /opt/extern/bremen/symbiosis/sogin/Data/SedimentMG/processed_reads/libraries/library_3847/gene_catalog/cazymes/maps_nn/"$lib"/;
rm /scratch/sogin/tmp.$JOB_ID -R;
echo "job finished: "
date

4. featureCounts to get read counts per library

Feature counts was easily downloaded into tools directory and should be useable (v 1.6.4).

might need to annotate GTF file with annotation entires from cazyme hmmscan runs.

Run with a test (reduced input dataset)

#Remove additional information not needed
sed '/# .*/d' test.gff > test2.gff
featureCounts -T 12 -t CDS -g ID -a test2.gff -o counts.txt 3847_A.bam

error message:

featureCounts: input-files.c:2889: SAM_pairer_get_next_read_BIN: Assertion `l_name < 256' failed.'

Try sorting and indexing bam files first

samtools sort 3847_A.bam > 3847_A_sorted.bam
samtools index 3847_A_sorted.bam

#run feature counts
featureCounts -T 12 -t CDS -g ID -a test2.gff -o counts.txt 3847_A_sorted.bam

still failed:

featureCounts: input-files.c:2889: SAM_pairer_get_next_read_BIN: Assertion `l_name < 256' failed.

Try to use entire catalog instead of just test run

sed '/# .*/d' coords.gff > coords_for_featureCounts.gff
featureCounts -T 12 -t CDS -g ID -a coords_for_featureCounts.gff -o counts.txt 3847_A_sorted.bam

I believe I have the wrong mapping files, files were mapped to catalog with gene calls associated, gff coordinations file does not have correct names. Re-do mapping to non-annotated gene catalog:

Redo mapping template script:

lib=LIB;
cp /opt/extern/bremen/symbiosis/sogin/Data/SedimentMG/processed_reads/libraries/library_3847/$lib/qc/"$lib"_q2_ktrimmed.fq.gz ./;
bbmap.sh in=${lib}_q2_ktrimmed.fq.gz ref=genes.fa covstats="$lib"_covstats out="$lib".bam scafstats="$lib"_scafstats.txt statsfile="$lib"_stderr;

This wasn't the wrong mapping file program but really related to how feature counts is made.

I found this thread: https://groups.google.com/forum/#!topic/subread/do5lV0OBJBk

I reinstalled featureCounts (1.6.1) from the GitHub repo and tried to change the required line. Update: still didn't not work.

Gene catalog is likely too big to work effectively with. This approach is reworked in 02B_gene catalog 2.0

##END