02_Gene Catalog 2.0 - esogin/seagrassOmics GitHub Wiki

Reformulating the Gene Catalog Approach

Maggie Sogin

Created: July 23, 2019

Updated:August 5, 2019

Note: sequences used to create the gene catalog were first dereplicated and clustered before moving before. Briefly, here are the commands below used to initiate this process as implemented in initial gene catalog scripts.

vsearch --derep_fulllength allcontigs_renamed.fa --output clusters/rep_set.fa  --relabel group --relabel_keep
mmseqs easy-linclust clusters/rep_set.fa clusters tmp

##1. Remove short sequences from clustered rep set of sequences

reformat.sh in=clusters_rep_seq.fasta minlength=500 out=clusters_rep_500bp.fasta
reformat.sh in=clusters_rep_seq_500bp.fasta out=clusters_rep_seq_500bp_fixed.fasta addunderscore

number of reads that went in: 73,357,971 number of reads that came out: 12,599,825

##2. Call prodigal

prodigal -i clusters_rep_seq_500bp.fasta -o genes/coords.gff  -a genes/orfs.faa  -p meta -d genes/genes.fa -f gff;

number of proteins: 21,077,130 (25% of original protein file!)

##3. Get a non-redundant gene catalog.

There are several different approaches to doing this. a- blat: I tried this first but received a segmentation fault. I believe I needed more memory. Will think of strategies to implement since I am not sold yet on the other two options. b- cdhit: It is running0 but will likely take a long time to finish. c- mmseqs: Ran very quickly!

here are my notes: Cavet: all published gene catalogues do this step here either with BLAT or cd-hit. DHI will take a while...

cd-hit-est -i genes/genes.fa -o genes/nr_genes.fa -c 0.95 -T 12 -aS 0.9 -n 8 -M 0 -G 0;

--> cd-hit slow, try working with blat again

blat genes.fa genes.fa -minIdentity=95 -t=dna -tileSize=11 -stepSize=10 output.psl

--> Blat ran out of memory on himem machine.

Try reclusting a second time with mmseqs to get non-redundant genes out.

mmseqs easy-linclust genes/genes.fa clusters tmp

--> mmseqs worked very quickly (< 2 mins)

Nr. sequences in: 21,077,130

Nr of sequences out: 17,947,592

Get rep sequence headers and subset protein file

grep ">" clusters_rep_seq.fasta > headers

subset protein sequences with list of fasta headers

seqtk subseq genes/orfs.fa rep_seq_headers > orfs_rep_seqs.fa

Nr. seqs: 17947592

##4. Annotate New idea for annotation pipeline:

3a. Annotate with emapper.py

It appears this is also a very memory intensive program. One solution: split up files into chunks and then merge.

Try spliting up file

split -l 200000 -a 3 -d orfs_rep_seqs.fa input_file.chunk
# generate all the commands that should be distributed in the cluster
for i in *chunk*; do  python2 ./../emapper.py -i $i -m diamond --no_annot --no_file_comments --cpu 24 -o $i; done

still needs to be done:

cat *.chunk_*.emapper.seed_orthologs > input_file.emapper.seed_orthologs
emapper.py --annotate_hits_table input.emapper.seed_orthologs --no_file_comments -o output_file --cpu 10

3b. Select genes with ec numbers matching to CAZYms 3c. Get genes out of catalog and re-annotate with run_dbcan 3d. If feeling ambition, annotate with Pfam categories 3e. annotate with KEGG (this is actually taken care of in emapper results)

One really appealing alternative would be to annotate with blast2go. but I need a licenses.

##4. Re-map libraries to gene catalog (make sure to fix catalog names first) ##5. Implement feature counts using a new SAF with merged results from the emapper pipeline and cazymes results