02_Gene Catalog 2.0 - esogin/seagrassOmics GitHub Wiki
Reformulating the Gene Catalog Approach
Maggie Sogin
Created: July 23, 2019
Updated:August 5, 2019
Note: sequences used to create the gene catalog were first dereplicated and clustered before moving before. Briefly, here are the commands below used to initiate this process as implemented in initial gene catalog scripts.
vsearch --derep_fulllength allcontigs_renamed.fa --output clusters/rep_set.fa --relabel group --relabel_keep
mmseqs easy-linclust clusters/rep_set.fa clusters tmp
##1. Remove short sequences from clustered rep set of sequences
reformat.sh in=clusters_rep_seq.fasta minlength=500 out=clusters_rep_500bp.fasta
reformat.sh in=clusters_rep_seq_500bp.fasta out=clusters_rep_seq_500bp_fixed.fasta addunderscore
number of reads that went in: 73,357,971 number of reads that came out: 12,599,825
##2. Call prodigal
prodigal -i clusters_rep_seq_500bp.fasta -o genes/coords.gff -a genes/orfs.faa -p meta -d genes/genes.fa -f gff;
number of proteins: 21,077,130 (25% of original protein file!)
##3. Get a non-redundant gene catalog.
There are several different approaches to doing this. a- blat: I tried this first but received a segmentation fault. I believe I needed more memory. Will think of strategies to implement since I am not sold yet on the other two options. b- cdhit: It is running0 but will likely take a long time to finish. c- mmseqs: Ran very quickly!
here are my notes: Cavet: all published gene catalogues do this step here either with BLAT or cd-hit. DHI will take a while...
cd-hit-est -i genes/genes.fa -o genes/nr_genes.fa -c 0.95 -T 12 -aS 0.9 -n 8 -M 0 -G 0;
--> cd-hit slow, try working with blat again
blat genes.fa genes.fa -minIdentity=95 -t=dna -tileSize=11 -stepSize=10 output.psl
--> Blat ran out of memory on himem machine.
Try reclusting a second time with mmseqs to get non-redundant genes out.
mmseqs easy-linclust genes/genes.fa clusters tmp
--> mmseqs worked very quickly (< 2 mins)
Nr. sequences in: 21,077,130
Nr of sequences out: 17,947,592
Get rep sequence headers and subset protein file
grep ">" clusters_rep_seq.fasta > headers
- subset protein sequences with list of fasta headers
seqtk subseq genes/orfs.fa rep_seq_headers > orfs_rep_seqs.fa
Nr. seqs: 17947592
##4. Annotate New idea for annotation pipeline:
3a. Annotate with emapper.py
It appears this is also a very memory intensive program. One solution: split up files into chunks and then merge.
Try spliting up file
split -l 200000 -a 3 -d orfs_rep_seqs.fa input_file.chunk
# generate all the commands that should be distributed in the cluster
for i in *chunk*; do python2 ./../emapper.py -i $i -m diamond --no_annot --no_file_comments --cpu 24 -o $i; done
still needs to be done:
cat *.chunk_*.emapper.seed_orthologs > input_file.emapper.seed_orthologs
emapper.py --annotate_hits_table input.emapper.seed_orthologs --no_file_comments -o output_file --cpu 10
3b. Select genes with ec numbers matching to CAZYms 3c. Get genes out of catalog and re-annotate with run_dbcan 3d. If feeling ambition, annotate with Pfam categories 3e. annotate with KEGG (this is actually taken care of in emapper results)
One really appealing alternative would be to annotate with blast2go. but I need a licenses.
##4. Re-map libraries to gene catalog (make sure to fix catalog names first) ##5. Implement feature counts using a new SAF with merged results from the emapper pipeline and cazymes results