MAKER P on BNB server - warelab/NAM_annotation GitHub Wiki

MAKER-P run on BNB

The below protocol describes the steps to run MAKER-P pipeline on CSHL BNB cluster. The protocol uses maize genome for annotation using MAKER-P

Prerequisites:

  • MAKER-P installed along with dependencies. The current installation is at /sonas-hs/ware/hpc_norepl/data/programs/maker_v3_updated/maker
  • To make this simpler source the bash profile file source /sonas-hs/ware/hpc/home/kchougul/.bash_profile. This will set all the paths and environment variables for tools needed to run MAKER-P
  • Genome fasta to annotate. Make sure the fasta headers are proper i.e there is no missing newline character. You can check if there are any issues using grep "[ATGC]>[ATGC]*" testing.fasta. Also if there are non ATGC characters present in the sequence please inspect or remove it using sed -e '/^[^>]/s/[^ATGCatgc]/N/g' testing.fasta
  • Evidence data for aligning to genome (EST/mRNA/cDNA/proteins). A baseline evidence data used for b73 annotation can be found at /sonas-hs/ware/hpc_norepl/data/kapeel/b73_NAM/standarized_evidence_set

Step 1. Create a working directory

Before running maker please check the avaible disk space. For a standard MAKER-P run using maize genome requires ~300Gb for a complete run. Check the disk space by loggin into filezone1 server and running the below command.

kchougul@filezone1:~$ df -h | grep "ware"

You may choose to run MAKER-P on /sonas-hs/ware/hpc/data or /sonas-hs/ware/hpc_norepl/data. You should have a directory

$ mkdir -P /sonas-hs/ware/hpc_norepl/data/kapeel/NAM/CML333
$ cd /sonas-hs/ware/hpc_norepl/data/kapeel/NAM/CML333

Step 2. Splitting genome fasta file

A general observation if you have a assembly with scaffold/contig > 1000 no need of splitting, but if you have < 100 scaffold/contig split the genome fasta as below

$ mkdir contig_fasta
$ cp <YOUR_GENOME_FILE> contig_fasta
$ contig_fasta
$ fasta_tool split cml333consensus_v2.fasta

This particular assembly has 53 contigs named Contig[0..52].fasta

Step 3. Creating maker control files

$ cd ..; maker -CTL

this will create the following files

  • maker_opts.ctl - gives location of input files (genome and evidence) and sets options that affect MAKER behavior
  • maker_exe.ctl - gives path information for the underlying executables.
  • maker_bopt.ctl - sets parameters for filtering BLAST and Exonerate alignment results
  • maker_evm.ctl - sets parameters adding weights to evidence modeller tool

Provide the file and tools paths within maker_opts.ctl and `maker_exe.ctl

Here is full control file:

maker_exe.ctl

#-----Location of Executables Used by MAKER/EVALUATOR
makeblastdb=/mnt/grid/ware/hpc_norepl/data/data/programs/maker_v3_updated/maker/bin/../exe/blast/bin/makeblastdb #location of NCBI+ makeblastdb executable
blastn=/mnt/grid/ware/hpc_norepl/data/data/programs/maker_v3_updated/maker/bin/../exe/blast/bin/blastn #location of NCBI+ blastn executable
blastx=/mnt/grid/ware/hpc_norepl/data/data/programs/maker_v3_updated/maker/bin/../exe/blast/bin/blastx #location of NCBI+ blastx executable
tblastx=/mnt/grid/ware/hpc_norepl/data/data/programs/maker_v3_updated/maker/bin/../exe/blast/bin/tblastx #location of NCBI+ tblastx executable
formatdb= #location of NCBI formatdb executable
blastall= #location of NCBI blastall executable
xdformat= #location of WUBLAST xdformat executable
blasta= #location of WUBLAST blasta executable
prerapsearch= #location of prerapsearch executable
rapsearch= #location of rapsearch executable
RepeatMasker=/mnt/grid/ware/hpc_norepl/data/data/programs/maker_v3_updated/maker/bin/../exe/RepeatMasker/RepeatMasker #location of RepeatMasker executable
exonerate=/mnt/grid/ware/hpc_norepl/data/data/programs/maker_v3_updated/maker/bin/../exe/exonerate/bin/exonerate #location of exonerate executable

#-----Ab-initio Gene Prediction Algorithms
snap=/mnt/grid/ware/hpc_norepl/data/data/programs/maker_v3_updated/maker/bin/../exe/snap/snap #location of snap executable
gmhmme3= #location of eukaryotic genemark executable
gmhmmp= #location of prokaryotic genemark executable
augustus=/sonas-hs/ware/hpc/home/mcampbel/applications/augustus-3.1/bin/augustus #location of augustus executable
fgenesh=/sonas-hs/ware/hpc_norepl/data/programs/fgenesh_v8/fgenesh_suite_v8.0.0a/fgenesh #location of fgenesh executable
evm= #location of EvidenceModeler executable
tRNAscan-SE=/sonas-hs/ware/hpc/home/mcampbel/applications/tRNAscan-SE-1.23/bin/tRNAscan-SE #location of trnascan executable
snoscan= #location of snoscan executable

#-----Other Algorithms
probuild= #location of probuild executable (required for genemark)

Note: most of the paths will be prefilled by MAKER when you run maker -CTL command. The path that needs to be added is for fgenesh and augustus tool

maker_opts.ctl

genome= #genome sequence (fasta file or fasta embeded in GFF3 file)
organism_type=eukaryotic #eukaryotic or prokaryotic. Default is eukaryotic

#-----Re-annotation Using MAKER Derived GFF3
maker_gff= #MAKER derived GFF3 file
est_pass=0 #use ESTs in maker_gff: 1 = yes, 0 = no
altest_pass=0 #use alternate organism ESTs in maker_gff: 1 = yes, 0 = no
protein_pass=0 #use protein alignments in maker_gff: 1 = yes, 0 = no
rm_pass=0 #use repeats in maker_gff: 1 = yes, 0 = no
model_pass=0 #use gene models in maker_gff: 1 = yes, 0 = no
pred_pass=0 #use ab-initio predictions in maker_gff: 1 = yes, 0 = no
other_pass=0 #passthrough anyything else in maker_gff: 1 = yes, 0 = no

#-----EST Evidence (for best results provide a file for at least one)
est=/sonas-hs/ware/hpc_norepl/data/kapeel/NAM/NAM_Canu1.8/evidence/Bo_curated_iso-seq/final.filtered.fa:wang_curated_isoseq,/sonas-hs/ware/hpc_norepl/data/kapeel/b73_NAM/standarized_evidence_set/genbank_fl_cdnas_ATCG_only.fasta:flc,/sonas-hs/ware/hpc_norepl/data/kapeel/NAM/NAM_Canu1.8/evidence/NAM_mikado_assemblies/CML333/CML333_mikado.transcripts.fasta:mikado #set of ESTs or assembled mRNA-seq in fasta format
altest=/sonas-hs/ware/hpc_norepl/data/kapeel/NAM/NAM_Canu1.8/evidence/alt_Flcdna/Rice_flcdna_ncbi.fasta:Rice_flc,/sonas-hs/ware/hpc_norepl/data/kapeel/NAM/NAM_Canu1.8/evidence/Bo_curated_iso-seq/sorghum.filtered.Iso-seq.fa:Sb_flc,/sonas-hs/ware/hpc_norepl/data/kapeel/NAM/NAM_Canu1.8/evidence/Bo_curated_iso-seq/sorghum.filtered.reddy.fa:Sb_reddy #EST/cDNA sequence file in fasta format from an alternate organism
est_gff= #aligned ESTs or mRNA-seq from an external GFF3 file
altest_gff= #aligned ESTs from a closly relate species in GFF3 format

#-----Protein Homology Evidence (for best results provide a file for at least one)
protein=/sonas-hs/ware/hpc_norepl/data/kapeel/NAM/NAM_Canu1.8/evidence/Bo_curated_iso-seq/curated_proteins/Arabidopsis_thaliana.prot2.fasta:AT,/sonas-hs/ware/hpc_norepl/data/kapeel/NAM/NAM_Canu1.8/evidence/Bo_curated_iso-seq/curated_proteins/Oryza_sativa_japonica.prot2.fasta:OS,/sonas-hs/ware/hpc_norepl/data/kapeel/NAM/NAM_Canu1.8/evidence/Bo_curated_iso-seq/curated_proteins/Sorghum_bicolor.prot2.fasta:SB #protein sequence file in fasta format (i.e. from mutiple organisms)
protein_gff=  #aligned protein homology evidence from an external GFF3 file
#-----Repeat Masking (leave values blank to skip repeat masking)
model_org= #select a model organism for RepBase masking in RepeatMasker
rmlib=/sonas-hs/ware/hpc_norepl/data/kapeel/NAM/NAM_Canu1.8_new_runs/repeat_lib_curated_by_shujun/maizeTE10102014.RMname.nogene #provide an organism specific repeat library in fasta format for RepeatMasker
repeat_protein= #provide a fasta file of transposable element proteins for RepeatRunner
rm_gff= #pre-identified repeat elements from an external GFF3 file
prok_rm=0 #forces MAKER to repeatmask prokaryotes (no reason to change this), 1 = yes, 0 = no
softmask=1 #use soft-masking rather than hard-masking in BLAST (i.e. seg and dust filtering)
#-----Gene Prediction
snaphmm= #SNAP HMM file
gmhmm= #GeneMark HMM file
augustus_species=B73_prot-rna_20190904.3 #Augustus gene prediction species model
fgenesh_par_file=/sonas-hs/ware/hpc_norepl/data/programs/fgenesh_v8/fgenesh_suite_v8.0.0a/Zeamays.mpar.dat.new #FGENESH parameter file
pred_gff= #ab-initio predictions from an external GFF3 file
model_gff= #annotated gene models from an external GFF3 file (annotation pass-through)
run_evm=0 #run EvidenceModeler, 1 = yes, 0 = no
est2genome=0 #infer gene predictions directly from ESTs, 1 = yes, 0 = no
protein2genome=0 #infer predictions from protein homology, 1 = yes, 0 = no
trna=0 #find tRNAs with tRNAscan, 1 = yes, 0 = no
snoscan_rrna= #rRNA file to have Snoscan find snoRNAs
snoscan_meth= #-O-methylation site fileto have Snoscan find snoRNAs
unmask=0 #also run ab-initio prediction programs on unmasked sequence, 1 = yes, 0 = no

#-----Other Annotation Feature Types (features MAKER doesn't recognize)
other_gff= #extra features to pass-through to final MAKER generated GFF3 file
#-----External Application Behavior Options
alt_peptide=C #amino acid used to replace non-standard amino acids in BLAST databases
cpus=1 #max number of cpus to use in BLAST and RepeatMasker (not for MPI, leave 1 when using MPI)

#-----MAKER Behavior Options
max_dna_len=100000 #length for dividing up contigs into chunks (increases/decreases memory usage)
min_contig=1 #skip genome contigs below this length (under 10kb are often useless)

pred_flank=200 #flank for extending evidence clusters sent to gene predictors
pred_stats=1 #report AED and QI statistics for all predictions as well as models
AED_threshold=1 #Maximum Annotation Edit Distance allowed (bound by 0 and 1)
min_protein=0 #require at least this many amino acids in predicted proteins
alt_splice=1 #Take extra steps to try and find alternative splicing, 1 = yes, 0 = no
always_complete=1 #extra steps to force start and stop codons, 1 = yes, 0 = no
map_forward=0 #map names and attributes forward from old GFF3 genes, 1 = yes, 0 = no
keep_preds=0 #Concordance threshold to add unsupported gene prediction (bound by 0 and 1)
split_hit=10000 #length for the splitting of hits (expected max intron size for evidence alignments)
min_intron=20 #minimum intron length (used for alignment polishing)
single_exon=1 #consider single exon EST evidence when generating annotations, 1 = yes, 0 = no
single_length=250 #min length required for single exon ESTs if 'single_exon is enabled'
correct_est_fusion=1 #limits use of ESTs in annotation to avoid fusion genes

tries=2 #number of times to try a contig if there is a failure for some reason
clean_try=0 #remove all data from previous run before retrying, 1 = yes, 0 = no
clean_up=0 #removes theVoid directory with individual analysis files, 1 = yes, 0 = no
TMP= #specify a directory other than the system default temporary directory for temporary files

Note:

  • Provide full paths somehoe MAKER-P on BNB does not recognize relative paths
  • multiple proteins/EST files can be added separated by , and tagged using :<tag> at the end of file name
  • the TMP location can be empty which will default to /tmp on the systems else choose a desired location with enough space
  • CPU=1, this setting is desired if one is running MAKER-P in MPI mode
  • Details on EST and flcDNA used as evindences
* genbank_maize_FLC(69,163) from original B73 project (Genbank)
* Filtered Iso-seq data (46,311) Remove intron retention (splice information)
* RNAseq Mikado transcripts NAM lines (using accession specific evidence)
* Curated protein evidence AT, Sb, Os  (two longest peptides) 
* Alt specie flcDNA
** Rice- (32,148);Sorghum Iso-seq- Bo(37,655); Reddy’s (31,881).
* the iso-seq and alt specie transcripts were filtered to Remove intron retention using the protocol described here:https://github.com/warelab/Maize-B73-RefGen_v5/issues/16
* The augustus HMM was generate using the BRAKER tool using Mikado rna-seq and proteins.

Step 4. preparing qsub script

To submit maker jobs we need to prepare a qsub script to schedule jobs. More details on how to submit a job on BNB see this documentation: http://intranet.cshl.edu/administration/information-technology/hpcc/uge.

For our maize run we will use MPI and submit parallel array job submission. Documentation on array jobs: https://github.com/warelab/annotation_workflows/wiki/maker_run_BNB_cluster

cat submit_maker_here.sh

#!/bin/bash
# specify BASH shell
#$ -S /bin/bash
# run job in the current working directory where qsub is executed from
#$ -cwd
# Tell the SGE that this is an array job, with "tasks" to be numbered 
#$ -t 1-53
#  specify that the job requires 3GB of memory for each process
#$ -pe mpi 16 -l m_mem_free=2G
# make sure that there is at least 20G of tmp space on the node
#$ -l tmp_free=200G

module load openmpi-x86_64
source /sonas-hs/ware/hpc/home/kchougul/.bash_profile
export LD_PRELOAD=/sonas-hs/ware/hpc/home/mcampbel/applications/openmpi-1.8.8/build/lib/libmpi.so

let i=$SGE_TASK_ID-1
/sonas-hs/ware/hpc/home/mcampbel/applications/openmpi-1.8.8/build/bin/mpiexec -mca btl ^openib -n 16 /sonas-hs/ware/hpc/home/mcampbel/applications/maker/bin/maker -g contig_fasta/Contig$SGE_TASK_ID.fasta -fix_nucleotides -TMP $TMPDIR

array job specification is represernted by -t option. Here we use 1-53 instead of 0-52.

Note the -t parameter can be modified to submit 10 array jobs at a time to avoid overload on the system : -t 1-53:10

Step 5. submitting qsub job and monitoring status of the job.

Submit qsub script $ qsub submit_maker_here.sh

To check status of array jobs: qstat -u kchougul

To check if MAKER finishes for each contig/array job there will 2 log file submit_maker_here.sh.e[0-9]* submit_maker_here.sh.oe[0-9]. Once MAKER-P finishes check the status of submit_maker_here.sh.e[0-9] file. You should see the following message

$ tail -f submit_maker_here.sh.e42180609
adding statistics to annotations
Calculating annotation quality statistics
choosing best annotation set
Choosing best annotations
processing chunk output
processing contig output
Maker is now finished!!!
Start_time: 1509292679
End_time:   1509303084
Elapsed:    10405

You can also choose to see the maker output directory if the annotation gff and fasta files are been created. $ ls Contig*/*out*/*data*/*/*/*/*gff | wc -l This should match the number of contigs in the assembly. If for some reason a contig fails or gets stuck for more than 2 days kill the job using qdel <Job-ID> and resubmit by editing the qsub script and specifying the contig number under -t option.

Step 6. Merge gff and fasta files generated from MAKER-P run

A typical set of outputs for a contig looks like this:

kchougul@bnbdev2:/sonas-hs/ware/hpc_norepl/data/kapeel/NAM/CML333$ ls Contig*/*out*/*data*/*/*/Contig0
Contig0.gff
Contig0.maker.augustus_masked.proteins.fasta
Contig0.maker.augustus_masked.transcripts.fasta
Contig0.maker.fgenesh_masked.proteins.fasta
Contig0.maker.fgenesh_masked.transcripts.fasta
Contig0.maker.non_overlapping_ab_initio.proteins.fasta
Contig0.maker.non_overlapping_ab_initio.transcripts.fasta
Contig0.maker.proteins.fasta
Contig0.maker.transcripts.fasta
run.log
theVoid.Contig0

To merge gff's and extract only gene features and remove nucleotide seq

$ gff3_merge -g -n -o CML33.maker.gene_only.gff Contig*/*out*/*data*/*/*/*/*gff

To merge gff's and extract all features and remove nucleotide seq

$ gff3_merge -n -o CML33.maker.all.gff Contig*/*out*/*data*/*/*/*/*gff

To merge fasta

$ cat Contig*/*out*/*data*/*/*/*/*maker.augustus_masked.proteins.fasta > CML333.maker.augustus_masked.proteins.fasta
$ cat Contig*/*out*/*data*/*/*/*/*maker.augustus_masked.transcripts.fasta > CML333.maker.augustus_masked.transcripts.fasta
$ cat Contig*/*out*/*data*/*/*/*/*maker.fgenesh_masked.proteins.fasta > CML333.maker.fgenesh_masked.proteins.fasta
$ cat Contig*/*out*/*data*/*/*/*/*maker.fgenesh_masked.transcripts.fasta > CML333.maker.fgenesh_masked.transcripts.fasta
$ cat Contig*/*out*/*data*/*/*/*/*maker.non_overlapping_ab_initio.proteins.fasta > CML333.maker.non_overlapping_ab_initio.proteins.fasta
$ cat Contig*/*out*/*data*/*/*/*/*maker.non_overlapping_ab_initio.transcripts.fasta > CML333.maker.non_overlapping_ab_initio.transcripts.fasta
$ cat Contig*/*out*/*data*/*/*/*/*maker.proteins.fasta > CML333.maker.proteins.fasta
$ cat Contig*/*out*/*data*/*/*/*/*maker.transcripts.fasta > CML333.maker.transcripts.fasta

To extract repeatmasked genome seq and gff

$ cat Contig*/*out*/*data*/*/*/*/*/*query.masked.fasta > CML333.maker.repeatmasked.fasta
$ cat Contig*/*out*/*data*/*/*/*/*/*query.masked.gff  > CML333.maker.repeatmasked.gff

Step 7. move all the output results to final output directory

$ mkdir maker_final_annotations
$ mv *gff *fasta maker_final_annotations
⚠️ **GitHub.com Fallback** ⚠️