PASA runs for NAM genomes - warelab/NAM_annotation GitHub Wiki
PASA tool will be used update the MAKER-P annotation. Below are the steps to run PASA.
- repeatmasked genome from MAKER
- transcript seq
- genbank_maize_FLC(69,163) from original B73 project (Genbank)
- Filtered Iso-seq data (46,311) Remove intron retention (splice information)
- Maize EST from genebank. These are maize ESTs downloaded from genbank and identified using this search command: (EST[Keyword]) AND maize[Organism] (2,019,896)
run the seqclean utility on you transcripts like so-
../PASApipeline/bin/seqclean maize.flc.iso.est.combined.fasta
This will generate several output files including maize.flc.iso.est.combined.fasta.cln and maize.flc.iso.est.combined.fasta.clean Both of these can be used as inputs to PASA. (We will use maize.flc.iso.est.combined.fasta.clean)
Output summary:
**************************************************
Sequences analyzed: 2135370
-----------------------------------
valid: 1871612 (325878 trimmed)
trashed: 263758
**************************************************
----= Trashing summary =------
by 'low_qual': 1126
by 'dust': 759
by 'short': 178149
by 'shortq': 83724
------------------------------
Output file containing only valid and trimmed sequences: maize.flc.iso.est.combined.fasta.clean
For trimming and trashing details see cleaning report : maize.flc.iso.est.combined.fasta.cln
Before running the transcript alignment step, copy and edit the copy files annotCompare.config and alignAssembly.config from PASA installation folder.
Edit the database path :
# database settings
DATABASE=/home/kchougul/NAM_PASA_runs/sqlite/<specie>.sqlite
- Create a sqlite folder and set the permission
$ mkdir sqlite; chmod -R 777 sqlite
- Run the PASA alignment assembly pipeline like so
../PASApipeline/Launch_PASA_pipeline.pl -c alignAssembly.config -C -R -g M162W.maker.repeatmasked.fasta -t maize.flc.iso.est.combined.fasta.clean -T -u maize.flc.iso.est.combined.fasta --ALIGNERS blat,gmap --CPU 5
The '--ALIGNERS' can take values 'gmap', 'blat', or 'gmap,blat', in which case both aligners will be executed in parallel. The CPU setting determines the number of threads to be used for each process. This is passed on to GMAP to indicate the thread count. In the case of BLAT, the transcript database is split into CPU number of partitions and each partition is searched separately and in parallel using BLAT. Also, note that if 'gmap,blat' is specified, then you may have up to 2*CPU number of processes running simultaneously.
This executes the following operations, generating the corresponding output files:
-
aligns the all_transcripts.fasta file to genome_sample.fasta using the specified alignment tools. Files generated include:
-
'm162W.sqlite.validated_transcripts.gff3,.gtf,.bed' :the valid alignments from Blat and GMAP
-
'm162W.sqlite.failed_gmap/blat_alignments.gff3,.gtf,.bed' :the alignments that fail validation test
-
'alignment.validations.output' :tab-delimited format describing the alignment validation results the valid alignments are clustered into piles based on genome alignment position and piles are assembled using the PASA alignment assembler. Files generated include:
-
'm162W.sqlite.assemblies.fasta' :the PASA assemblies in FASTA format.
-
'm162W.sqlite.pasa_assemblies.gff3,.gtf,.bed' :the PASA assembly structures.
-
'm162W.sqlite.pasa_alignment_assembly_building.ascii_illustrations.out' :descriptions of alignment assemblies and how they were constructed from the underlying transcript alignments.
-
'm162W.sqlite.pasa_assemblies_described.txt' :tab-delimited format describing the contents of the PASA assemblies, including the identity of those transcripts that were assembled into the corresponding structure.
-
Incorporating PASA Assemblies into Existing Gene Predictions, Changing Exons, Adding UTRs and Alternatively Spliced Models
The PASA software can update any preexisting set of protein-coding gene annotations to incorporate the PASA alignment evidence, correcting exon boundaries, adding UTRs, and models for alternative splicing based on the PASA alignment assemblies generated above
- Loading your preexisting protein-coding gene annotations
~/PASApipeline/scripts/Load_Current_Gene_Annotations.dbi -c alignAssembly.config -g M162W.maker.repeatmasked.fasta -P M162W.maker.gene_only.gff3
- Performing an annotation comparison and generating an updated gene set
Now that the original annotations are loaded, we can perform a comparison of the PASA alignment assemblies to these preexisting gene annotations, to identify cases where updates can be automatically performed to gene structures in order to incorporate the transcript alignments.
Run the annotation comparison like so:
~/PASApipeline/Launch_PASA_pipeline.pl -c annotCompare.config -A -g M162W.maker.repeatmasked.fasta -t maize.flc.iso.est.combined.fasta
Once the annotation comparison is complete, PASA will output a new GFF3 file that contains the PASA-updated version of the genome annotation, including those gene models successfully updated by PASA, and those that remained untouched. This file will be named '${mysql_db}.gene_structures_post_PASA_updates.$pid.gff3', where $pid is the process ID for this annotation comparison computation.
Note:
- Additionally decided to include NAM mikado transcripts to update annotations.
- the protocol above was repeated using NAM mikado transcripts as evidence and output gff from the above PASA runs.