Ensembl Core DB Gene Load - warelab/NAM_annotation GitHub Wiki

This SOP explains the steps involved after the annotation pipeline. It covers from loading genes to ensemble core database, to improving/refining the models, to running interproScan on proteins, and dumping gff and fasta out of the database

Progress tracking sheet


1. Load genes to ensemble core database with QC

Remove the old genes from the core database, load the new gene sets generated by Kapeel's annotation pipeline. These genes are the union of mikado and braker annotations. They were further filtered to the following three sets.

The symlinks in `/sonas-hs/ware/hpc_norepl/data/weix/NAM/` to the data are

a. this set represents the high confident coding genes: with both 1. AED < 0.75 and 2. phylogenetic conserved at Androgopocea

UnionFilteredModels -> /sonas-hs/ware/hpc_norepl/data/kapeel/NAM/NAM_Canu1.8_new_runs/mikado_braker.evd.ab.nam.models/aed_lt_75/mikado_braker_pasa_union_all/gff/new/final/aed_lt_75_androp_abv

b. this set represents maize specific coding genes: 1. AED < 0.75 and 2. both start and stop codon in the original model

LineageBothStartStop -> /sonas-hs/ware/hpc_norepl/data/kapeel/NAM/NAM_Canu1.8_new_runs/mikado_braker.evd.ab.nam.models/aed_lt_75/mikado_braker_pasa_union_all/gff/new/final/lineage_specific/lt_75/both-cds-start-stop

c. this set is maize specific non_coding genes: 1. AED < 0.75 2. missing either start codon or stop codon or both in the original model

LineageNonCoding -> /sonas-hs/ware/hpc_norepl/data/kapeel/NAM/NAM_Canu1.8_new_runs/mikado_braker.evd.ab.nam.models/aed_lt_75/mikado_braker_pasa_union_all/gff/new/final/lineage_specific/lt_75/with-either-cds-start-or-stop-or-both


  • the script is /sonas-hs/ware/hpc_norepl/data/weix/NAM/scripts/load_all.sh

    args1: name line name, for example: cml52

    args2: the old gene set logic_name to delete, for example, mikado_gene

    it will create a directory with NAM line name such as cml52/, remove the old gene set under the logic_name "mikado_gene" in the existing ensemble core database, load the 3 sets of genes, coding genes under the name "cshl_gene", non_coding genes under the name "cshl_noncoding_gene". At each loading, it will ask for the user to confirm the input file is correct, after receiving 'Y' as the answer, it will continue with the loading, after which a check of the translation with protein sequences will be performed. It writes the output in files named translation.*.out.

    This is an interactive script, the user needs to confirm the input file is correct by typing 'Y'. Each loading takes about 3 hours.

    When the script is done running, make sure to check the 3 $nam/badGenes2del files, make sure they are all empty, otherwise, it didn't pass the quality check and calls for an investigation.*

  • The working directory is /sonas-hs/ware/hpc_norepl/data/weix/NAM/UnionCore

  • An example of running the loading in batch

     cd /sonas-hs/ware/hpc_norepl/data/weix/NAM/UnionCore 
      
     for line in `cat TropicalNAM` 
       
           do echo $line; ../scripts/load_all.sh $line mikado_gene 
    
      done
    

2. Gene Model polish

Now hand over the core database to Andrew, he will run his CDS fix pipeline to improve the models. and dump the proteins for interproScan and biotype reassignment

First create a backup of the relevant db fields

mysqldump -h bhsqldw1 -u plensembl -pAudreyII --add-drop-table zea_mayscml228_core_3_87_1 exon exon_transcript gene transcript translation > cml228.backup.before.sql

To split the gene list into 3 jobs run

~olson/setup_CDSFix_array.sh /sonas-hs/ware/hpc_norepl/data/weix/NAM/registry/nam.reg zea_mays cml228 3

Dump gff for gffQC

perl /sonas-hs/ware/hpc/home/olson/src/warelab/gramene-ensembl/scripts/dump-scripts/sharon_gff_dump_by_logicname_nam.pl -o . -dbname zea_mayscml228_core_3_87_1 -logicname cshl_gene >& cml228.dump_gff &

3. Gene biotype reassign

The really short protein coding genes (with < 50aa) will need to be reassigned to non_coding type, Kapeel identifies those genes and the symlink for that directory is

     transcripts_lt_50aa -> /sonas-hs/ware/hpc_norepl/data/kapeel/NAM/NAM_Canu1.8_new_runs/mikado_braker.evd.ab.nam.models/aed_lt_75/mikado_braker_pasa_union_all/gff/new/final/transcripts_lt_50aa

4. Run the GFF3QC tool kit to fix some gene model problem such as overlapping or duplicated genes.

 The following files were saved to `cshl_coding_corrected` folder.

 1. the list of genes changed, they will be removed from the database
 2.  corrected gff file with all protein coding genes, unchanged ones and changed ones (with different gene names)
 3. complete fasta file of the protein for validation

 There were confusions between the phase definition between GFF tools and ensembl database. They seem to differ slightly, so we decided to ignore the potential phase problems detected by `gramene-ensembl/scripts/QC-scripts/correct_phase.pl` and skip this step.

  Command:
  cd /sonas-hs/ware/hpc_norepl/data/weix/NAM/UnionCore

  for line in `cat TropicalNAM` 
     
         do echo $line; ../scripts/replace_coding_corrected.sh $line 
 
    done

5. Assign gene ids and dump gff

Andrew assigns new names for the genes and dump a new complete gff to share with collaborators Also the protein fasta with the new names

~olson/assign_stable_IDs.pl --registry /sonas-hs/ware/hpc_norepl/data/weix/NAM/registry/nam.reg --species zea_mayscml228 --prefix Zm00022ab --bylogicname cshl_gene --bylogicname cshl_noncoding_gene >& cml228.assign_stable_IDs.stderr

Dump gff for gffQC

perl /sonas-hs/ware/hpc/home/olson/src/warelab/gramene-ensembl/scripts/dump-scripts/sharon_gff_dump_by_logicname_nam.pl -o . -dbname zea_mayscml228_core_3_87_1 -logicname cshl_gene >& cml228.dump_gff &

Dump fasta for interproscan

perl ~olson/dump_all.pl --registry /sonas-hs/ware/hpc_norepl/data/weix/NAM/registry/nam.reg --species zea_mayscml228 --bylogicname cshl_gene --bylogicname cshl_noncoding_gene --outfile cml228 >& dump_all.cml228 &

gffQC reviewed more problems especially ema0003 (feature not contained with parent feature), Kapeel fixed those genes and put the fixed gff and fast under

/sonas-hs/ware/hpc_norepl/data/kapeel/NAM/NAM_Canu1.8_new_runs/mikado_braker.evd.ab.nam.models/aed_lt_75/mikado_braker_pasa_union_all/gff/new/final/ensembl_dumps/new_gene_id_assign/gff3_fix_files/ema0003

Use the following script to replace those genes

/sonas-hs/ware/hpc_norepl/data/weix/NAM/UnionCore$ ../scripts/replace_ema0003_corrected.sh cml52

6. Run interproScan on the proteins

/sonas-hs/ware/hpc/home/weix/gramene-ensembl/scripts/misc-scripts/setup_interpro_array.sh ARG1 ARG2 ARG3

     ARG1: project directory name
     ARG2: input protein fasta file
     ARG3: Number of jobs to submit

Example:

/sonas-hs/ware/hpc/home/weix/gramene-ensembl/scripts/misc-scripts/setup_interpro_array.sh interproscan-oh43 ../cshl_coding_corrected/oh43.corrected.pep.fasta 100

Command:
  cd /sonas-hs/ware/hpc_norepl/data/weix/NAM/UnionCore

for line in `cat TropicalNAM` 
 
     do echo $line; /sonas-hs/ware/hpc/home/weix/gramene-ensembl/scripts/misc-scripts/setup_interpro_array.sh interproscan-$line ../protein_dump/${line}.pep.fasta 100 

done

7. Andrew run his TRaCE to assign canonical transcript

cd /sonas-hs/ware/hpc_norepl/data/olson/NAM/TRaCE/CML228

cat /sonas-hs/ware/hpc_norepl/data/weix/NAM/UnionCore/interproscan-cml228/output/interproscan-cml228.* > interproscan.tsv

../replace.pl ../../dump/zea_mayscml228_core_3_87_1.gff 0 ../chr_names_lut.txt 0 1 > models.chr.gff

~olson/rank_transcripts.pl models.chr.gff interproscan.tsv 0.5 0.5 0.2 0.7 stringtie/* > rt.txt

perl ~olson/src/warelab/gramene-ensembl/scripts/load-scripts/set_canonical_transcripts_from_TRaCE.pl --debug -r ../../../../weix/NAM/registry/nam.reg -s zea_mayscml228 -t rt.txt >& set.stdall

8. Dump genes into the following format

    1) gff with cannonical and interpro domains (actually, not interpro domains, they are only in the tsv files)
    2) fasta files for cDNA, CDs, protein