Ensembl Core DB Gene Load - warelab/NAM_annotation GitHub Wiki
This SOP explains the steps involved after the annotation pipeline. It covers from loading genes to ensemble core database, to improving/refining the models, to running interproScan on proteins, and dumping gff and fasta out of the database
1. Load genes to ensemble core database with QC
Remove the old genes from the core database, load the new gene sets generated by Kapeel's annotation pipeline. These genes are the union of mikado and braker annotations. They were further filtered to the following three sets.
The symlinks in `/sonas-hs/ware/hpc_norepl/data/weix/NAM/` to the data are
a. this set represents the high confident coding genes: with both 1. AED < 0.75 and 2. phylogenetic conserved at Androgopocea
UnionFilteredModels -> /sonas-hs/ware/hpc_norepl/data/kapeel/NAM/NAM_Canu1.8_new_runs/mikado_braker.evd.ab.nam.models/aed_lt_75/mikado_braker_pasa_union_all/gff/new/final/aed_lt_75_androp_abv
b. this set represents maize specific coding genes: 1. AED < 0.75 and 2. both start and stop codon in the original model
LineageBothStartStop -> /sonas-hs/ware/hpc_norepl/data/kapeel/NAM/NAM_Canu1.8_new_runs/mikado_braker.evd.ab.nam.models/aed_lt_75/mikado_braker_pasa_union_all/gff/new/final/lineage_specific/lt_75/both-cds-start-stop
c. this set is maize specific non_coding genes: 1. AED < 0.75 2. missing either start codon or stop codon or both in the original model
LineageNonCoding -> /sonas-hs/ware/hpc_norepl/data/kapeel/NAM/NAM_Canu1.8_new_runs/mikado_braker.evd.ab.nam.models/aed_lt_75/mikado_braker_pasa_union_all/gff/new/final/lineage_specific/lt_75/with-either-cds-start-or-stop-or-both
-
the script is
/sonas-hs/ware/hpc_norepl/data/weix/NAM/scripts/load_all.shargs1: name line name, for example: cml52
args2: the old gene set logic_name to delete, for example, mikado_gene
it will create a directory with NAM line name such as cml52/, remove the old gene set under the logic_name "mikado_gene" in the existing ensemble core database, load the 3 sets of genes, coding genes under the name "cshl_gene", non_coding genes under the name "cshl_noncoding_gene". At each loading, it will ask for the user to confirm the input file is correct, after receiving 'Y' as the answer, it will continue with the loading, after which a check of the translation with protein sequences will be performed. It writes the output in files named translation.*.out.
This is an interactive script, the user needs to confirm the input file is correct by typing 'Y'. Each loading takes about 3 hours.
When the script is done running, make sure to check the 3 $nam/badGenes2del files, make sure they are all empty, otherwise, it didn't pass the quality check and calls for an investigation.*
-
The working directory is
/sonas-hs/ware/hpc_norepl/data/weix/NAM/UnionCore -
An example of running the loading in batch
cd /sonas-hs/ware/hpc_norepl/data/weix/NAM/UnionCore for line in `cat TropicalNAM` do echo $line; ../scripts/load_all.sh $line mikado_gene done
2. Gene Model polish
Now hand over the core database to Andrew, he will run his CDS fix pipeline to improve the models. and dump the proteins for interproScan and biotype reassignment
First create a backup of the relevant db fields
mysqldump -h bhsqldw1 -u plensembl -pAudreyII --add-drop-table zea_mayscml228_core_3_87_1 exon exon_transcript gene transcript translation > cml228.backup.before.sql
To split the gene list into 3 jobs run
~olson/setup_CDSFix_array.sh /sonas-hs/ware/hpc_norepl/data/weix/NAM/registry/nam.reg zea_mays cml228 3
Dump gff for gffQC
perl /sonas-hs/ware/hpc/home/olson/src/warelab/gramene-ensembl/scripts/dump-scripts/sharon_gff_dump_by_logicname_nam.pl -o . -dbname zea_mayscml228_core_3_87_1 -logicname cshl_gene >& cml228.dump_gff &
3. Gene biotype reassign
The really short protein coding genes (with < 50aa) will need to be reassigned to non_coding type, Kapeel identifies those genes and the symlink for that directory is
transcripts_lt_50aa -> /sonas-hs/ware/hpc_norepl/data/kapeel/NAM/NAM_Canu1.8_new_runs/mikado_braker.evd.ab.nam.models/aed_lt_75/mikado_braker_pasa_union_all/gff/new/final/transcripts_lt_50aa
4. Run the GFF3QC tool kit to fix some gene model problem such as overlapping or duplicated genes.
The following files were saved to `cshl_coding_corrected` folder.
1. the list of genes changed, they will be removed from the database
2. corrected gff file with all protein coding genes, unchanged ones and changed ones (with different gene names)
3. complete fasta file of the protein for validation
There were confusions between the phase definition between GFF tools and ensembl database. They seem to differ slightly, so we decided to ignore the potential phase problems detected by `gramene-ensembl/scripts/QC-scripts/correct_phase.pl` and skip this step.
Command:
cd /sonas-hs/ware/hpc_norepl/data/weix/NAM/UnionCore
for line in `cat TropicalNAM`
do echo $line; ../scripts/replace_coding_corrected.sh $line
done
5. Assign gene ids and dump gff
Andrew assigns new names for the genes and dump a new complete gff to share with collaborators Also the protein fasta with the new names
~olson/assign_stable_IDs.pl --registry /sonas-hs/ware/hpc_norepl/data/weix/NAM/registry/nam.reg --species zea_mayscml228 --prefix Zm00022ab --bylogicname cshl_gene --bylogicname cshl_noncoding_gene >& cml228.assign_stable_IDs.stderr
Dump gff for gffQC
perl /sonas-hs/ware/hpc/home/olson/src/warelab/gramene-ensembl/scripts/dump-scripts/sharon_gff_dump_by_logicname_nam.pl -o . -dbname zea_mayscml228_core_3_87_1 -logicname cshl_gene >& cml228.dump_gff &
Dump fasta for interproscan
perl ~olson/dump_all.pl --registry /sonas-hs/ware/hpc_norepl/data/weix/NAM/registry/nam.reg --species zea_mayscml228 --bylogicname cshl_gene --bylogicname cshl_noncoding_gene --outfile cml228 >& dump_all.cml228 &
gffQC reviewed more problems especially ema0003 (feature not contained with parent feature), Kapeel fixed those genes and put the fixed gff and fast under
/sonas-hs/ware/hpc_norepl/data/kapeel/NAM/NAM_Canu1.8_new_runs/mikado_braker.evd.ab.nam.models/aed_lt_75/mikado_braker_pasa_union_all/gff/new/final/ensembl_dumps/new_gene_id_assign/gff3_fix_files/ema0003
Use the following script to replace those genes
/sonas-hs/ware/hpc_norepl/data/weix/NAM/UnionCore$ ../scripts/replace_ema0003_corrected.sh cml52
6. Run interproScan on the proteins
/sonas-hs/ware/hpc/home/weix/gramene-ensembl/scripts/misc-scripts/setup_interpro_array.sh ARG1 ARG2 ARG3
ARG1: project directory name
ARG2: input protein fasta file
ARG3: Number of jobs to submit
Example:
/sonas-hs/ware/hpc/home/weix/gramene-ensembl/scripts/misc-scripts/setup_interpro_array.sh interproscan-oh43 ../cshl_coding_corrected/oh43.corrected.pep.fasta 100
Command:
cd /sonas-hs/ware/hpc_norepl/data/weix/NAM/UnionCore
for line in `cat TropicalNAM`
do echo $line; /sonas-hs/ware/hpc/home/weix/gramene-ensembl/scripts/misc-scripts/setup_interpro_array.sh interproscan-$line ../protein_dump/${line}.pep.fasta 100
done
7. Andrew run his TRaCE to assign canonical transcript
cd /sonas-hs/ware/hpc_norepl/data/olson/NAM/TRaCE/CML228
cat /sonas-hs/ware/hpc_norepl/data/weix/NAM/UnionCore/interproscan-cml228/output/interproscan-cml228.* > interproscan.tsv
../replace.pl ../../dump/zea_mayscml228_core_3_87_1.gff 0 ../chr_names_lut.txt 0 1 > models.chr.gff
~olson/rank_transcripts.pl models.chr.gff interproscan.tsv 0.5 0.5 0.2 0.7 stringtie/* > rt.txt
perl ~olson/src/warelab/gramene-ensembl/scripts/load-scripts/set_canonical_transcripts_from_TRaCE.pl --debug -r ../../../../weix/NAM/registry/nam.reg -s zea_mayscml228 -t rt.txt >& set.stdall
8. Dump genes into the following format
1) gff with cannonical and interpro domains (actually, not interpro domains, they are only in the tsv files)
2) fasta files for cDNA, CDs, protein