9.2_GO_enrichment - bennestor/hakea_genome GitHub Wiki
GO terms of Hakea genes with eggNOG mapper
I used eggNOG (http://eggnog-mapper.embl.de/) filtered to just Viridiplantae mapper to generate GO terms, but there are lots of mappers around.
- 45,786 sequences were annotated
- 24,052 sequences annotated with GO terms
Used this script: https://philippbayer.github.io/Amphibolis_Posidonia_Comparison/GOenrichment.html#Revigo
Main scripts are topgo.R and revigo.R
plotrevigo.R is handy for enrichment network figures
GO enrichment of genes specific to Hakea (gene families highly divergent from other species)
#Extract Hakea gene ids from orthogroups cat Orthogroups.txt | grep -Fw -f hakea_specific.ids | tr ' ' '\n' | grep -E -v 'O.*' | sort -u > hakea_specific_gene.txt #5868 sequences #Run topGO script in R compare('hakea_specific_gene.txt', './topgo_out/GO_results_hakea_specific_genes.csv', allgenes, annAT) #Run Revigo script in R write.csv((results_list[["GO_results_hakea_specific_genes.csv"]] %>% dplyr::filter(Eliminated == FALSE) %>% dplyr::select(TermID, Name, Value, Frequency) %>% arrange(Value)), "Revigo_hakea_specific_genes.csv")
GO enrichment of genes specific to Proteaceae GO enrichment is done on the Hakea gene IDs as a representative for the overall enriched genes in the Proteaceae-specific orthogroup (since I donโt have GO terms for Telopea and Macadamia).
#Get ids of Proteaceae-specific genes cat Orthogroups.GeneCount.tsv | awk -F "\t" '$5>0 && $6>0 && $8>0 && $2==0 && $3==0 && $4==0 && $7==0 && $9==0' | cut -f 1 > proteaceae_specific.ids #516 orthogroups #Extract Hakea gene IDs from orthogroups (gemoma_nr.ids is a list of Hakea gene IDs, can make by grepping โ>โ then tr -d โ>โ or similar). cat Orthogroups.txt | grep -Fw -f proteaceae_specific.ids | tr ' ' '\n' | grep -Fw -f gemoma_nr.ids | sort -u > proteaceae_specific_gene.txt #1962 sequences #Run topGO script in R compare('proteaceae_specific_gene.txt', './topgo_out/GO_results_proteaceae_specific_genes.csv', allgenes, annAT) #Run revigo script in R write.csv((results_list[["GO_results_proteaceae_specific_genes.csv"]] %>% dplyr::filter(Eliminated == FALSE) %>% dplyr::select(TermID, Name, Value, Frequency) %>% arrange(Value)), "Revigo_proteaceae_specific_genes.csv")
Genes in expanded gene families in Hakea
#Getting orthogroup IDs from CAFE output cat Base_change.tab | cut -f 1,5 | grep "+" | grep -v "+0" | cut -f 1 > expanded_orthogroups.ids #Append ids of large filtered families cat Base_change.tab | cut -f 1,5 | grep "+" | grep -v "+0" | cut -f 1 >> ../k2_e_1/expanded_orthogroups.ids #Extract Hakea gene IDs from significantly expanded orthogroups cat Orthogroups.txt | grep -Fw -f significant_orthogroups.ids | grep -Fw -f expanded_orthogroups.ids | tr ' ' '\n' | grep -Fw -f gemoma_nr.ids | sort -u > expanded_genes.txt #527 significantly expanded orthogroups #8104 hakea genes significantly expanded #Run topGO script in R compare('expanded_genes.txt', './topgo_out/GO_results_expanded_genes.csv', allgenes, annAT) #Run Revigo script in R write.csv((results_list[["GO_results_expanded_genes.csv"]] %>% dplyr::filter(Eliminated == FALSE) %>% dplyr::select(TermID, Name, Value, Frequency) %>% arrange(Value)), "Revigo_expanded_genes.csv")
Genes in contracted gene families in Hakea
#Append ids of large filtered families cat Base_change.tab | cut -f 1,5 | grep "\-" | cut -f 1 >>../k2_e_1/contracted_orthogroups.ids #Extract Hakea gene IDs from contracted orthogroups cat Orthogroups.txt | grep -Fw -f significant_orthogroups.ids | grep -Fw -f contracted_orthogroups.ids | tr ' ' '\n' | grep -Fw -f gemoma_nr.ids | sort -u > contracted_genes.txt #748 significantly contracted orthogroups #1487 Hakea genes #Run topGO script in R compare('contracted_genes.txt', './topgo_out/GO_results_contracted_genes.csv', allgenes, annAT) #Run Revigo script in R write.csv((results_list[["GO_results_contracted_genes.csv"]] %>% dplyr::filter(Eliminated == FALSE) %>% dplyr::select(TermID, Name, Value, Frequency) %>% arrange(Value)), "Revigo_contracted_genes.csv")
Genes in expanded and contracted gene families in Proteaceae IDs of expanded gene families can be retrieved and then used similarly to the above methods to filter Orthogroups.txt and extract a list of expanded genes.
#Append IDs of large filtered families cat Base_change.tab | cut -f 1,12 | grep "+" | grep -v "+0" | cut -f 1 >> ../k2_e_1/proteaceae_expanded_orthogroups.ids
Similarly, for contracted genes
#Append ids of large filtered families cat Base_change.tab | cut -f 1,12 | grep "\-" | cut -f 1 >>../k2_e_1/proteaceae_contracted_orthogroups.ids