B1 IV: Functional analysis - BDC-training/VT25 GitHub Wiki
Course: VT25 Bioinformatics 1 (SC00037)
The aim of this practical it to introduce you to databases and their corresponding web tools to perform functional analysis on a set of genes
1. String. A database of known and predicted protein-protein interactions
2. The Gene ontology resource. The world’s largest source of information on the functions of genes
3. DAVID Bioinformatics Resources. A knowledgebase with a set of functional annotation tools
4. reactome. An open-source, open access, manually curated and peer-reviewed pathway database
5. Enrichr. A suite of gene set enrichment analysis tools (OPTIONAL)
Whenever we end up with lists of proteins or genes, we would like to investigate what are their functions and/or to which pathways they belong to, in other words do a functional analysis.
The data set used in the proteomics analysis was from HeLa cell. The control group was grown in normal conditions, and a treatment group was grown in hypoxic conditions, that is with a limited amount of oxygen. Let's try different tools for analysis of functional consequences using the DE proteins you obtained from previous exercise. If you do not have the file, you can download it here: results_proteomics.xlsx
Open the excel file and answer the following questions.
Q1. What is the accession ID of the most significant differentially expressed protein?
Q2. What are the accession IDs of the most up- and down-regulated proteins that are also significant? use fdr < 0.05 as threshold
Go to Uniprot
Q3. What are their full name and their function?
Remember that from most databases you can access information from other databases. In this case gene ontology, pathways, etc are scattered throughout the entry. Besides UniPRto has a tool for mapping ID's from one format to another!.
Let's look at String. Here you can upload more than one protein at once. Upload the 20 most significant genes. Hint: Check the left menu.
Note that here edges do not necessarily mean that proteins physically bind to each other.
Note also that all interactions may not be shown.
- Go to
Settings
- Change max number of interactors to show to
50
Update
You can check the color coding under Legend
to see what kind of interactions there are in the graph.
Q4. How is this different from the previous newtork?
Let's limit the type of interactions:
- Under
Settings
->active interaction sources
- Tick only
Databases
andExperiments
- Change the max number of interactors to show back to
-none /query proteins only
, otherwise there will be a lot of data to digest! - Update
Q5. Do any of the proteins functionally interact? List them indicating which type of interaction they have
Have a look under the Analysis
tab. Make a note of the type of information displayed.
- Go to
Settings
- Change max number of interactors to show to
50
- Check the
Analysis
tab again.
Q6. Why do you have different information?
GOrilla is one of many tools for identifying and visualizing enriched GO terms in ranked lists of proteins/genes. It can be used in two ways:
- one single ranked list (ranked by adjusted p-value/fdr) or
- upload two lists, your chosen significant set and a background
- Sort your list of DE proteins by
fdr
- Go to the GOrilla webpage
- Select the
Single ranked list of genes
as running mode - Paste your ranked list of protein names (you could instead, upload a file with the Uniprot accession IDs if you like)
- Select
Process
as Ontology - Under
Advanced parameters
tickShow output also in REViGO
- Run the analysis
Q7. What is the most enriched GO term?
- Click
Visualize output in REViGO
- Use default settings and
start REViGO
Q8. Based on the scatterplot, what is the most enriched GO term? What is the interpretation of this graph?
- Extract the genes from your list that are significant using the threshold
fdr < 0.05
- Divide the list in up-regulated and down-regulated proteins
- Go to the GOrilla webpage
- Select the
Two unranked lists of genes
as running mode - Paste the down-regulated Uniprot accession IDs as the Target set
- Paste all the protein accession IDs as the background set
- Select
Process
as Ontology - Under
Advanced parameters
tickShow output also in REViGO
- Run the analysis
- Run REViGO and take a snapshot of the
Interactive Graph
Now repite the analysis for the up-regulated dataset
Q9. What are the most enriched GO terms for the up-regulated and down-regulated datasets? What are your conclusions from the interactive graphs?
PANTHER is another tool that can be used to perform an overrepresentation analysis of GO terms.
- Go to the PANTHER webpage
Under the Gene List Analysis
:
- Paste your list of significant protein names, using
fdr < 0.05
(you could instead, upload a file with the Uniprot accession IDs if you like) - Select
ID list
under List Type - Select the corresponding organism
- Under
Select Analysis
tickStatistical overrepresentation test
- Choose
GO biological process
as Annotation set - Click
Submit
- Select
Homo sampiens genes
as Default whole-genome lists
- Select
Fisher's exact test
as Test type - Select
Calculate False Discovery Rate
as Correction - Launch the analysis
Q10. What does the
+/-
column indicate?
Q11. What are the most significant overrepresented GO terms? Are they the same as the ones you got with the single ranked list of gene in GOrilla?
Besides the function of the genes annotated as GO terms, proteins and genes can also be placed in pathways. Reactome is a pathway database that can be used for overrepresentation analysis. REACTOME is a curated open source pathway database, that may show us a visualization of how molecules interact with each other.
- Go REACTOME
- Select
Analysis tools
- Paste the list of significant proteins
- Click
Continue
- Leave the default values
- Click
Analyse!
In the detail panel (lower panel) you will see a sorted list with the most enriched pathways at top. Click on one of them and see how this pathway is placed in the event hierarchy (left panel). The view port(central panel) will zoom in to the corresponding pathway.
Q12. What does the yellow coloring in the view port mean?
Q13. How many proteins from our dataset were found in the most enriched pathway? What proteins are those?
Now click at the node representing the most enriched pathway in the view port. We will zoom in to the specific reactions. Take a snapshot of the pathway. Hint: there is an export diagram icon at the top right of the viewer
There are some proteins/complexes that are filled (or half filled) in a darker yellow color. Zoom in to any of the aminoacyl-tRNA synthetase multienzyme complexes
. Hover over the complex, a blue arrow will appear. Click on it, you will get the proteins that belong to that complex.
Q14. Why are some proteins highlighted and others are not?
Search for the VARS
protein. Hint: you could use the Search in the diagram icon at the top left of the viewer
You will see a 6
, which indicates the interactions this protein has. Click on it.
Q15. Which proteins interact with
VARS
?
Let's zoom out. In the Event hierarchy, click on each one of the parents of theCytosolic tRNA aminoacylation
pathway and take a snapshot of the highest level pathway.
Q16. Which is the highest level pathway?
Q17. Considering the pathways with and FDR < 0.001, what are the common high level pathways?
DAVID is a Database for annotation, Visualization and Integrated Discovery. It provides a comprehensive set of functional annotation tools for any given gene list. You can identify enriched biological GO terms, visualize genes on BioCarta & KEGG pathway maps and convert gene identifiers from one type to another, among others.
- Go to the DAVID webpage
- Under Shortcut to DAVID Tools select
Functional Annotation Clustering
- Under the Upload panel, paste the list of the significant proteins
- Select
UNIPROT_ACCESSION
as Identifier - Select
Gene List
as List Type - Submit
Click on Gene_Ontology
within the Annotation Summary Results:
Q18. Why are there some terms in
red
?
Click on the Chart
button of the GOTERM_BP_1. This will open a window with the GO terms of all your proteins.
- Display the
Options
menu - Set the
EASE
field to0.001
. EASE is a modified Fisher Exact p-value for gene-enrichment analysis - Make sure only
Fold Enrichment
andFDR
are selected - Rerun
The _1
points to the highest level of the GO tree, thus it provides low specificity, just have a look at the resulting terms in your Functional Annotation chart.
Check the Chart
for the GOTERM_BP_3:
Q19. At which level do you find the terms you found at the Gene Ontology webpage?
Let's investigate the Pathways, so look at the Chart
from the REACTOME_PATHWAY and rerun with the options we used for the GOTERMS.
Q20. Do you find the top pathways you got at the REACTOME webpage?
Q21. Are there any relevant OMIM terms related to your dataset?
As you may see, the data to analyse tends to be a lot! especially because there is redundancy. The Functioal Annotation Clustering
groups similar annotations together making it easier to browse the results. As an example:
- Check that only the REACTOME_PATHWAY is selected (unselect everything else)
- Click the
Fucntional Annotation Clustering
button at the bottom of the page - Display the
Options
menu - Set the
EASE
field to0.001
- Make sure only
Fold Enrichment
andFDR
are selected - Rerun
Q22. How many clusters were found? Do you think they are correctly clustered?
Enrichr is a suite of gene set enrichment analysis tools and as for today they have 192 different library reference sets.
Input the list of significant proteins, analyze and browse the results. Note that you need the ENTREZ gene symbols
as input, you can use DAVID's Gene ID Conversion
tool for this purpose.
Q23. Are the results similar to what you have obtained so far?
Developed by Katarina Truvé and Marcela Dávila, 2018. Modified by Marcela Dávila, 2022. by Marcela Dávila, 2022.