B1 IV: Functional analysis - BDC-training/VT25 GitHub Wiki

Course: VT25 Bioinformatics 1 (SC00037)


The aim of this practical it to introduce you to databases and their corresponding web tools to perform functional analysis on a set of genes

1. String. A database of known and predicted protein-protein interactions

2. The Gene ontology resource. The world’s largest source of information on the functions of genes

3. DAVID Bioinformatics Resources. A knowledgebase with a set of functional annotation tools

4. reactome. An open-source, open access, manually curated and peer-reviewed pathway database

5. Enrichr. A suite of gene set enrichment analysis tools (OPTIONAL)




Whenever we end up with lists of proteins or genes, we would like to investigate what are their functions and/or to which pathways they belong to, in other words do a functional analysis.

The data set used in the proteomics analysis was from HeLa cell. The control group was grown in normal conditions, and a treatment group was grown in hypoxic conditions, that is with a limited amount of oxygen. Let's try different tools for analysis of functional consequences using the DE proteins you obtained from previous exercise. If you do not have the file, you can download it here: results_proteomics.xlsx

Protein Interaction

Open the excel file and answer the following questions.

Q1. What is the accession ID of the most significant differentially expressed protein?

Q2. What are the accession IDs of the most up- and down-regulated proteins that are also significant? use fdr < 0.05 as threshold

Uniprot

Go to Uniprot

Q3. What are their full name and their function?

Remember that from most databases you can access information from other databases. In this case gene ontology, pathways, etc are scattered throughout the entry. Besides UniPRto has a tool for mapping ID's from one format to another!.

String

Let's look at String. Here you can upload more than one protein at once. Upload the 20 most significant genes. Hint: Check the left menu.

Note that here edges do not necessarily mean that proteins physically bind to each other.

Note also that all interactions may not be shown.

  • Go to Settings
  • Change max number of interactors to show to 50
  • Update

You can check the color coding under Legend to see what kind of interactions there are in the graph.

Q4. How is this different from the previous newtork?

Let's limit the type of interactions:

  • Under Settings -> active interaction sources
  • Tick only Databases and Experiments
  • Change the max number of interactors to show back to -none /query proteins only, otherwise there will be a lot of data to digest!
  • Update

Q5. Do any of the proteins functionally interact? List them indicating which type of interaction they have

Have a look under the Analysis tab. Make a note of the type of information displayed.

  • Go to Settings
  • Change max number of interactors to show to 50
  • Check the Analysis tab again.

Q6. Why do you have different information?

GO analysis with GOrilla

GOrilla is one of many tools for identifying and visualizing enriched GO terms in ranked lists of proteins/genes. It can be used in two ways:

  • one single ranked list (ranked by adjusted p-value/fdr) or
  • upload two lists, your chosen significant set and a background

Single ranked list of genes

  • Sort your list of DE proteins by fdr
  • Go to the GOrilla webpage
  • Select the Single ranked list of genes as running mode
  • Paste your ranked list of protein names (you could instead, upload a file with the Uniprot accession IDs if you like)
  • Select Process as Ontology
  • Under Advanced parameters tick Show output also in REViGO
  • Run the analysis

Q7. What is the most enriched GO term?

  • Click Visualize output in REViGO
  • Use default settings and start REViGO

Q8. Based on the scatterplot, what is the most enriched GO term? What is the interpretation of this graph?

Two unranked lists of genes

  • Extract the genes from your list that are significant using the threshold fdr < 0.05
  • Divide the list in up-regulated and down-regulated proteins
  • Go to the GOrilla webpage
  • Select the Two unranked lists of genes as running mode
  • Paste the down-regulated Uniprot accession IDs as the Target set
  • Paste all the protein accession IDs as the background set
  • Select Process as Ontology
  • Under Advanced parameters tick Show output also in REViGO
  • Run the analysis
  • Run REViGO and take a snapshot of the Interactive Graph

Now repite the analysis for the up-regulated dataset

Q9. What are the most enriched GO terms for the up-regulated and down-regulated datasets? What are your conclusions from the interactive graphs?

GO analysis with PANTHER

PANTHER is another tool that can be used to perform an overrepresentation analysis of GO terms.

  • Go to the PANTHER webpage

Under the Gene List Analysis:

  • Paste your list of significant protein names, using fdr < 0.05 (you could instead, upload a file with the Uniprot accession IDs if you like)
  • Select ID list under List Type
  • Select the corresponding organism
  • Under Select Analysis tick Statistical overrepresentation test
  • Choose GO biological process as Annotation set
  • Click Submit
  • Select Homo sampiens genes as Default whole-genome lists
  • Select Fisher's exact test as Test type
  • Select Calculate False Discovery Rate as Correction
  • Launch the analysis

Q10. What does the +/- column indicate?

Q11. What are the most significant overrepresented GO terms? Are they the same as the ones you got with the single ranked list of gene in GOrilla?

Pathway analysis with REACTOME

Besides the function of the genes annotated as GO terms, proteins and genes can also be placed in pathways. Reactome is a pathway database that can be used for overrepresentation analysis. REACTOME is a curated open source pathway database, that may show us a visualization of how molecules interact with each other.

  • Go REACTOME
  • Select Analysis tools
  • Paste the list of significant proteins
  • Click Continue
  • Leave the default values
  • Click Analyse!

In the detail panel (lower panel) you will see a sorted list with the most enriched pathways at top. Click on one of them and see how this pathway is placed in the event hierarchy (left panel). The view port(central panel) will zoom in to the corresponding pathway.

Q12. What does the yellow coloring in the view port mean?

Q13. How many proteins from our dataset were found in the most enriched pathway? What proteins are those?

Now click at the node representing the most enriched pathway in the view port. We will zoom in to the specific reactions. Take a snapshot of the pathway. Hint: there is an export diagram icon at the top right of the viewer

There are some proteins/complexes that are filled (or half filled) in a darker yellow color. Zoom in to any of the aminoacyl-tRNA synthetase multienzyme complexes. Hover over the complex, a blue arrow will appear. Click on it, you will get the proteins that belong to that complex.

Q14. Why are some proteins highlighted and others are not?

Search for the VARS protein. Hint: you could use the Search in the diagram icon at the top left of the viewer

You will see a 6, which indicates the interactions this protein has. Click on it.

Q15. Which proteins interact with VARS?

Let's zoom out. In the Event hierarchy, click on each one of the parents of theCytosolic tRNA aminoacylation pathway and take a snapshot of the highest level pathway.

Q16. Which is the highest level pathway?

Q17. Considering the pathways with and FDR < 0.001, what are the common high level pathways?

Functional annotation with DAVID

DAVID is a Database for annotation, Visualization and Integrated Discovery. It provides a comprehensive set of functional annotation tools for any given gene list. You can identify enriched biological GO terms, visualize genes on BioCarta & KEGG pathway maps and convert gene identifiers from one type to another, among others.

  • Go to the DAVID webpage
  • Under Shortcut to DAVID Tools select Functional Annotation Clustering
  • Under the Upload panel, paste the list of the significant proteins
  • Select UNIPROT_ACCESSION as Identifier
  • Select Gene Listas List Type
  • Submit

Click on Gene_Ontology within the Annotation Summary Results:

Q18. Why are there some terms in red?

Click on the Chart button of the GOTERM_BP_1. This will open a window with the GO terms of all your proteins.

  • Display the Options menu
  • Set the EASE field to 0.001. EASE is a modified Fisher Exact p-value for gene-enrichment analysis
  • Make sure only Fold Enrichment and FDR are selected
  • Rerun

The _1 points to the highest level of the GO tree, thus it provides low specificity, just have a look at the resulting terms in your Functional Annotation chart.

Check the Chart for the GOTERM_BP_3:

Q19. At which level do you find the terms you found at the Gene Ontology webpage?

Let's investigate the Pathways, so look at the Chart from the REACTOME_PATHWAY and rerun with the options we used for the GOTERMS.

Q20. Do you find the top pathways you got at the REACTOME webpage?

Q21. Are there any relevant OMIM terms related to your dataset?

As you may see, the data to analyse tends to be a lot! especially because there is redundancy. The Functioal Annotation Clusteringgroups similar annotations together making it easier to browse the results. As an example:

  • Check that only the REACTOME_PATHWAY is selected (unselect everything else)
  • Click the Fucntional Annotation Clustering button at the bottom of the page
  • Display the Options menu
  • Set the EASE field to 0.001
  • Make sure only Fold Enrichment and FDR are selected
  • Rerun

Q22. How many clusters were found? Do you think they are correctly clustered?

OPTIONAL: Enrichr

Enrichr is a suite of gene set enrichment analysis tools and as for today they have 192 different library reference sets.

Input the list of significant proteins, analyze and browse the results. Note that you need the ENTREZ gene symbols as input, you can use DAVID's Gene ID Conversion tool for this purpose.

Q23. Are the results similar to what you have obtained so far?



Developed by Katarina Truvé and Marcela Dávila, 2018. Modified by Marcela Dávila, 2022. by Marcela Dávila, 2022.

⚠️ **GitHub.com Fallback** ⚠️