Insights! - bcb420-2023/Angela_Uzelac GitHub Wiki

Objective

Note down big ideas throughout the course.

Insight 1

It seems that the biggest challenge facing the study of bioinformatics, and what might be causing the most pain to bioinformaticians, is inconsistency in data across databases. I've learned in previous courses and seen directly in this course that the IDs of the same gene or protein are different across different databases, and a lot of energy and brain power must go into being able to convert the IDs so that the data is consistent throughout an experiment. Whether that is by using an existing conversion tool such as BioMart, or creating a new algorithm. I now see that another major problem is that the data that is input into GEO is not regulated at many levels. So even doing something like pulling a dataset from GEO will not be so easy because, for example, even organism names are not always the same. There are many factors to take into consideration when doing bioinformatics analyses, so we must always take extra care.

Insight 2

Bioinformatics analyses are not as rigid as I thought. For example, I thought that genes are only differentially expressed if the difference between the expression values in the control vs disease has a p-value of less than 0.05. But this threshold is just commonly chosen and widely accepted. This value can be flexible and it doesn't have to mean that genes are not differentially expressed if they do not fit this threshold. There are so many factors that must be taken into account and experiments are never perfect. Even imperfect data can be useful in achieving our final goal and tell us a story about that data.

Insight 3

There is rarely one best tool to use for all experiments, everything depends on context. For example, there are many different types of gene expression data, each with their own benefits (the volume of the data, whether we're looking at a single cell, etc). Another example is normalization methods for RNASeq data which are chosen depending on factors like what factor it will normalize across. There are also many tools for gene set enrichment analysis. One big factor is whether you are using a thresholded or non-thresholded gene list, but you also must consider things like where the annotations come from.

Insight 4

There is a huge benefit for tools to have associated R packages because then developers don't have to switch between programs to run an analysis. Instead, they can simply have all their scripts in one place and don't have to manually gather data and input it somewhere else.

Insight 5

Remember for the future: it is really important to choose data sources that are updated regularly. We saw with the DAVID data source how detrimental it can be to not update data frequently. I also see that several times in this course we chose tools that use GO annotations because GO is frequently updated. We also live in a world where new discoveries are being made at a huge rate, therefore, data gets outdated faster than ever, which can lead to less accurate results.

Insight 6

Whenever we do anything to our data, the whole way from raw data to the final results, it is a really good idea to always explore what each change has done to the data. I have personally run into several mishaps while cleaning, normalizing, and scoring my data, and it was very beneficial to do things like actually open the data frames or run some statistical analysis to see if the data makes sense so far. Additionally, it's always a good idea to graph these intermediates because they can give us a great overview of the data and how it has changed.

Insight 7

It's important to remember that sometimes less is more. Just because something outputs more results doesn't mean they are good results. For example, in my work, I had what I thought were low numbers of significantly differentially expressed genes. But what I didn't realize is that these "low numbers" could actually give me more focused results when I did pathway analysis. Also, when we were looking at the results from g:profiler, it was good to decrease the term size because it gave us results that were less vague and more specific to our gene lists. In this class, luckily we were given recommendations for the size of our results that would give us reasonable explanations, but I think that in the future when we are working in the field by ourselves, it will be hard to determine what the best number or size of results is. It's probably subjective and maybe it's best to run analyses several times with different parameters.

Insight 8

The only networks that I have learned about or dealt with in the past have been interactions between genes, or better said, the nodes represented genes, transcripts, or proteins, and the edges represented physical interactions between them. It was really interesting to learn about the different ways we can use networks, and some of the examples that the professor talked about seem like they have significant applications in the field of bioinformatics. I am really intrigued by the abstraction where nodes represent gene sets and edges represent the genes that the gene sets share. I think this is harder to wrap my head around as I am used to one node being one physical entity.

Insight 9

In my opinion, the cool thing about this course, as well as bioinformatics as a whole, is that I can get really caught up in designing algorithms, writing scripts, running statistical analyses, plotting, and interpreting many numbers that I forget the actual purpose of what I'm doing. I frequently forget to think about biological context and what these results mean when talking about actual human disease. The reason I love doing this after all is for the healthcare applications. So I am trying to refer back to the paper associated with my data set more frequently, first of all, to better understand the results, and second of all to appreciate the potential these findings could have on therapeutics research. It's always a reward for me to look back at the big picture and think about why doing these analyses is important.