Data Analysis Project 2 - OscarO-0/BAT102_oscarosuna GitHub Wiki

#Project 2. Impact of the number of citations on long term database availability

##Research aim: Investigate if the popularity of databases (the number of citations of the database) impacts the availability of databases.

##Research questions:

  1. How many databases have a very high number of citations (more than 100 citations), a medium range number of citations (10 to 100 citations), a low citation number (less than 10 citations), and no citations?

##What did you calculate?

The number of databases with high citations was 852. The amount with medium range of citations was 1053. Low citations is 106. No citations was 77.

##How did you calculate it?

I calculated this using the database for project 2, and using excel formulas to calculate and pull out the data for each type of citation range.

##What are your observations and conclusions?

Overall, we see that the medium range was the highest, but the amount with high citations wasn't too far behind. It seems that the databases in my dataset were relatively popular. Low citations and no citations tapered off into way lower number amounts.

2.How many databases with a very high number of citations (more than 100 citations) are old databases (published more than 10 years ago)?

##What did you calculate?

66.67% of databases with a very high range of citations are more than 10 years old.

##How did you calculate it?

I used excel formulas to count the amount of databases are are >10 years old, and highly cited. Then we divided by the amount of highly cited databases to get the proportion.

##Observations and conclusions?

This shows that we have a decently proportional margin of highly cited databases also being over 10+ years old. Just about more than half. Although, I wonder, since we are just barely above half, what would this proportion look like with a larger dataset? Would it flatten out to be half or less than half? Or would it increase to an even larger proportion?

  1. Are databases with a high or medium number of citations less susceptible to being discontinued than databases with low or no citations?

##What did you calculate?

In my dataset I saw that highly cited databases are significantly less discontinued than the others. Medium range cited databases weren't significantly less discontinued than the low or no citations databases.

##How did you calculate it?

We used the "if" calculations and the old dataset. Unfortunately my old dataset was out of order, so I had to handpick a lot of the data, there is susceptibility to miscalculations here.

##Observations and conclusions?

there is a significant proportion of highly cited databases that aren't discontinued, so our hypothesis is correct. I'm still not sure why we used a different dataset for this question than the first two questions, since it seems like the correlation between the 3 will be harder to make. Furthermore, the dataset was out of order which made it difficult to do calculations.