09 10 22: Data Analysis Overview and Hypothesis - rolandgriggs/URE2022_XGXaviGriggs GitHub Wiki
Project 4. Impact of the number of database updates on long term database availability
Research Aim :
Investigate if the number of databases updates impacts the availability of databases.
Research questions:
- How many databases have a very high number of published updates (more than 5 updates), a medium range number of updates (5 to 2 updates), and no updates?
Hypothesis : The Majority of databases will have a medium number of updates.
Methods : Number of databases with more than 5 updates, 5-2 updates and those with none.
Potential visualizations: Column chart
- How many databases (published more than 10 years ago) have recent updates (updates in the past 10 years)?
Hypothesis : The majority of databases published more than 10 years ago will not have recent updates.
Methods : The number of databases that are more than 10 years old with any updates in the past 10 years.
Potential visualizations: Line graph
- What is the proportion of available/unavailable databases with a high, medium or no updates?
Hypothesis : Databases that have high or medium updates are more likely to be available.
Methods : Number of databases that are available/unavailable in proportion to their number of updates.
Potential visualizations: Scatterplot
Dataset description:
2343 entries (1 entry per database. Excluded the Databases never published online.
Variables included:
- db_id : Unique identifier for the database in JL_DB dataset
- resource_name : Name of the database
- first_publication : Date of the first article publication of the database
- Nb_of_articles : Number of publications for that database. If equal to 1, then the database had no published updates, if superior to 1, the database was updated.
- last_publication : Date of the last publication. Equal to first publication if only one article was published for that database.
- available_2022 : TRUE if the database is available online in 2022, FALSE if not