Data Analysis - alexneubert/BAT102_ANalexandrian GitHub Wiki

Project 4. Impact of the number of database updates on long term database availability

Research Aim :

Investigate if the number of databases updates impacts the availability of databases.

Research questions:

  1. How many databases have a very high number of published updates (more than 5 updates), a medium range number of updates (5 to 2 updates), and no updates?

Hypothesis : Most databases are likely to have few updates, given the effort involved in maintaining and updating databases. Databases with a high number of updates are expected to be less common compared to those with medium or no updates.

Methods : Data Grouping: Classify the databases into three groups: High updates: More than 5 updates Medium updates: Between 2 and 5 updates No updates: 0 updates

Potential visualizations: Bar Chart: A simple bar chart showing the number of databases in each update range (high, medium, no updates).

  1. How many databases (published more than 10 years ago) have recent updates (updates in the past 10 years)?

Hypothesis Many older databases may not receive frequent updates. However, a significant number of essential or widely used databases are expected to still be updated regularly.

Methods Data Filtering: Identify databases published more than 10 years ago. Recent Updates Check: From this filtered group, count how many databases have had updates in the last 10 years. Time Analysis: Compare the publication date of the database with the date of the most recent update.

Potential visualizations Line Graph: Showing the trend of updates over time for databases published over 10 years ago.

  1. What is the proportion of available/unavailable databases with a high, medium or no updates?

Hypothesis Databases with more frequent updates are expected to have higher availability due to active maintenance. Those with no or few updates may face issues with long-term availability.

Methods
Data Grouping: Group the databases based on their update frequency (high, medium, no updates). Availability Check: Determine the current availability of each database. Proportion Calculation: Calculate the proportion of available vs unavailable databases for each group (high, medium, no updates).

Potential visualizations Stacked Bar Chart: Showing the proportion of available and unavailable databases within each update frequency category.

Dataset description:

2355 entries (1 entry per database. Excluded the Databases never published online.

Variables included:

  • db_id : Unique identifier for the database in JL_DB dataset
  • resource_name : Name of the database