HW4 ‐ Cleaning and Preparing the Bing COVID‐19 Data for Analysis - yentingw-gin/BIDD330_Spring2025_Reporting GitHub Wiki

Overview of the Bing COVID-19 Dataset

The Bing COVID-19 dataset, provided by Microsoft through its Azure Open Datasets platform. It contains daily numbers of confirmed cases, deaths, and recoveries from trusted sources around the world.

In HW3, we used the SQL Server Import and Export Wizard to upload the raw CSV file. While this approach is quick and easy, one drawback is that it automatically converts all fields to the string data type. This can be a problem when we want to perform calculations or filter data accurately, especially for numeric and date fields. In addition to this, there are other data quality issues in the dataset that also need to be addressed as part of the cleanup process.

For our final project, we’ll be using the Bing COVID-19 dataset to build a data warehouse and create reports. But before diving into analysis or building dashboards, we need to clean the data to ensure it’s consistent and ready for use in a BI environment.


Data Cleanup Areas in the Bing COVID-19 Dataset

1. Data Type Issues: Quantitative and Date Fields Stored as Strings

Some numeric fields such as confirmed and deaths, as well as dates fields like updated are automatically converted to strings when uploading the Bing COVID-19 CSV file using SQL Server Import and Export Wizard. This leads to issues in sorting, mathematical operations, and time-based analysis.

Solution:

  • If using SQL, convert the data type using CAST function during ETL processing

2. Empty or Null Fields

Several entries in the dataset contain missing values. For example, the admin_region_1 column has blank fields in many rows. These gaps can cause issues when trying to group or filter the data by geographic regions or when joining it with other datasets that rely on complete location hierarchies. To identify where these missing values occur, we can use the following snippet:

SELECT
  Top 100 *
FROM 
  [dbo].[Raw_BingCovid]
WHERE 
  country_region = 'United States'
  AND admin_region_1 = ''
  AND country_region <> ''
Go

Solution:

  • Remove incomplete records, impute values from other sources, or replace them with placeholders such as "Unknown"

3. Unicode Issue: Non-English Country Names

Some country or region names in the dataset appear in non-English characters. However, because the data is stored using the VARCHAR data type instead of NVARCHAR, these characters are displayed incorrectly or replaced with unreadable symbols. This can lead to issues when standardizing geographic names for analysis or visualization, or joining the data with external sources using English and properly encoded names.

Solution:

  • Ensure to store the text fields containing non-English characters using Unicode-compatible types like NVARCHAR
  • For the final, the non-English country names can be irrelevant as the scope is limited to analyzing data within the United States

Final Notes

The above data-cleaning steps are necessary to get the Bing COVID-19 dataset ready for the final project. These actions are essential to ensure the data is clean and structured properly for loading into a data warehouse.