HW4 ‐ Cleaning and Preparing the Bing COVID‐19 Data for Analysis - yentingw-gin/BIDD330_Spring2025_Reporting GitHub Wiki
Overview of the Bing COVID-19 Dataset
The Bing COVID-19 dataset, provided by Microsoft through its Azure Open Datasets platform. It contains daily numbers of confirmed cases, deaths, and recoveries from trusted sources around the world.
In HW3, we used the SQL Server Import and Export Wizard to upload the raw CSV file. While this approach is quick and easy, one drawback is that it automatically converts all fields to the string data type. This can be a problem when we want to perform calculations or filter data accurately, especially for numeric and date fields. In addition to this, there are other data quality issues in the dataset that also need to be addressed as part of the cleanup process.
For our final project, we’ll be using the Bing COVID-19 dataset to build a data warehouse and create reports. But before diving into analysis or building dashboards, we need to clean the data to ensure it’s consistent and ready for use in a BI environment.
Data Cleanup Areas in the Bing COVID-19 Dataset
1. Data Type Issues: Quantitative and Date Fields Stored as Strings
Some numeric fields such as confirmed
and deaths
, as well as dates fields like updated
are automatically converted to strings when uploading the Bing COVID-19 CSV file using SQL Server Import and Export Wizard. This leads to issues in sorting, mathematical operations, and time-based analysis.
Solution:
- If using SQL, convert the data type using
CAST
function during ETL processing
2. Empty or Null Fields
Several entries in the dataset contain missing values. For example, the admin_region_1
column has blank fields in many rows. These gaps can cause issues when trying to group or filter the data by geographic regions or when joining it with other datasets that rely on complete location hierarchies. To identify where these missing values occur, we can use the following snippet:
SELECT
Top 100 *
FROM
[dbo].[Raw_BingCovid]
WHERE
country_region = 'United States'
AND admin_region_1 = ''
AND country_region <> ''
Go
Solution:
- Remove incomplete records, impute values from other sources, or replace them with placeholders such as "Unknown"
3. Unicode Issue: Non-English Country Names
Some country or region names in the dataset appear in non-English characters. However, because the data is stored using the VARCHAR
data type instead of NVARCHAR
, these characters are displayed incorrectly or replaced with unreadable symbols. This can lead to issues when standardizing geographic names for analysis or visualization, or joining the data with external sources using English and properly encoded names.
Solution:
- Ensure to store the text fields containing non-English characters using Unicode-compatible types like
NVARCHAR
- For the final, the non-English country names can be irrelevant as the scope is limited to analyzing data within the United States
Final Notes
The above data-cleaning steps are necessary to get the Bing COVID-19 dataset ready for the final project. These actions are essential to ensure the data is clean and structured properly for loading into a data warehouse.