4. Process Phase of Data Analytics - pssawyer/Cyclistic-Case-Study GitHub Wiki

Process

Steps taken to process data for the analyze phase of data analysis.

  1. Tools utilized for data cleaning:
    • The intial data cleaning and analysis will be performed in Microsoft Excel because the data sets were provided as CSV and gaining a better understanding of the data sets early should lead to better data cleaning and analysis in RStudio.
    • Final data cleaning and analysis will be performed in Posit Cloud's RStudio, which provides a more robust set of features than spreadsheet software and capability to manipulate the data sets when combined.
      • Both data sets for 2019 and 2020 will have more than 790,000 combined observations and RStudio I belive is the approriate tool to manipulate the larger number of observations.
  2. To ensure data integrity I will or have:
    • Sorted the data to find any obvious patterns or outliers.
      • There are some large tripduration times in the 2019 data set, when performing the intial sort in Microsoft Excel.
    • Filtered the data to find any missing or obvious erroneous data.
      • There is one row or observation of data that contained blanks "end_station_name," "end_station_id," "end_lat," and "end_lng." This observation is incomplete and will be removed in Microsoft Excel before importing the data set into RStudio.
    • Validated that all columns are formatted with the appropriate format, that is all numbers are formatted as numbers, all datetime columns are formatted as YYYY-MM-DD HH:MM:SS and any text columns are formatted as general in excel.
    • Identified the variables that will be retained between the two data sets, and variables (columns) that will be added to the data sets to ensure that the data sets when merged will have consistent columns. Intial data cleaning and manipulation will be performed in Microsoft Excel and will be joined/merged in RStudio. Below is a full breakdown of the columns and how they will be cleaned and manipulated:
      • Q1 2019:
        • [1] "trip_id"
          • Transformed:
            • Unique ID for each trip, similar to 2020 data set "ride_id" column. Both columns will be merged when joining the data into "ride_id" to ensure forward compatibility with any future data analysis.
        • [2] "start_time"
          • Transformed:
            • Unique DateTime identical to 2020 data set "started_at" column. Both columns will be merged when joining the data into "started_at" to ensure forward compatibility with any future data analysis.
        • [3] "end_time"
          • Transformed:
            • Unique DateTime identical to 2020 data set "ended_at" column. Both columns will be merged when joining the data into "ended_at" to ensure forward compatibility with any future data analysis.
        • [N/A] "bikeid"
          • Removed:
            • No comparable variable in 2020 data set. Will be removed during initial cleaning in Microsoft Excel.
        • [4] "tripduration"
          • Transformed:
            • No comparable variable in 2020 data set. Will require arithmetic operation in 2020 data set to create prior to joining. New 2020 variable column will be merged with "tripduration" in new column name "ride_length" to keep schema naming conventions consistent. This column will be renamed to "ride_length" in preparation for the data merge. Recalcualted this by subtracting "started_at" variable from "ended_at" variable for each observation. The result is expressed as time format of HH:MM:SS.
        • [5] "day_of_week"
          • New column to create a variable based on the DateTime from the "started_at" column to determine the days of the week that each trip started, expressed as the name of the day that week the trip started. This was accomplished in Excel utilizing the text formula, specically "=TEXT([@[started_at]],"dddd")." Validated against the "day_of_week_num" column observation for each row by filtering data by each day and comparing to that column.
        • [6] "day_of_week_num"
          • Added:
            • New column to create a variable based on the DateTime from the "started_at" column to determine the days of the week that each trip started, expressed as 1 = Sunday, 2 = Monday and so forth. This was accomplished in Excel utilizing the weekday formula, specifically "=WEEKDAY([@[started_at]],1)." Validated against the "day_of_week" column observation for each row by filtering data by each number and comparing to the day in that column.
        • [7] "from_station_name"
          • Transformed:
            • Unique station name for bike departure station. Identical to 2020 data set "start_station_name" column. Both columns will be merged when joining the data into "start_station_name" to ensure forward compatibility with any future data analysis. Moved to column 6 to match 2020 data set variable orders.
        • [8] "from_station_id"
          • Transformed:
            • Unique station id for bike departure station. Identical to 2020 data set "start_station_id" column. Both columns will be merged when joining the data into "start_station_id" to ensure forward compatibility with any future data analysis. Moved to column 7 to match 2020 data set variable orders.
        • [9] "to_station_name"
          • Transformed:
            • Unique station name for bike arrival station. Identical to 2020 data set "end_station_name" column. Both columns will be merged when joining the data into "end_station_name" to ensure forward compatibility with any future data analysis. Moved to column 8 to match 2020 data set variable orders.
        • [10] "to_station_id"
          • Transformed:
            • Unique station id for bike arrival station. Identical to 2020 data set "end_station_id" column. Both columns will be merged when joining the data into "end_station_id" to ensure forward compatibility with any future data analysis. Moved to column 9 to match 2020 data set variable orders.
        • [11] "usertype"
          • Transformed:
            • Both current two options of "Subscriber" or "Customer," will be changed to reflect 2020 data set values of "Member" or "Casual." After the changes, both columns will be merged when joining the data into "member_casual" to ensure forward compatibility with any future data analysis. Used "IF" formulas, specifically "=IF([@[member_casual]]="Subscriber","Member","Casual")" in an adjoining column in excel to match "Subscriber" to "Member" and "Customer" to "Casual." Validated the result by filtering by "Subscriber," then validated all results in the ajoining columen were "Members" and repeated that process for "Customer" with "Casual." Then copied the values from the new column and pasted them into the "member_casual" column as values.
        • [N/A] "gender"
          • Removed:
            • No comparable variable in 2020 data set, and is Personally Identifiable Information, that will be removed during intial cleaning in Microsoft Excel.
        • [N/A]"birthyear"
          • Removed:
            • No comparable variable in 2020 data set, and is Personally Identifiable Information, that will be removed during intial cleaning in Microsoft Excel.
        • Columns data:
          • 3 Columns removed
          • 9 Columns transformed
          • 2 Columns added
          • 11 Total Columns after cleaning
      • Q1 2020:
        • [1] "ride_id"
          • Unique ID for each trip, similar to 2019 data set "trip_id" column. Both columns will be merged when joining the data into "ride_id" to ensure forward compatibility with any future data analysis.
        • [2] "started_at"
          • Unique DateTime identical to 2019 data set "start_time" column. Both columns will be merged when joining the data into "started_at" variable.
        • [3] "ended_at"
          • Unique DateTime identical to 2019 data set "end_time" column. Both columns will be merged when joining the data into "ended_at" variable.
        • [4] "ride_length"
          • Added:
            • New column in the 2020 data set that calculates the duration of a ride based on subtracting "started_at" variable from "ended_at" variable for each observation. The result is expressed as time format of HH:MM:SS.
        • [5] "day_of_week"
          • New column to create a variable based on the DateTime from the "started_at" column to determine the days of the week that each trip started, expressed as the name of the day that week the trip started. This was accomplished in Excel utilizing the text formula, specically "=TEXT([@[started_at]],"dddd")." Validated against the "day_of_week_num" column observation for each row by filtering data by each day and comparing to that column.
        • [6] "day_of_week_num"
          • Added:
            • New column to create a variable based on the DateTime from the "started_at" column to determine the days of the week that each trip started, expressed as 1 = Sunday, 2 = Monday and so forth. This was accomplished in Excel utilizing the weekday formula, specifically "=WEEKDAY([@[started_at]],1)." Validated against the "day_of_week" column observation for each row by filtering data by each number and comparing to the day in that column.
        • [7] "start_station_name"
          • Unique station name for bike departure station. Identical to 2019 data set "from_station_name" column. Both columns will be merged when joining the data into "start_station_name."
        • [8] "start_station_id"
          • Unique station id for bike departure station. Identical to 2019 data set "from_station_name" column. Both columns will be merged when joining the data into "start_station_name."
        • [9] "end_station_name"
          • Unique station name for bike arrival station. Identical to 2019 data set "to_station_name" column. Both columns will be merged when joining the data into "end_station_name."
        • [10] "end_station_id"
          • Unique station id for bike arrival station. Identical to 2019 data set "to_station_name" column. Both columns will be merged when joining the data into "end_station_name."
        • [11] "member_casual"
          • Nominal data that is identical to the 2019 data set "usertype" column. No changes required.
        • [12] "rideable_type"
          • No comparable variable in 2019 data set. Will be kept for future schema compatibility with newly collected data, but moved to column 11 to ensure consistency with schema from the cleaned 2019 data set.
        • [13] "start_lat"
          • No comparable variable in 2019 data set. Will be kept for future schema compatibility with newly collected data, but moved to column 12 to ensure consistency with schema from the cleaned 2019 data set.
        • [14] "start_lng"
          • No comparable variable in 2019 data set. Will be kept for future schema compatibility with newly collected data, but moved to column 13 to ensure consistency with schema from the cleaned 2019 data set.
        • [15] "end_lat"
          • No comparable variable in 2019 data set. Will be kept for future schema compatibility with newly collected data, but moved to column 14 to ensure consistency with schema from the cleaned 2019 data set.
        • [16] "end_lng"
          • No comparable variable in 2019 data set. Will be kept for future schema compatibility with newly collected data, but moved to column 14 to ensure consistency with schema from the cleaned 2019 data set.
        • Columns data:
          • 0 Columns removed
          • 0 Columns transformed
          • 3 Columns added
          • 16 Total Columns after cleaning

Next Step is the Analyze Phase of Data Analytics.