3. Prepare Phase of Data Analytics - pssawyer/Cyclistic-Case-Study GitHub Wiki
Prepare
Parameters to prepare the data set for the processing phase of data analysis.
I will be performing the data examination in Microsoft Excel and in RStudio utilizing Posit Cloud's RStudio.
I will be using Q1 data from years 2019 and 2020 provided by the Google Data Analytics Professional Certificate course on Coursera. The provided data is publicly available and provided by Motivate International Inc. under this license.
Due to privacy issues, we will not be able to connect rider credit card numbers pass purchases for casual riders and thus not be able to determine if casual riders are in a Cyclistic service area.
Data Preparation Steps:
Data is orginally located in AWS S3 repository provied by Motivate International Inc., then downloaded to my local network for preparation.
The data sets are organized by the below schema for each CSV file provided:
Any schema or variable differences will be noted in the processing phase, after closer examination of the data sets.
The 2019 Q1 CSV follows the below schema:
[1] "trip_id"
Interger - 7 to 8 Digits, no blanks and no duplicates. This value is unique.
[2] "start_time"
DateTime - Format: YYYY-MM-DD HH:MM:SS
[3] "end_time"
DateTime - Format: YYYY-MM-DD HH:MM:SS
[4] "bikeid"
Interger - 1 to 4 Digits. 4769 bikes in this data set.
[5] "tripduration"
Time in Seconds validated by subtracting end_time from start_time column observations.
[6] "from_station_id"
Interger - 1 to 3 Digits
[7] "from_station_name"
Alpha-numeric - variable length up to 43 characters, in closest cross street format.
[8] "to_station_id"
Interger - 1 to 3 Digits
[9] "to_station_name"
Alpha-numeric - variable length up to 43 characters, in closest cross street format.
[10] "usertype"
Alpha Characters, two options only "Subscriber" or "Customer." No blanks or "NA."
[11] "gender"
Alpha Characters, three options, "Male," "Female" or blanks. Likely from declining to self-identify.
[12]"birthyear"
Year, including blanks. Likely from declining to self-identify.
The 2020 Q1 CSV follows the below schema:
[1] "ride_id"
Alpha-numeric - 16 character, no blanks and no duplicates. This value is unique.
[2] "rideable_type"
Alpha Characters - 11 character, only one variable "docked_bike" and no blanks.
[3] "started_at"
DateTime - Format: YYYY-MM-DD HH:MM:SS
[4] "ended_at"
DateTime - Format: YYYY-MM-DD HH:MM:SS
[5] "start_station_name"
Alpha-numeric - variable length up to 43 characters, in closest cross street format.
[6] "start_station_id"
Interger - 1 to 3 Digits
[7] "end_station_name"
Alpha-numeric - variable length up to 43 characters, in closest cross street format.
[8] "end_station_id"
Interger - 1 to 3 Digits
[9] "start_lat"
Floating point, including positive and negative numbers depeding on the lattitude of the starting station.
[10] "start_lng"
Floating point, including positive and negative numbers depeding on the longitude of the starting station.
[11] "end_lat"
Floating point, including positive and negative numbers depeding on the lattitude of the ending station.
[12] "end_lng"
Floating point, including positive and negative numbers depeding on the longitude of the ending station.
[13] "member_casual"
Alpha Characters, two options only "Member" or "Casual." No blanks or "NA."
During examination:
No reliablility issues were observed with the data sets.
The data sets are provided by the first party company that collected the data and are in their original schemas.
The data sets are comprehensive specifically to trip data but include very little personally identifiable information.
Specifically, the 2019 Q1 data set contains gender and birthyear information.
The data sets are current for the purposes of this case study.
The data source has been cited to Motivate International Inc. and is the original source of the data.
Data integrity has been validated by sorting, filtering and removing blanks.
Specifically, the only blank encountered was in the 2020 Q1 Data Set, in the varible columns "end_station_name," "end_station_id," "end_lat," and "end_lang."
The Data sets should help provide insight into how casual and member riders differ by providing observations on start, end stations, weekdays and trip durations. I belive I will find there are high volume start/end stations, or trip durations that show differences between members and casual riders that can be utilized to create a marketing campaign or intiative to convert more casual riders to members.
Problems with the data sets include:
Blanks that must be accounted for.
Differing schemas between the 2019 and 2020 data sets.
During the processing phase, the data sets will be joined, and variables will be normalized to 2020 and beyond variable names.
Any newly created variables, such as trip duration will utilize a consistent naming convention similar to all other 2020 column names. That is utiling underscores between words in variable names.
2020 data set does not have a trip duration variable.
I will calculate this and add the variable for each observation based on "started_at" and "ended_at" datetime variables.
There are two variables with personally identifiable information in the 2019 data set that will be removed: