title: "Lesson 3: Cleaning Solutions" - guillermopetcho/Coursera-Certificate----Data-Analytics-Google GitHub Wiki


title: "Lesson 3: Cleaning Solutions"

output: html_document

Cleaning data solutions

This document contains the solutions for the cleaning data activity. You can use these solutions to check your work and ensure that your code is correct or troubleshoot your code if it is returning errors. If you haven't completed the activity yet, we suggest you go back and finish it before reading the solutions.

If you experience errors, remember that you can search the internet and the RStudio community for help: https://community.rstudio.com/#

Step 1: Load packages

Start by installing the required packages. If you have already installed and loaded tidyverse, skimr, and janitor in this session, feel free to skip the code chunks in this step.

install.packages("tidyverse")
install.packages("skimr")
install.packages("janitor")

Once a package is installed, you can load it by running the library() function with the package name inside the parentheses:

library(tidyverse)
library(skimr)
library(janitor)

Step 2: Import data

The data in this example is originally from the article Hotel Booking Demand Datasets (https://www.sciencedirect.com/science/article/pii/S2352340918315191), written by Nuno Antonio, Ana Almeida, and Luis Nunes for Data in Brief, Volume 22, February 2019.

The data was downloaded and cleaned by Thomas Mock and Antoine Bichat for #TidyTuesday during the week of February 11th, 2020 (https://github.com/rfordatascience/tidytuesday/blob/master/data/2020/2020-02-11/readme.md).

You can learn more about the dataset here: https://www.kaggle.com/jessemostipak/hotel-booking-demand

In the chunk below, you will use the read_csv() function to import data from a .csv in the project folder called "hotel_bookings.csv" and save it as a data frame called bookings_df:

bookings_df <- read_csv("hotel_bookings.csv")

Step 3: Getting to know your data

Before you start cleaning your data, take some time to explore it. You can use several functions that you are already familiar with to preview your data, including the head() function in the code chunk below:

head(bookings_df)

You can summarize or preview the data with the str() and glimpse() functions to get a better understanding of the data by running the code chunks below:

str(bookings_df)
glimpse(bookings_df)

You can also use colnames() to check the names of the columns in your data set. Run the code chunk below to find out the column names in this data set:

colnames(bookings_df)

Use the skim_without_charts() function from the skimr package by running the code below:

skim_without_charts(bookings_df)

Step 4: Cleaning your data

Based on your notes you are primarily interested in the following variables: hotel, is_canceled, lead_time. Create a new data frame with just those columns, calling it trimmed_df.

trimmed_df <- bookings_df %>% 
  select(hotel, is_canceled, lead_time)

Rename the variable 'hotel' to be named 'hotel_type' to be crystal clear on what the data is about:

trimmed_df %>% 
  select(hotel, is_canceled, lead_time) %>% 
  rename(hotel_type = hotel)

In this example, you can combine the arrival month and year into one column using the unite() function:

example_df <- bookings_df %>%
  select(arrival_date_year, arrival_date_month) %>% 
  unite(arrival_month_year, c("arrival_date_month", "arrival_date_year"), sep = " ")

Step 5: Another way of doing things

You can also use themutate() function to make changes to your columns. Let's say you wanted to create a new column that summed up all the adults, children, and babies on a reservation for the total number of people. Modify the code chunk below to create that new column:

example_df <- bookings_df %>%
  mutate(guests = adults + children + babies)

head(example_df)

Great. Now it's time to calculate some summary statistics! Calculate the total number of canceled bookings and the average lead time for booking - you'll want to start your code after the %>% symbol. Make a column called 'number_canceled' to represent the total number of canceled bookings. Then, make a column called 'average_lead_time' to represent the average lead time. Use the summarize() function to do this in the code chunk below:

example_df <- bookings_df %>%
  summarize(number_canceled = sum(is_canceled),
            average_lead_time = mean(lead_time))

head(example_df)