7.3.2.Cleaning data - quanganh2001/Google-Data-Analytics-Professional-Certificate-Coursera GitHub Wiki
An important part of cleaning data is making sure that all of your files are accurately named. Although individual preferences will vary a bit, most analysts generally agree that file names should be accurate, consistent, and easy to read. This reading provides some general guidelines for you to follow when naming or renaming your data files.
When you first start working with R (or any other programming language, analysis tool, or platform, for that matter), you or your company should establish naming conventions for your files. This helps ensure that anyone reviewing your analysis–yourself included–can quickly and easily find what they need. Next are some helpful “do’s” and “don’ts” to keep in mind when naming your files.
Do
- Keep your filenames to a reasonable length
- Use underscores and hyphens for readability
- Start or end your filename with a letter or number
- Use a standard date format when applicable; example: YYYY-MM-DD
- Use filenames for related files that work well with default ordering; example: in chronological order, or logical order using numbers first
Examples of good filenames |
---|
2020-04-10_march-attendance.R |
2021_03_20_new_customer_ids.csv |
01_data-sales.html |
02_data-sales.html |
Don't
- Use unnecessary additional characters in filenames
- Use spaces or “illegal” characters; examples: &, %, #, <, or >
- Start or end your filename with a symbol
- Use incomplete or inconsistent date formats; example: M-D-YY
- Use filenames for related files that do not work well with default ordering; examples: a random system of numbers or date formats, or using letters first
Examples of filenames to avoid |
---|
4102020marchattendance.R |
_20210320*newcustomeridsforfebonly.csv |
firstfile_for_datasales/1-25-2020.html |
secondfile_for_datasales/2-5-2020.html |
These resources include more info about some of the file naming standards discussed here, and provide additional insights into best practices.
- How to name files : this resource from Speaker Deck is a playful take on file naming. It includes several slides with tips and examples for how to accurately name lots of different types of files. You will learn why filenames should be both machine readable and human readable.
- File naming and structure : this resource from the Princeton University Library provides an easy-to-scan list of best practices, considerations, and examples for developing file naming conventions.
You might remember that an operator is a symbol that identifies the type of operation or calculation to be performed in a formula. In an earlier video, you learned how to use the assignment and arithmetic operators to assign variables and perform calculations. In this reading, you will review a detailed summary of the main types of operators in R, and learn how to use specific operators in R code.
In R, there are four main types of operators:
- Arithmetic
- Relational
- Logical
- Assignment
Review the specific operators in each category and check out some examples of how to use them in R code.
Arithmetic operators
Arithmetic operators let you perform basic math operations like addition, subtraction, multiplication, and division.
The table below summarizes the different arithmetic operators in R. The examples used in the table are based on the creation of two variables: : x equals 2 and y equals 5. Note that you use the assignment operator to store these values:
x <- 2
y <- 5
Operator | Description | Example Code | Result/Output |
---|---|---|---|
+ | Addition | x + y | [1] 7 |
- | Subtraction | x - y | [1] -3 |
* | Multiplication | x * y | [1] 10 |
/ | Division | x / y | [1] 0.4 |
%% | Modulus (returns the remainder after division) | y %% x | [1] 1 |
%/% | Integer division (returns an integer value after division) | y %/% x | [1] 2 |
^ | Exponent | y ^ x | [1]25 |
Relational operators
Relational operators, also known as comparators, allow you to compare values. Relational operators identify how one R object relates to another—like whether an object is less than, equal to, or greater than another object. The output for relational operators is either TRUE or FALSE (which is a logical data type, or boolean).
The table below summarizes the six relational operators in R. The examples used in the table are based on the creation of two variables: x equals 2 and y equals 5. Note that you use the assignment operator to store these values.
x <- 2
y <- 5
If you perform calculations with each operator, you get the following results. In this case, the output is boolean: TRUE or FALSE. Note that the [1] that appears before each output is used to represent how output is displayed in RStudio.
Operator | Description | Example Code | Result/Output |
---|---|---|---|
< | Less than | x < y | [1] TRUE |
> | Greater than | x > y | [1] FALSE |
<= | Less than or equal to | x <= 2 | [1] TRUE |
>= | Greater than or equal to | y >= 10 | [1] FALSE |
== | Equal to | y == 5 | [1] TRUE |
!= | Not equal to | x != 2 | [1] FALSE |
Logical operators
Logical operators allow you to combine logical values. Logical operators return a logical data type or boolean (TRUE or FALSE). You encountered logical operators in an earlier reading, Logical operators and conditional statements , but here is a quick refresher.
The table below summarizes the logical operators in R.
Operator | Description |
---|---|
& | Element-wise logical AND |
&& | Logical AND |
| | Element-wise logical OR |
|| | Logical OR |
! | Logical NOT |
Next, check out some examples of how logical operators work in R code.
Element-wise logical AND (&) and OR (|)
You can illustrate logical AND (&) and OR (|) by comparing numerical values. Create a variable x that is equal to 10.
x <- 10
The AND operator returns TRUE only if both individual values are TRUE.
x > 2 & x < 12
[1] TRUE
10 is greater than 2 and 10 is less than 12. So, the operation evaluates to TRUE
.
The OR operator (|) works in a similar way to the AND operator (&). The main difference is that just one of the values of the OR operation needs to be TRUE for the entire OR operation to evaluate to TRUE. Only if both values are FALSE will the entire OR operation evaluate to FALSE
.
Now try an example with the same variable (x <- 10)
:
x > 2 | x < 8
[1] TRUE
10 is greater than 2, but 10 is not less than 8. But since at least one of the values (10>2) is TRUE, the OR operation evaluates to TRUE
.
Logical AND (&&) and OR (||)
The main difference between element-wise logical operators (&, |) and logical operators (&&, ||) is the way they apply to operations with vectors. The operations with double signs, AND (&&) and logical OR (||), only examine the first element of each vector. The operations with single signs, AND (&) and OR (|), examine all the elements of each vector.
For example, imagine you are working with two vectors that each contain three elements: c(3, 5, 7)
and c(2, 4, 6)
. The element-wise logical AND (&) will compare the first element of the first vector with the first element of the second vector (3&2), the second element with the second element (5&4), and the third element with the third element (7&6).
Now check out this example in R code.
First, create two variables, x and y, to store the two vectors:
x <- c(3, 5, 7)
y <- c(2, 4, 6)
Then run the code with a single ampersand (&). The output is boolean (TRUE or FALSE).
x < 5 & y < 5
[1] TRUE FALSE FALSE
When you compare each element of the two vectors, the output is TRUE, FALSE, FALSE
. The first element of both x (3) and y (2) is less than 5, so this is TRUE. The second element of x is not less than 5 (it’s equal to 5) but the second element of y is less than 5, so this is FALSE (because you used AND). The third element of both x and y is not less than 5, so this is also FALSE.
Now, run the same operation using the double ampersand (&&):
x < 5 && y < 5
[1] TRUE
In this case, R only compares the first elements of each vector: 3 and 2. So, the output is TRUE
because 3 and 2 are both less than 5.
Depending on the type of work you do, you might make use of single sign operators more often than double sign operators. But it is helpful to know how all of the operators work regardless.
Logical NOT (!)
The NOT operator simply negates the logical value, and evaluates to its opposite. In R, zero is considered FALSE and all non-zero numbers are considered TRUE.
For example, apply the NOT operator to your variable (x <- 10)
:
!(x < 15)
[1] FALSE
The NOT operation evaluates to FALSE
because it takes the opposite logical value of the statement x < 15
, which is TRUE (10 is less than 15).
Assignment operators
Assignment operators let you assign values to variables.
In many scripting programming languages you can just use the equal sign (=) to assign a variable. For R, the best practice is to use the arrow assignment (<-). Technically, the single arrow assignment can be used in the left or right direction. But the rightward assignment is not generally used in R code.
You can also use the double arrow assignment, known as a scoping assignment. But the scoping assignment is for advanced R users, so you won’t learn about it in this reading.
The table below summarizes the assignment operators and example code in R. Notice that the output for each variable is its assigned value.
Operator | Description | Example Code (after the sample code below, typing x will generate the output in the next column) |
Result/ Output |
---|---|---|---|
<- | Leftwards assignment |
x <- 2 | [1] 2 |
<<- | Leftwards assignment |
x <<- 7 | [1] 7 |
= | Leftwards assignment |
x = 9 | [1] 9 |
-> | Rightwards assignment |
11 -> x | [1] 11 |
->> | Rightwards assignment |
21 ->> x | [1] 21 |
The operators you learned about in this reading are a great foundation for using operators in R.
Check out the article about R Operators on the R Coder website for a comprehensive guide to the different types of operators in R. The article includes lots of useful coding examples, and information about miscellaneous operators, the infix operator, and the pipe operator.
So far, you’ve learned a lot about the importance of cleaning data and how to do it in spreadsheets and SQL. In this activity, you’ll follow a scenario and clean real data in R.
By the time you complete this activity, you will learn more about data cleaning functions in R and apply this know-how to import, preview, and perform calculations on different data sets. You can use these techniques to gain initial insights into your data, which will help you analyze data throughout your career.
To start, log into your RStudio (Posit) Cloud account. Open the project you will work on in the activity with this link, which opens in a new tab. If you haven't gone through this process already, at the top right portion of the screen you will see a "red stamp" indicating this project as a Temporary Copy. Click on the adjacent button, Save a Permanent Copy, and the project will be saved in your main dashboard for use with future lessons. Once that is completed, navigate to the file explorer in the bottom right and click on the following: Course 7 -> Week 3 -> Lesson3_Clean.Rmd.
The .csv file, hotel_bookings.csv, is also located in this folder.
If you have trouble finding the correct activity, check out this step-by-step guide on how to navigate in RStudio (Posit) Cloud. Make sure to select the correct R markdown (Rmd) file. The other Rmd files will be used in different activities.
If you are using RStudio Desktop, you can download the Rmd file and the data for this activity directly here:
You can also find the Rmd file with the solutions for this activity here:
Carefully read the instructions in the comments of the Rmd file and complete each step. Some steps may be as simple as running pre-written code, while others may require you to write your own functions. After you finish the steps in the Rmd file, return here to confirm that your work is complete.
In Step 5 of this activity, you created the number_canceled column to represent the total number of canceled bookings. What value was returned in this column?
A. 40234
B. 44224
C. 49550
D. 52965
The correct answer is B. 44224. Explain: The number returned in this number_canceled column should be 44,224, which represents the total number of canceled hotel bookings. By cleaning and manipulating the data, you were able to answer an important question about it. Going forward, you can use what you know about data cleaning from past courses to help you learn how to clean data in R.
Coming up in the next video, you are going to learn how to transform data in R. The video will be using manually entered data instead of a data set from an R package.
If you would like to follow along with the video in your own RStudio console, you can copy and paste the following code to enter the data and create a data frame:
id <- c(1:10)
name <- c("John Mendes", "Rob Stewart", "Rachel Abrahamson", "Christy Hickman", "Johnson Harper", "Candace Miller", "Carlson Landy", "Pansy Jordan", "Darius Berry", "Claudia Garcia")
job_title <- c("Professional", "Programmer", "Management", "Clerical", "Developer", "Programmer", "Management", "Clerical", "Developer", "Programmer")
employee <- data.frame(id, name, job_title)
Then, you can perform the functions from the video in your own console to practice transforming and cleaning data in R! Practicing along with the video will help you explore how these functions are supposed to work while also executing them yourself. You can also use this data frame to practice more after the video.
When organizing or tidying your data using R, you might need to convert wide data to long data or long to wide. Recall that this is what data looks like in a wide format spreadsheet:
Wide data has observations across several columns. Each column contains data from a different condition of the variable. In this example the columns are different years.
Now check out the same data in a long format:
To review what you already learned about the difference, long data has all the observations in a single column, and the variable conditions are placed into separate rows.
There are compelling reasons to use both formats. But as an analyst, it is important to know how to tidy data when you need to. In R, you may have a data frame in a wide format that has several variables and conditions for each variable. It might feel a bit messy.
That’s where pivot_longer()comes in. As part of the tidyr package, you can use this R function to lengthen the data in a data frame by increasing the number of rows and decreasing the number of columns. Similarly, if you want to convert your data to have more columns and fewer rows, you would use the pivot_wider() function.
To learn more about these two functions and how to apply them in your R programming, check out these resources:
-
Pivoting : Consider this a starting point for tidying data through wide and long conversions. This web page is taken directly from tidyr package information at tidyverse.org . It explores the components of the pivot_longer and pivot_wider functions using specific details, examples, and definitions.
-
CleanItUp 5: R-Ladies Sydney: Wide to Long to Wide to…PIVOT : This resource gives you additional details about the pivot_longer and pivot_wider functions. The examples provided use interesting datasets to illustrate how to convert data from wide to long and back to wide.
-
Plotting multiple variables : This resource explains how to visualize wide and long data, with ggplot2 to help tidy it. The focus is on using pivot_longer to restructure data and make similar plots of a number of variables at once. You can apply what you learn from the other resources here for a broader understanding of the pivot functions.
What’s the best use for this command?
Categorize each R command by dragging it to the appropriate functional category. Or, use the plus sign to choose a category.
- Clean: rename_with(), rename(), clean_names(), select(), glimpse(), skim_without_charts()
- Organize: filter(), mean(), summarize(), group_by(), drop_na(), arrange(), max()
- Transform: unite(), mutate(), separate()
Data analysts are cleaning their data in R. They want to be sure that their column names are unique and consistent to avoid any errors in their analysis. What R function can they use to do this automatically?
A. rename_with()
B. rename()
C. select()
D. clean_names()
The correct answer is D. clean_names()
You are working with the penguins dataset. You want to use the arrange() function to sort the data for the column bill_length_mm in ascending order. You write the following code:
penguins %>%
Add a single code chunk to sort the column bill_length_mm in ascending order. Note: DO NOT write the above code penguins %>% into your answer as it has already been pre-written into the code chunk.
penguins %>% arrange(bill_length_mm)
Output:
# A tibble: 344 <U+00D7> 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<chr> <chr> <dbl> <dbl> <int> <int>
1 Adelie Dream 32.1 15.5 188 3050
2 Adelie Dream 33.1 16.1 178 2900
3 Adelie Torgersen 33.5 19.0 190 3600
4 Adelie Dream 34.0 17.1 185 3400
5 Adelie Torgersen 34.1 18.1 193 3475
6 Adelie Torgersen 34.4 18.4 184 3325
7 Adelie Biscoe 34.5 18.1 187 2900
8 Adelie Torgersen 34.6 21.1 198 4400
9 Adelie Torgersen 34.6 17.2 189 3200
10 Adelie Biscoe 35.0 17.9 190 3450
# ... with 334 more rows, and 2 more variables: sex <chr>, year <int>
What is the shortest bill length in mm?
A. 33.5
B. 32.1
C. 34.0
D. 33.1
The correct answer is B. 32.1. Explain: You add the code chunk arrange(bill_length_mm)
to sort the column bill_length_mm in ascending order. The correct code is penguins %>% arrange(bill_length_mm)
. Inside the parentheses of the arrange() function is the name of the variable you want to sort. The code returns a tibble that displays the data for bill_length_mm from shortest to longest. The shortest bill length is 32.1mm.
Data analysts are working with customer information from their company’s sales data. The first and last names are in separate columns, but they want to create one column with both names instead. Which of the following functions can they use?
A. separate()
B. arrange()
C. unite()
D. select()
The correct answer is C. unite(). Explain: The unite() function can be used to combine columns.