7.3.2.Cleaning data - sj50179/Google-Data-Analytics-Professional-Certificate GitHub Wiki

Question

Which of the following functions returns a summary of the data frame, including the number of columns and rows? Select all that apply.

  • rename()
  • skim_without_charts()
  • clean_names()
  • glimpse()

Correct. The skim_without_charts() and glimpse() functions both return a summary of the data frame, including the number of columns and rows.

Question

The rename_with() function can be used to reformat column names to be upper or lower case.

  • True
  • False

Correct. The rename_with() function can be used to reformat column names to be upper or lower case.

File-naming conventions

An important part of cleaning data is making sure that all of your files are accurately named. Although individual preferences will vary a bit, most analysts generally agree that file names should be accurate, consistent, and easy to read. This reading provides some general guidelines for you to follow when naming or renaming your data files.

What’s in a (file)name?

When you first start working with R (or any other programming language, analysis tool, or platform, for that matter), you or your company should establish naming conventions for your files. This helps ensure that anyone reviewing your analysis–yourself included–can quickly and easily find what they need. Next are some helpful “do’s” and “don’ts” to keep in mind when naming your files.

Do

  • Keep your filenames to a reasonable length
  • Use underscores and hyphens for readability
  • Start or end your filename with a letter or number
  • Use a standard date format when applicable; example: YYYY-MM-DD
  • Use filenames for related files that work well with default ordering; example: in chronological order, or logical order using numbers first

Examples of good filenames:

  • 2020-04-10_march-attendance.R
  • 2021_03_20_new_customer_ids.csv
  • 01_data-sales.html
  • 02_data-sales.html

Don't

  • Use unnecessary additional characters in filenames
  • Use spaces or “illegal” characters; examples: &, %, #, <, or >
  • Start or end your filename with a symbol
  • Use incomplete or inconsistent date formats; example: M-D-YY
  • Use filenames for related files that do not work well with default ordering; examples: a random system of numbers or date formats, or using letters first

Examples of filenames to avoid:

  • 4102020marchattendance.R
  • _20210320*newcustomeridsforfebonly.csv
  • firstfile_for_datasales/1-25-2020.html
  • secondfile_for_datasales/2-5-2020.html

Additional resources

These resources include more info about some of the file naming standards discussed here, and provide additional insights into best practices.

  • How to name files: this resource from Speaker Deck is a playful take on file naming. It includes several slides with tips and examples for how to accurately name lots of different types of files. You will learn why filenames should be both machine readable and human readable.
  • File naming and structure: this resource from the Princeton University Library provides an easy-to-scan list of best practices, considerations, and examples for developing file naming conventions.

More on R operators

You might remember that an operator is a symbol that identifies the type of operation or calculation to be performed in a formula. In an earlier video, you learned how to use the assignment and arithmetic operators to assign variables and perform calculations. In this reading, you will review a detailed summary of the main types of operators in R, and learn how to use specific operators in R code.

Operators

In R, there are four main types of operators:

  1. Arithmetic
  2. Relational
  3. Logical
  4. Assignment

Review the specific operators in each category and check out some examples of how to use them in R code.

Arithmetic operators

Arithmetic operators let you perform basic math operations like addition, subtraction, multiplication, and division.

The table below summarizes the different arithmetic operators in R. The examples used in the table are based on the creation of two variables: : x equals 2 and y equals 5. Note that you use the assignment operator to store these values:

x <- 2

y <- 5

Relational operators

Relational operators, also known as comparators, allow you to compare values. Relational operators identify how one R object relates to another—like whether an object is less than, equal to, or greater than another object. The output for relational operators is either TRUE or FALSE (which is a logical data type, or boolean).

The table below summarizes the six relational operators in R. The examples used in the table are based on the creation of two variables: x equals 2 and y equals 5. Note that you use the assignment operator to store these values.

x <- 2

y <- 5

If you perform calculations with each operator, you get the following results. In this case, the output is boolean: TRUE or FALSE. Note that the [1] that appears before each output is used to represent how output is displayed in RStudio.

Logical operators

Logical operators allow you to combine logical values. Logical operators return a logical data type or boolean (TRUE or FALSE). You encountered logical operators in an earlier reading, Logical operators and conditional statements, but here is a quick refresher.

The table below summarizes the logical operators in R.

Next, check out some examples of how logical operators work in R code.

Element-wise logical AND (&) and OR (|)

You can illustrate logical AND (&) and OR (|) by comparing numerical values. Let’s create a variable x that is equal to 10.

x <- 10

The AND operator returns TRUE only if both individual values are TRUE.

x > 2 & x < 12

[1] TRUE

10 is greater than 2 and 10 is less than 12. So, the operation evaluates to TRUE.

The OR operator (|) works in a similar way to the AND operator (&). The main difference is that just one of the values of the OR operation needs to be TRUE for the entire OR operation to evaluate to TRUE. Only if both values are FALSE will the entire OR operation evaluate to FALSE.

Let’s try an example with the same variable (x <- 10):

x > 2 | x < 8

[1] TRUE

10 is greater than 2, but 10 is not less than 8. But since at least one of the values (10>2) is TRUE, the OR operation evaluates to TRUE.

Logical AND (&&)  and OR (||)

The main difference between element-wise logical operators (&, |) and logical operators (&&, ||) is the way they apply to operations with vectors. The operations with double signs, AND (&&) and logical OR (||), only examine the first element of each vector. The operations with single signs, AND (&) and OR (|), examine all the elements of each vector.

For example, imagine you are working with two vectors that each contain three elements: c(3, 5, 7) and c(2, 4, 6). The element-wise logical AND (&) will compare the first element of the first vector with the first element of the second vector (3&2), the second element with the second element (5&4), and the third element with the third element (7&6).

Let’s check out this example in R code.

First, create two variables, x and y, to store the two vectors:

x <- c(3, 5, 7)

y <- c(2, 4, 6)

Then run the code with a single ampersand (&). The output is boolean (TRUE or FALSE).

x < 5 & y < 5

[1]  TRUE FALSE FALSE

When you compare each element of the two vectors, the output is TRUE, FALSE, FALSE. The first element of both x (3) and y (2) is less than 5, so this is TRUE. The second element of x is not less than 5 (it’s equal to 5) but the second element of y is less than 5, so this is FALSE (because we used AND). The third element of both x and y is not less than 5, so this is also FALSE.

Now, let’s run the same operation using the double ampersand (&&):

x < 5 && y < 5

[1] TRUE

In this case, R only compares the first elements of each vector: 3 and 2. So, the output is TRUE because 3 and 2 are both less than 5.

Depending on the type of work you do, you might make use of single sign operators more often than double sign operators. But it is helpful to know how all of the operators work regardless.

Logical NOT (!)

The NOT operator simply negates the logical value, and evaluates to its opposite. In R, zero is considered FALSE and all non-zero numbers are considered TRUE.

For example, let’s apply the NOT operator to our variable (x <- 10):

!(x < 15)

[1] FALSE

The NOT operation evaluates to FALSE because it takes the opposite logical value of the statement x < 15, which is TRUE (10 is less than 15).

Assignment operators

Assignment operators let you assign values to variables.

In many scripting programming languages you can just use the equal sign (=) to assign a variable. For R, the best practice is to use the arrow assignment (<-). Technically, the single arrow assignment can be used in the left or right direction. But the rightward assignment is not generally used in R code.

You can also use the double arrow assignment, known as a scoping assignment. But the scoping assignment is for advanced R users, so you won’t learn about it in this reading.

The table below summarizes the assignment operators and example code in R. Notice that the output for each variable is its assigned value.

The operators you learned about in this reading are a great foundation for using operators in R.

Additional resource

Check out the article about R Operators on the R Coder website for a comprehensive guide to the different types of operators in R. The article includes lots of useful coding examples, and information about miscellaneous operators, the infix operator, and the pipe operator.

Wide to long with tidyr

When organizing or tidying your data using R, you might need to convert wide data to long data or long to wide. Recall that this is what data in a wide format looks like in a spreadsheet:

Wide data has observations across several columns. Each column contains data from a different condition of the variable. In this example, different years.

Now check out the same data in a long format:

And, to review what you already learned about the difference, long data has all the observations in a single column, and variables in separate columns.

The pivot_longer and pivot_wider functions

There are compelling reasons to use both formats. But as an analyst, it is important to know how to tidy data when you need to. In R, you may have a data frame in a wide format that has several variables and conditions for each variable. It might feel a bit messy.

That’s where pivot_longer()comes in. As part of the tidyr package, you can use this R function to lengthen the data in a data frame by increasing the number of rows and decreasing the number of columns. Similarly, if you want to convert your data to have more columns and fewer rows, you would use the pivot_wider() function.

Additional resources

To learn more about these two functions and how to apply them in your R programming, check out these resources:

  • Pivoting: Consider this a starting point for tidying data through wide and long conversions. This web page is taken directly from tidyr package information at tidyverse.org. It explores the components of the pivot_longer and pivot_wider functions using specific details, examples, and definitions.
  • CleanItUp 5: R-Ladies Sydney: Wide to Long to Wide to…PIVOT: This resource gives you additional details about the pivot_longer and pivot_wider functions. The examples provided use interesting datasets to illustrate how to convert data from wide to long and back to wide.
  • Plotting multiple variables: This resource explains how to visualize wide and long data, with ggplot2 to help tidy it. The focus is on using pivot_longer to restructure data and make similar plots of a number of variables at once. You can apply what you learn from the other resources here for a broader understanding of the pivot functions.

Test your knowledge on cleaning data

TOTAL POINTS 3

Question 1

A data analyst is cleaning their data in R. They want to be sure that their column names are unique and consistent to avoid any errors in their analysis. What R function can they use to do this automatically?

  • rename_with()
  • rename()
  • clean_names()
  • select()

Correct. The clean_names() function will automatically make sure that column names are unique and consistent.

Question 2

You are working with the penguins dataset. You want to use the arrange() function to sort the data for the column bill_length_mm in ascending order. You write the following code:

penguins %>%

Add a code chunk to sort the column bill_length_mm in ascending order.

penguins %>% arrange(bill_length_mm)

RunReset# A tibble: 344 <U+00D7> 8
   species    island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
     <chr>     <chr>          <dbl>         <dbl>             <int>       <int>
1   Adelie     Dream           32.1          15.5               188        3050
2   Adelie     Dream           33.1          16.1               178        2900
3   Adelie Torgersen           33.5          19.0               190        3600
4   Adelie     Dream           34.0          17.1               185        3400
5   Adelie Torgersen           34.1          18.1               193        3475
6   Adelie Torgersen           34.4          18.4               184        3325
7   Adelie    Biscoe           34.5          18.1               187        2900
8   Adelie Torgersen           34.6          21.1               198        4400
9   Adelie Torgersen           34.6          17.2               189        3200
10  Adelie    Biscoe           35.0          17.9               190        3450
# ... with 334 more rows, and 2 more variables: sex <chr>, year <int>

What is the shortest bill length in mm?

  • 33.5
  • 33.1
  • 32.1
  • 34.0

Correct. You add the code chunk **arrange(bill_length_mm) to sort the column bill_length_mm in ascending order. The correct code is penguins %>% arrange(bill_length_mm). Inside the parentheses of the arrange() function is the name of the variable you want to sort. The code returns a tibble that displays the data for bill_length_mm from shortest to longest. The shortest bill length is 32.1mm.

Question 3

A data analyst is working with customer information from their company’s sales data. The first and last names are in separate columns, but they want to create one column with both names instead. Which of the following functions can they use?

  • separate()
  • select()
  • arrange()
  • unite()

Correct. The unite() function can be used to combine columns.


Take a closer look at the data

Working with biased data

Every data analyst will encounter an element of bias at some point in the data analysis process. That’s why it’s so important to understand how to identify and manage biased data whenever possible.

In this reading, you will read a real-life example of an analyst who discovered bias in their data, and learn how they used R to address it.

Addressing biased data with R

This scenario was shared by a quantitative analyst who collects data from people all over the world. They explain how they discovered bias in their data, and how they used R to address it:

“I work on a team that collects survey-like data. One of the tasks my team does is called a side-by-side comparison. For example, we might show users two ads side-by-side at the same time. In our survey, we ask which of the two ads they prefer. In one case, after many iterations, we were seeing consistent bias in favor of the first item. There was also a measurable decrease in the preference for an item if we swapped its position to second.

So we decided to add randomization to the position of the ads using R. We wanted to make sure that the items appeared in the first and second positions with similar frequencies. We used sample() to inject a randomization element into our R programming. In R, the sample() function allows you to take a random sample of elements from a data set. Adding this piece of code shuffled the rows in our data set randomly. So when we presented the ads to users, the positions of the ads were now random and controlled for bias. This made the survey more effective and the data more reliable.”

Key takeaways

The sample() function is just one of many functions and methods in R that you can use to address bias in your data. Depending on the kind of analysis you are conducting, you might need to incorporate some advanced processes in your programming. Although this program won’t cover those kinds of processes in detail, you will likely learn more about them as you get more experience in the data analytics field.

To learn more about bias and data ethics, check out these resources:

  • Bias function: This web page is a good starting point to learn about how the bias function in R can help you identify and manage bias in your analysis.
  • Data Science Ethics: This online course provides slides, videos, and exercises to help you learn more about ethics in the world of data analytics. It includes information about data privacy, misrepresentation in data, and applying ethics to your visualizations.

Test your knowledge on R functions

TOTAL POINTS 3

Question 1

Which of the following functions can a data analyst use to get a statistical summary of their dataset? Select all that apply.

  • sd()
  • mean()
  • cor()
  • ggplot2()

Correct. The sd(), cor(), and mean() functions can provide a statistical summary of the dataset using standard deviation, correlation, and mean.

Question 2

A data analyst inputs the following command:  quartet %>% group_by(set) %>% summarize(mean(x), sd(x), mean(y), sd(y), cor(x, y)).  Which of the functions in this command can help them determine how strongly related their variables are?

  • sd(y)
  • cor(x,y)
  • mean(y)
  • sd(x)

Correct. The cor() function returns the correlation between two variables. This determines how strong the relationship between those two variables is.

Question 3

Fill in the blank: The bias function compares the actual outcome of the data with the _____ outcome to determine whether or not the model is biased.

  • predicted
  • final
  • probable
  • desired

Correct. The bias function compares the actual outcome of the data with the predicted outcome to determine whether or not the model is biased.

⚠️ **GitHub.com Fallback** ⚠️