7.3.3.Take a closer look at the data - quanganh2001/Google-Data-Analytics-Professional-Certificate-Coursera GitHub Wiki

Working with biased data

Every data analyst will encounter an element of bias at some point in the data analysis process. That’s why it’s so important to understand how to identify and manage biased data whenever possible. You might recall we explored bias in detail in Course 3 of this program. In this reading, you will read a real-life example of an analyst who discovered bias in their data, and learn how they used R to address it.

Addressing biased data with R

9S2LIPiKTGatiyD4iqxmIw_6802a87a5b0f4781b6791b5b8d4d5e4b_Screen-Shot-2021-02-16-at-9 44 46-AM

This scenario was shared by a quantitative analyst who collects data from people all over the world. They explain how they discovered bias in their data, and how they used R to address it:

“I work on a team that collects survey-like data. One of the tasks my team does is called a side-by-side comparison. For example, we might show users two ads side-by-side at the same time. In our survey, we ask which of the two ads they prefer. In one case, after many iterations, we were seeing consistent bias in favor of the first item. There was also a measurable decrease in the preference for an item if we swapped its position to second.

So we decided to add randomization to the position of the ads using R. We wanted to make sure that the items appeared in the first and second positions with similar frequencies. We used sample() to inject a randomization element into our R programming. In R, the sample() function allows you to take a random sample of elements from a data set. Adding this piece of code shuffled the rows in our data set randomly. So when we presented the ads to users, the positions of the ads were now random and controlled for bias. This made the survey more effective and the data more reliable.”

Key takeaways

The sample() function is just one of many functions and methods in R that you can use to address bias in your data. Depending on the kind of analysis you are conducting, you might need to incorporate some advanced processes in your programming. Although this program won’t cover those kinds of processes in detail, you will likely learn more about them as you get more experience in the data analytics field.

To learn more about bias and data ethics, check out these resources:

  • Bias function: This web page is a good starting point to learn about how the bias function in R can help you identify and manage bias in your analysis.
  • Data Science Ethics: This online course provides slides, videos, and exercises to help you learn more about ethics in the world of data analytics. It includes information about data privacy, misrepresentation in data, and applying ethics to your visualizations.

Hands-On Activity: Changing your data

Activity overview

UWFf-U9hTzKhX_lPYX8yBw_8c2e9cd211e3479a89816c7b1816ab07_image4

By now, you have learned many ways to change and work with data in a variety of settings, including spreadsheets and RStudio. In this activity, you’ll follow through a real-world scenario and practice manipulating and changing real data in R.

Upon completing this activity, you will know how to use functions to manipulate your data and use statistical summaries to explore your data. This will enable you to use R for more complex tasks in your career as a data analyst and help you gain initial insights into data that you can share with your stakeholders.

Working in RStudio Cloud

UWFf-U9hTzKhX_lPYX8yBw_8c2e9cd211e3479a89816c7b1816ab07_image4

To start, log into your RStudio (Posit) Cloud account. Open the project you will work on in the activity with this link, which opens in a new tab. If you haven't gone through this process already, at the top right portion of the screen you will see a "red stamp" indicating this project as a Temporary Copy. Click on the adjacent button, Save a Permanent Copy, and the project will be saved in your main dashboard for use with future lessons. Once that is completed, navigate to the file explorer in the bottom right and click on the following: Course 7 -> Week 3 -> Lesson3_Change.Rmd.

The .csv file that you will need, hotel_bookings.csv, is also located in this folder.

If you have trouble finding the correct activity, check out this step-by-step guide on how to navigate in RStudio (Posit) Cloud. Make sure to select the correct R markdown (Rmd) file. The other Rmd files will be used in different activities.

If you are using RStudio Desktop, you can download the Rmd file and the data for this activity directly here:

Lesson3_Change

hotel_bookings

Lesson3_Change_Solutions

Carefully read the instructions in the comments of the Rmd file and complete each step. Some steps may be as simple as running pre-written code, while others may require you to write your own functions. After you finish the steps in the Rmd file, return here to confirm that your work is complete.

Confirmation

TOqxzuNFR2eqsc7jRVdnKg_a3c6611d874f403a923e10406b4f38a9_image4

What is the average lead time for a hotel booking in this data set?

A. 104.0114

B. 45.0283

C. 100.0011

D. 14.0221

The correct answer is A. 104.0114. Explain: The average lead time is 104.0114 days. You were able to calculate this using the mean() function on the lead_time column in the data set. Going forward, you can apply the functions you used in this activity to future projects to change and analyze your data.

Compare data cleaning on different platforms

In the past few Discussion Prompts, you have been comparing R to other data analytics tools. Now, it’s time to consider the similarities and differences of R and spreadsheets for data-cleaning processes.

Write a response of two or more paragraphs (100-150 words total) discussing what you have noticed while cleaning data. Then, visit the discussion forum to review what others have written, and respond to at least two posts with your own thoughts.

Test your knowledge on R functions

Question 1

Which of the following functions can a data analyst use to get a statistical summary of their dataset? Select all that apply.

  • cor()
  • mean()
  • sd()
  • ggplot2()

Explain: The sd(), cor(), and mean() functions can provide a statistical summary of the dataset using standard deviation, correlation, and mean.

Question 2

A data analyst inputs the following command:

quartet %>% group_by(set) %>% summarize(mean(x), sd(x), mean(y), sd(y), cor(x, y)).

Which of the functions in this command can help them determine how strongly related their variables are?

A. mean(y)

B. sd(x)

C. cor(x,y)

D. sd(y)

The correct answer is C. cor(x, y). Explain: The cor() function returns the correlation between two variables. This determines how strong the relationship between those two variables is.

Question 3

Fill in the blank: The bias function compares the actual outcome of the data with the _____ outcome to determine whether or not the model is biased.

A. desired

B. predicted

C. probable

D. final

Explain: The bias function compares the actual outcome of the data with the predicted outcome to determine whether or not the model is biased.