4.1.3.Testing your data & Consider the margin of error - sj50179/Google-Data-Analytics-Professional-Certificate GitHub Wiki

Testing your data

Statistical power

The probability of getting meaningful results from a test (0.6 → 60%)

Hypothesis testing

A way to see if a survey or experiment has meaningful results

Statistically significant

If a test is statistically significant, it means the results of the test are real and not an error caused by random chance.

Usually, you need a statistical power of at least 0.8 or 80% to consider your results statistically significant.

What to do when there is no data

Earlier, you learned how you can still do an analysis using proxy data if you have no data. You might have some questions about proxy data, so this reading will give you a few more examples of the types of datasets that can serve as alternate data sources.

Proxy data examples

Sometimes the data to support a business objective isn’t readily available. This is when proxy data is useful. Take a look at the following scenarios and where proxy data comes in for each example:

Business scenario How proxy data can be used
A new car model was just launched a few days ago and the auto dealership can’t wait until the end of the month for sales data to come in. They want sales projections now. The analyst proxies the number of clicks to the car specifications on the dealership’s website as an estimate of potential sales at the dealership.
A brand new plant-based meat product was only recently stocked in grocery stores and the supplier needs to estimate the demand over the next four years. The analyst proxies the sales data for a turkey substitute made out of tofu that has been on the market for several years.
The Chamber of Commerce wants to know how a tourism campaign is going to impact travel to their city, but the results from the campaign aren’t publicly available yet. The analyst proxies the historical data for airline bookings to the city one to three months after a similar campaign was run six months earlier.

Open (public) datasets

If you are part of a large organization, you might have access to lots of sources of data. But if you are looking for something specific or a little outside your line of business, you can also make use of open or public datasets. (You can refer to this Towards Data Science article for a brief explanation of the difference between open and public data.)

Here's an example. A nasal version of a vaccine was recently made available. A clinic wants to know what to expect for contraindications, but just started collecting first-party data from its patients. A contraindication is a condition that may cause a patient not to take a vaccine due to the harm it would cause them if taken. To estimate the number of possible contraindications, a data analyst proxies an open dataset from a trial of the injection version of the vaccine. The analyst selects a subset of the data with patient profiles most closely matching the makeup of the patients at the clinic.

There are plenty of ways to share and collaborate on data within a community. Kaggle (kaggle.com) has datasets in a variety of formats including the most basic type, Comma Separated Values (CSV) files.

CSV, JSON, SQLite, and BigQuery datasets

  • CSV: Check out this Credit card customers dataset, which has information from 10,000 customers including age, salary, marital status, credit card limit, credit card category, etc. (CC0: Public Domain, Sakshi Goyal).
  • JSON: Check out this JSON dataset for trending YouTube videos (CC0: Public Domain, Mitchell J).
  • SQLite: Check out this SQLite dataset for 24 years worth of U.S. wildfire data (CC0: Public Domain, Rachael Tatman).
  • BigQuery: Check out this Google Analytics 360 sample dataset from the Google Merchandise Store (CC0 Public Domain, Google BigQuery).

Refer to the Kaggle documentation for datasets for more information and search for and explore datasets on your own at kaggle.com/datasets.

As with all other kinds of datasets, be on the lookout for ‘Null’ and duplicate data in open datasets from databases. Keep in mind that Null can mean that the value is unassigned, or it can be the value of 0. It is important to make sure you know which is intended before you use a dataset with Null data.

Confidence level

The probability that your sample accurately reflects the greater population

Having a 99 percent confidence level is ideal, but most industries hope for at least a 90 or 95 percent confidence level

Sample size calculator

In this reading, you will learn the basics of sample size calculators, how to use them, and how to understand the results. A sample size calculator tells you how many people you need to interview (or things you need to test) to get results that represent the target population. Let’s review some terms you will come across when using a sample size calculator:

  • Confidence level: The probability that your sample size accurately reflects the greater population.
  • Margin of error: The maximum amount that the sample results are expected to differ from those of the actual population.
  • Population: This is the total number you hope to pull your sample from.
  • Sample: A part of a population that is representative of the population.
  • Estimated response rate: If you are running a survey of individuals, this is the percentage of people you expect will complete your survey out of those who received the survey.

How to use a sample size calculator

In order to use a sample size calculator, you need to have the population size, confidence level, and the acceptable margin of error already decided so you can input them into the tool. If this information is ready to go, check out these sample size calculators below:

What to do with the results

After you have plugged your information into one of these calculators, it will give you a recommended sample size. Keep in mind, the calculated sample size is the minimum number to achieve what you input for confidence level and margin of error. If you are working with a survey, you will also need to think about the estimated response rate to figure out how many surveys you will need to send out. For example, if you need a sample size of 100 individuals and your estimated response rate is 10%, you will need to send your survey to 1,000 individuals to get the 100 responses you need for your analysis.

Now that you have the basics, try some calculations using the sample size calculators and refer back to this reading if you need a refresher on the definitions.

Test your knowledge on testing your data

TOTAL POINTS 3

Question 1

A research team runs an experiment to determine if a new security system is more effective than the previous version. What type of results are required for the experiment to be statistically significant?

  • Results that are real and not caused by random chance
  • Results that are hypothetical and in need of more testing
  • Results that are unlikely to occur again
  • Results that are inaccurate and should be ignored

Correct. In order for an experiment to be statistically significant, the results should be real and not caused by random chance.

Question 2

In order to have a high confidence level in a customer survey, what should the sample size accurately reflect?

  • The trends from other customer surveys
  • The entire population
  • The predictions of stakeholders
  • The most valuable members of the population

Correct. In order to have a high confidence level in a customer survey, the sample size should accurately reflect the entire population.

Question 3

A data analyst determines an appropriate sample size for a survey. They can check their work by making sure the confidence level percentage plus the margin of error percentage add up to 100%.

  • True
  • False

Correct. The confidence level percentage and margin of error percentage do not have to add up to 100%. They are independent of each other.

Consider the margin of error

Margin of error

The maximum amount that the sample results are expected to differ from those of the actual population

All about margin of error

Margin of error is the maximum amount that the sample results are expected to differ from those of the actual population. More technically, the margin of error defines a range of values below and above the average result for the sample. The average result for the entire population is expected to be within that range. We can better understand margin of error by using some examples below.

Margin of error in baseball

Imagine you are playing baseball and that you are up at bat. The crowd is roaring, and you are getting ready to try to hit the ball. The pitcher delivers a fastball traveling about 90-95mph, which takes about 400 milliseconds (ms) to reach the catcher’s glove. You swing and miss the first pitch because your timing was a little off. You wonder if you should have swung slightly earlier or slightly later to hit a home run. That time difference can be considered the margin of error, and it tells us how close or far your timing was from the average home run swing.

Margin of error in marketing

The margin of error is also important in marketing. Let’s use A/B testing as an example. A/B testing (or split testing) tests two variations of the same web page to determine which page is more successful in attracting user traffic and generating revenue. User traffic that gets monetized is known as the conversion rate. A/B testing allows marketers to test emails, ads, and landing pages to find the data behind what is working and what isn’t working. Marketers use the confidence interval (determined by the conversion rate and the margin of error) to understand the results.

For example, suppose you are conducting an A/B test to compare the effectiveness of two different email subject lines to entice people to open the email. You find that subject line A: “Special offer just for you” resulted in a 5% open rate compared to subject line B: “Don’t miss this opportunity” at 3%.

Does that mean subject line A is better than subject line B? It depends on your margin of error. If the margin of error was 2%, then subject line A’s actual open rate or confidence interval is somewhere between 3% and 7%. Since the lower end of the interval overlaps with subject line B’s results at 3%, you can’t conclude that there is a statistically significant difference between subject line A and B. Examining the margin of error is important when making conclusions based on your test results.

Want to calculate your margin of error?

All you need is population size, confidence level, and sample size. In order to better understand this calculator, review these terms:

  • Confidence level: A percentage indicating how likely your sample accurately reflects the greater population
  • Population: The total number you pull your sample from
  • Sample: A part of a population that is representative of the population
  • Margin of error: The maximum amount that the sample results are expected to differ from those of the actual population

In most cases, a 90% or 95% confidence level is used. But, depending on your industry, you might want to set a stricter confidence level. A 99% confidence level is reasonable in some industries, such as the pharmaceutical industry.

After you have settled on your population size, sample size, and confidence level, plug the information into a margin of error calculator like the ones below:

Key takeaway

Margin of error is used to determine how close your sample’s result is to what the result would likely have been if you could have surveyed or tested the entire population. Margin of error helps you understand and interpret survey or test results in real-life.  Calculating the margin of error is particularly helpful when you are given the data to analyze. After using a calculator to calculate the margin of error, you will know how much the sample results might differ from the results of the entire population.

Test your knowledge on margin of error

TOTAL POINTS 2

Question 1

Fill in the blank: Margin of error is the _____ amount that the sample results are expected to differ from those of the actual population.

  • minimum
  • maximum
  • median
  • average

Correct. Margin of error is the maximum amount that the sample results are expected to differ from those of the actual population.

Question 2

In a survey about a new cleaning product, 75% of respondents report they would buy the product again. The margin of error for the survey is 5%. Based on the margin of error, what percentage range reflects the population's true response?

  • Between 73% and 78%
  • Between 70% and 75%
  • Between 70% and 80%
  • Between 75% and 80%

Correct. Based on the margin of error, between 70% and 80% accurately reflects the population's true response.