4.1.2.Overcoming the challenges of insufficient data - sj50179/Google-Data-Analytics-Professional-Certificate GitHub Wiki

Type of insufficient data

  • Data from only one source
  • Data that keeps updating
  • Outdated data
  • Geographically-limited data

Ways to address insufficient data

  • Identify trends with the available data
  • Wait for more data if time allows
  • Talk with stakeholders and adjust your objective
  • Look for a new dataset

What to do when you find an issue with your data

When you are getting ready for data analysis, you might realize you don’t have the data you need or you don’t have enough of it. In some cases, you can use what is known as proxy data in place of the real data. Think of it like substituting oil for butter in a recipe when you don’t have butter. In other cases, there is no reasonable substitute and your only option is to collect more data.

Consider the following data issues and suggestions on how to work around them.

Data issue 1: no data

Possible Solutions Examples of solutions in real life
Gather the data on a small scale to perform a preliminary analysis and then request additional time to complete the analysis after you have collected more data. If you are surveying employees about what they think about a new performance and bonus plan, use a sample for a preliminary analysis. Then, ask for another 3 weeks to collect the data from all employees.
If there isn’t time to collect data, perform the analysis using proxy data from other datasets.
This is the most common workaround. If you are analyzing peak travel times for commuters but don’t have the data for a particular city, use the data from another city with a similar size and demographic.

Data issue 2: too little data

Possible Solutions Examples of solutions in real life
Do the analysis using proxy data along with actual data. If you are analyzing trends for owners of golden retrievers, make your dataset larger by including the data from owners of labradors.
Adjust your analysis to align with the data you already have. If you are missing data for 18- to 24-year-olds, do the analysis but note the following limitation in your report: this conclusion applies to adults 25 years and older only.

Data issue 3: wrong data, including data with errors*

Possible Solutions Examples of solutions in real life
If you have the wrong data because requirements were misunderstood, communicate the requirements again. If you need the data for female voters and received the data for male voters, restate your needs.
Identify errors in the data and, if possible, correct them at the source by looking for a pattern in the errors. If your data is in a spreadsheet and there is a conditional statement or boolean causing calculations to be wrong, change the conditional statement instead of just fixing the calculated values.
If you can’t correct data errors yourself, you can ignore the wrong data and go ahead with the analysis if your sample size is still large enough and ignoring the data won’t cause systematic bias. If your dataset was translated from a different language and some of the translations don’t make sense, ignore the data with bad translation and go ahead with the analysis of the other data.
  • *Important note: sometimes data with errors can be a warning sign that the data isn’t reliable. Use your best judgment.

Use the following decision tree as a reminder of how to deal with data errors or not enough data:

Calculating sample size


Before you dig deeper into sample size, familiarize yourself with these terms and definitions:

Terminology Definitions
Population The entire group that you are interested in for your study. For example, if you are surveying people in your company, the population would be all the employees in your company.
Sample A subset of your population. Just like a food sample, it is called a sample because it is only a taste. So if your company is too large to survey every individual, you can survey a representative sample of your population.
Margin of error Since a sample is used to represent a population, the sample’s results are expected to differ from what the result would have been if you had surveyed the entire population. This difference is called the margin of error. The smaller the margin of error, the closer the results of the sample are to what the result would have been if you had surveyed the entire population.
Confidence level How confident you are in the survey results. For example, a 95% confidence level means that if you were to run the same survey 100 times, you would get similar results 95 of those 100 times. Confidence level is targeted before you start your study because it will affect how big your margin of error is at the end of your study.
Confidence interval The range of possible values that the population’s result would be at the confidence level of the study. This range is the sample result +/- the margin of error.
Statistical significance The determination of whether your result could be due to random chance or not. The greater the significance, the less due to chance.

Things to remember when determining the size of your sample

When figuring out a sample size, here are things to keep in mind:

  • Don’t use a sample size less than 30. It has been statistically proven that 30 is the smallest sample size where an average result of a sample starts to represent the average result of a population.
  • The confidence level most commonly used is 95%, but 90% can work in some cases.

Increase the sample size to meet specific needs of your project:

  • For a higher confidence level, use a larger sample size
  • To decrease the margin of error, use a larger sample size
  • For greater statistical significance, use a larger sample size

Note: Sample size calculators use statistical formulas to determine a sample size. More about these are coming up in the course! Stay tuned.

Why a minimum sample of 30?

This recommendation is based on the Central Limit Theorem (CLT) in the field of probability and statistics. As sample size increases, the results more closely resemble the normal (bell-shaped) distribution from a large number of samples. A sample of 30 is the smallest sample size for which the CLT is still valid. Researchers who rely on regression analysis – ****statistical methods to determine the relationships between controlled and dependent variables – ****also prefer a minimum sample of 30.

Still curious? Without getting too much into the math, check out these articles:

  • Central Limit Theorem (CLT): This article by Investopedia explains the Central Limit Theorem and briefly describes how it can apply to an analysis of a stock index.
  • Sample Size Formula: This article by Statistics Solutions provides a little more detail about why some researchers use 30 as a minimum sample size.

Sample sizes vary by business problem

Sample size will vary based on the type of business problem you are trying to solve.

For example, if you live in a city with a population of 200,000 and get 180,000 people to respond to a survey, that is a large sample size. But without actually doing that, what would an acceptable, smaller sample size look like?

Would 200 be alright if the people surveyed represented every district in the city?

Answer: It depends on the stakes.

  • A sample size of 200 might be large enough if your business problem is to find out how residents felt about the new library
  • A sample size of 200 might not be large enough if your business problem is to determine how residents would vote to fund the library

You could probably accept a larger margin of error surveying how residents feel about the new library versus surveying residents about how they would vote to fund it. For that reason, you would most likely use a larger sample size for the voter survey.

Larger sample sizes have a higher cost

You also have to weigh the cost against the benefits of more accurate results with a larger sample size. Someone who is trying to understand consumer preferences for a new line of products wouldn’t need as large a sample size as someone who is trying to understand the effects of a new drug. For drug safety, the benefits outweigh the cost of using a larger sample size. But for consumer preferences, a smaller sample size at a lower cost could provide good enough results.

Knowing the basics is helpful

Knowing the basics will help you make the right choices when it comes to sample size. You can always raise concerns if you come across a sample size that is too small. A sample size calculator is also a great tool for this. Sample size calculators let you enter a desired confidence level and margin of error for a given population size. They then calculate the sample size needed to statistically achieve those results.

Self-Reflection: Why pre-cleaning activities are important

Data integrity

In this activity, you will consider what you've learned about how to prepare for data analysis and deal with insufficient data. Before you get started, first reflect on the steps to take before you begin analysis of your data. In particular, consider the following:

  • Step 1: Determine data integrity. Assess the overall accuracy, consistency, and completeness of the data.
  • Step 2: Connect objectives to data. Understand how your business objectives can be served by an investigation into the data.
  • Step 3: Know when to stop collecting data.

To address these steps, data analysts complete pre-cleaning activities. These activities relate to Data integrity, which is where you will spend most of your time as a junior data analyst.

Reflection

Take a moment to reflect on what you’ve just reviewed: the steps to address data integrity, and the solutions for data insufficiency. Now, write a 3-5 sentence (50-100 word) response addressing why these steps are important to complete prior to data cleaning.

  • These steps are important because they increase the efficiency and success of your data analysis tasks. If you know that your data is accurate, consistent, and complete you can have confidence that your results will be valid. The stakeholders will be pleased if you connect the data to business objectives. Finally, knowing when to stop collecting data will allow you to finish your tasks in a timely manner without sacrificing data integrity.

Can you think of any problems that might occur if you don't follow these steps? What are they? Write a 5 - 7 sentence response (100 - 150 words) reflecting these questions.

  • Failing to follow these steps can produce problems when you try to explore and analyze data. Suppose that you didn't determine data integrity. Then you may find that you are working with inaccurate or missing data. This can cause misleading results in your analysis. If you neglect to connect objectives with the data, your analysis may not be relevant to the stakeholders. Finally, not understanding when to stop collecting data can lead to unnecessary delays in completing tasks.

Test your knowledge on insufficient data

TOTAL POINTS 3

Question 1

What should an analyst do if they do not have the data needed to meet a business objective? Select all that apply.

  • Perform the analysis by finding and using proxy data from other datasets.
  • Continue with the analysis using data from less reliable sources.
  • Create and use hypothetical data that aligns with analysis predictions.
  • Gather related data on a small scale and request additional time to find more complete data.

Correct. If an analyst does not have the data needed to meet a business objective, they should gather related data on a small scale and request additional time. Then, they can find more complete data or perform the analysis by finding and using proxy data from other datasets.

Question 2

Which of the following are limitations that might lead to insufficient data? Select all that apply.

  • Outdated data
  • Duplicate data
  • Data from a single source
  • Data that updates continually

Correct. Limitations that might lead to insufficient data include data that updates continually, outdated data, and data from a single source.

Question 3

A data analyst wants to find out how many people in Utah have swimming pools. It’s unlikely that they can survey every Utah resident. Instead, they survey enough people to be representative of the population. This describes what data analytics concept?

  • Sample
  • Margin of error
  • Confidence level
  • Statistical significance

Correct. This describes a sample, which is a part of a population that is representative of the whole.