4.1.3.Overcoming the challenges of insufficient data - quanganh2001/Google-Data-Analytics-Professional-Certificate-Coursera GitHub Wiki
What to do when you find an issue with your data
When you are getting ready for data analysis, you might realize you don’t have the data you need or you don’t have enough of it. In some cases, you can use what is known as proxy data in place of the real data. Think of it like substituting oil for butter in a recipe when you don’t have butter. In other cases, there is no reasonable substitute and your only option is to collect more data.
Consider the following data issues and suggestions on how to work around them.
Data issue 1: no data
Possible Solutions | Examples of solutions in real life |
---|---|
Gather the data on a small scale to perform a preliminary analysis and then request additional time to complete the analysis after you have collected more data. | If you are surveying employees about what they think about a new performance and bonus plan, use a sample for a preliminary analysis. Then, ask for another 3 weeks to collect the data from all employees |
If there isn't time to collect data, perform the analysis using proxy data from other datasets. This is the most common workaround. | If you are analyzing peak travel times for commuters but don't have the data for a particular city, use the data from another city with a similar size and demographic. |
Data issue 2: too little data
Possible Solutions | Examples of solutions in real life |
---|---|
Do the analysis using proxy data along with actual data. | If you are analyzing trends for owners of golden retrievers, make your dataset larger by including the data from owners of labradors. |
Adjust your analysis to align with the data you already have. | If you are missing data for 18-to 24-years-olds, do the analysis but note the following limitation in your report: this conclusion applies to adults 25 years and older only. |
Data issue 3: wrong data, including data with errors
Possible Solutions | Examples of solutions in real life |
---|---|
If you have the wrong data because requirements were misunderstood, communicate the requirements again. | If you need the data for female voters and received the data for male voters, restate your needs. |
Identify errors in the data and, if possible, correct them at the source by looking for a pattern in the errors. | If your data is in a spreadsheet and there is a conditional statement or boolean causing calculations to be wrong, change the conditional statement instead of just fixing the calculated values. |
If you can't correct data errors yourself, you can ignore the wrong data and go ahead with the analysis if your sample size is still large enough and ignoring the data won't cause systematic bias. | If your dataset was translated from a different language and some of the translations don't make sense, ignore the data with bad translation and go ahead with the analysis of the other data. |
* Important note: sometimes data with errors can be a warning sign that the data isn’t reliable. Use your best judgment.
Use the following decision tree as a reminder of how to deal with data errors or not enough data:
Calculating sample size
Before you dig deeper into sample size, familiarize yourself with these terms and definitions:
Terminology | Definitions |
---|---|
Population | The entire group that you are interested in for your study. For example, if you are surveying people in your company, the population would be all the employees in your company |
Sample | A subset of your population. Just like a food sample, it is called a sample because it is only a taste. So if your company is too large to survey every individual, you can survey a representative sample of your population. |
Margin of error | Since a sample is used to represent a population, the sample's results are expected to differ from what the result would have been if you had surveyed the entire population. This difference is called the margin of error. The smaller the margin of error, the closer the results of the sample are to what the result would have been if you had surveryed the entire population. |
Confidence level | How confident you are in the survey results. For example, a 95% confidence level means that if you were to run the same survey 100 times, you would get similar results 95 of those 100 times. Confidence level is targeted before you start your study because it will affect how big your margin of error is at the end of your study. |
Confidence interval | The range of possible values that the population's result would be at the confidence level of the study. This range is the sample result +/- the margin of error. |
Statistical significance | The determination of whether your result could be due to random chance or not. The greater the significance, the less due to chance. |
Things to remember when determining the size of your sample
When figuring out a sample size, here are things to keep in mind:
- Don’t use a sample size less than 30. It has been statistically proven that 30 is the smallest sample size where an average result of a sample starts to represent the average result of a population.
- The confidence level most commonly used is 95%, but 90% can work in some cases.
Increase the sample size to meet specific needs of your project:
- For a higher confidence level, use a larger sample size
- To decrease the margin of error, use a larger sample size
- For greater statistical significance, use a larger sample size
Note: Sample size calculators use statistical formulas to determine a sample size. More about these are coming up in the course! Stay tuned.
Why a minimum sample of 30?
This recommendation is based on the Central Limit Theorem (CLT) in the field of probability and statistics. As sample size increases, the results more closely resemble the normal (bell-shaped) distribution from a large number of samples. A sample of 30 is the smallest sample size for which the CLT is still valid. Researchers who rely on regression analysis – statistical methods to determine the relationships between controlled and dependent variables – also prefer a minimum sample of 30.
Still curious? Without getting too much into the math, check out these articles:
- Central Limit Theorem (CLT): This article by Investopedia explains the Central Limit Theorem and briefly describes how it can apply to an analysis of a stock index.
- Sample Size Formula: This article by Statistics Solutions provides a little more detail about why some researchers use 30 as a minimum sample size.
Sample sizes vary by business problem
Sample size will vary based on the type of business problem you are trying to solve.
For example, if you live in a city with a population of 200,000 and get 180,000 people to respond to a survey, that is a large sample size. But without actually doing that, what would an acceptable, smaller sample size look like?
Would 200 be alright if the people surveyed represented every district in the city?
Answer: It depends on the stakes.
- A sample size of 200 might be large enough if your business problem is to find out how residents felt about the new library
- A sample size of 200 might not be large enough if your business problem is to determine how residents would vote to fund the library
You could probably accept a larger margin of error surveying how residents feel about the new library versus surveying residents about how they would vote to fund it. For that reason, you would most likely use a larger sample size for the voter survey.
Larger sample sizes have a higher cost
You also have to weigh the cost against the benefits of more accurate results with a larger sample size. Someone who is trying to understand consumer preferences for a new line of products wouldn’t need as large a sample size as someone who is trying to understand the effects of a new drug. For drug safety, the benefits outweigh the cost of using a larger sample size. But for consumer preferences, a smaller sample size at a lower cost could provide good enough results.
Larger sample sizes have a higher cost
You also have to weigh the cost against the benefits of more accurate results with a larger sample size. Someone who is trying to understand consumer preferences for a new line of products wouldn’t need as large a sample size as someone who is trying to understand the effects of a new drug. For drug safety, the benefits outweigh the cost of using a larger sample size. But for consumer preferences, a smaller sample size at a lower cost could provide good enough results.
Self-Reflection: Why pre-cleaning activities are important
Overview
Now that you have learned about how to prepare for data cleaning, you can pause for a moment and reflect on these steps. In this self-reflection, you will consider your thoughts about the importance of pre-cleaning activities and respond to brief questions.
This self-reflection will help you develop insights into your own learning and prepare you to apply your knowledge of pre-cleaning activities and insufficient data to your own data cleaning work. As you answer questions—and come up with questions of your own—you will consider concepts, practices, and principles to help refine your understanding and reinforce your learning. You’ve done the hard work, so make sure to get the most out of it: This reflection will help your knowledge stick!
Review data integrity
Before data analysts can analyze data, they first need to think about and understand the data they're working with. Assessing data integrity is a key step in this process. As you've learned in previous lessons, you should complete the following tasks before analyzing data:
- Determine data integrity by assessing the overall accuracy, consistency, and completeness of the data.
- Connect objectives to data by understanding how your business objectives can be served by an investigation into the data.
- Know when to stop collecting data.
Data analysts perform pre-cleaning activities to complete these steps. Pre-cleaning activities help you determine and maintain data integrity, which is essential to the role of a junior data analyst.
What makes data insufficient
One of the objectives of pre-cleaning activities is to address insufficient data. Recall from previous lessons that data can be insufficient for a number of reasons. Insufficient data has one or more of the following problems:
- Comes from only one source
- Continuously updates and is incomplete
- Is outdated
- Is geographically limited
To deal with insufficient data, you can:
- Identify trends within the available data
- Wait for more data if time allows
- Discuss with stakeholders and adjust your objective
- Search for a new dataset
Reflection
Consider what you have learned about data insufficiency and the steps for how to avoid it:
- Why are pre-cleaning steps important to complete prior to data cleaning?
- What problems might occur if you don't follow these steps?
Now, write 2-3 sentences (40-60 words) in response to each of these questions. Type your response in the text box below.
Test your knowledge on insufficient data
Question 1
What should an analyst do if they do not have the data needed to meet a business objective? Select all that apply.
- Perform the analysis by finding and using proxy data from other datasets.
- Gather related data on a small scale and request additional time to find more complete data.
- Create and use hypothetical data that aligns with analysis predictions.
- Continue with the analysis using data from less reliable sources.
Explain: If an analyst does not have the data needed to meet a business objective, they should gather related data on a small scale and request additional time. Then, they can find more complete data or perform the analysis by finding and using proxy data from other datasets.
Question 2
Which of the following are limitations that might lead to insufficient data? Select all that apply.
- Data that updates continually
- Data from a single source
- Outdated data
- Duplicate data
Explain: Limitations that might lead to insufficient data include data that updates continually, outdated data, and data from a single source.
Question 3
A data analyst wants to find out how many people in Utah have swimming pools. It’s unlikely that they can survey every Utah resident. Instead, they survey enough people to be representative of the population. This describes what data analytics concept?
A. Margin of error
B. Confidence level
C. Statistical significance
D. Sample
The correct answer is D. Sample. Explain: This describes a sample, which is a part of a population that is representative of the whole.