Data Storytelling - Pauriccarroll/datavisualisation_udacity_q42019 GitHub Wiki

Overview:

  • Define Problem Statement.
  • Issue Trees and Building a Ghost Deck.
  • Project: Build a Data Story Midterm Project.
  • Limitations and Biases
  • Visualisations and Tying it Together
  • Project: Build a Data Story Final Project

Learning Objectives

  • Define Problem Statement: In this lesson students will understand how to clearly articulate the problem statement that is driving analysis and why it matters.
  • Issue Trees and Building a Ghost Deck: In this lesson students will learn how to build an analysis roadmap to stay efficient with your analysis time. An analysis roadmap consists of two key elements (1) an issue tree and (2) a ghost deck.
  • Project: Build a Data Story Midterm Project.
  • Limitations and Biases: The data you work with is flawed and in this lesson students will learn where basis can be introduced in the collection , processing and analysis process and call these caveats out.
  • Visualisations and Tying it Together: This lesson ties everything together. It covers effective visualisation depending on the problem you are solving for and reviewing a data presentation end to end.
  • Project: Build a Data Story Final Project.

Projects

Build a Data Story Midterm Project

In this project, you will apply the skills you have learned in structuring a data story. You will be provided with a Movies Metadata CSV and will have the ability to define your own problem statement, the analyses you would conduct to solve the objective, and outline a set of potential solutions/actionable next steps for the target audience.

Project Steps:

  1. Define the problem statement you find particularly compelling for an executive team
  2. Build out a ghost deck that conveys the analyses you’ll need to conduct to provide actionable insights to an executive team. (FYI: The ghost deck should be a comprehensive analysis; you are encouraged to include an issue tree up front or in the appendix to help the audience understand how you structured the problem)
  3. Include an executive summary that takes the team through the problem to be solved, the key takeaways, and your recommendation
  4. Call out hypothesized biases and limitations in the dataset and why they matter to a potential recommendation

Link to data here Project specifications [here](https://review.udacity.com/#!/rubrics/2580/view)

Build a Data Story Midterm Project

Your role will be to develop a final presentation that is roughly ~10-15 slides and analyses that dig into potential limitations and biases of the dataset you’re working with.

You will continue to use the same Movies Metadata CSV and conduct the EDA necessary to understand the dataset as a whole. The expected output will be to surface if the dataset is balanced, if there are anomalies in the dataset that affect the applicability of the recommendation, and the final presentation itself that will be used for a mock recommendation to a management team.

Datasets: For this project, you’ll be working with a choice of datasets. The description above relies on the Metadata_Movies.csv.

Project Steps:

  1. Conduct the analyses that you had written out in the Project 1
  2. Dig into potential limitations, biases, and data quality issues that affect the credibility of your recommendations and address why these biases affect your recommendation.
  3. Put together a final presentation that is ready to present to an executive team

Project specifications here

Content

Define problem statement:

"The ability to take data—to be able to understand it, to process it, to extract value from it, to visualize it, to communicate it—that’s going to be a hugely important skill in the next decades." - Hal Varian

What is data storytelling? The ability to combine narrative with data. More specifically it is the ability to extract and convey an actionable, insightful recommendation from a previously uninterpretable dataset.

Why is data storytelling important? It becomes more clear to the user what is happening and can drive significant changes. It also incorporates interpretations, encompasses actionability and is often very persuasive.

What goes into an effective data story? A problem statement that is well defined, actionable, in scope and important to the business. Why is defining the problem statement important? Allows you to understand whether this is important.

Problem statements need to be :

  1. Well-scoped.
  2. Matter to the business.
  3. Actionable.
  4. Specific

Before starting a project you need to spend a lot of time bringing specificity, alignment and scope to the project plan.

Data Storytelling: The Essentially Data Science Skill Everyone Needs to Know

What goes into an effective data story:

  • The Problem Statement: Clearly defined and scoped problem statement that matters to the business.
  • An Effective Analysis: Structured, hypotheses driven analyses.
  • The "So What" Statement: A synthesised, actionable recommendation.

An effective problem statement will:

  • Be specific, measurable, and actionable.
  • Drive value to a business if solved.
  • Contain the scope to a specific timeline.
  • Allow for structured, hypotheses driven analyses.
  • Have actionability by providing specific recommendations.

Note: Often you will need to define the problem statement from a place of ambiguity. Once you have defined the problem you will need to spend time explaining this to the management team. Here is an example of a clear problem statement:

What are the key demographic differentiators between the top 5% of players compared to the average player when considering overall rating?

Issue Trees and Build a Ghost Deck

In this section we will talk about how to structure a problem, how to develop issue trees, how to develop hypothesis driven work and the creating a ghost deck.

Issue Tree: An issue tree is a customized framework used to analyze the root causes of problems in a case. During interviews you will be given case problems that you would need to solve on the spot. Use a graphical issue tree (or logic tree) method to break a complex problem down into its component parts. The benefit of the issue tree is that you can split the problem into subsections that are mutually exclusive. These can then be prioritised and then solved by an analyst.

Example of an issue tree in action

Ghost Deck: This is a skeleton deck that includes the following:

  • Executive Summary.
  • Overview of your analysis.
  • Blank Analysis Slides
  • Next Steps.

McKinsey Problem Solving

The eight step problem solving method

  • Solve at the first meeting with a hypothesis: Utilise MECE, Being MECE in the context of problem-solving means separating your problem into distinct, non-overlapping issues while making sure that no issues relevant to your problem have been overlooked.
  • Intuition is as important as facts: Executives make major strategic decisions based as much on gut instinct as on fact-based analysis. *Do your research but don’t reinvent the wheel: Rely on the data sources available and which are close to hand. *Tell the story behind the data: Interweave your data with a narrative, set the scene, provide different characters, introduce conflict. *Prewire your key people: No one like to go into a meeting surprised about a situation. Discuss with the relevant stakeholders about what is happening. *Start with the conclusion: Upfront tell your audiences what you would like them to be considering with an executive summary. *Hit singles: Make sure that what you focus on is a win. Find the area’s where you can knock it out of the park. *Respect your time: How much analysis does this situation actually require and or need in order to be successful.

Examples of an issue tree:

Can I reduce monthly expenses by 20%:

  • Can I buy fewer things: Eating expenses, shopping expenses, Grocery Expenses, Entertainment Expenses.
  • Can I buy my current items for cheaper: Discounted, Alternatives.

What drives the current market value for the Top 10% of Fifa Players:

*Is it demographic traits that drives market value: Age, nationality, height, weight. *Is it evaluated skills that drive market value: Vision, positioning. *Is it performance by position that drives market value: RWB,LWB

What is a ghost deck

An outline of a presentation you plan in advance of making it. Typically it will include the following sections:

*Synthesis: The problem and so what statement. *Overview of analysis: Clear flow from one thought to the next. With the headings read by themselves. *Analysis 1 *Analysis 2. *Analysis 3. *Limitation of the analysis: Call outs to the limtation and biases of the analysis. These can include data collection issues, data validation issues, unrepresentative sample, data processing issues. *Review of next steps.

Wine Ghost Deck

  • Synthesis: The region the wine is from drives its quality score. Our recommendation is to have at least 1-2 wines from X region.
  • Overview of analysis: Price differences by quality score, regional differences by quality score, variety differences by quality score.
  • Analysis 1 - Price Differences: Price distribution differences between Top 10% of quality wines.
  • Analysis 2 - Regional Differences: Regional distribution differences between 10% of quality wines.
  • Analysis 3 - Variety Differences: Variety distribution between 10% of quality wines.
  • Limitation 1: Not all wines are presented that are for purchase. Limitation 2: Biasis in review objectivity.

Limitations & Biasis:

In this section we will carry discuss the following topics; overview of biases, biases in data collection, commonly asked questions and dealing with missingness.

Overview of biases:

Biased Data: Your data is flawed. It is often not representative of the population you trying to speak about. You need to understand whether these biasis are coming from;

*Data Collection: Selection biases, Response Bias, Missing Variables, Survivorship Bias. *Data Analysis: Outliers, Distribution Understanding, Missingness. *Processing Steps: Confirmation biases, overfitting/underfitting, confounding variables.

Data Collection Biases

Selection Bias: Non response bias can come to play here. Basing important business decisions on reasearrch conducted using non-representative reposendents is potential disastrous. Voluntary Bias: Random sampling can provide strong protect against voluntary respond biasi. Under coverage: When members of the population are under represented. Response Bias: In the form of leading question but also the social desirability of certain questions. Missing Variables: Features that are not included as part of data collection that affect the analysis and the final recommendations. Survivorship Bias: Brands that exist in collection today but their churn indicates implcications on the analysis and interpretation.

Examples of Selection Bias:

**Non-response: **Asking a survey group their salaries as part of a feature dataset / asking a group of older residents to take a five minute survey via a smartphone as part of a fe3ature dataset Voluntary: Polling a college group about their political perferences as a feature set to present the larger populations option. Under-coverage: Phoning business people to ask about their financial growth and primarily getting responses from companies that are growing. Running polls in urban areas as part of a feature dataset.

**Examples of Response Biasis:

Leading questions:Prompt the respondent to answer in a certain way. If you ask peope if they have a problem with your boss. Do you think the transit is rubbish. Social desirability: Say with salaries , you want to be shown to be earning loads Missing Variables: Survivorship Biasis:Only focusing on the merits of the companies that have been excluded. Amazon excludes data of products which are no longer online. They spend a significant amount of time scraping data. You can deal with this by assuring confidentiallity and anonymity

Checklist to avoid data collection biasis:

  • What business problem are we trying to solve by collecting this data?
  • What schema will you organise the data into?
  • Collect a sample of that data. Then audit the data.
  • Scale collection as we understanf the biases of the data.

Biases in Data Collection

  1. Outliers : Identify the outliers. How do outliers impact your results.
  2. Distribution: What is the distribution across all variables. How does your understanding of the distribution impact your results.
  3. Missingness: How much missingness is in your dataset? How does your understanding of the missingness impact your results.

Outlier: An extreme data value. You can calculate it by calculating the interquartile range. Subtracting the Q3-Q1 creating the interquartile range. There are more advanced ways of calculating outliers including generalized ESD test, Grubbs Test, Dixon’s Test, Modified Thompson Tau Test, Perce’s Criterion. Example: Person age is 120

Distribution: A set of distribution summaries across each of the variables. Summary statistics (median and medium stages). Visualisation. Product Reviews:

Missingness

**Missing Completely at Random: The propensity for data to be missing is complete at random eg when data is missing for respondents which was lost in the mail. **Missing at Randon: The propensity for a data point to be missing does have a pattern and is related to another variable eg political opinion polls and the refusal to answer questions. **Missing not at random: The propensity for data point to be missing related to the values in the data itself eg data is missing on IQ and only the people with low IQ have missing observations for this type. of data

Strategies to address missingness:

  • Deletion methods: Listwise deletions, pairwise deletion.
  • Single input methods: Mean / Mode distribution, single regression.
  • Model based.

Listwise deletionis a method of handling missing data, In this method, entire values are completely deleted and the entire record is excluded from analysis if any single record is excluded. In this model we assume the data is MCAR. It is relatively simple to implement. Disadvantage: Reduced statistical power, does use all available data and estimates maybe biases if data not MCAR.

Mean/Mode Substitution: Replace missing values with sample mean or mode. Run analyses as if all complete cases. Advantage: Can use all complete case examples. Disadvantage is that it reduces variability and weakens covariance / correlation estimates in the data.

Pairwise Deletion: You choose to analyse with all cases in which the variable of interst are present. In this situation we are assuming the data is MCAR. The advantage is that it keeps as many cases as possible for each analysis. Uses all information as possible for each analysis. Disadvantage is that we can compare analysis as samples are different each time.

Biases in Data Insights:

**Confirmation Bias: Tendency to interept an analysis based on our own belief. Confirmation bias can unintentionally occur with the inclusion of domain experts. They often times influence the collection process, the processing and insight generation to an extent that confirms their own heuristics.

**Overfitting Vs Under-fitting: Fitting a regression line without paying attention to the nuances of the data itself.

**Confounding Variables: Tendency to draw an inaccurate insight due to the absence of a feature that is more relevant to the analysis. The effect of confounding variables is an analysis occurs when the missing variable is a biasis that has been introduced in the data collection stage.

Visualisations and Tying it Together:

In this section we will take about the relationships, concepts, temporal distribution, metrics oputputs, data normalisation. The main takeaways are:

*Define your problem statement. *Build a Ghost Deck. *Outline the limitation and biases in the data. *Deliver the data presentation.

Data is not intimidating and is infused with meaning via narrative. The above addresses why: defining a problem statement up front matter so much? Why does building out an issue tree with structured hypothesis help you save time and stay focused? Why does a ghost deck help you think ahead on what your analysis will look and feel like? What are effective visualizations based on varying problem statements? And how does it all look together?

Visualisations

Relationship: When you problem statement requires understanding the relationship between 1+ features. examples include correlation plots and regressions. Comparison: When your problem statement requires the comparision of two data features or cohorts. The example here would be box plots. Temporal: When your problem statement requires understanding changes to 1+ features across time. Examples include time plots. Distribution: When enabling your problem statement requires understanding the biases in your data. Examples include mean, mode and averages. We can also leverrage geo maps. Metric Outputs: When showcasing the results to your problem statement eg experimentation, prediction, performance, metric etc).

Charts are useful in showing how our predictions compare to the actual data. Distributions allow you to compare quantities. Relationships explore how two variables relate to each other - whether both increase or whether one increases while the other decreases. The comparison of cohorts allows us to draw the comparison between the different audiences.

Use Cases:

When both your variables are numeric utilise Pearson correlation or regressions. When both your variables are categorical leverage chi-square test or any non-parametric test. When one variable is Categorical variable and the other has a numerical variable; ANOVA test is a good choice.

Comparison: When your problem statement requires the comparison between two cohorts eg what are the demographic differences between the top 10% of Fifa players by market value and the remaining 90% of players. One of the most powerful ways to understand comparisons is utilising a box plot and can be read in many ways. Centre will tell if on average the cohorts are similar. Spread will tell you if they vary differently. Shape (symmetry, skewness) will indicate any asymmetry and it will also highligh missing pata / outliers.

Box Plot

Temporal: When your problem statements requires an understanding of change of 1+features across time. Eg how have the number of small brands changed in the past ten years compared to larger CPG strategics.

Distribution: When your problem statement requires understanding the biases in your data eg why has crime in NYC neighbourhoods decreased by 10% in the past 5 years. You can leverage R to create richer bubble visualisation for your user.

Reminder: Build out an Issue Tree **

What drives the current market value of the Top 10% of Fifa Platers?

  • Is it demographically driven.
  • Is it evaluated skills that drive market value.
  • Is it performance by position that drives market value.

Communicating Visualisations:

Strong communications of visualisations have the following trats:

  • Starts with the 'so what'.
  • Review the hypothesis to the 'so what', if necessary.

Reminder

An end to end data presentation includes:

  • A synthesis with a recommendation.
  • A overview of the analysis.
  • Key analysis that supports the recommendation.
  • An overview of the biases that matter to and affect the recommendation.
  • A set of proposed next steps.

Ghost Deck

Executive Summary

The key differentiators of the Top 10% of Fifa players based on market value from the remaining 90% of pplayers are, in order of significanceL:

  1. The potential of the player.
  2. The position on field of the player.
  • Strikers and right midfield are more often in the Top 10% of . rated players.
  1. The age of the player
  • Top rated players are significantly younger that the rest of the player populations.
  1. And the ability for the player to react and stay composed.