Exploratory Data Analysis - clizarraga-UAD7/Workshops GitHub Wiki
(Image Credit: Wikimedia Commons, CC)
- Describe main characteristics of dataset: number of rows/columns, missing data, data types, preview.
- How to clean corrupted data, handle missing data, invalid data types, incorrect values.
- Visualize data distributions using the Seaborn Library: bar plots, count plots, histograms, box plots, violin plots, and more
- Calculate and visualize correlations (relationships) between variables with the help of a heat map.
Exploratory Data Analysis is a Statistics approach of analyzing data sets in order to quickly summarize their main characteristics, and may be supported with simple data visualization like box plots, histograms,scatter plots, cummulative distribution functions, quantile-quantile (Q-Q) plots, among others.
John W. Tukey wrote the book Exploratory Data Analysis in 1977, where he held that too much emphasis in statistics was placed on statistical hypothesis testing and more emphasis needed to be placed on using data to suggest hypotheses to test. Exploratory Data Analysis does not need any previous assumption on the statistical distribution of the underlying data.
Tukey suggested computing the five number summary of numerical data: the two extremes (maximum and minimum), the median, and the quartiles since they are defined for all empirical distribution.
Turkey also gives a criteria for defining outlier data. If Q1, and Q3 are the first and third quartile positions, the interquartile range IQR = Q3 - Q1 , then an outlier value will fall below Q1 - 1.5 IQR or above Q3 + 1.5 IQR.
(Image credit: UF Biostatistics Open Learning Textbook, CC)
Histograms summarize the distribution of the data, by placing observations into intervals (bins) and counting the number of observations in each interval.
Boxplots are a box and whisker plot, which provides a compact summary of the distribution of a variable. A standard boxplot consists of:
- a box defined by the 25th and 75th percentiles,
- a horizontal line or point on the box at the median, and
- vertical lines (whiskers) drawn from each hinge (quartile) to the extreme value.
The cumulative distribution function (CDF) is a function F(X) that is the probability that the observations of a variable are not larger than a specified value.
A quantile-quantile (Q-Q) plot, or probability plot, is a graphical means for comparing a variable to a particular, theoretical distribution or to compare it to the distribution of another variable. One common application of the Q-Q plot is to check whether a variable is normally distributed.
Scatterplots are graphical displays of matched data plotted with one variable on the horizontal axis and the other variable on the vertical axis.
We enlist some functions which are useful in a EDA. We will find some function previously used.
Function | Description |
---|---|
df.columns | Prints column names of dataframe |
df.compare() | Compare one dataframe with another and show differences |
df.corr() | Compute pairwise correlation between columns excluding NaN/Null values |
df.describe() | Generate descriptive statistics of numerical values |
df.dropna() | Removes row or column with missing values |
df.fillna() | Fill NaN/Null values using a specified method |
df.head() | Prints first n=5 rows of a dataframe |
df.info() | Print summary of dataframe |
df.interpolate() | Fill NaN values using an interpolation method |
df.isnull().sum() | Sums the number of missing data |
df.query() | Query the columns of a dataframe with a boolean operator |
df.sample() | Return a random sample of items from a dataframe row |
df.shape | Prints the dimensions of a dataframe (rows, columns) |
df.tail() | Returns the last n=5 rows of a dataframe |
df.types | Prints data types of each column |
pd.Series.unique() | Returns unique values from the series |
Additional Pandas Tools | Situations |
---|---|
merge, join, concatenate and compare | Forms of combining different data frames |
Working with missing data | Posible available options when missing data |
Group by - split, apply, combine | Pandas objects can be split on any of their axes |
Optional (Click me)
There is another library we can use in doing Exploratory Data Analysis, this is the Sidetable Library written by Chris Moffitt.
To install it from a Jupyter Notebook we can enter the pip
command:
!pip install sidetable
or if we are using conda
, from a terminal run
conda install -c conda-forge sidetable
After we have sidetable
installed, we load it into the system working memory
import pandas as pd
import sidetable
The functions we will cover are:
- Freq function
- Counts function
- Missing function
- Subtotal function
Freq function returns a dataframe that conveys 3 pieces of information.
- The number of observations (i.e. rows) for each category (
value_counts()
). - The percentage of each category in the entire column (
value_counts(normalize=True)
). - The cumulative versions of the two above.
Another useful function of sidetable
is the count function. It returns the number of unique values in each column along with some other measures.
- The number of non-missing values in each column
- The number of unique categories in each column
- The most and least frequent categories in each column
- The number of values that belong the most and least frequent columns
The missing function is pretty simple. It returns the count and percentage of missing values in each column.
The subtotal function is best used with the groupby function of Pandas. It adds a subtotal for levels of the grouping.
This is an example on how to use sidetable
, which will be called thru the Pandas accessor df.stb.
import sidetable
import pandas as pd
import seaborn as sns
# Load the Titanic dataset
df = sns.load_dataset('titanic')
# Compute the frequency of one variable
df.stb.freq(['column1'], style=True)
# Build a frequency table for one or more columns
df.stb.freq(['column1', 'column2'], style=True)
# See what data is missing
df.stb.missing()
# Group data and add a subtotal
df.groupby(['column1', 'column2'])['col3'].sum().stb.subtotal()
We continue with the Titanic dataset loaded into the dataframe df
.
Function | Description |
---|---|
df.stb.freq(['class'], style=True) |
Similar to Pandas df['column1'].value_counts(normalize=True)
|
df.stb.freq(['sex', 'class'], style=True) |
You can group more than one columns together. |
df.stb.freq(['class'], value='fare') |
Specifying a value argument, the data should be summed based on the data in another column. |
df.stb.freq(['class', 'who'], value='fare', thresh=80) |
Using the thresh to define a threshold, selecting only values above that threshold. |
df.stb.freq(['class', 'who'], value='fare', thresh=80, other_label='All others') |
Specify the label to be used for all the others. |
df.stb.counts() |
Shows how many unique values, most and least frequent values and total count. |
df.stb.counts(exclude='number') |
Excludes numeric values (Same syntax as DataFrame.select_dtypes) |
df.stb.missing(style=True) |
Summary of missing values. |
df.stb.missing(clip_0=True, style=True) |
Exclude variables with 0 missing values. |
df.stb.subtotal() |
Adds a Grand Total label. |
(Please see more details in the sidetable documentation)
Automated EDA packages can perform EDA in a few lines of Python code.
Here is a small list of them:
The Seaborn Library is based on the general visualization library Matplotlib. Seaborn makes visualization of a dataset statistical properties more easier to use.
There are several types of graphics that we can produce with Seaborn, we will only show a small set of them, that can be used in performing an Exploratory Data Analysis.
Function | Description |
---|---|
Relational Plots | |
sns.scatterplot() | Basic relational plot between variables |
sns.lineplot() | Plot lines between values |
Distribution Plots | |
sns.histplot() | Basic frequency distribution plot |
sns.kdeplot() | The kernel density estimation plot |
Categorical Plots | |
sns.stripplot() | Basic distribution categorical plot |
sns.swarmplot | Categorical plot without overlapping points |
sns.boxplot() | Categorical box plots |
sns.violinplot() | Categorical violin plots |
sns.boxenplot() | Enhanced boxplot for larger datasets |
sns.pointplot() | Point estimates and confidence intervals using scatter plot glyphs |
sns.barplot() | Point estimates and confidence intervals as rectangular bars |
sns.countplot() | Counts of observations in each categorical bin using bars |
Regression Plots | |
sns.lmplot() | Plot data and regression model fits |
Matrix Plots | |
sns.heatmap() | Plot rectangular data as a color-encoded matrix |
Multiplot grids | |
sns.FacetGrid() | Multi-plot grid for plotting conditional relationships |
sns.pairplot() | Plot pairwise relationships in a dataset |
sns.joint.plot() | Draw a plot of two variables with bivariate and univariate graphs |
Optional (Click me)
The seaborn.objects
are a new interface for making Seaborn plots. It offers a more consistent and flexible API, comprising a collection of composable classes for transforming and plotting data.
The objects interface should be imported with the following convention:
import seaborn.objects as so
The seaborn.objects
are composed of classes, being Plot the most important. You specify plots by instantiating a Plot object and calling its methods.
Object | Description |
---|---|
so.Plot() | An interface for declaratively specifying statistical graphics. |
so.Dot | A mark suitable for dot plots or less-dense scatterplots. |
so.Dots() | A dot mark defined by strokes to better handle overplotting. |
so.Line() | A mark connecting data points with sorting along the orientation axis. |
so.Lines() | A faster but less-flexible mark for drawing many lines. |
so.Path() | A mark connecting data points in the order they appear. |
so.Paths() | A faster but less-flexible mark for drawing many paths. |
so.Dash() | A line mark drawn as an oriented segment for each datapoint. |
so.Range() | An oriented line mark drawn between min/max values. |
so.Bar() | A bar mark drawn between baseline and data values. |
so.Bars() | A faster bar mark with defaults more suitable histograms. |
so.Area() | A fill mark drawn from a baseline to data values. |
so.Band() | A fill mark representing an interval between values. |
so.Text() | A textual mark to annotate or represent data values. |
so.Agg(func='mean') | Aggregate data along the value axis using given method. |
so.Est() | Calculate a point estimate and error bar interval. |
so.Count() | Count distinct observations within groups. |
so.Hist | Bin observations, count them, and optionally normalize or cumulate. |
so.Perc(k=5, method='linear') | Replace observations with percentile values. |
so.PolyFit(order=2, gridsize=100) | Fit a polynomial of the given order and resample data onto predicted curve. |
so.Dodge(empty='keep', gap=0, by=None) | Displacement and narrowing of overlapping marks along orientation axis. |
so.Norm(func='max', where=None, by=None, percent=False) | Divisive scaling on the value axis after aggregating within groups. |
so.Stack() | Displacement of overlapping bar or area marks along the value axis. |
- Exploratory Data Analysis with Seaborn & Seaborn.objects
- Exploratory Data Analysis Jupyter Notebook (Pandas/Seaborn).
- 10 automated EDA libraries in one place. Implementation of Exploratory Data Analysis libraries with a few lines of Python code.
- Exploratory Data Analysis Wikipedia article.
- Exploratory Data Analysis. Causal Analysis/Diagnosis Decision Information System (CADDIS), U.S. Environmental Protection Agency.
- Exploratory Data Analysis. Engineering Statistics Handbook. NIST.gov.
- The Essential Data Science Reference Notebook. Eric Onofrey.
- Introduction to Data Visualization. Coding for Economists, Arthur Turrell.
- Pandas Cookbook
- Seaborn Visualization Library Examples
- Introduction to Seaborn Tutorial
- Seaborn.objects.Plot
- Matplotlib Visualization Library
Created: 01/30/2022 (C. Lizárraga); Last update: 02/20/2023 (C. Lizárraga)