Exploratory Data Analysis - clizarraga-UAD7/Workshops GitHub Wiki

Exploratory Data Analysis with Python and Seaborn.

Data Science Process

(Image Credit: Wikimedia Commons, CC)


Learning objectives

  • Describe main characteristics of dataset: number of rows/columns, missing data, data types, preview.
  • How to clean corrupted data, handle missing data, invalid data types, incorrect values.
  • Visualize data distributions using the Seaborn Library: bar plots, count plots, histograms, box plots, violin plots, and more
  • Calculate and visualize correlations (relationships) between variables with the help of a heat map.

What is Exploratory Data Analysis (EDA)?

Exploratory Data Analysis is a Statistics approach of analyzing data sets in order to quickly summarize their main characteristics, and may be supported with simple data visualization like box plots, histograms,scatter plots, cummulative distribution functions, quantile-quantile (Q-Q) plots, among others.

John W. Tukey wrote the book Exploratory Data Analysis in 1977, where he held that too much emphasis in statistics was placed on statistical hypothesis testing and more emphasis needed to be placed on using data to suggest hypotheses to test. Exploratory Data Analysis does not need any previous assumption on the statistical distribution of the underlying data.

Tukey suggested computing the five number summary of numerical data: the two extremes (maximum and minimum), the median, and the quartiles since they are defined for all empirical distribution.

Turkey also gives a criteria for defining outlier data. If Q1, and Q3 are the first and third quartile positions, the interquartile range IQR = Q3 - Q1 , then an outlier value will fall below Q1 - 1.5 IQR or above Q3 + 1.5 IQR.

Tuckey Outlier Criteria

(Image credit: UF Biostatistics Open Learning Textbook, CC)


Common plots.

Histograms summarize the distribution of the data, by placing observations into intervals (bins) and counting the number of observations in each interval.

Boxplots are a box and whisker plot, which provides a compact summary of the distribution of a variable. A standard boxplot consists of:

  • a box defined by the 25th and 75th percentiles,
  • a horizontal line or point on the box at the median, and
  • vertical lines (whiskers) drawn from each hinge (quartile) to the extreme value.

The cumulative distribution function (CDF) is a function F(X) that is the probability that the observations of a variable are not larger than a specified value.

A quantile-quantile (Q-Q) plot, or probability plot, is a graphical means for comparing a variable to a particular, theoretical distribution or to compare it to the distribution of another variable. One common application of the Q-Q plot is to check whether a variable is normally distributed.

Scatterplots are graphical displays of matched data plotted with one variable on the horizontal axis and the other variable on the vertical axis.


Data analysis with Pandas

We enlist some functions which are useful in a EDA. We will find some function previously used.

Function Description
df.columns Prints column names of dataframe
df.compare() Compare one dataframe with another and show differences
df.corr() Compute pairwise correlation between columns excluding NaN/Null values
df.describe() Generate descriptive statistics of numerical values
df.dropna() Removes row or column with missing values
df.fillna() Fill NaN/Null values using a specified method
df.head() Prints first n=5 rows of a dataframe
df.info() Print summary of dataframe
df.interpolate() Fill NaN values using an interpolation method
df.isnull().sum() Sums the number of missing data
df.query() Query the columns of a dataframe with a boolean operator
df.sample() Return a random sample of items from a dataframe row
df.shape Prints the dimensions of a dataframe (rows, columns)
df.tail() Returns the last n=5 rows of a dataframe
df.types Prints data types of each column
pd.Series.unique() Returns unique values from the series
Additional Pandas Tools Situations
merge, join, concatenate and compare Forms of combining different data frames
Working with missing data Posible available options when missing data
Group by - split, apply, combine Pandas objects can be split on any of their axes

Sidetable Library for Pandas

Optional (Click me)

There is another library we can use in doing Exploratory Data Analysis, this is the Sidetable Library written by Chris Moffitt.

To install it from a Jupyter Notebook we can enter the pip command:

!pip install sidetable

or if we are using conda, from a terminal run

conda install -c conda-forge sidetable

After we have sidetable installed, we load it into the system working memory

import pandas as pd
import sidetable

The functions we will cover are:

  • Freq function
  • Counts function
  • Missing function
  • Subtotal function

Freq function

Freq function returns a dataframe that conveys 3 pieces of information.

  • The number of observations (i.e. rows) for each category (value_counts()).
  • The percentage of each category in the entire column (value_counts(normalize=True)).
  • The cumulative versions of the two above.

Counts function

Another useful function of sidetable is the count function. It returns the number of unique values in each column along with some other measures.

  • The number of non-missing values in each column
  • The number of unique categories in each column
  • The most and least frequent categories in each column
  • The number of values that belong the most and least frequent columns

Missing function

The missing function is pretty simple. It returns the count and percentage of missing values in each column.

Subtotal function

The subtotal function is best used with the groupby function of Pandas. It adds a subtotal for levels of the grouping.

This is an example on how to use sidetable, which will be called thru the Pandas accessor df.stb.

import sidetable
import pandas as pd
import seaborn as sns

# Load the Titanic dataset
df = sns.load_dataset('titanic')

# Compute the frequency of one variable
df.stb.freq(['column1'], style=True)

# Build a frequency table for one or more columns
df.stb.freq(['column1', 'column2'], style=True)

# See what data is missing
df.stb.missing()

# Group data and add a subtotal
df.groupby(['column1', 'column2'])['col3'].sum().stb.subtotal()

We continue with the Titanic dataset loaded into the dataframe df.

Function Description
df.stb.freq(['class'], style=True) Similar to Pandas df['column1'].value_counts(normalize=True)
df.stb.freq(['sex', 'class'], style=True) You can group more than one columns together.
df.stb.freq(['class'], value='fare') Specifying a value argument, the data should be summed based on the data in another column.
df.stb.freq(['class', 'who'], value='fare', thresh=80) Using the thresh to define a threshold, selecting only values above that threshold.
df.stb.freq(['class', 'who'], value='fare', thresh=80, other_label='All others') Specify the label to be used for all the others.
df.stb.counts() Shows how many unique values, most and least frequent values and total count.
df.stb.counts(exclude='number') Excludes numeric values (Same syntax as DataFrame.select_dtypes)
df.stb.missing(style=True) Summary of missing values.
df.stb.missing(clip_0=True, style=True) Exclude variables with 0 missing values.
df.stb.subtotal() Adds a Grand Total label.

(Please see more details in the sidetable documentation)


Low code EDA libraries

Automated EDA packages can perform EDA in a few lines of Python code.

Here is a small list of them:

See Low-code EDA Tools


The Seaborn Visualization Library

Please see Slides.


The Seaborn Library is based on the general visualization library Matplotlib. Seaborn makes visualization of a dataset statistical properties more easier to use.

There are several types of graphics that we can produce with Seaborn, we will only show a small set of them, that can be used in performing an Exploratory Data Analysis.

Seaborn standard plotting functions

Function Description
Relational Plots
sns.scatterplot() Basic relational plot between variables
sns.lineplot() Plot lines between values
Distribution Plots
sns.histplot() Basic frequency distribution plot
sns.kdeplot() The kernel density estimation plot
Categorical Plots
sns.stripplot() Basic distribution categorical plot
sns.swarmplot Categorical plot without overlapping points
sns.boxplot() Categorical box plots
sns.violinplot() Categorical violin plots
sns.boxenplot() Enhanced boxplot for larger datasets
sns.pointplot() Point estimates and confidence intervals using scatter plot glyphs
sns.barplot() Point estimates and confidence intervals as rectangular bars
sns.countplot() Counts of observations in each categorical bin using bars
Regression Plots
sns.lmplot() Plot data and regression model fits
Matrix Plots
sns.heatmap() Plot rectangular data as a color-encoded matrix
Multiplot grids
sns.FacetGrid() Multi-plot grid for plotting conditional relationships
sns.pairplot() Plot pairwise relationships in a dataset
sns.joint.plot() Draw a plot of two variables with bivariate and univariate graphs

Seaborn objects interface

Optional (Click me)

The seaborn.objects are a new interface for making Seaborn plots. It offers a more consistent and flexible API, comprising a collection of composable classes for transforming and plotting data.

The objects interface should be imported with the following convention:

import seaborn.objects as so

The seaborn.objectsare composed of classes, being Plot the most important. You specify plots by instantiating a Plot object and calling its methods.

Object Description
so.Plot() An interface for declaratively specifying statistical graphics.
so.Dot A mark suitable for dot plots or less-dense scatterplots.
so.Dots() A dot mark defined by strokes to better handle overplotting.
so.Line() A mark connecting data points with sorting along the orientation axis.
so.Lines() A faster but less-flexible mark for drawing many lines.
so.Path() A mark connecting data points in the order they appear.
so.Paths() A faster but less-flexible mark for drawing many paths.
so.Dash() A line mark drawn as an oriented segment for each datapoint.
so.Range() An oriented line mark drawn between min/max values.
so.Bar() A bar mark drawn between baseline and data values.
so.Bars() A faster bar mark with defaults more suitable histograms.
so.Area() A fill mark drawn from a baseline to data values.
so.Band() A fill mark representing an interval between values.
so.Text() A textual mark to annotate or represent data values.
so.Agg(func='mean') Aggregate data along the value axis using given method.
so.Est() Calculate a point estimate and error bar interval.
so.Count() Count distinct observations within groups.
so.Hist Bin observations, count them, and optionally normalize or cumulate.
so.Perc(k=5, method='linear') Replace observations with percentile values.
so.PolyFit(order=2, gridsize=100) Fit a polynomial of the given order and resample data onto predicted curve.
so.Dodge(empty='keep', gap=0, by=None) Displacement and narrowing of overlapping marks along orientation axis.
so.Norm(func='max', where=None, by=None, percent=False) Divisive scaling on the value axis after aggregating within groups.
so.Stack() Displacement of overlapping bar or area marks along the value axis.

Jupyter Notebook Examples


General References

Exploratory Data Analysis

Data Visualization


Created: 01/30/2022 (C. Lizárraga); Last update: 02/20/2023 (C. Lizárraga)

CC BY-NC-SA

⚠️ **GitHub.com Fallback** ⚠️