Introduction to Python for Data Science - clizarraga-UAD7/Workshops GitHub Wiki

ResBaz2022 RESBAZ TUCSON MAY 23-26, 2022

An Introduction to Python for Data Science


Python is one of the used programming languages in Data Science, as are R, Julia and SQL.

Python is an open-source, high-level, interpreted, general-purpose programming language. Its design philosophy emphasizes code readability with the use of significant indentation.

Python was created by the Dutch programmer Guido van Rossum, releasing version 0.9 on February 20th, 1991. The current Python stable version is Python 3.10.4, released on March 24th, 2022.

According to the TIOBE Index, Python is the most popular language used by the program developers community.

Python is one of the popular languages used by data scientists for various applications. Python provide great functionality to work in mathematics, statistics, machine learning, data visualization and scientific computing. It has a wide collection of libraries to support data science applications. See Kaggle free courses on Python.


Main Python Libraries

Since Python is developed by a wide community of developers and users, there is a set of libraries that can be used for specific tasks. We mention a few:

Machine Learning and Deep Learning are beyond the scope of this workshop.


Python Programming Environments.

There are several options for working with Python.

Working in Python.

There are two options for working in Python. Offline and Cloud-based platforms.


Using Jupyter Notebooks in Google Colab.

Google Colab offers a basic free Python development environment on Google Cloud. It has the advantage of storing all our files in the Google Drive, as well as storing a copy of our code in Github.com.

Start your Google Colab session login in into the platform.

📝 Note (Click to open) To execute a Code Cell: SHIFT+ENTER _or_ use execute button.

Python basics.

Python like any programming language has data types and arithmetic operations.

📝 Code Style: Python Programming Best Practices(Click to open)
  • Use 4-space indentation, and no tabs.
  • 4 spaces are a good compromise between small indentation (allows greater nesting depth) and large indentation (easier to read). Tabs introduce confusion, and are best left out.
  • Wrap lines so that they don’t exceed 79 characters. Use \ to break a long line.
  • This helps users with small displays and makes it possible to have several code files side-by-side on larger displays.
  • Use blank lines to separate functions and classes, and larger blocks of code inside functions.
  • When possible, put comments on a line of their own (Everything to the right of # is a comment.
  • Use docstrings, that is comments extending several lines to document your code.
  • Use spaces around operators and after commas, but not directly inside bracketing constructs: a = f(1, 2) + g(3, 4).
  • Name your classes and functions consistently; the convention is to use UpperCamelCase for classes and lowercase_with_underscores for functions and methods.
  • Don’t use fancy encodings if your code is meant to be used in international environments. Plain ASCII work best in any case.

The Python interpreter as a Calculator.

Basic operations are allowed on a command line.

Numbers (Integers and Floating)

x = 5          # Define x = 5 as an Integer. 
print(type(x)) # Prints type "<class 'int'>"
print(x)       # Prints "5"
print(x + 1)   # Addition; prints "6"
print(x - 1)   # Subtraction; prints "4"
print(x * 2)   # Multiplication; prints "10"
print(x ** 2)  # Exponentiation; prints "25"
print(x + 1 + 3 * x) # Prints "21". The multiplication operator has precedence over addition. 
print((x + 1) + (3 * x)) # Prints "20". The preferred way of writing an operation. 
x += 1    # Equivalent to "x = x + 1"
print(x)  # Prints "6"
x *= 2    # Equivalent to "x = x * 2"
print(x)  # Prints "12"
y = 5/2   # Automatically defines y = 2.5 as a floating or real number.
print(type(y)) # Prints "<class 'float'>"
print(y, y + 1, y * 2, y ** 2) # Prints "2.5 3.5 5.0 6.25"

Python automatically classifies the type of variables.

Boolean or logical variables (True, False)

t = True
f = False
print(type(t)) # Prints "<class 'bool'>"
print(t and f) # Logical AND; prints "False"
print(t or f)  # Logical OR; prints "True"
print(not t)   # Logical NOT; prints "False"
print(t != f)  # Logical XOR; prints "True"

Strings

hello = 'Hello'    # String literals can use single quotes
world = "World!"    # or double quotes; it does not matter.
print(hello)       # Prints "Hello"
print(len(hello))  # String length; prints "5"
hw = hello + ' ' + world  # String concatenation
print(hw)  # prints "Hello World!"
hw2 = '%s %s %d' % (hello, world, 2)  # sprintf style string formatting
print(hw2)  # prints "Hello World! 2"

Data Structures.

By default Python has several objects to store data: lists, dictionaries, sets, and tuples.

📝 Note (Click to open) Python counting start at 0.

Lists

xs = [1, 2, 3]    # Create a list
print(xs, xs[2])  # Prints "[1, 2, 3] 3"
print(xs[-1])     # Negative indices count from the end of the list; prints "3"
xs[2] = 'foo'     # Lists can contain elements of different types
print(xs)         # Prints "[1, 2, 'foo']"
xs.append('bar')  # Add a new element to the end of the list
print(xs)         # Prints "[1, 2, 'foo', 'bar']"
x = xs.pop()      # Remove and return the last element of the list
print(x, xs)      # Prints "bar [1, 2, 'foo']"

Slicing or accessing the contents of a list.

nums = list(range(5))     # range is a built-in function that creates a list of integers
print(nums)               # Prints "[0, 1, 2, 3, 4]"
print(nums[2:4])          # Get a slice from index 2 to 4 (exclusive); prints "[2, 3]"
print(nums[2:])           # Get a slice from index 2 to the end; prints "[2, 3, 4]"
print(nums[:2])           # Get a slice from the start to index 2 (exclusive); prints "[0, 1]"
print(nums[:])            # Get a slice of the whole list; prints "[0, 1, 2, 3, 4]"
print(nums[:-1])          # Slice indices can be negative; prints "[0, 1, 2, 3]"
nums[2:4] = [8, 9]        # Assign a new sublist to a slice
print(nums)               # Prints "[0, 1, 8, 9, 4]"

You can loop over the elements of a list.

animals = ['cat', 'dog', 'monkey']
for animal in animals:
    print(animal)
# Prints "cat", "dog", "monkey", each on its own line.

nums = [0, 1, 2, 3, 4]
squares = [x ** 2 for x in nums]
print(squares)   # Prints [0, 1, 4, 9, 16]

Tuples

A tuple is an (immutable) ordered list of values. A tuple is in many ways similar to a list; one of the most important differences is that tuples can be used as keys in dictionaries and as elements of sets, while lists cannot.

d = {(x, x + 1): x for x in range(10)}  # Create a dictionary with tuple keys
t = (5, 6)        # Create a tuple
print(type(t))    # Prints "<class 'tuple'>"
print(d[t])       # Prints "5"
print(d[(1, 2)])  # Prints "1"

Sets

A set is an unordered collection of distinct elements.

animals = {'cat', 'dog'}
print('cat' in animals)   # Check if an element is in a set; prints "True"
print('fish' in animals)  # prints "False"
animals.add('fish')       # Add an element to a set
print('fish' in animals)  # Prints "True"
print(len(animals))       # Number of elements in a set; prints "3"
animals.add('cat')        # Adding an element that is already in the set does nothing
print(len(animals))       # Prints "3"
animals.remove('cat')     # Remove an element from a set
print(len(animals))       # Prints "2"

Dictionaries

A dictionary stores (key, value) pairs.

d = {'cat': 'cute', 'dog': 'furry'}  # Create a new dictionary with some data
print(d['cat'])       # Get an entry from a dictionary; prints "cute"
print('cat' in d)     # Check if a dictionary has a given key; prints "True"
d['fish'] = 'wet'     # Set an entry in a dictionary
print(d['fish'])      # Prints "wet"
# print(d['monkey'])  # KeyError: 'monkey' not a key of d
print(d.get('monkey', 'N/A'))  # Get an element with a default; prints "N/A"
print(d.get('fish', 'N/A'))    # Get an element with a default; prints "wet"
del d['fish']         # Remove an element from a dictionary
print(d.get('fish', 'N/A')) # "fish" is no longer a key; prints "N/A"

Loop or iterate over the keys in a dictionary.

d = {'person': 2, 'cat': 4, 'spider': 8}
for animal in d:
    legs = d[animal]
    print('A %s has %d legs' % (animal, legs))
# Prints "A person has 2 legs", "A cat has 4 legs", "A spider has 8 legs"

Functions

Python functions are defined using the def keyword

def sign(x):
    if x > 0:
        return 'positive'
    elif x < 0:
        return 'negative'
    else:
        return 'zero'

for x in [-1, 0, 1]:
    print(sign(x))
# Prints "negative", "zero", "positive"

Functions can take optional argument value

def hello(name, loud=False):
    if loud:
        print('HELLO, %s!' % name.upper())
    else:
        print('Hello, %s' % name)

hello('Bob') # Prints "Hello, Bob"
hello('Fred', loud=True)  # Prints "HELLO, FRED!"

Numpy

Numpy is the core library for scientific computing in Python. It provides a high-performance multidimensional array object, and tools for working with these arrays.

Numpy includes a large collection of mathematical defined functions.

Before we start working with the Numpy Library, we need to load (import) it into the working memory, by including the following line in a Jupyter Notebook code cell.

import numpy as np

where the alias or short name np is given to refer to Numpy.

import numpy as np

print('Pi number = ', np.pi) # Using the definition of Pi from Numpy
print('The square root of Pi is = ', np.sqrt(np.pi)) # Using the square root function in Numpy

Arrays

A numpy array is a grid of values, all of the same type, and is indexed by a tuple of nonnegative integers. The number of dimensions is the rank of the array; the shape of an array is a tuple of integers giving the size of the array along each dimension.

import numpy as np

a = np.array([1, 2, 3])   # Create a rank 1 array
print(type(a))            # Prints "<class 'numpy.ndarray'>"
print(a.shape)            # Prints "(3,)"
print(a[0], a[1], a[2])   # Prints "1 2 3"
a[0] = 5                  # Change an element of the array
print(a)                  # Prints "[5, 2, 3]"

b = np.array([[1,2,3],[4,5,6]])    # Create a rank 2 array
print(b.shape)                     # Prints "(2, 3)"
print(b[0, 0], b[0, 1], b[1, 0])   # Prints "1 2 4"

Numpy also includes a set of functions to create arrays

import numpy as np

a = np.zeros((2,2))   # Create an array of all zeros
print(a)              # Prints "[[ 0.  0.]
                      #          [ 0.  0.]]"

b = np.ones((1,2))    # Create an array of all ones
print(b)              # Prints "[[ 1.  1.]]"

c = np.full((2,2), 7)  # Create a constant array
print(c)               # Prints "[[ 7.  7.]
                       #          [ 7.  7.]]"

d = np.eye(2)         # Create a 2x2 identity matrix
print(d)              # Prints "[[ 1.  0.]
                      #          [ 0.  1.]]"

e = np.random.random((2,2))  # Create an array filled with random values (0,1)
print(e)                     # If run many times will give different results

Slicing can also be used for arrays similarly as it was used for lists.

import numpy as np

# Create the following rank 2 array with shape (3, 4)
# [[ 1  2  3  4]
#  [ 5  6  7  8]
#  [ 9 10 11 12]]
a = np.array([[1,2,3,4], [5,6,7,8], [9,10,11,12]])

# Use slicing to pull out the subarray consisting of the first 2 rows
# and columns 1 and 2; b is the following array of shape (2, 2):
# [[2 3]
#  [6 7]]
b = a[:2, 1:3]

#### Array indexing

Numpy offers several ways to index into arrays.

# A slice of an array is a view into the same data, so modifying it
# will modify the original array.
print(a[0, 1])   # Prints "2"
b[0, 0] = 77     # b[0, 0] is the same piece of data as a[0, 1]
print(a[0, 1])   # Prints "77"

Integer array indexing

import numpy as np

a = np.array([[1,2], [3, 4], [5, 6]])

# An example of integer array indexing.
print(np.array([a[0, 0], a[1, 1], a[2, 0]]))  # Prints "[1 4 5]"
print(np.array([a[0, 1], a[0, 1]]))  # Prints "[2 2]"

Datatypes

Every numpy array is a grid of elements of the same type.

import numpy as np

x = np.array([1, 2])   # Let numpy choose the datatype
print(x.dtype)         # Prints "int64"

x = np.array([1.0, 2.0])   # Let numpy choose the datatype
print(x.dtype)             # Prints "float64"

x = np.array([1, 2], dtype=np.int64)   # Force a particular datatype
print(x.dtype)                         # Prints "int64"

Array math

Basic mathematical functions operate elementwise on arrays.

import numpy as np

x = np.array([[1,2],[3,4]], dtype=np.float64)
y = np.array([[5,6],[7,8]], dtype=np.float64)

# Elementwise sum; both produce the array
# [[ 6.0  8.0]
#  [10.0 12.0]]
print(x + y)
print(np.add(x, y))

# Elementwise difference; both produce the array
# [[-4.0 -4.0]
#  [-4.0 -4.0]]
print(x - y)
print(np.subtract(x, y))

# Elementwise product; both produce the array
# [[ 5.0 12.0]
#  [21.0 32.0]]
print(x * y)
print(np.multiply(x, y))

# Elementwise division; both produce the array
# [[ 0.2         0.33333333]
#  [ 0.42857143  0.5       ]]
print(x / y)
print(np.divide(x, y))

# Elementwise square root; produces the array
# [[ 1.          1.41421356]
#  [ 1.73205081  2.        ]]
print(np.sqrt(x))

Inner product:

import numpy as np

x = np.array([[1,2],[3,4]])
y = np.array([[5,6],[7,8]])

v = np.array([9,10])
w = np.array([11, 12])

# Inner product of vectors; both produce 219
print(v.dot(w))
print(np.dot(v, w))

# Matrix / vector product; both produce the rank 1 array [29 67]
print(x.dot(v))
print(np.dot(x, v))

# Matrix / matrix product; both produce the rank 2 array
# [[19 22]
#  [43 50]]
print(x.dot(y))
print(np.dot(x, y))

Sum of elements:

import numpy as np

x = np.array([[1,2],[3,4]])

print(np.sum(x))  # Compute sum of all elements; prints "10"
print(np.sum(x, axis=0))  # Compute sum of each column; prints "[4 6]"
print(np.sum(x, axis=1))  # Compute sum of each row; prints "[3 7]"

Transposing a matrix:

import numpy as np

x = np.array([[1,2], [3,4]])
print(x)    # Prints "[[1 2]
            #          [3 4]]"
print(x.T)  # Prints "[[1 3]
            #          [2 4]]"

# Note that taking the transpose of a rank 1 array does nothing:
v = np.array([1,2,3])
print(v)    # Prints "[1 2 3]"
print(v.T)  # Prints "[1 2 3]"

SciPy

SciPy provides algorithms for optimization, integration, interpolation, matrix equations, eigenvalue problems, algebraic equations, differential equations, statistics and many other classes of problems.

More information on Scipy subpackages

Interpolation example.

from scipy.interpolate import interp1d

x = np.linspace(0, 10, num=11, endpoint=True)
y = np.cos(-x**2/9.0)
f = interp1d(x, y)
f2 = interp1d(x, y, kind='cubic')

xnew = np.linspace(0, 10, num=41, endpoint=True)

# Make plot of interpolation
import matplotlib.pyplot as plt
plt.plot(x, y, 'o', xnew, f(xnew), '-', xnew, f2(xnew), '--')
plt.legend(['data', 'linear', 'cubic'], loc='best')
plt.show()

Solve matrix equation a*x = b.

from scipy import linalg

a = np.array([[3, 2, 0], [1, -1, 0], [0, 5, 1]])
b = np.array([2, 4, -1])

x = linalg.solve(a, b)
print(x)
# Will print array([ 2., -2.,  9.])


Matplotlib

Matplotlib is a plotting library. We give a brief introduction to the matplotlib.pyplot module.

Matplotlib includes the Pyplot module which provides a MATLAB-like interface. Matplotlib is designed to be as usable as MATLAB, with the ability to use Python, and the advantage of being free and open-source.

📝 Most popular visualization libraries for Python (Click to open)

Among all visualization libraries for Python, we enlist some of the most popular:

  • Altair. Altair is a declarative statistical visualization library for Python, based on Vega and Vega-Lite.
  • Bokeh. Bokeh is a Python library for creating interactive visualizations for modern web browsers.
  • Ggplot. Ggplot is a Python implementation of the grammar of graphics ggplot2. Ggplot is not necessary a feature by feature equivalent of ggplot2, but does have some overlap.
  • Matplotlib. Matplotlib is a comprehensive library for creating static, animated, and interactive visualizations in Python.
  • Plotly. Plotly's Python graphing library makes interactive, publication-quality graphs.
  • Plotnine. plotnine is an implementation of a grammar of graphics in Python, it is based on ggplot2 used in R programming language.
  • Seaborn. Seaborn is a Python data visualization library based on Matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics.

Basic example plot.

import numpy as np
import matplotlib.pyplot as plt

# Compute the x and y coordinates for points on a sine curve
x = np.arange(0, 3 * np.pi, 0.1) # Uniform array between 0 and 3Pi, with 0.1 spaced points. 
y = np.sin(x)

# Plot the points using matplotlib
plt.plot(x, y)
plt.show()  # You must call plt.show() to make graphics appear.

We improve the above plot.

import numpy as np
import matplotlib.pyplot as plt

# Compute the x and y coordinates for points on sine and cosine curves
x = np.arange(0, 3 * np.pi, 0.1)
y_sin = np.sin(x)
y_cos = np.cos(x)

# Plot the points using matplotlib
plt.plot(x, y_sin)
plt.plot(x, y_cos)
plt.xlabel('x axis label')
plt.ylabel('y axis label')
plt.title('Sine and Cosine')
plt.legend(['Sine', 'Cosine'])
plt.show()

Subplots

You can plot different things in the same figure using the subplot function.

import numpy as np
import matplotlib.pyplot as plt

# Compute the x and y coordinates for points on sine and cosine curves
x = np.arange(0, 3 * np.pi, 0.1)
y_sin = np.sin(x)
y_cos = np.cos(x)

# Set up a subplot grid that has height 2 and width 1,
# and set the first such subplot as active.
plt.subplot(2, 1, 1)

# Make the first plot
plt.plot(x, y_sin)
plt.title('Sine')

# Set the second subplot as active, and make the second plot.
plt.subplot(2, 1, 2)
plt.plot(x, y_cos)
plt.title('Cosine')

# Show the figure.
plt.show()

A more complete example of Matplotlib.

x = np.linspace(0, 5, 100)  # Sample data.

plt.figure(figsize=(10, 6))
plt.grid(True) # Add a grid

plt.plot(x, np.sin(x), label='Sine function $\sin(x)$')  # Plot some data. Can use LaTeX notation.
plt.plot(x, x, label='linear $x$')  # etc.
plt.plot(x, x**3, label='cubic $x^{3}$')
plt.plot(x, x-x**3/6.0, linestyle='dashed', label='$x - x^{3} / 6$')

plt.xlim([0, np.pi/2.0 ]) # We select limits for plotting
plt.ylim([0, 1.5])

# Optional: We can define where and what the xtics can be
plt.xticks([0, 0.125*np.pi, 0.25*np.pi, 0.375*np.pi, np.pi/2], # different as in first method above
           ['$0$', '$\pi/8$','$\pi/4$','$3 \pi/8$','$\pi/2$'])

plt.xlabel('$x$')
plt.ylabel('$y = f(x)$')
plt.title("Simple function plots")
plt.legend();

Pandas

Pandas is a Python Library designed by Wes McKinney for data manipulation and data analysis. Pandas basic data structures objects are 1-dimensional Series and 2-dimensional DataFrame.

Pandas can read a wide variety of data formats, such as comma separated values (csv), Microsoft Excel files, JSON, SQL tables and queries, and more. Pandas is a base tool for data cleansing and data wrangling.

The Pandas Python Library was developed in 2008 by Wes McKinney, for performing data manipulation and analysis. Pandas uses data structures and functions for manipulating numerical tables and time series.

As we will see further, Pandas has a set of plotting functions based on Matplotlib, that will help us visualize the analyzed dataframe.

What is a DataFrame?. It is a two-dimensional, size-mutable, potentially heterogeneous tabular data.

dataframe

Some basic operations on a Pandas DataFrame df.

Function Action
pd.read_csv(filename) Reads a CSV (comma separated values) file
pd.read_excel(filename) Read an Excel file
Reading other file formats
pd.to_csv(filename) Write the dataframe to a file
df.head() Shows the first 5 rows by default. If you wish to print n rows use df.head(n)
df.tail() Shows the last 5 rows by default
df.shape Prints the dataframe dimensions (rows, columns)
df.info() Prints out dataframe information: number of rows, columns, names, data types, number of non null entries and more
df.describe() Returns a statistical analysis of float variables
df['categorical variable'].describe() Describes how many they are and how many are different
df['categorical variable'].value_counts().head(10) Counts the number of occurrences of each categorial variable and shows the first 10
df.columns Prints the names of the columns
df.columns = ['col1','col2','col3'] It names the columns according to list
df.rename(columns={'Old1' : 'New1', 'Old2' : 'New2'}, inplace=True) Renames some of the columns
df.drop_duplicates(inplace=True) Eliminates repeated rows
df.isnull().sum() Returns the sum of missing values in each variable
df.dropna() Will eliminate rows having at least one null value
df.dropna(axis=1) Will eliminate columns having at least one null value
df.mean() Computes the arithmetic mean of the dataframe
df.fillna(x_mean, inplace=True) Will replace missing values with given mean value
df.corr() Show the correlation between variables

Selecting information from a Pandas DataFrame

Function Action
df['B'] Selects column 'B'
df[['A','B']] Selects columns 'A' and 'B'
df_new = df[['A','B']] Creates a new dataframe df_new composed by two selected columns of df
df['C'] = df['A'] + df['B'] Creates a new column in df, being the sum of columns 'A' and 'B'
df_copy = df.copy() Creates a new dataframe copy of existing df
df.drop('D', axis=1, inplace=True) Eliminates column 'D' and redefines df
df.loc['2'] Returns df row with index '2'
df.loc['2','C'] Returns the specific value of df, with index=2, and column='C'
df.iloc[2] Returns row with index=2
df.loc['2':'4'] Returns rows with index '2','3', and '4'
df.iloc[2:4] Returns rows with index '2' and '3', excludes '4'
df[df['B'] == 5.0] Selects rows where the condition df['B'] equals 5.0
df[(df['B'] == 5.0) & (df['D'] <= 2.0)] Selects rows that satisfy both conditions simultaneously

The standard operators for comparing two values and conditional clauses.

Operators
Comparison '<', '<=', '==', '!=', '>=', '>',
or their wrappers '.lt()', '.le()', 'eq()', '.ne()', '.ge()', '.gt()'
Conditionals & (and), | (or)

Reading a csv file into Pandas

Let's apply the above functions in an example. We can read a local file or a remote files.

# We load the libraries into running memory

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

Let's read in raw form or plain text the CSV (comma separated values) Penguins dataset from a Github repository using Pandas. And print the first 5 lines.

# Read the penguins size dataset

filename = "https://raw.githubusercontent.com/clizarraga-UAD7/Datasets/main/penguins/penguins_size.csv"

df = pd.read_csv(filename)

df.head()

Next, we can apply a set of functions to inquiry about the dataframe

# General information about the dataset
df.info()

Combining Pandas and Matplotlib

Once we have a loaded dataset into a Pandas dataframe, we can create different type of plots.

To load Numpy, Pandas, Matplotlib, and Pyplot, we enter the following import commands in a code cell:

import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt

Some basic Matplotlib Pyplot or Pandas functions:

Functions Description
Relational Plots
plt.scatter() Plotting y vs. x as scatter plot with varying markers size/color.
df.plot.scatter()
plt.plot() Plotting y vs. x, as lines or markers
df.plot.line
plt.hlines() Plot horizontal lines
plt.vlines() Plot vertical lines
Distribution Plot
plt.hist() Plot a histogram
df.plot.hist()
df.plot.kde() Generate Kernel Density Estimate plot using Gaussian kernels.
Categorical Plots
plt.boxplot() Draw a box and whisker plot
df.plot.box()
df.plot.boxplot()
plt.bar() Make a bar plot
df.plot.bar()
plt.violinplot() Make a violinplot
Multiplot Grids
plt.subplots() Creates a figure and a set of subplots

Exploratory Data Analysis using Pandas

Exploratory Data Analysis is a Statistics approach of analyzing data sets in order to quickly summarize their main characteristics, and may be supported with simple data visualization like box plots, histograms, or scatter plots, among others.

John W. Tukey wrote the book Exploratory Data Analysis in 1977, where he held that too much emphasis in statistics was placed on statistical hypothesis testing and more emphasis needed to be placed on using data to suggest hypotheses to test. Exploratory Data Analysis does not need any previous assumption on the statistical distribution of the underlying data.

Tukey suggested computing the five number summary of numerical data: the two extremes (maximum and minimum), the median, and the quartiles since they are defined for all empirical distribution (This is basically what you get when using the df.describe() function from Pandas).

📝 Tuckey's outlier criteria (Click to open)

Turkey also gives a criteria for defining outlier data. If Q1, and Q3 are the first and third quartile positions, the interquartile range IQR = Q3 - Q1 , then an outlier value will fall below Q1 - 1.5 IQR or above Q3 + 1.5 IQR.

Tuckey Outlier Criteria

(Image credit: UF Biostatistics Open Learning Textbook, CC)


Plotting examples.

Basic plots

Histograms
# Pyplot histogram
plt.grid(True)
plt.hist(df['body_mass_g'], bins=10, rwidth=.9, color = 'tab:blue' );
# Pandas plot histogram 

df.hist(column='body_mass_g', bins=10, grid=True, rwidth=.9, color='tab:blue',
        alpha=0.5, fill='blue');
Bar plots
# Using Pandas plot bar

df['species'].value_counts().plot(kind='bar', color = 'tab:blue');
df.groupby('species').body_mass_g.mean().plot.bar();
Boxplots
# Using Pandas boxplot
boxplot = df.boxplot(column=['culmen_length_mm', 'culmen_depth_mm'])
# Using Pandas boxplot

boxplot = df.groupby('species').boxplot(column=['flipper_length_mm'], color='blue', subplots=False);
plt.title('Penguins characteristics') ;
plt.ylabel('flipper_length_mm');
plt.xticks(rotation=30);
Scatter plots
# Matplotlib.Pyplot

# Define plotting variables
x = df.body_mass_g.values   
y = df.flipper_length_mm.values
Pspecies = df.species.values
Pspecies_ = np.unique(Pspecies) # Keep only unique species values
Pcolors = ["#1B9E77", "#D95F02", "#7570B3"]
Pmarkers = ["o", "^", "s"] # circle, triangle, square

fig, ax = plt.subplots(figsize=(9,6))

for species, color, marker in zip(Pspecies_, Pcolors, Pmarkers):
    idxs = np.where(Pspecies == species)
    # No legend will be generated if we don't pass label=species
    ax.scatter(
        x[idxs], y[idxs], label=species,
        s= 30, color=color, marker=marker, alpha=0.7)
    
plt.xlabel('Body mass (g)')
plt.ylabel('Flipper length (mm)')
plt.title('Matplotlib Pyplot Scatter Plot')
ax.legend();

# Pandas Plot

dfA = df[df['species'] == 'Adelie'].dropna()
dfC = df[df['species'] == 'Chinstrap'].dropna()
dfG = df[df['species'] == 'Gentoo'].dropna()

ax = dfA.plot.scatter(x='body_mass_g', y='flipper_length_mm', 
                      c='tab:green', label='Adelie', 
                      xlabel='Body mass (g)', ylabel='Flipper length (mm)',
                      title='Pandas Scatter Plot', figsize=(12, 8))
dfC.plot.scatter(x='body_mass_g', y='flipper_length_mm', ax=ax, c='tab:orange', 
                 label='Chinstrap')
dfG.plot.scatter(x='body_mass_g', y='flipper_length_mm', ax=ax, c='tab:purple', 
                 label='GentooA')

plt.show()

Figures and Subplots

A basic plot structure example.

# Work with a copy of the dataframe df dropping all NAN 
df1 = df.copy().dropna() 

# Create an empty figure object
fig = plt.figure(figsize=(12, 8))
# Add some subplots 2 cols x 2 rows = 4 subplots, but plotting only 3 
ax1 = fig.add_subplot(2,2,1) # First subplot of row 1
ax2 = fig.add_subplot(2,2,2) # Second subplot of row 1
ax3 = fig.add_subplot(2,2,3) # Third subplot, now on row 2

# Add some content
# Histogram
_ = ax1.hist(df1['body_mass_g'], bins=20, color = 'tab:blue', alpha = 0.4)
# Scatterplot
ax2.scatter(x = df1['culmen_length_mm'], y = df1['culmen_depth_mm'], color = 'tab:orange', marker = "v", )
# Boxplot
ax3.boxplot(df1['flipper_length_mm']);
Adding ticks, labels and legends
# Add some ticks, labels and legends to the plots

# Create an empty figure object
fig = plt.figure(figsize=(12, 8), constrained_layout=False, tight_layout=False)
# Add some subplots 2 cols x 2 rows = 4 subplots, but plotting only 3 
ax1 = fig.add_subplot(2,2,1) # First subplot of row 1
ax2 = fig.add_subplot(2,2,2) # Second subplot of row 1
ax3 = fig.add_subplot(2,2,3) # Third subplot, now on row 2

# Add subplots
_ = ax1.hist(df1['body_mass_g'], bins=20, color = 'tab:blue', alpha = 0.4)
ax2.scatter(x = df1['culmen_length_mm'], y = df1['culmen_depth_mm'], color = 'tab:orange', marker = "v", )
ax3.boxplot(df1['flipper_length_mm'])

# Work on details of 1st. plot
ax1.set_xticks([3000, 4000, 5000, 6000])
ax1.set_xticklabels(['3.0', '4.0', '5.0', '6.0'], rotation=30, fontsize='small')
ax1.set_title('Penguins mass distribution (kg)')

# Work on details of 2nd. plot
ax2.set_xticks([30, 40, 50, 60])
ax2.set_xticklabels(['3.0', '4.0', '5.0', '6.0'], rotation=30, fontsize='small')
ax2.set_xlabel('Culmen length (cm)')
ax2.set_yticks([10, 15, 20, 25])
ax2.set_yticklabels(['1.0', '1.5', '2.0', '2.5'], fontsize='small')
ax2.set_ylabel('Culmen depth (cm)')
ax2.set_title('Culmen Depth vs. Length (cm)')

# Work on details of 3rd. plot
ax3.set_xlabel('')
ax3.set_yticks([160, 180, 200, 220, 240])
ax3.set_yticklabels(['16.0', '18.0', '20.0', '22.0', '24.0'], fontsize='small')
ax3.set_ylabel('Flipper length (cm)')
ax3.set_title('Flipper Lenght variability')

# Optional 
# Setting width and height space between subplots and general plot title
plt.subplots_adjust(left=0.13, right=0.93, top=1.0, bottom= 0.3, wspace= 0.3, hspace=0.3)
plt.suptitle('Antartica Penguins Dataset', y=1.08, fontsize=14, fontweight='bold');

📝 Here you will find more data visualization examples with Matplotlib, Pandas, Seaborn and other libraries.


📚 Basic References

Cheat Sheets

Free Short Courses.


Created: 05/20/2022 (C. Lizárraga); Last update: 05/24/2022 (C. Lizárraga)

CC BY-NC-SA

⚠️ **GitHub.com Fallback** ⚠️