Introduction to Python for Data Science - clizarraga-UAD7/Workshops GitHub Wiki
RESBAZ TUCSON MAY 23-26, 2022
Python is one of the used programming languages in Data Science, as are R, Julia and SQL.
Python is an open-source, high-level, interpreted, general-purpose programming language. Its design philosophy emphasizes code readability with the use of significant indentation.
Python was created by the Dutch programmer Guido van Rossum, releasing version 0.9 on February 20th, 1991. The current Python stable version is Python 3.10.4, released on March 24th, 2022.
According to the TIOBE Index, Python is the most popular language used by the program developers community.
Python is one of the popular languages used by data scientists for various applications. Python provide great functionality to work in mathematics, statistics, machine learning, data visualization and scientific computing. It has a wide collection of libraries to support data science applications. See Kaggle free courses on Python.
Since Python is developed by a wide community of developers and users, there is a set of libraries that can be used for specific tasks. We mention a few:
- Numpy. The fundamental library for scientific computing.
- SciPy. Fundamental algorithms for scientific computing.
- Pandas. Basic library for Data Analysis with Python.
- Matplotlib. Basic data visualization library in Python.
- Seaborn. Specialized library for statistical data visualization.
- Scikit-learn. A Machine Learning library for Python.
- Scikit-image. A machine learning library for digital image processing.
- Tensorflow 2.0. Specialized library for Deep Learning Models.
- PyTorch. Another library for Deep Learning.
- Hugging Face. A collection of Deep Learning models named Transformer used for Natural Language Processing, Machine Translation and Computer Vision.
Machine Learning and Deep Learning are beyond the scope of this workshop.
There are several options for working with Python.
- There is a Command-Line Interface for a command shell Python named iPython (interactive Python).
- There are GUI (graphical user interface) options like the web-based Jupyter Lab / Jupyter Notebooks or Spyder.
There are two options for working in Python. Offline and Cloud-based platforms.
- Offline method. Need to install all Python libraries in a local machine. The Anaconda Python has all the packages needed. You can download the free academic license version.
- Cloud-based option. Again, there are several options. We recommend using Google Colab (colab.research.google.com) with your Gmail account.
Google Colab offers a basic free Python development environment on Google Cloud. It has the advantage of storing all our files in the Google Drive, as well as storing a copy of our code in Github.com.
Start your Google Colab session login in into the platform.
📝 Note (Click to open)
To execute a Code Cell: SHIFT+ENTER _or_ use execute button.Python like any programming language has data types and arithmetic operations.
📝 Code Style: Python Programming Best Practices(Click to open)
- Use 4-space indentation, and no tabs.
- 4 spaces are a good compromise between small indentation (allows greater nesting depth) and large indentation (easier to read). Tabs introduce confusion, and are best left out.
- Wrap lines so that they don’t exceed 79 characters. Use
\
to break a long line. - This helps users with small displays and makes it possible to have several code files side-by-side on larger displays.
- Use blank lines to separate functions and classes, and larger blocks of code inside functions.
- When possible, put comments on a line of their own (Everything to the right of
#
is a comment. - Use docstrings, that is comments extending several lines to document your code.
- Use spaces around operators and after commas, but not directly inside bracketing constructs: a = f(1, 2) + g(3, 4).
- Name your classes and functions consistently; the convention is to use UpperCamelCase for classes and lowercase_with_underscores for functions and methods.
- Don’t use fancy encodings if your code is meant to be used in international environments. Plain ASCII work best in any case.
Basic operations are allowed on a command line.
Numbers (Integers and Floating)
x = 5 # Define x = 5 as an Integer.
print(type(x)) # Prints type "<class 'int'>"
print(x) # Prints "5"
print(x + 1) # Addition; prints "6"
print(x - 1) # Subtraction; prints "4"
print(x * 2) # Multiplication; prints "10"
print(x ** 2) # Exponentiation; prints "25"
print(x + 1 + 3 * x) # Prints "21". The multiplication operator has precedence over addition.
print((x + 1) + (3 * x)) # Prints "20". The preferred way of writing an operation.
x += 1 # Equivalent to "x = x + 1"
print(x) # Prints "6"
x *= 2 # Equivalent to "x = x * 2"
print(x) # Prints "12"
y = 5/2 # Automatically defines y = 2.5 as a floating or real number.
print(type(y)) # Prints "<class 'float'>"
print(y, y + 1, y * 2, y ** 2) # Prints "2.5 3.5 5.0 6.25"
Python automatically classifies the type of variables.
Boolean or logical variables (True, False)
t = True
f = False
print(type(t)) # Prints "<class 'bool'>"
print(t and f) # Logical AND; prints "False"
print(t or f) # Logical OR; prints "True"
print(not t) # Logical NOT; prints "False"
print(t != f) # Logical XOR; prints "True"
hello = 'Hello' # String literals can use single quotes
world = "World!" # or double quotes; it does not matter.
print(hello) # Prints "Hello"
print(len(hello)) # String length; prints "5"
hw = hello + ' ' + world # String concatenation
print(hw) # prints "Hello World!"
hw2 = '%s %s %d' % (hello, world, 2) # sprintf style string formatting
print(hw2) # prints "Hello World! 2"
By default Python has several objects to store data: lists, dictionaries, sets, and tuples.
📝 Note (Click to open)
Python counting start at 0.xs = [1, 2, 3] # Create a list
print(xs, xs[2]) # Prints "[1, 2, 3] 3"
print(xs[-1]) # Negative indices count from the end of the list; prints "3"
xs[2] = 'foo' # Lists can contain elements of different types
print(xs) # Prints "[1, 2, 'foo']"
xs.append('bar') # Add a new element to the end of the list
print(xs) # Prints "[1, 2, 'foo', 'bar']"
x = xs.pop() # Remove and return the last element of the list
print(x, xs) # Prints "bar [1, 2, 'foo']"
Slicing or accessing the contents of a list.
nums = list(range(5)) # range is a built-in function that creates a list of integers
print(nums) # Prints "[0, 1, 2, 3, 4]"
print(nums[2:4]) # Get a slice from index 2 to 4 (exclusive); prints "[2, 3]"
print(nums[2:]) # Get a slice from index 2 to the end; prints "[2, 3, 4]"
print(nums[:2]) # Get a slice from the start to index 2 (exclusive); prints "[0, 1]"
print(nums[:]) # Get a slice of the whole list; prints "[0, 1, 2, 3, 4]"
print(nums[:-1]) # Slice indices can be negative; prints "[0, 1, 2, 3]"
nums[2:4] = [8, 9] # Assign a new sublist to a slice
print(nums) # Prints "[0, 1, 8, 9, 4]"
You can loop over the elements of a list.
animals = ['cat', 'dog', 'monkey']
for animal in animals:
print(animal)
# Prints "cat", "dog", "monkey", each on its own line.
nums = [0, 1, 2, 3, 4]
squares = [x ** 2 for x in nums]
print(squares) # Prints [0, 1, 4, 9, 16]
A tuple is an (immutable) ordered list of values. A tuple is in many ways similar to a list; one of the most important differences is that tuples can be used as keys in dictionaries and as elements of sets, while lists cannot.
d = {(x, x + 1): x for x in range(10)} # Create a dictionary with tuple keys
t = (5, 6) # Create a tuple
print(type(t)) # Prints "<class 'tuple'>"
print(d[t]) # Prints "5"
print(d[(1, 2)]) # Prints "1"
A set is an unordered collection of distinct elements.
animals = {'cat', 'dog'}
print('cat' in animals) # Check if an element is in a set; prints "True"
print('fish' in animals) # prints "False"
animals.add('fish') # Add an element to a set
print('fish' in animals) # Prints "True"
print(len(animals)) # Number of elements in a set; prints "3"
animals.add('cat') # Adding an element that is already in the set does nothing
print(len(animals)) # Prints "3"
animals.remove('cat') # Remove an element from a set
print(len(animals)) # Prints "2"
A dictionary stores (key, value) pairs.
d = {'cat': 'cute', 'dog': 'furry'} # Create a new dictionary with some data
print(d['cat']) # Get an entry from a dictionary; prints "cute"
print('cat' in d) # Check if a dictionary has a given key; prints "True"
d['fish'] = 'wet' # Set an entry in a dictionary
print(d['fish']) # Prints "wet"
# print(d['monkey']) # KeyError: 'monkey' not a key of d
print(d.get('monkey', 'N/A')) # Get an element with a default; prints "N/A"
print(d.get('fish', 'N/A')) # Get an element with a default; prints "wet"
del d['fish'] # Remove an element from a dictionary
print(d.get('fish', 'N/A')) # "fish" is no longer a key; prints "N/A"
Loop or iterate over the keys in a dictionary.
d = {'person': 2, 'cat': 4, 'spider': 8}
for animal in d:
legs = d[animal]
print('A %s has %d legs' % (animal, legs))
# Prints "A person has 2 legs", "A cat has 4 legs", "A spider has 8 legs"
Python functions are defined using the def
keyword
def sign(x):
if x > 0:
return 'positive'
elif x < 0:
return 'negative'
else:
return 'zero'
for x in [-1, 0, 1]:
print(sign(x))
# Prints "negative", "zero", "positive"
Functions can take optional argument value
def hello(name, loud=False):
if loud:
print('HELLO, %s!' % name.upper())
else:
print('Hello, %s' % name)
hello('Bob') # Prints "Hello, Bob"
hello('Fred', loud=True) # Prints "HELLO, FRED!"
Numpy is the core library for scientific computing in Python. It provides a high-performance multidimensional array object, and tools for working with these arrays.
Numpy includes a large collection of mathematical defined functions.
- Basic Mathematical Functions
- Linear Algebra functions based on the matrix algebra BLAS and numeric linear algebra LAPACK software libraries.
- Discrete Fourier Transform for spectral analysis.
- Random sampling library
- And more ...
Before we start working with the Numpy Library, we need to load (import) it into the working memory, by including the following line in a Jupyter Notebook code cell.
import numpy as np
where the alias or short name np is given to refer to Numpy.
import numpy as np
print('Pi number = ', np.pi) # Using the definition of Pi from Numpy
print('The square root of Pi is = ', np.sqrt(np.pi)) # Using the square root function in Numpy
A numpy array is a grid of values, all of the same type, and is indexed by a tuple of nonnegative integers. The number of dimensions is the rank of the array; the shape of an array is a tuple of integers giving the size of the array along each dimension.
import numpy as np
a = np.array([1, 2, 3]) # Create a rank 1 array
print(type(a)) # Prints "<class 'numpy.ndarray'>"
print(a.shape) # Prints "(3,)"
print(a[0], a[1], a[2]) # Prints "1 2 3"
a[0] = 5 # Change an element of the array
print(a) # Prints "[5, 2, 3]"
b = np.array([[1,2,3],[4,5,6]]) # Create a rank 2 array
print(b.shape) # Prints "(2, 3)"
print(b[0, 0], b[0, 1], b[1, 0]) # Prints "1 2 4"
Numpy also includes a set of functions to create arrays
import numpy as np
a = np.zeros((2,2)) # Create an array of all zeros
print(a) # Prints "[[ 0. 0.]
# [ 0. 0.]]"
b = np.ones((1,2)) # Create an array of all ones
print(b) # Prints "[[ 1. 1.]]"
c = np.full((2,2), 7) # Create a constant array
print(c) # Prints "[[ 7. 7.]
# [ 7. 7.]]"
d = np.eye(2) # Create a 2x2 identity matrix
print(d) # Prints "[[ 1. 0.]
# [ 0. 1.]]"
e = np.random.random((2,2)) # Create an array filled with random values (0,1)
print(e) # If run many times will give different results
Slicing can also be used for arrays similarly as it was used for lists.
import numpy as np
# Create the following rank 2 array with shape (3, 4)
# [[ 1 2 3 4]
# [ 5 6 7 8]
# [ 9 10 11 12]]
a = np.array([[1,2,3,4], [5,6,7,8], [9,10,11,12]])
# Use slicing to pull out the subarray consisting of the first 2 rows
# and columns 1 and 2; b is the following array of shape (2, 2):
# [[2 3]
# [6 7]]
b = a[:2, 1:3]
#### Array indexing
Numpy offers several ways to index into arrays.
# A slice of an array is a view into the same data, so modifying it
# will modify the original array.
print(a[0, 1]) # Prints "2"
b[0, 0] = 77 # b[0, 0] is the same piece of data as a[0, 1]
print(a[0, 1]) # Prints "77"
Integer array indexing
import numpy as np
a = np.array([[1,2], [3, 4], [5, 6]])
# An example of integer array indexing.
print(np.array([a[0, 0], a[1, 1], a[2, 0]])) # Prints "[1 4 5]"
print(np.array([a[0, 1], a[0, 1]])) # Prints "[2 2]"
Every numpy array is a grid of elements of the same type.
import numpy as np
x = np.array([1, 2]) # Let numpy choose the datatype
print(x.dtype) # Prints "int64"
x = np.array([1.0, 2.0]) # Let numpy choose the datatype
print(x.dtype) # Prints "float64"
x = np.array([1, 2], dtype=np.int64) # Force a particular datatype
print(x.dtype) # Prints "int64"
Basic mathematical functions operate elementwise on arrays.
import numpy as np
x = np.array([[1,2],[3,4]], dtype=np.float64)
y = np.array([[5,6],[7,8]], dtype=np.float64)
# Elementwise sum; both produce the array
# [[ 6.0 8.0]
# [10.0 12.0]]
print(x + y)
print(np.add(x, y))
# Elementwise difference; both produce the array
# [[-4.0 -4.0]
# [-4.0 -4.0]]
print(x - y)
print(np.subtract(x, y))
# Elementwise product; both produce the array
# [[ 5.0 12.0]
# [21.0 32.0]]
print(x * y)
print(np.multiply(x, y))
# Elementwise division; both produce the array
# [[ 0.2 0.33333333]
# [ 0.42857143 0.5 ]]
print(x / y)
print(np.divide(x, y))
# Elementwise square root; produces the array
# [[ 1. 1.41421356]
# [ 1.73205081 2. ]]
print(np.sqrt(x))
Inner product:
import numpy as np
x = np.array([[1,2],[3,4]])
y = np.array([[5,6],[7,8]])
v = np.array([9,10])
w = np.array([11, 12])
# Inner product of vectors; both produce 219
print(v.dot(w))
print(np.dot(v, w))
# Matrix / vector product; both produce the rank 1 array [29 67]
print(x.dot(v))
print(np.dot(x, v))
# Matrix / matrix product; both produce the rank 2 array
# [[19 22]
# [43 50]]
print(x.dot(y))
print(np.dot(x, y))
Sum of elements:
import numpy as np
x = np.array([[1,2],[3,4]])
print(np.sum(x)) # Compute sum of all elements; prints "10"
print(np.sum(x, axis=0)) # Compute sum of each column; prints "[4 6]"
print(np.sum(x, axis=1)) # Compute sum of each row; prints "[3 7]"
Transposing a matrix:
import numpy as np
x = np.array([[1,2], [3,4]])
print(x) # Prints "[[1 2]
# [3 4]]"
print(x.T) # Prints "[[1 3]
# [2 4]]"
# Note that taking the transpose of a rank 1 array does nothing:
v = np.array([1,2,3])
print(v) # Prints "[1 2 3]"
print(v.T) # Prints "[1 2 3]"
SciPy provides algorithms for optimization, integration, interpolation, matrix equations, eigenvalue problems, algebraic equations, differential equations, statistics and many other classes of problems.
More information on Scipy subpackages
Interpolation example.
from scipy.interpolate import interp1d
x = np.linspace(0, 10, num=11, endpoint=True)
y = np.cos(-x**2/9.0)
f = interp1d(x, y)
f2 = interp1d(x, y, kind='cubic')
xnew = np.linspace(0, 10, num=41, endpoint=True)
# Make plot of interpolation
import matplotlib.pyplot as plt
plt.plot(x, y, 'o', xnew, f(xnew), '-', xnew, f2(xnew), '--')
plt.legend(['data', 'linear', 'cubic'], loc='best')
plt.show()
Solve matrix equation a*x = b.
from scipy import linalg
a = np.array([[3, 2, 0], [1, -1, 0], [0, 5, 1]])
b = np.array([2, 4, -1])
x = linalg.solve(a, b)
print(x)
# Will print array([ 2., -2., 9.])
Matplotlib is a plotting library. We give a brief introduction to the matplotlib.pyplot
module.
Matplotlib includes the Pyplot module which provides a MATLAB-like interface. Matplotlib is designed to be as usable as MATLAB, with the ability to use Python, and the advantage of being free and open-source.
📝 Most popular visualization libraries for Python (Click to open)
Among all visualization libraries for Python, we enlist some of the most popular:
- Altair. Altair is a declarative statistical visualization library for Python, based on Vega and Vega-Lite.
- Bokeh. Bokeh is a Python library for creating interactive visualizations for modern web browsers.
- Ggplot. Ggplot is a Python implementation of the grammar of graphics ggplot2. Ggplot is not necessary a feature by feature equivalent of ggplot2, but does have some overlap.
- Matplotlib. Matplotlib is a comprehensive library for creating static, animated, and interactive visualizations in Python.
- Plotly. Plotly's Python graphing library makes interactive, publication-quality graphs.
- Plotnine. plotnine is an implementation of a grammar of graphics in Python, it is based on ggplot2 used in R programming language.
- Seaborn. Seaborn is a Python data visualization library based on Matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics.
Basic example plot.
import numpy as np
import matplotlib.pyplot as plt
# Compute the x and y coordinates for points on a sine curve
x = np.arange(0, 3 * np.pi, 0.1) # Uniform array between 0 and 3Pi, with 0.1 spaced points.
y = np.sin(x)
# Plot the points using matplotlib
plt.plot(x, y)
plt.show() # You must call plt.show() to make graphics appear.
We improve the above plot.
import numpy as np
import matplotlib.pyplot as plt
# Compute the x and y coordinates for points on sine and cosine curves
x = np.arange(0, 3 * np.pi, 0.1)
y_sin = np.sin(x)
y_cos = np.cos(x)
# Plot the points using matplotlib
plt.plot(x, y_sin)
plt.plot(x, y_cos)
plt.xlabel('x axis label')
plt.ylabel('y axis label')
plt.title('Sine and Cosine')
plt.legend(['Sine', 'Cosine'])
plt.show()
You can plot different things in the same figure using the subplot
function.
import numpy as np
import matplotlib.pyplot as plt
# Compute the x and y coordinates for points on sine and cosine curves
x = np.arange(0, 3 * np.pi, 0.1)
y_sin = np.sin(x)
y_cos = np.cos(x)
# Set up a subplot grid that has height 2 and width 1,
# and set the first such subplot as active.
plt.subplot(2, 1, 1)
# Make the first plot
plt.plot(x, y_sin)
plt.title('Sine')
# Set the second subplot as active, and make the second plot.
plt.subplot(2, 1, 2)
plt.plot(x, y_cos)
plt.title('Cosine')
# Show the figure.
plt.show()
A more complete example of Matplotlib.
x = np.linspace(0, 5, 100) # Sample data.
plt.figure(figsize=(10, 6))
plt.grid(True) # Add a grid
plt.plot(x, np.sin(x), label='Sine function $\sin(x)$') # Plot some data. Can use LaTeX notation.
plt.plot(x, x, label='linear $x$') # etc.
plt.plot(x, x**3, label='cubic $x^{3}$')
plt.plot(x, x-x**3/6.0, linestyle='dashed', label='$x - x^{3} / 6$')
plt.xlim([0, np.pi/2.0 ]) # We select limits for plotting
plt.ylim([0, 1.5])
# Optional: We can define where and what the xtics can be
plt.xticks([0, 0.125*np.pi, 0.25*np.pi, 0.375*np.pi, np.pi/2], # different as in first method above
['$0$', '$\pi/8$','$\pi/4$','$3 \pi/8$','$\pi/2$'])
plt.xlabel('$x$')
plt.ylabel('$y = f(x)$')
plt.title("Simple function plots")
plt.legend();
Pandas is a Python Library designed by Wes McKinney for data manipulation and data analysis. Pandas basic data structures objects are 1-dimensional Series and 2-dimensional DataFrame.
Pandas can read a wide variety of data formats, such as comma separated values (csv), Microsoft Excel files, JSON, SQL tables and queries, and more. Pandas is a base tool for data cleansing and data wrangling.
The Pandas Python Library was developed in 2008 by Wes McKinney, for performing data manipulation and analysis. Pandas uses data structures and functions for manipulating numerical tables and time series.
As we will see further, Pandas has a set of plotting functions based on Matplotlib, that will help us visualize the analyzed dataframe.
What is a DataFrame?. It is a two-dimensional, size-mutable, potentially heterogeneous tabular data.
Function | Action |
---|---|
pd.read_csv(filename) |
Reads a CSV (comma separated values) file |
pd.read_excel(filename) |
Read an Excel file |
Reading other file formats | |
pd.to_csv(filename) |
Write the dataframe to a file |
df.head() |
Shows the first 5 rows by default. If you wish to print n rows use df.head(n)
|
df.tail() |
Shows the last 5 rows by default |
df.shape |
Prints the dataframe dimensions (rows, columns) |
df.info() |
Prints out dataframe information: number of rows, columns, names, data types, number of non null entries and more |
df.describe() |
Returns a statistical analysis of float variables |
df['categorical variable'].describe() |
Describes how many they are and how many are different |
df['categorical variable'].value_counts().head(10) |
Counts the number of occurrences of each categorial variable and shows the first 10 |
df.columns |
Prints the names of the columns |
df.columns = ['col1','col2','col3'] |
It names the columns according to list |
df.rename(columns={'Old1' : 'New1', 'Old2' : 'New2'}, inplace=True) |
Renames some of the columns |
df.drop_duplicates(inplace=True) |
Eliminates repeated rows |
df.isnull().sum() |
Returns the sum of missing values in each variable |
df.dropna() |
Will eliminate rows having at least one null value |
df.dropna(axis=1) |
Will eliminate columns having at least one null value |
df.mean() |
Computes the arithmetic mean of the dataframe |
df.fillna(x_mean, inplace=True) |
Will replace missing values with given mean value |
df.corr() |
Show the correlation between variables |
Function | Action |
---|---|
df['B'] |
Selects column 'B'
|
df[['A','B']] |
Selects columns 'A' and 'B'
|
df_new = df[['A','B']] |
Creates a new dataframe df_new composed by two selected columns of df
|
df['C'] = df['A'] + df['B'] |
Creates a new column in df , being the sum of columns 'A' and 'B'
|
df_copy = df.copy() |
Creates a new dataframe copy of existing df
|
df.drop('D', axis=1, inplace=True) |
Eliminates column 'D' and redefines df
|
df.loc['2'] |
Returns df row with index '2'
|
df.loc['2','C'] |
Returns the specific value of df , with index=2 , and column='C'
|
df.iloc[2] |
Returns row with index=2
|
df.loc['2':'4'] |
Returns rows with index '2','3' , and '4'
|
df.iloc[2:4] |
Returns rows with index '2' and '3' , excludes '4'
|
df[df['B'] == 5.0] |
Selects rows where the condition df['B'] equals 5.0
|
df[(df['B'] == 5.0) & (df['D'] <= 2.0)] |
Selects rows that satisfy both conditions simultaneously |
The standard operators for comparing two values and conditional clauses.
Operators | |
---|---|
Comparison | '<', '<=', '==', '!=', '>=', '>', |
or their wrappers | '.lt()', '.le()', 'eq()', '.ne()', '.ge()', '.gt()' |
Conditionals |
& (and) , | (or)
|
Let's apply the above functions in an example. We can read a local file or a remote files.
# We load the libraries into running memory
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
Let's read in raw form or plain text the CSV (comma separated values) Penguins dataset from a Github repository using Pandas. And print the first 5 lines.
# Read the penguins size dataset
filename = "https://raw.githubusercontent.com/clizarraga-UAD7/Datasets/main/penguins/penguins_size.csv"
df = pd.read_csv(filename)
df.head()
Next, we can apply a set of functions to inquiry about the dataframe
# General information about the dataset
df.info()
Once we have a loaded dataset into a Pandas dataframe, we can create different type of plots.
To load Numpy, Pandas, Matplotlib, and Pyplot, we enter the following import commands in a code cell:
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
Some basic Matplotlib Pyplot or Pandas functions:
Functions | Description |
---|---|
Relational Plots | |
plt.scatter() |
Plotting y vs. x as scatter plot with varying markers size/color. |
df.plot.scatter() |
|
plt.plot() |
Plotting y vs. x , as lines or markers |
df.plot.line |
|
plt.hlines() |
Plot horizontal lines |
plt.vlines() |
Plot vertical lines |
Distribution Plot | |
plt.hist() |
Plot a histogram |
df.plot.hist() |
|
df.plot.kde() |
Generate Kernel Density Estimate plot using Gaussian kernels. |
Categorical Plots | |
plt.boxplot() |
Draw a box and whisker plot |
df.plot.box() |
|
df.plot.boxplot() |
|
plt.bar() |
Make a bar plot |
df.plot.bar() |
|
plt.violinplot() |
Make a violinplot |
Multiplot Grids | |
plt.subplots() |
Creates a figure and a set of subplots |
Exploratory Data Analysis is a Statistics approach of analyzing data sets in order to quickly summarize their main characteristics, and may be supported with simple data visualization like box plots, histograms, or scatter plots, among others.
John W. Tukey wrote the book Exploratory Data Analysis in 1977, where he held that too much emphasis in statistics was placed on statistical hypothesis testing and more emphasis needed to be placed on using data to suggest hypotheses to test. Exploratory Data Analysis does not need any previous assumption on the statistical distribution of the underlying data.
Tukey suggested computing the five number summary of numerical data: the two extremes (maximum and minimum), the median, and the quartiles since they are defined for all empirical distribution (This is basically what you get when using the df.describe()
function from Pandas).
📝 Tuckey's outlier criteria (Click to open)
Turkey also gives a criteria for defining outlier data. If Q1, and Q3 are the first and third quartile positions, the interquartile range IQR = Q3 - Q1 , then an outlier value will fall below Q1 - 1.5 IQR or above Q3 + 1.5 IQR.
(Image credit: UF Biostatistics Open Learning Textbook, CC)
# Pyplot histogram
plt.grid(True)
plt.hist(df['body_mass_g'], bins=10, rwidth=.9, color = 'tab:blue' );
# Pandas plot histogram
df.hist(column='body_mass_g', bins=10, grid=True, rwidth=.9, color='tab:blue',
alpha=0.5, fill='blue');
# Using Pandas plot bar
df['species'].value_counts().plot(kind='bar', color = 'tab:blue');
df.groupby('species').body_mass_g.mean().plot.bar();
# Using Pandas boxplot
boxplot = df.boxplot(column=['culmen_length_mm', 'culmen_depth_mm'])
# Using Pandas boxplot
boxplot = df.groupby('species').boxplot(column=['flipper_length_mm'], color='blue', subplots=False);
plt.title('Penguins characteristics') ;
plt.ylabel('flipper_length_mm');
plt.xticks(rotation=30);
# Matplotlib.Pyplot
# Define plotting variables
x = df.body_mass_g.values
y = df.flipper_length_mm.values
Pspecies = df.species.values
Pspecies_ = np.unique(Pspecies) # Keep only unique species values
Pcolors = ["#1B9E77", "#D95F02", "#7570B3"]
Pmarkers = ["o", "^", "s"] # circle, triangle, square
fig, ax = plt.subplots(figsize=(9,6))
for species, color, marker in zip(Pspecies_, Pcolors, Pmarkers):
idxs = np.where(Pspecies == species)
# No legend will be generated if we don't pass label=species
ax.scatter(
x[idxs], y[idxs], label=species,
s= 30, color=color, marker=marker, alpha=0.7)
plt.xlabel('Body mass (g)')
plt.ylabel('Flipper length (mm)')
plt.title('Matplotlib Pyplot Scatter Plot')
ax.legend();
# Pandas Plot
dfA = df[df['species'] == 'Adelie'].dropna()
dfC = df[df['species'] == 'Chinstrap'].dropna()
dfG = df[df['species'] == 'Gentoo'].dropna()
ax = dfA.plot.scatter(x='body_mass_g', y='flipper_length_mm',
c='tab:green', label='Adelie',
xlabel='Body mass (g)', ylabel='Flipper length (mm)',
title='Pandas Scatter Plot', figsize=(12, 8))
dfC.plot.scatter(x='body_mass_g', y='flipper_length_mm', ax=ax, c='tab:orange',
label='Chinstrap')
dfG.plot.scatter(x='body_mass_g', y='flipper_length_mm', ax=ax, c='tab:purple',
label='GentooA')
plt.show()
A basic plot structure example.
# Work with a copy of the dataframe df dropping all NAN
df1 = df.copy().dropna()
# Create an empty figure object
fig = plt.figure(figsize=(12, 8))
# Add some subplots 2 cols x 2 rows = 4 subplots, but plotting only 3
ax1 = fig.add_subplot(2,2,1) # First subplot of row 1
ax2 = fig.add_subplot(2,2,2) # Second subplot of row 1
ax3 = fig.add_subplot(2,2,3) # Third subplot, now on row 2
# Add some content
# Histogram
_ = ax1.hist(df1['body_mass_g'], bins=20, color = 'tab:blue', alpha = 0.4)
# Scatterplot
ax2.scatter(x = df1['culmen_length_mm'], y = df1['culmen_depth_mm'], color = 'tab:orange', marker = "v", )
# Boxplot
ax3.boxplot(df1['flipper_length_mm']);
# Add some ticks, labels and legends to the plots
# Create an empty figure object
fig = plt.figure(figsize=(12, 8), constrained_layout=False, tight_layout=False)
# Add some subplots 2 cols x 2 rows = 4 subplots, but plotting only 3
ax1 = fig.add_subplot(2,2,1) # First subplot of row 1
ax2 = fig.add_subplot(2,2,2) # Second subplot of row 1
ax3 = fig.add_subplot(2,2,3) # Third subplot, now on row 2
# Add subplots
_ = ax1.hist(df1['body_mass_g'], bins=20, color = 'tab:blue', alpha = 0.4)
ax2.scatter(x = df1['culmen_length_mm'], y = df1['culmen_depth_mm'], color = 'tab:orange', marker = "v", )
ax3.boxplot(df1['flipper_length_mm'])
# Work on details of 1st. plot
ax1.set_xticks([3000, 4000, 5000, 6000])
ax1.set_xticklabels(['3.0', '4.0', '5.0', '6.0'], rotation=30, fontsize='small')
ax1.set_title('Penguins mass distribution (kg)')
# Work on details of 2nd. plot
ax2.set_xticks([30, 40, 50, 60])
ax2.set_xticklabels(['3.0', '4.0', '5.0', '6.0'], rotation=30, fontsize='small')
ax2.set_xlabel('Culmen length (cm)')
ax2.set_yticks([10, 15, 20, 25])
ax2.set_yticklabels(['1.0', '1.5', '2.0', '2.5'], fontsize='small')
ax2.set_ylabel('Culmen depth (cm)')
ax2.set_title('Culmen Depth vs. Length (cm)')
# Work on details of 3rd. plot
ax3.set_xlabel('')
ax3.set_yticks([160, 180, 200, 220, 240])
ax3.set_yticklabels(['16.0', '18.0', '20.0', '22.0', '24.0'], fontsize='small')
ax3.set_ylabel('Flipper length (cm)')
ax3.set_title('Flipper Lenght variability')
# Optional
# Setting width and height space between subplots and general plot title
plt.subplots_adjust(left=0.13, right=0.93, top=1.0, bottom= 0.3, wspace= 0.3, hspace=0.3)
plt.suptitle('Antartica Penguins Dataset', y=1.08, fontsize=14, fontweight='bold');
📝 Here you will find more data visualization examples with Matplotlib, Pandas, Seaborn and other libraries.
- Python Tutorial. Python.org.
- Numpy Tutorial. Numpy.org.
- SciPy User's Guide. SciPy.org.
- Matplotlib Tutorial. Matplotlib.org.
- Getting started with Pandas. Pydata.org.
- Seaborn User's guide and tutorial. Pydata.org.
- Jupyter Notebook. Cheat Sheet. Datacamp.
- Python 3. Cheat Sheet. Laurent Pointal. Mémento v.2.0.6.
- Python Basics. Data Science Cheat Sheet. Dataquest.
- Python Intermediate. Data Science Cheat Sheet. Dataquest.
- Importing Data. Python for Data Science Cheat Sheet. Datacamp.
- Numpy. Data Analysis in Python. Cheat Sheet. Datacamp.
- Numpy. Data Science Cheat Sheet. Dataquest.
- Pandas. Data Science Cheat Sheet. Datacamp.
- Pandas. Data Wrangling in Python. Cheat Sheet. Datacamp.
- Pandas. Data Science Cheat Sheet. Dataquest.
- Matplotlib. Plotting in Python. Cheat Sheet. Datacamp.
- Seaborn Statistical Data Visualization. Cheat Sheet. Datacamp.
- Introduction to Programming. Kaggle.com.
- Python. Kaggle.com.
- Data Cleaning. Kaggle.com.
- Pandas. Kaggle.com.
Created: 05/20/2022 (C. Lizárraga); Last update: 05/24/2022 (C. Lizárraga)