Python Data Analytics - JohnHau/mis GitHub Wiki
- Introduction Data Analysis is process of extracting information from raw data. It aims to build a model with predictive power. In parallel, data visualization aims to present the data graphically for you to easily understanding their meaning. At the end of data analysis, you could have a model and a set of graphical displays that allow you to predict the responses given the inputs.
To undertake data analysis, you need these knowledges:
Programming (in Python, R or Matlab), e.g., Web Scraping which allows the collection of data through the recognition of specific occurrence of HTML tags within the web page. Mathematics and Statistics: in particular, Bayesian, regression and clustering. Machine Learning and Artificial Intelligence. Domain knowledge on the field under study. 2. Tools and Packages 2.1 Jupyter Notebook Jupyter Notebook is great tool for data analysis under Python, which bundled with all the Python data analytics packages. Read "Jupyter Notebook" on how to install and get started.
2.2 SciPy SciPy (@ https://www.scipy.org) is a set of open-source Python libraries specialized for mathematics, science and engineering. It consists of the many Python packages.
We will use the following packages for data analysis:
NumPy (@ http://www.numpy.org/): the fundamental package for numerical computation. It defines the n-dimensional array (ndarray) and its basic operations. Pandas (@ http://pandas.pydata.org/): provides a high-performance, easy-to-use 2D tabular data structures (DataFrame) and its analysis. Matplotlib (@ https://matplotlib.org/): supports comprehensive 2D Plotting and rudimentary 3D plotting. scikit-learn (@ https://scikit-learn.org/stable/) is a collection of algorithms and tools for machine learning. Jupyter Notebook (@ http://jupyter.org/): An webapp allows you to document your computation in an easily reproducible form. In addition, SciPy also includes:
SciPy Library (@ https://www.scipy.org/scipylib/index.html): a collection of numerical algorithms and domain-specific toolboxes, including signal processing, optimization, statistics and more. SymPy (@ https://www.sympy.org/en/index.html): symbolic mathematics and algebra. scikit-image (@ https://scikit-image.org/) is a collection of algorithms for image processing. Nose (@ https://nose.readthedocs.io/en/latest/): a framework for testing Python code, being phased out in preference for pytest (@ https://docs.pytest.org/en/latest/). h5py (@ http://www.h5py.org/) and PyTables (@ http://www.pytables.org/) can both access data stored in the HDF5 format. Installation (For Windows/Mac/Ubuntu) I suggest that you install Jupyter Notebook (via Python 3's Anaconda distribution), which bundles with most of the Python data analysis packages.
(For Ubuntu) To install all the packages:
$ sudo apt-get install python-numpy python-scipy python-matplotlib python-pandas python-sympy python-nose
$ sudo apt-get install python3-numpy python3-scipy pytho3n-matplotlib python3-pandas python3-sympy python3-nose
- Matplotlib References:
Matplotlib mother site @ http://matplotlib.org/index.html. Matplotlib beginner's guide @ http://matplotlib.org/users/beginner.html.
Matplotlib is a Python 2D plotting library for generating plots, such as histograms, power spectra, bar charts, error charts, scatter plots, and more. It can be used in interactive environments, including Python scripts, the Python command-line shells, the Jupyter Notebook, web application servers, and graphical user interface toolkits, across platforms (Windows, Unix, Mac). It also produces quality figures in various hardcopy formats, such as PDF, PNG, SVG.
3.1 The matplotlib.pyplot Module The matplotlib.pyplot is a collection of command-style functions that makes Matplotlib work like MATLAB.
Include the following import statement to use the module:
import matplotlib.pyplot as plt 3.2 Get Started Simplest Plot The simplest example to plot a line is as follows. Try it out on Jupyter Notebook and Python's command-line shell, and observe the output.
import matplotlib.pyplot as plt
plt.plot([1, 2, 3, 4, 5, 6, 7], [7, 8, 6, 5, 2, 2, 4], 'b*-')
[<matplotlib.lines.Line2D object at ...>]
plt.show()
Customizing Your Figure: Setting Title, X-Y Axis, Legend You can customize the figure, such as adding title, setting the axes and legend, via dedicated functions/commands. For example,
import matplotlib.pyplot as plt
plt.plot([1, 2, 3, 4, 5, 6, 7], [7, 8, 6, 5, 2, 2, 4], 'b*-', label='Major') # "label" used for legend [<matplotlib.lines.Line2D object at ...>] # Return a list of "Line2D" objects plt.plot([1, 2, 3, 4, 5, 6, 7], [3, 1, 1, 3, 4, 3, 5], 'ro-', label='Minor') # Another line [<matplotlib.lines.Line2D object at ...>]
plt.title('My Star Plot') Text(0.5,1,'My Star Plot') # Return a "Text" object
plt.xlabel('Some X (unit)') <matplotlib.text.Text object at ...> # Return a "Text" object plt.ylabel('Some Y (unit)') <matplotlib.text.Text object at ...> plt.axis([1, 7, 0, 9]) # [xmin, xmax, ymin, ymax] [1, 7, 0, 9]
plt.legend() <matplotlib.legend.Legend object at ...> # Return a "Legend" object
plt.savefig('PlotStars.png', dpi=600, format='png') plt.show() # Show figure, clear figure and free memory
For example,
import matplotlib.pyplot as plt
# Return a figure objectplt.figure(1) # Same as plt.figure()
plt.subplot(2, 1, 1) # 2 rows, 1 column, start subplot 1. Same as plt.subplot(211) <matplotlib.axes._subplots.AxesSubplot object at ...> # Return an axes object
plt.plot([1, 2, 3, 4, 5, 6, 7], [7, 8, 6, 5, 2, 2, 4], 'b*-', label='Major') [<matplotlib.lines.Line2D object at ...>] plt.title('Sub-Plot 1 Title') Text(0.5,1,'Sub-Plot 1 Title') plt.legend() <matplotlib.legend.Legend object at ...>
plt.subplot(2, 1, 2) # 2 rows, 1 column, start subplot 2. Same as plt.subplot(212) <matplotlib.axes._subplots.AxesSubplot object at ...> # Return an axes object
plt.plot([1, 2, 3, 4, 5, 6, 7], [3, 1, 1, 3, 4, 3, 5], 'ro-', label='Minor') [<matplotlib.lines.Line2D object at ...>] plt.title('Sub-Plot 2 Title') Text(0.5,1,'Sub-Plot 2 Title') plt.legend() <matplotlib.legend.Legend object at ...>
plt.tight_layout() # Prevent subplots overlap plt.savefig('Plot2x1.png', dpi=600, format='png') # Save this figure
>>> plt.plot([1, 2, 3, 4, 5], [1, 3, 2, 7, 5], 'ro-') # subplot 1 created automatically as the current axesplt.figure(2)
plt.show()
You can also retrieve the handles (references) to the figure and sub-plots (axes), and use the axes in plotting. For example,
import matplotlib.pyplot as plt
Create a figure and sub-plots of 2 rows by 2 columns. Retrieve the handles of figure and subplot axes
# Figure object >>> ax1 # subplots are AxesSubplot objectsfig1, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2) fig1
ax1.plot([1, 2, 3, 4, 5], [1, 3, 2, 7, 5], 'ro-') [<matplotlib.lines.Line2D object at ...>] ax1.set_title('AX1 Title') Text(0.5,1,'AX1 Title') ax2.plot([1, 2, 3, 4, 5], [8, 5, 2, 3, 3], 'gx-') [<matplotlib.lines.Line2D object at ...>] ax2.set_title('AX2 Title') Text(0.5,1,'AX2 Title') ax3.plot([1, 2, 3, 4, 5], [1, 2, 3, 4, 5], 'bo-') [<matplotlib.lines.Line2D object at ...>] ax3.set_title('AX3 Title') Text(0.5,1,'AX3 Title') ax4.plot([1, 2, 3, 4, 5], [5, 4, 3, 2, 1], 'rx-') [<matplotlib.lines.Line2D object at ...>] ax4.set_title('AX4 Title') Text(0.5,1,'AX4 Title')
plt.tight_layout() # Prevent subplots overlap plt.show()
3.3 The plot() Function The plot() has these signatures:
help(plt.plot) plot([x], y, [fmt], [**kwargs]) # Single line or point plot([x1], y1, [fmt1], [x2], y2, [fmt2], ..., [**kwargs]) # Multiple lines or points
For examples,
plot(y): plot y with x=range(len(y))=[0, 1, 2, ..., len-1], where y can be an array (line-plot) or a scalar (point-plot). plot(x, y): plot y against x, where x and y can be an array (line-plot) or a scalar (point-plot) plot(x, y, fmt): plot y against x using the format string, e.g., 'bo-' for blue circle solid-line, 'r+' for red pluses. plot(x1, y1, fmt1, x2, y2, fmt2, ...): plot yn vs. xn using the respective format strings (multiple lines or multiple points). Line's Properties: Color, Marker and Line Style LInes are represented in Line2D objects. You can use format string to specify the color, marker and line style.
The color abbreviations are:
'r' (red), 'g' (green), 'b' (blue) 'c' (cyan), 'm' (magenta), 'y' (yellow) 'k' (black) and 'w' (white) The markers are:
'.' (point marker), ',' (pixel marker), '*' (star marker), '+' (plus marker), 'x' (cross marker) 'o' (circle marker), 's' (square marker), 'h' (hexagon1 marker), 'H' (hexagon2 marker), 'd' (thin-diamond marker), 'D' (diamond marker) 'v' (triangle-down marker), '^' (triangle-up marker), '<' (triangle-left marker), '>' (triangle-right marker) '1' (triangle-down marker), '2' (triangle-up marker), '3' (triangle-left marker), '4' (triangle-right marker) '|' (vline marker), '_' (hline marker) The line styles are:
'-' or 'solid' '--' or 'dashed' '-.' or 'dashdot' ':' or 'dotted' Setting Line's Properties The function plot() returns a list of Line2D objects (see above examples), which has these attributes:
color (or c) marker, markersize (or ms), markerfacecolor (or mfc), markeredgecolor (or mec), markeredgewidth (or mew) linestyle (or ls), linewidth (or lw) others You can set the line's properties:
Using keyword arguments of plot(), e.g.,
plt.plot([1, 2, 3, 4, 5], [5, 1, 2, 4, 3], color='green', marker='o', markerfacecolor='blue', markersize=12, linestyle='dashed') plt.show() Using Line2D's Setters set_xxx() for each property, e.g., line, = plt.plot([1, 2, 3, 4, 5], [5, 1, 2, 4, 3]) # plot() returns a list of Line2D objects - an one-item list in this plot # Retrieve a reference to the Line2D by unpack an one-item list line.set_color('y') # same as line.set_color('yellow') line.set_linestyle('-.') # same as line.set_linestyle('dashdot') line.set_marker('*') # star marker plt.show() Using setp() (set property) function, e.g., lines = plt.plot([1, 2, 3, 4, 5], [5, 1, 2, 4, 3], [1, 2, 3, 4, 5], [2, 4, 6, 3, 4]) # 2-item list lines [<matplotlib.lines.Line2D object at ...>, <matplotlib.lines.Line2D object at ...>] plt.setp(lines, color='r', marker='+') # Applicable to single line or list of lines [None, None, None, None] plt.show() 3.4 Working with Texts The following functions returns a Text object:
title(str): Set title xlabel(str), ylabel(str): Set x-axis and y-axis labels text(xPos, yPos, str): Draws str at (xPos, yPos). annotate(str, xy=(x, y), xytext=(x, y)): Annotate for the point at xy, with str placed at xytext, with an optional arrow. You can include optional keyword arguments in the above functions, such as fontsize, color, etc.
Exmaple: text() and annotate()
x = range(1, 6) # [1, 2, 3, 4, 5] y = [5, 2, 4, 1, 6] ytexts = ['First', 'Second', 'Third', 'Fourth', 'Fifth'] plt.plot(x, y, 'ro-') [<matplotlib.lines.Line2D object at ...>]
for i in range(len(x)): plt.text(x[i], y[i]+0.1, ytexts[i], horizontalalignment='center', verticalalignment='bottom') Text(1,5,'First') Text(2,2,'Second') ......
plt.annotate('Annotate Third', xy=(x[2], y[2]), xytext=(x[2]+0.5, y[2]+1), arrowprops={'facecolor':'black', 'shrink':0.05, 'width':1}) Text(3.5,5,'Annotate Third') plt.show() 3.5 Logarithmic and Non-Linear Axis xscale(scale), yscale(scale): the available scales are 'linear', 'log', 'symlog' (symmetric log). [TODO] Examples
3.6 Saving the Figures: savefig()
help(plt.savefig) savefig(fname, dpi=None, facecolor='w', edgecolor='w', orientation='portrait', papertype=None, format=None, transparent=False, bbox_inches=None, pad_inches=0.1, frameon=None) The output file formats including PNG, PDF, SVG, EPS, set via keyword format=xxx.
For example,
plt.plot([1, 2, 3, 4, 5], [5, 2, 4, 3, 2], 'ro-') [<matplotlib.lines.Line2D object at ...>] plt.savefig('test.pdf', dpi=600, format='pdf') plt.savefig('test.png', dpi=600, format='png') plt.show() # You cannot issue show() before savefig(), # as show() clears the figure and free the memory 3.7 Configuration File "matplotlibrc" You can configure Matplotlib via configuration file "matplotlibrc".
You can check the location of "matplotlibrc" via:
import matplotlib matplotlib.matplotlib_fname() ...... [TODO]
- NumPy References:
NumPy mother site @ http://www.numpy.org/. NumPy User Guide @ http://docs.scipy.org/doc/numpy-dev/user/
NumPy (which stands for Numerical Python @ http://www.numpy.org/) is the foundation library for scientific computing in Python. It provides data structures and high-performance functions that the standard Python does not provide. NumPy defines a data structure called ndarray which is an N-dimensional array to support matrix operations, basic linear algebra, basic statistical operations, Fourier transform, random number capabilities and much more. NumPy uses pre-compiled numerical routines (most of them implemented in C code) for high-performance operations. It also supports vector (or parallel) computations.
4.1 The numpy Package NumPy is distributed in Python package numpy. You need to import the package:
import numpy as np 4.2 The numpy.ndarray Class At the core of NumPy is a class called ndarray for modeling homogeneous n-dimensional arrays and matrices. Unlike Python's normal array list, but like C/C++/Java's array:
ndarray has a fixed size at creation. ndarray contains elements of the same data type. The ndarray has these attributes:
ndarray.dtype: data type of the elements. Recall that ndarray contains elements of the same type (unlike Python's array list). You can use the Python built-in types such as int, float, bool, str and complex; or the NumPy's types, such as int8, int16, int32, int64, uint8, uint16, uint32, uint64, float32, float64, complex64, complex128, with the specified bit-size. ndarray.shape: a tuple of n positive integers (d0, d1, ..., dn-1) that specifies the size for each dimension. E.g., for a 2D matrix with n rows and m columns, shape is a tuple (n, m). In Numpy, dimensions are called axes. (NumPy dimension is different from the Mathematical dimension!) The number of axes is rank. The length of axis-0 is d0, the length of axis-1 is d1, and so on. ndarray.ndim: rank (number of axes, length of shape). NumPy's rank is different from Linear Algebra's rank (number of independent vectors)! ndarray.size: total number of elements, same as the product of shape. ndarray.itemsize: size in bytes of each element (all elements have the same type). ndarray.data: the buffer containing the actual elements. 4.3 Creating an ndarray and Checking its Attributes There are a few ways to create a NumPy's ndarray.
Creating an Array 1: numpy.array(lst, [dtype=None]) -> ndarray You can use the NumPy's function array() to create and initialize an ndarray object from a Python's list/tuple. You can use the optional keyword argument dtype to specify the data type instead of taking the default data type.
For examples,
import numpy as np help(np.array) ......
m1 = np.array([11, 22, 33]) m1 array([11, 22, 33]) # ndarray is printed with prefix array() type(m1) <class 'numpy.ndarray'> m1.shape # dimension (3,) # shape is a tuple of dimensions m1.dtype # data type dtype('int32') m1.itemsize 4 # 4 bytes (32 bits) for int32 m1.ndim # rank (number of axes) 1 m1.size # total number of elements 3 m1.data <memory at ...>
m2 = np.array([1.1, 2.2, 3]) m2 array([1.1, 2.2, 3. ]) m2.dtype dtype('float64') # default floats are float64
m3 = np.array([1, 2.2, 3], dtype=complex) m3 array([ 1.0+0.j, 2.2+0.j, 3.0+0.j]) m3.dtype dtype('complex128')
m4 = np.array(['a', 'bb', 'ccc']) m4 array(['a', 'bb', 'ccc'], dtype='<U3') # little-endian Unicode 3-character string m4.dtype dtype('<U3')
m5 = np.array((11, 22, 33)) # Can also use a tuple m5 array([11, 22, 33])
m6 = np.array(11, 22, 33], [44, 55, 66) m6 array([[11, 22, 33], [44, 55, 66]]) m6.shape # dimensions (2, 3) # rows, columns m6.ndim # number of dimensions, or rank, or number of axes 2
m7 = np.array([(1, 2), [3, 4], (5, 6)], dtype=float) m7 array([[1., 2.], [3., 4.], [5., 6.]]) m7.dtype dtype('float64') m7.shape (3, 2) m7.ndim 2 # rank (2 axes) NumPy's Data Types NumPy supports Python's built-in data types (such as int, float, bool, complex, and str). It also introduces its own scalar data types:
Signed Integers: int8, int16, int32, int64, int_ (default integer type, same as C's long, normally either int64 or int32), intc (same as C's int), intp (integers used for indexing, same as C's ssize_t, normally either int32 or int64) Unsigned Integers: uint8, uint16, unit32, uint64 Floating-point numbers: float16, float32, float64, float_ (default, same as float64) Boolean: bool_ (True or False) Complex numbers: complex64, complex128, complex_ (default, same as complex128) Strings: str, unicode, unicode_ Creating an Array 2: numpy.ones(shape) -> ndarray: Return a new array of the given shape, filled with 1. numpy.zeros(shape) -> ndarray: Return a new array of the given shape, filled with 0. numpy.empty(shape) -> ndarray: Return a new array of the given shape, uninitialized. numpy.full(shape, fill_value) -> ndarray: Return a new array of the given shape, filled with fill_value. numpy.diag(lstDiag) -> ndarray: Return a new array with the given diagonal elements. numpy.ones_like(a) -> ndarray: Return a new array of the same shape and type as a, filled with 1. numpy.zeros_like(a) -> ndarray: Return a new array of the same shape and type as a, filled with 0. numpy.empty_like(a) -> ndarray: Return a new array of the same shape and type as a, uninitialized. numpy.full_like(a, fill_value) -> ndarray: Return a new array of the same shape and type as a, filled with fill_value.
The function ones() and zeros() create an array full of ones and zeros respectively. The empty() creates a new array of given shape and type, without initializing entries. The default type is float64, unless overridden with keyword dtype. For example,
import numpy as np help(np.ones) m1 = np.ones((3, 5)) # takes a shape tuple in row-major order m1 array([[ 1., 1., 1., 1., 1.], [ 1., 1., 1., 1., 1.], [ 1., 1., 1., 1., 1.]]) m1.dtype dtype('float64')
help(np.zeros) m2 = np.zeros((2, 3, 4), dtype=np.int32) # 3D array m2 array([[[0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0]],
[[0, 0, 0, 0],
[0, 0, 0, 0],
[0, 0, 0, 0]]])
m2.dtype dtype('int32')
help(np.full) m3 = np.full((2, 5), 99) m3 array([[99, 99, 99, 99, 99], [99, 99, 99, 99, 99]])
help(np.empty) m4 = np.empty((2, 3, 2, 2)) # A 4D array m4 array([[[[4.65302447e-312, 0.00000000e+000], # Contents not initialized [0.00000000e+000, 1.53527001e-311]],
[[0.00000000e+000, 1.00000000e+000],
[0.00000000e+000, 0.00000000e+000]],
[[1.00000000e+000, 0.00000000e+000],
[0.00000000e+000, 0.00000000e+000]]],
[[[0.00000000e+000, 1.00000000e+000],
[1.01007000e-311, 0.00000000e+000]],
[[2.49009086e-321, 4.94065646e-324],
[0.00000000e+000, 1.53526866e-311]],
[[1.53526866e-311, 0.00000000e+000],
[0.00000000e+000, 0.00000000e+000]]]])
m4.dtype dtype('float64')
help(np.diag) m5 = np.diag([11, 22, 33]) # Create a diagonal 2D array m5 array([[11, 0, 0], [ 0, 22, 0], [ 0, 0, 33]])
help(np.zeros_like) m6 = np.zeros_like(m5) # Same shape and type m6 array([[0, 0, 0], [0, 0, 0], [0, 0, 0]]) m6.dtype dtype('int32') Creating an Array 3: numpy.arange([start=0], stop, [step=1]) -> ndarray_1D numpy.linspace(start, stop, num) -> ndarray_1D ndarray.reshape(newShape) NumPy provides a function numpy.arange(start, stop, step) to create a 1D ndarray in the range of [start, stop), analogous to Python's range(start, stop, step) built-in function. Unlike range() which accepts only int, you can use float for start, stop and step in arange(). For examples,
help(np.arange) m1 = np.arange(1, 11) # start included, stop excluded m1 array([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]) m1.dtype dtype('int32')
m2 = np.arange(5) # default start is 0, step is 1 m2 array([0, 1, 2, 3, 4])
m3 = np.arange(1.5, 8.5) m3 array([1.5, 2.5, 3.5, 4.5, 5.5, 6.5, 7.5]) m4 = np.arange(1, 10, 0.8) m4 array([1. , 1.8, 2.6, 3.4, 4.2, 5. , 5.8, 6.6, 7.4, 8.2, 9. , 9.8])
a1 = range(1.5, 8.5) TypeError: 'float' object cannot be interpreted as an integer You can then use ndarray.reshape(newShape) to reshape the 1D to N-D ndarray. For examples,
help(np.reshape)
m5 = np.arange(10).reshape(2, 5) m5 array([[0, 1, 2, 3, 4], [5, 6, 7, 8, 9]])
m6 = m5.reshape(1, -1) m6 array(0, 1, 2, 3, 4, 5, 6, 7, 8, 9) NumPy also provides a similar function called numpy.linspace(start, stop, num) to create a evenly-spaced (linearly-spaced) 1D float ndarray in the interval [start, stop]. By default, stop is included. But you can exclude it via keyword endpoint=False. The linspace() takes the number of points, instead of step size for arange().
For example,
help(np.linspace) m1 = np.linspace(1, 2, 10) m1 array([ 1. , 1.11111111, 1.22222222, 1.33333333, 1.44444444, 1.55555556, 1.66666667, 1.77777778, 1.88888889, 2. ]) m1.dtype dtype('float64') m2 = np.linspace(1, 2, 10, endpoint=False) # Exclude end-point m2 array([1. , 1.1, 1.2, 1.3, 1.4, 1.5, 1.6, 1.7, 1.8, 1.9]) m3 = np.linspace(1, 10, 10, dtype=int) # Set data type m3 array([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]) m4 = np.linspace(1, 10, 10).reshape(2, 5) m4 array([[ 1., 2., 3., 4., 5.], [ 6., 7., 8., 9., 10.]])
m5 = np.linspace(-np.pi, np.pi, 361) m5 array([-3.14159265, .... 3.14159265]) Creating an Array 4: Creating Random Array via the numpy.random.xxx() -> ndarray The numpy.random module supports random number generation. You can draw samples from commonly-used distributions like Uniform, Normal (Gaussian), Binomial and Poisson.
Uniformly Distributed: numpy.random.rand(d0, d1, ..., dn-1) -> ndarray: Uniformly distributed floats in [0.0, 1.0), providing the dimensions. numpy.random.random(size=None) -> ndarray: Uniformly distributed floats in [0.0, 1.0), providing the size in scalar or shape in tuple. numpy.random.random_sample(size=None) -> ndarray: same as above. numpy.random.uniform(low=0.0, high=1.0, size=None) -> ndarray: Uniformly distributed floats in [low, high). numpy.random.randint(low, high=None, size=None) -> ndarray: Uniformly distributed integers between [low, high)
help(np.random.rand) m1 = np.random.rand(2, 3) # Specify the dimensions m1 array([[0.57877041, 0.93898599, 0.15998744], [0.5195182 , 0.79441764, 0.47046495]]) m1.dtype dtype('float64')
help(np.random.random) m2 = np.random.random() # One sample (default) m2 0.8530312529958475 # Scalar, NOT array m3 = np.random.random(5) # 1D ndarray m3 array([ 0.31007576, 0.21615439, 0.26983623, 0.44427757, 0.35548085]) m4 = np.random.random((2, 4)) # ndarray of given shape m4 array([[ 0.45519034, 0.97199324, 0.49615973, 0.5377464 ], [ 0.1057191 , 0.900195 , 0.7685127 , 0.23238175]])
help(np.random.uniform) m5 = np.random.uniform(5, 10, (2, 4)) # low, high, shape m5 array([[8.39092855, 5.95135548, 7.21166273, 6.46086279], [9.7510942 , 5.99099363, 9.9313887 , 6.75191231]])
help(np.random.randint) m6 = np.random.randint(1, 101, (2, 4)) m6 array([[68, 97, 84, 55], [49, 57, 28, 87]]) m7 = np.random.randint(1, 101, 10) m7 array([37, 34, 57, 60, 26, 34, 46, 73, 59, 96]) m8 = np.random.randint(50, size=(2, 5)) # [0, 50) m8 array([[16, 48, 9, 3, 22], [19, 20, 16, 17, 11]]) Normal (Gaussian) Distributed: numpy.random.randn(d0, d1, ..., dn-1) -> ndarray: Standard normal distribution (mean=0, standard deviation=1), providing the dimensions. numpy.random.normal(loc=0.0, scale=1.0, size=None) -> ndarray: Normal (Gaussian) distribution, with mean loc and standard deviation scale. help(np.random.randn) m1 = np.random.randn(2, 5) m1 array([[-0.36150823, -2.02660018, -0.38235962, 0.64032599, 0.23108273], [-0.31966815, 1.3190811 , 0.49096282, 0.01427582, -1.35702935]])
help(np.random.normal()) m2 = np.random.normal() m2 -0.355415080976361 # Scalar, NOT array m3 = np.random.normal(size=10) m3 array([-0.78298485, 0.53316234, 0.07914094, 0.88850953, 1.05475548, 0.84182328, 0.0081135 , -0.28555631, -0.04288513, -0.36058967]) m4 = np.random.normal(size=(2, 3)) m4 array([[-1.24201626, -0.66748844, 0.3602864 ], [-0.97706347, 1.02509533, 0.08946322]])
m5 = np.random.normal(50, 15, 10) m5 array([ 49.57202009, 57.63097904, 51.33961472, 22.0570641 , 65.46613523, 35.14129408, 61.97144885, 56.32118504, 75.82942142, 40.70516785]) m6 = np.random.normal(5, 2, (2, 4)) m6 array([[5.09802446, 1.74155424, 3.87027413, 3.87650247], [5.50037146, 6.61549043, 6.9740259 , 5.04622304]]) Binomial Distributed: numpy.random.binomial(n, p, size=None) -> ndarray: Binomial distribution for n trials with p probability of success. help(np.random.binomial) m1 = np.random.binomial(1, 0.5, 10) # 5 trials, probability of success is 0.5 m1 array([1, 1, 1, 1, 0, 1, 0, 1, 1, 1]) m2 = np.random.binomial(5, 0.2, (2, 4)) m2 array([[2, 0, 0, 0], [1, 1, 1, 2]]) Poisson Distributed: numpy.random.poisson(lam=1.0, size=None) -> ndarray: Poisson distribution with parameter lambda. help(np.random.poisson) m1 = np.random.poisson(1, 15) m1 array([2, 2, 2, 2, 0, 1, 2, 1, 0, 1, 0, 0, 3, 0, 0]) m2 = np.random.poisson(5, (2, 5)) m2 array([[ 4, 6, 5, 11, 5], [ 7, 4, 3, 7, 7]]) Permutation: numpy.random.permutation(x) -> ndarray
np.random.permutation(10) array([0, 8, 2, 5, 3, 6, 7, 9, 1, 4])
np.random.permutation([1, 3, 8, 11, 15]) array([ 8, 3, 11, 15, 1])
m1 = np.arange(12).reshape(3, 4) m1 array([[ 0, 1, 2, 3], [ 4, 5, 6, 7], [ 8, 9, 10, 11]]) np.random.permutation(m1) # Return a new array array([[ 8, 9, 10, 11], [ 4, 5, 6, 7], [ 0, 1, 2, 3]]) 4.4 Accessing the ndarray Accessing the ndarray 1: Multi-Dimensional Indexing [i, j, ...] and Slicing [m1:n1:step1, m2:n2:step2, ...] You can apply indexing and slicing to NumPy's ndarray, similar to Python's array list, but extended to multi-dimensional.
m1 = np.arange(1, 13).reshape(3, 4) # 2D m1 array([[ 1, 2, 3, 4], [ 5, 6, 7, 8], [ 9, 10, 11, 12]]) m1.shape (3, 4)
m1[0, 2] 3 m1[0, -1] # Can use negative index starting from the end 4
m1[0] # Same as m1[0, :] or m1[0, 0:m1.shape[1]] # Missing trailing index same as : (all elements) array([1, 2, 3, 4]) m1[0, :] # Same as above array([1, 2, 3, 4]) m1[:, 0] # Same as m1[0:m1.shape[0], 0] array([1, 5, 9]) # Column becomes row m1[-1] # Can use negative index, same as m1[-1, :] array([ 9, 10, 11, 12]) m1[:, -1] array([ 4, 8, 12]) m1[0, 1:] array([2, 3, 4]) m1[0:2, 1:3] array([[2, 3], [6, 7]]) m1[0:2, 0:3:2] array([[1, 3], [5, 7]]) m1[::2, ::2] # Alternate rows and columns array([[ 1, 3], [ 9, 11]])
m1[::-1] array([[ 9, 10, 11, 12], [ 5, 6, 7, 8], [ 1, 2, 3, 4]]) m1[::-1, ::-1] array([[12, 11, 10, 9], [ 8, 7, 6, 5], [ 4, 3, 2, 1]]) m1[::-2, ::-2] array([[12, 10], [ 4, 2]])
lst = 1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12 # A list of lists lst[0] [1, 2, 3, 4] # Element is a list lst[0, 0] # not truly 2D TypeError: list indices must be integers or slices, not tuple lst[0][0] 1 lst[::2] 1, 2, 3, 4], [9, 10, 11, 12 lst[::2][::2] 1, 2, 3, 4 Accessing the ndarray 2: Indexing with list You can provide a list in indexing (this is not supported in Python's array list). For examples,
m1 = np.arange(12).reshape(3, 4) m1 array([[ 0, 1, 2, 3], [ 4, 5, 6, 7], [ 8, 9, 10, 11]]) m1[:, [1, 3]] # Select a list of columns array([[ 1, 3], [ 5, 7], [ 9, 11]]) m1[[0, 1], :] # Select a list of rows array([[0, 1, 2, 3], [4, 5, 6, 7]]) m1[[2, 0], :] # Select a list of rows and re-arrange array([[ 8, 9, 10, 11], [ 0, 1, 2, 3]])
m10, 1], [1, 3 # Elements [0, 1] and [1, 3] array([1, 7]) m10, 1, 2], [1, 3, 1 # Elements [0, 1], [1, 3] and [2, 1] array([1, 7, 9]) m10, 1], [1, 3, 1 IndexError: shape mismatch: indexing arrays could not be broadcast together with shapes (2,) (3,) Accessing the ndarray 3: Filtering (Selection) via a boolean list/ndarray You can filter a NumPy's ndarray with a boolean list or ndarray. This is not supported in Python's list.
m1 = np.arange(12).reshape(3, 4) m1 array([[ 0, 1, 2, 3], [ 4, 5, 6, 7], [ 8, 9, 10, 11]])
m1True, False, False # Filtering rows according to bool value (axis-0) array(0, 1, 2, 3) m1True, False, True array([[ 0, 1, 2, 3], [ 8, 9, 10, 11]])
m1[:, [True, False, True, False]] # Filtering columns (axis-1) array([[ 0, 2], [ 4, 6], [ 8, 10]])
filter = np.array(True, False, True, False], [True, False, True, False], [True, False, True, False) filter array([[ True, False, True, False], [ True, False, True, False], [ True, False, True, False]]) m1[filter] array([ 0, 2, 4, 6, 8, 10]) # 1D result
m1 > 6 array([[False, False, False, False], [False, False, True, True], [ True, True, True, True]]) # result is an ndarray m1[m1 > 6] # filtering with a boolean ndarray array([ 7, 8, 9, 10, 11]) 4.5 The ndarray's Operators The Overloaded Element-wise Assignment Operator (=) for Multi-dimensional Indexing and Slicing The ndarray's assignment operator (=) is overloaded to support element-wise assignment for indexing and slicing. This is not supported in Python's list.
m1 = np.arange(1, 10).reshape((3, 3)) # 2D m1 array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
m1[0, 0] = 99 m1 array([[99, 2, 3], [ 4, 5, 6], [ 7, 8, 9]])
m1[::2, ::2] = 0 # Alternate rows and columns m1 array([[0, 2, 0], [4, 5, 6], [0, 8, 0]])
m1True, False, False = 88 # Slicing via bool list on axis-0 m1 array([[88, 88, 88], [ 4, 5, 6], [ 0, 8, 0]])
m1 = 0 # But this re-assigns the variable to new value m1 0
lst = [1, 2, 3, 4, 5, 6, 7, 8, 9] lst[0] = 99 # Supports indexing with assignment lst [99, 2, 3, 4, 5, 6, 7, 8, 9] lst[0:2] = 0 # No element-wise slicing assignment TypeError: can only assign an iterable lst[0:2] = [0, 0] # Support slicing assignment with list lst [0, 0, 3, 4, 5, 6, 7, 8, 9] lst[0:2] = [0] lst [0, 3, 4, 5, 6, 7, 8, 9] lst = 0 lst 0 # But this re-assigns the variable to scalar 0 The Overloaded Element-wise Arithmetic Operators (+, -, *, /, //, %, ) and Compound Arithmetic Assignment Operators (+=, -=, =, /=, //=, %=, **=) The arithmetic operators such as '+', '-', '', '/', '//', '%' and '' are overloaded to apply element-wise. The compound assignment operators, such as '+=' and '-=', are also supported. This is not supported in Python's array list.
For examples,
m1 = np.array(1, 2, 3], [4, 5, 6) m1 array([[1, 2, 3], [4, 5, 6]]) m2 = np.array(11, 12, 13], [14, 15, 16) m2 array([[11, 12, 13], [14, 15, 16]])
m1 + m2 array([[12, 14, 16], [18, 20, 22]]) m1 - m2 array([[-10, -10, -10], [-10, -10, -10]]) m1 * m2 # element-wise multiplication (not matrix multiplication) array([[11, 24, 39], [56, 75, 96]]) m2 / m1 # float divide array([[ 11. , 6. , 4.33333333], [ 3.5 , 3. , 2.66666667]]) m2 // m1 # floor integer divide array([[11, 6, 4], [ 3, 3, 2]], dtype=int32) m2 % m1 # modulus (remainder) array([[0, 0, 1], [2, 0, 4]], dtype=int32) m2 ** m1 # exponential (power) array([[ 11, 144, 2197], [ 38416, 759375, 16777216]], dtype=int32)
np.add(m1, m2) array([[12, 14, 16], [18, 20, 22]]) np.subtract(m1, m2) array([[-10, -10, -10], [-10, -10, -10]]) np.multiply(m1, m2) array([[11, 24, 39], [56, 75, 96]]) np.divide(m2, m1) array([[11. , 6. , 4.33333333], [ 3.5 , 3. , 2.66666667]]) np.floor_divide(m2, m1) array([[11, 6, 4], [ 3, 3, 2]], dtype=int32) np.mod(m2, m1) array([[0, 0, 1], [2, 0, 4]], dtype=int32) np.power(m2, m1) array([[ 11, 144, 2197], [ 38416, 759375, 16777216]], dtype=int32)
m1 + 80 array([[81, 82, 83], [84, 85, 86]])
m1 += m2 m1 array([[12, 14, 16], [18, 20, 22]])
m3 = np.array(1, 2, 3], [4, 5, 6) m3 array([[1, 2, 3], [4, 5, 6]]) m3 += 1 # Python does not support ++, use m3 += 1, or m3 = m3 + 1 m3 array([[2, 3, 4], [5, 6, 7]]) m3 -= 1 m3 array([[1, 2, 3], [4, 5, 6]])
lst1 = [1, 2, 3] lst2 = [4, 5, 6] lst1 + lst2 [1, 2, 3, 4, 5, 6] # Concatenation, NOT element-wise addition lst1 * lst2 TypeError: can't multiply sequence by non-int of type 'list' The Overloaded Element-wise Comparison Operators (==, !=, <, >, <=, >=) You can also apply comparison operators, such as ==, !=, <, <=, >, >=, element-wise. This is not supported in Python's list.
For example,
m1 = np.array(1, 222, 13], [44, 5, 66) m2 = np.array(11, 12, 13], [14, 15, 16) m1 < m2 array([[ True, False, False], [False, True, False]]) m1 == m2 array([[False, False, True], [False, False, False]])
m1 == 44 array([[False, False, False], [ True, False, False]])
m1[m1 < m2] array([1, 5]) numpy.any(a, axis=None), ndarray.any(axis=None): Test if ANY element along a given axis evaluates to True. numpy.all(a, axis=None), ndarray.all(axis=None): Test if ALL elements along a given axis evaluates to True. m1 = np.arange(10).reshape(2, 5) m1 array([[0, 1, 2, 3, 4], [5, 6, 7, 8, 9]]) m1 > 3 # element-wise comparison array([[False, False, False, False, True], [ True, True, True, True, True]]) (m1 > 3).any() # or True (m1 > 3).all() # and False
np.any([-1, 0, 5]) # 0 evaluates to False True
np.all([-1, 0, 5]) False
np.all(True, False, True], [True, True, False, axis=0) array([ True, False, False]) # Column-wise
4.7 The ndarray's Functions Multiplication: numpy.dot(a, b) The numpy.dot() performs different operations depending on the dimension of the array. It is NOT always the dot product or matrix multiplication.
v1 = np.array([1, 2, 3]) v2 = np.array([4, 5, 6]) m1 = np.arange(1, 10).reshape(3, 3) m1 array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]) m2 = np.arange(9, 0, -1).reshape(3, 3) m2 array([[9, 8, 7], [6, 5, 4], [3, 2, 1]])
help(np.dot)
np.dot(v1, v2) 32
np.dot(m1, m2) array([[ 30, 24, 18], [ 84, 69, 54], [138, 114, 90]]) np.matmul(m1, m2) # matrix multiplication array([[ 30, 24, 18], [ 84, 69, 54], [138, 114, 90]]) m1 @ m2 array([[ 30, 24, 18], [ 84, 69, 54], [138, 114, 90]])
np.dot(2, m1) array([[ 2, 4, 6], [ 8, 10, 12], [14, 16, 18]]) 2 * m1 array([[ 2, 4, 6], [ 8, 10, 12], [14, 16, 18]]) np.multiply(m1, 2) array([[ 2, 4, 6], [ 8, 10, 12], [14, 16, 18]])
np.dot(m1, v1) array([14, 32, 50])
np.dot(v1, m1) array([30, 36, 42])
4.8 Universal Functions (ufunc) and Aggregate Functions A Universal Functions (ufunc) operates on each element of the array and return a new array of the same size. For examples, numpy.sin(ndarray), numpy.sqrt(ndarray).
An aggregate function operates on an array and returns a single result. For examples, numpy.sum(ndarray), numpy.min(ndarray), numpy.mean(ndarray). In NumPy, you could choose to operate on the entire array, or a particular axis with the keyword argument axis=n.
NumPy's Aggregate Statistical Functions sum(), mean(), std(), min(), max() cumsum() (cumulative sum) More You can invoke these functions via either numpy's module-level functions or ndarray's member methods. For example, you can invoke the sum() function via ndarray.sum() or numpy.sum(ndarray). Furthermore, many of the aggregate functions can be applied to the entire array or a particular axis with the keyword argument axis=n.
For examples,
m1 = np.array(11, 22, 33], [44, 55, 66) m1 array([[11, 22, 33], [44, 55, 66]]) m1.sum() 231 np.sum(m1) # Same as above 231 m1.min() 11 np.max(m1) 66
m1.sum(axis = 0) # sum column-wise array([55, 77, 99]) np.sum(m1, axis = 0) # Same as above array([55, 77, 99]) m1.sum(axis = 1) # sum row-wise array([ 66, 165]) m1.cumsum(axis = 0) # cumulative sum column-wise array([[11, 22, 33], [55, 77, 99]]) m1.cumsum(axis = 1) # cumulative row-wise array([[ 11, 33, 66], [ 44, 99, 165]]) m1.cumsum() # default, operate on a flatten array array([ 11, 33, 66, 110, 165, 231], dtype=int32) m1.ravel() # flatten the array array([11, 22, 33, 44, 55, 66]) NumPy's Universal Mathematical Functions NumPy provides mathematical functions, such as:
numpy.sin(ndarray), numpy.cos(ndarray), numpy.tan(ndarray) numpy.exp(ndarray), numpy.sqrt(ndarray) numpy.pi, numpy.e more These functions are NumPy's module-level functions. They operate on each element of the array and return an array of the same size.
For examples,
m1 = np.array(11, 22, 33], [44, 55, 66) m1 array([[11, 22, 33], [44, 55, 66]]) np.sqrt(m1) array([[ 3.31662479, 4.69041576, 5.74456265], [ 6.63324958, 7.41619849, 8.1240384 ]]) np.exp(m1) array([[ 5.98741417e+04, 3.58491285e+09, 2.14643580e+14], [ 1.28516001e+19, 7.69478527e+23, 4.60718663e+28]]) np.sin(m1) array([[-0.99999021, -0.00885131, 0.99991186], [ 0.01770193, -0.99975517, -0.02655115]]) Iterator m1 = np.array(11, 22, 33], [44, 55, 66)
for row in m1: print(row, type(row)) [11 22 33] <class 'numpy.ndarray'> [44 55 66] <class 'numpy.ndarray'>
for row in m1: for col in row: print(col, end=', ') 11, 22, 33, 44, 55, 66,
for item in m1.flat: print(item, end=' ') 11 22 33 44 55 66 In general, you shall avoid iterate over the elements, as iteration (sequential) is very much slower than vector (parallel) operations.
4.9 numpy.apply_along_axis(func, axis, ndarray) Apply the given func along the axis for the ndarray. For examples,
m1 = np.array(1 , 2, 3], [4, 5, 6) np.apply_along_axis(np.sum, 0, m1) # axis-0 is column-wise array([5, 7, 9]) # return an ndarray np.apply_along_axis(np.sum, 1, m1) # axis-1 is row-wise array([ 6, 15])
np.apply_along_axis(lambda x: print(x, type(x)), 0, m1) [1 4] <class 'numpy.ndarray'> [2 5] <class 'numpy.ndarray'> [3 6] <class 'numpy.ndarray'> array([None, None, None], dtype=object)
np.apply_along_axis(lambda v: v+1, 0, m1) # v and v+1 is ndarray array([[2, 3, 4], [5, 6, 7]])
np.apply_along_axis(lambda v: v.max()-v.min(), 0, m1) # range array([3, 3, 3]) 4.10 More NumPy's Functions Shape (Dimension) Manipulation reshape(): return an array with modified shape. resize(): modifies this array. ravel(): flatten the array. transpose() You can invoke these functions via NumPy's module-level function or ndarray member functions, e.g., numpy.reshape(ndarray, newShape) or ndarray.reshape(newShape).
m1 = np.array(11, 22, 33], [44, 55, 66)
m2 = m1.reshape(3, 2) # Return a new array m2 array([[11, 22], [33, 44], [55, 66]]) m1 array([[11, 22, 33], [44, 55, 66]]) m3 = np.reshape(m1, (3, 2)) # using NumPy's module-level function m3 array([[11, 22], [33, 44], [55, 66]])
m1.resize(3, 2) # Resize this array m1 array([[11, 22], [33, 44], [55, 66]]) m1.shape = (2, 3) # Same as resize() m1 array([[11, 22, 33], [44, 55, 66]])
m1.ravel() # Flatten to 1D array([11, 22, 33, 44, 55, 66]) m1.resize(6) # Same as ravel() m1 array([11, 22, 33, 44, 55, 66])
m1 = np.array(11, 22, 33], [44, 55, 66) m1 array([[11, 22, 33], [44, 55, 66]]) m1 = m1.transpose() # transpose() returns a new array m1 array([[11, 44], [22, 55], [33, 66]]) Stacking Arrays numpy.vstack(tup): stack 2 or more array vertically. numpy.hstack(tup): stack 2 or more array horizontally. numpy.column_stack(tup): stack columns of 2 or more 1D arrays numpy.row_stack(tup): stack rows of 2 or more 1D arrays m1 = np.array(11, 22, 33], [44, 55, 66) m2 = np.arange(6).reshape(2, 3) m2 array([[0, 1, 2], [3, 4, 5]])
np.vstack((m1, m2)) array([[11, 22, 33], [44, 55, 66], [ 0, 1, 2], [ 3, 4, 5]])
np.hstack((m1, m2)) array([[11, 22, 33, 0, 1, 2], [44, 55, 66, 3, 4, 5]])
v1 = np.array([1, 2, 3, 4]) v2 = np.array([11, 12, 13, 14]) v3 = np.array([21, 22, 23, 24]) np.row_stack((v1, v2, v3)) array([[ 1, 2, 3, 4], [11, 12, 13, 14], [21, 22, 23, 24]]) np.column_stack((v1, v2, v3)) array([[ 1, 11, 21], [ 2, 12, 22], [ 3, 13, 23], [ 4, 14, 24]]) Splitting an Array numpy.hsplit(arr, sections): split horizontally into equal partitions numpy.vsplit(arr, sections): split vertically into equal partitions. numpy.split(arr, sections, axis=0): split into equal partitions along the axis. numpy.array_split(arr, sections, axis=0): For examples,
m1 = np.arange(1, 13).reshape(3, 4) m1 array([[ 1, 2, 3, 4], [ 5, 6, 7, 8], [ 9, 10, 11, 12]])
np.hsplit(m1, 2) [array([[ 1, 2], [ 5, 6], [ 9, 10]]), array([[ 3, 4], [ 7, 8], [11, 12]])]
a, b = np.hsplit(m1, 2) # with assignment a array([[ 1, 2], [ 5, 6], [ 9, 10]]) b array([[ 3, 4], [ 7, 8], [11, 12]])
np.vsplit(m1, 3) # must be equal partitions [array(1, 2, 3, 4), array(5, 6, 7, 8), array(9, 10, 11, 12)] Filling an Array with a Scalar: fill() m1 = np.array(11, 22, 33], [44, 55, 66) m1 array([[11, 22, 33], [44, 55, 66]]) m1.fill(0) m1 array([[0, 0, 0], [0, 0, 0]]) Copying an array: copy() Assigning one array to another variable via the assignment operator (=) simply assigns the reference, e.g.,
m1 = np.array(11, 22, 33], [44, 55, 66) m2 = m1 m2 array([[11, 22, 33], [44, 55, 66]]) m2 is m1
True # Same reference (pointing to the same object)
m1[0, 0] = 99 m2 array([[99, 22, 33], [44, 55, 66]]) To generate a new copy, use copy() function:
m1 = np.arange(1, 9).reshape(2, 4) m1 array([[1, 2, 3, 4], [5, 6, 7, 8]])
m2 = m1.copy() m1 is m2 False # holding different objects m1[0, 0] = 99 # modify m1 m2 array([[1, 2, 3, 4], # m2 not affected [5, 6, 7, 8]])
m3 = np.copy(m1) # using NumPy's module-level function m3 array([[99, 2, 3, 4], [ 5, 6, 7, 8]]) m3 is m1 False view(): creates a new array object that looks at the same data, i.e., shallow copy. A slice of array produces a view. copy(): makes a complete (deep) copy of the array and its data. 4.11 Vectorization and Broadcasting NumPy makes full use of vectorization in its implementation, where you do not need to use an explicit loop to iterate through the elements of an ndarray. For example, you can simply write m1 + m2 to perform element-wise addition, instead of writing an explicit loop.
Broadcasting allows NumPy to carry out some operations between two (or more) array of different shapes, subjected to certain constraints.
In NumPy, two arrays are compatible if the lengths of each dimension (shape) are the same, or one of the lengths is 1. For example, suppose that m1's shape is (3, 4, 1) and m2's shape is (3, 1, 4), m1 and m2 are compatible because d0 has the same length, and one of the lengths on d1 and d2 is 1.
Broadcasting is carried out on NumPy as illustrated in the following example:
m1 = np.arange(1, 13).reshape(3, 4) m1 array([[ 1, 2, 3, 4], [ 5, 6, 7, 8], [ 9, 10, 11, 12]]) m2 = np.array([1, 1, 1, 1]) m1.shape (3, 4) m2.shape (4,) m1 + m2 array([[ 2, 3, 4, 5], [ 6, 7, 8, 9], [10, 11, 12, 13]]) Clearly, m1 and m2 have different shapes, but NumPy is able to carry out the addition via broadcasting. The steps for broadcasting is as follows:
If the arrays have different ranks (dimensions), treat the missing dimensions as 1. In the example, m2's shape is treated as (1, 4). Now, m1 and m2 are compatible. If the arrays are compatible, extend the size of smaller array to match the larger one through repetition. Hence, m2 is extended to: array([[ 1, 1, 1, 1], [ 1, 1, 1, 1], [ 1, 1, 1, 1]]) NumPy is now able to carry out the addition, element-wise. However, the operation will fail if the arrays are not compatible, for example,
m1 = np.arange(1, 13).reshape(3, 4) m1 array([[ 1, 2, 3, 4], [ 5, 6, 7, 8], [ 9, 10, 11, 12]]) m3 = np.array([2, 2, 2]) m3 array([2, 2, 2]) m1 + m3 ValueError: operands could not be broadcast together with shapes (3,4) (3,) 4.12 Structured Arrays An ndarray can hold records, typically in the form of tuples, instead of plain scalar. It is called structured array. For example,
m1 = np.array([(1, 'a', 1.11), (2, 'b', 2.22)]) m1 array([['1', 'a', '1.11'], ['2', 'b', '2.22']], dtype='<U11')
m1 = np.array([(1, 'a', 1.11), (2, 'b', 2.22)], dtype=('i4, U11, f8')) m1 array([(1, 'a', 1.11), (2, 'b', 2.22)], dtype=[('f0', '<i4'), ('f1', '<U11'), ('f2', '<f8')]) m1.shape (2,) # 1D of tuples m1[0] (1, 'a', 1.11) m1[0, 0] # 1D IndexError: too many indexes for array
m2 = np.array([(1, 'a', 1.11), (2, 'b', 2.22)], dtype=[('idx', 'i4'), ('v1', 'U11'), ('v2', 'f8')]) m2 array([(1, 'a', 1.11), (2, 'b', 2.22)], dtype=[('idx', '<i4'), ('v1', '<U11'), ('v2', '<f8')]) m2.shape (2,)
m2['idx'] array([1, 2])
m2['v1'] array(['a', 'b'], dtype='<U11') m2['v2'] array([1.11, 2.22]) 4.13 Saving/Loading from Files Saving/Loading from Files in Binary Format: save() and load() NumPy provides a pair of functions called load() and save() for reading and writing an ndarray in binary format. For example,
m1 = np.random.rand(3, 4) m1 array([[0.72197242, 0.90794499, 0.07341204, 0.59910337], [0.37028474, 0.82666762, 0.68453112, 0.80082228], [0.53934751, 0.89862448, 0.78529266, 0.8680931 ]])
np.save('data', m1) m2 = np.load('data')
m2 = np.load('data.npy') m2 array([[0.72197242, 0.90794499, 0.07341204, 0.59910337], [0.37028474, 0.82666762, 0.68453112, 0.80082228], [0.53934751, 0.89862448, 0.78529266, 0.8680931 ]]) Saving/Loading from Text File: savetxt(), loadtxt(), and genfromtxt() NumPy provides a pair of functions called savetxt() and loadtxt() to save/load an ndarray from a text file, such as CSV (Comma-Separated Values) or TSV (Tab-Separated Values). For example,
m1 = np.arange(1, 11).reshape(2, 5) m1 array([[ 1, 2, 3, 4, 5], [ 6, 7, 8, 9, 10]]) np.savetxt('data.csv', m1, fmt='%d', delimiter=',')
m2 = np.loadtxt('data.csv', delimiter=',') m2 array([[ 1., 2., 3., 4., 5.], [ 6., 7., 8., 9., 10.]]) m3 = np.loadtxt('data.csv', delimiter=',', dtype='int') # Set data type m3 array([[ 1, 2, 3, 4, 5], [ 6, 7, 8, 9, 10]]) NumPy provides another function called genfromtxt() to handle structured arrays. For example, create the following CSV file called data1.csv with missing data points and header:
i1,i2,f1,f2,u1,u2 1,,3.33,4.44,'a1','a2' 6,7,,9.99,,'b2'
m1 = np.genfromtxt('data1.csv', delimiter=',', names=True, dtype=('i4, i4, f4, f8, U11, U11')) m1 array([(1, -1, 3.33, 4.44, 'aa1', 'aa2'), (6, 7, nan, 9.99, '', 'bb2')], dtype=[('i1', '<i4'), ('i2', '<i4'), ('f1', '<f4'), ('f2', '<f8'), ('u1', '<U11'), ('u2', '<U11')]) # Structured array of tuples of records # Missing int is replaced by -1, missing float by nan (not a number), missing string by empty string m1['i2'] # index by column name array([-1, 7]) m1['f1'] array([3.33, nan], dtype=float32) m1['u1'] array(['aa1', ''], dtype='<U11') m1[1] # usual indexing (6, 7, nan, 9.99, '', 'bb2') 4.14 Statistical Operations NumPy provides statistical functions such as:
sum(), min(), max() amin(), amax(), ptp() (range of values): nanmin(), nanmax(): ignore nan average(): weighted average mean(), median(), std(), var(), percentile(): naamean(), nanmedian(), nanstd(), nanvar(), nanpercentile(): ignore nan. corrcoef() (correlation coefficient); correlate() (cross-correlation between two 1D arrays), cov() (co-variance) histogram(), histogram2d(), histogramdd(), bincount(), digitize() You can invoke most of these function via ndarray's member function ndarray.func(*args), or NumPy's module-level function numpy.func(ndarray, *args).
For examples,
m1 = np.array(11, 22, 33], [44, 55, 66) m1 array([[11, 22, 33], [44, 55, 66]])
m1.mean() # All elements, using ndarray member function 38.5 np.mean(m1) # Using NumPy's module-level function 38.5 m1.mean(axis = 0) # Over the rows array([ 27.5, 38.5, 49.5]) np.mean(m1, axis = 0) array([27.5, 38.5, 49.5]) m1.mean(axis = 1) # Over the columns array([ 22., 55.]) 4.15 Linear Algebra numpy.transpose(): numpy.trace(): numpy.eye(dim): create an identity matrix numpy.dot(a1, a2): compute the dot product. For 1D, it is the inner product. For 2D, it is equivalent to matrix multiplication. numpy.linalg.inv(m): compute the inverse of matrix m numpy.linalg.eig(m): compute the eigenvalues and right eigenvectors of square matrix m. numpy.linalg.solve(a, b): Solving system of linear equations ax = b.
a = np.array(1, 3, -2], [3, 5, 6], [2, 4, 3) a array([[ 1, 3, -2], [ 3, 5, 6], [ 2, 4, 3]]) b = np.array(5], [7], [8) b array([[5], [7], [8]]) x = np.linalg.solve(a, b) x array([[-15.], [ 8.], [ 2.]]) np.dot(a, x) # matrix multiplication ax (=b) array([[ 5.], [ 7.], [ 8.]])
np.linalg.inv(a) array([[ 2.25, 4.25, -7. ], [-0.75, -1.75, 3. ], [-0.5 , -0.5 , 1. ]])
eig = np.linalg.eig(a) eig (array([ 0.41742431, 9.58257569, -1. ]), # eigenvalues array([[-0.92194876, 0.15950867, 0.85435766], # eigenvectors corresponding to eigenvalues [ 0.32226296, 0.82139716, -0.51261459], [ 0.21484197, 0.54759811, 0.08543577]]))
np.dot(a, eig[1][:, 0]) # column 0 array([-0.38484382, 0.13452039, 0.08968026]) np.dot(eig[0][0], eig[1][:, 0]) # Scalar multiplication array([-0.38484382, 0.13452039, 0.08968026]) 4.16 Performance and Vectorization NumPy provides pre-compiled numerical routines (most of them implemented in C code) for high-performance operations, and supports vector (or parallel) computations.
For example, we use the following programs to compare the performance of NumPy's ndarray and Python's array (list):
import numpy as np import time
size = 10000000 #size = 100000000
def using_python_array(): startTime = time.time() lst1 = range(size) # Python's list lst2 = range(size) lst3 = [] for i in range(len(lst1)): # Sequential lst3.append(lst1[i] + lst2[i]) return time.time() - startTime
def using_numpy_array(): startTime = time.time() m1 = np.arange(size) # NumPy's ndarray m2 = np.arange(size) m3 = m1 + m2 # Overloaded operator for element-wise addition (vectorized) return time.time() - startTime
t_python = using_python_array() t_numpy = using_numpy_array() print('Python Array:', t_python) print('NumPy Array:', t_numpy) print('Ratio: ', t_python // t_numpy)
#size = 10000000 #Python Array: 3.6722664833068848 #NumPy Array: 0.06250667572021484 #Ratio: 58
#size = 100000000 #Python Array: 38.09505248069763 #NumPy Array: 0.6761398315429688 #Ratio: 56 Vectorized Scalar Function: numpy.vectorize(func) -> func Normal functions that work on scalar cannot be applied to list (array). You can vectorize the function via numpy.vectorize(func). For example,
def myfunc(x): return x + 1
myfunc(5) 6
myfunc([1, 2, 3]) TypeError: can only concatenate list (not "int") to list
v_myfunc = np.vectorize(myfunc)
v_myfunc([1, 2, 3, 4]) array([2, 3, 4, 5]) # return a NumPy's array
m1 = np.array(11, 22, 33], [44, 55, 66) v_myfunc(m1) array([[12, 23, 34], [45, 56, 67]])
def my_absdiff(a, b): return a-b if a > b else b-a my_absdiff(5, 2) 3 my_absdiff(2, 5) 3 my_absdiff = np.vectorize(my_absdiff) # Same function name my_absdiff([1, 2, 3, 4, 5], 3) array([2, 1, 0, 1, 2])
- NumPy and Matplotlib The plot() function can handle NumPy's ndarray, just like Python's list.
plot([x], y, [fmt], **kwargs) # Single line or point These examples are developed and tested in Jupyter Notebook, which is convenience and productive. [TODO] Share the notebook.
5.1 Example 1: Line Chart
import matplotlib.pyplot as plt import numpy as np
x = np.linspace(-2np.pi, 2np.pi, 721)
sx, cx, cx2 = np.sin(x), np.cos(x), np.cos(x**2)
plt.plot(x/np.pi, sx, color='#FF6666', label='sin(x)') plt.plot(x/np.pi, cx, color='#66FF66', label='cos(x)') plt.plot(x/np.pi, cx2, color='#6666FF', label='cos(x**2)')
plt.xlabel(r'x (
5.2 Example 2: Line Chart with x-y Axis at Zero
import matplotlib.pyplot as plt import numpy as np
x = np.linspace(-2np.pi, 2np.pi, 721)
y = np.sin(3*x)/x
ax = plt.subplot(1, 1, 1) ax.plot(x, y, 'r-', label='sin(3*x)/x')
ax.spines['top'].set_color('none') ax.spines['right'].set_color('none')
ax.spines['bottom'].set_position(('data', 0)) ax.spines['left'].set_position(('data', 0))
ax.xaxis.set_ticks_position('bottom') ax.yaxis.set_ticks_position('left') ax.set_xticks([-2np.pi, -np.pi, 0, np.pi, 2np.pi]) ax.set_xticklabels([r'$-2\pi$', r'$-\pi$', r'$0$', r'$+\pi$', r'$+2\pi$']) # Using latex symbol ax.set_title('Line Plot with Axis at 0 (NumPy-Matplotlib)') plt.show()
5.4 Example 4: Bar Chart (Grouped and Stacked)
import matplotlib.pyplot as plt import numpy as np
x = np.arange(5) # [0, 1, ..., 4] y1 = np.array([1, 6, 3, 5, 2]) y2 = np.array([2, 2, 4, 5, 1]) y3 = np.array([3, 3, 2, 3, 1]) x_ticklabels = ['2020', '2021', '2022', '2023', '2024'] y_colors = ['#5B9BD5', '#ED7D31', '#70AD47'] y_labels = ['Espresso', 'Cappuccino', 'Latte']
plt.figure(figsize=(6.4, 6.4)) # in inches, default is (6.4, 4.8)
plt.subplot(2, 1, 1)
plt.bar(x, y1, color=y_colors[0], tick_label=x_ticklabels, label=y_labels[0]) plt.bar(x, y2, bottom=y1, color=y_colors[1], label=y_labels[1]) plt.bar(x, y3, bottom=y1+y2, color=y_colors[2], label=y_labels[2]) plt.xlabel('Year') plt.ylabel('Sales') plt.title('Coffee & Tea Sales (NumPy-Matplotlib Stacked Bar Chart)') plt.legend() # Extracted from plt.bar()'s label
plt.subplot(2, 1, 2) bar_width = 0.3 # 3*0.3 = 0.9
plt.bar(x, y1, bar_width, color=y_colors[0], label=y_labels[0]) plt.bar(x+bar_width, y2, bar_width, color=y_colors[1], label=y_labels[1], tick_label=x_ticklabels) plt.bar(x+2*bar_width, y3, bar_width, color=y_colors[2], label=y_labels[2]) plt.xlabel('Year') plt.ylabel('Sales') plt.title('Coffee & Tea Sales (NumPy-Matplotlib Grouped Bar Chart)') plt.legend()
plt.tight_layout() # To prevent overlapping of subplots plt.show()
To create a Pandas' Series, use the constructor:
import numpy as np import pandas as pd help(pd.Series) Series(data=None, index=None, dtype=None, name=None) # data: array-like, dict, or scaler # index: array-like or Pandas' Index object. Same length as data. Can be non-unique. # Default to Pandas' RangeIndex(0, 1, ..., n-1) if not provided Constructing a Pandas' Series 1: Using a Value-List and an Index-List. s1 = pd.Series([5, 7, 2, 5, 3], index=['a', 'b', 'c', 'd', 'a'], name='x') # non-unique index s1 a 5 b 7 c 2 d 5 a 3 Name: x, dtype: int64 s1.index Index(['a', 'b', 'c', 'd', 'a'], dtype='object') # An Index object s1.values array([5, 7, 2, 5, 3], dtype=int64) # Data values in ndarray s1.dtype dtype('int64') s1.name # column name 'x' Accessing the Series: Indexing [idx], Dot .idx, and Slicing [start:stop:step] s1 = pd.Series([5, 7, 2, 5, 3], index=['a', 'b', 'c', 'd', 'a'], name='x')
s1['c'] # Indexing via index 2 s1.c # Same as above 2 type(s1.c) <class 'numpy.int64'> # Scalar s1['a'] # Non-unique index a 5 a 3 Name: x, dtype: int64 s1.a # Same as above a 5 a 3 Name: x, dtype: int64 type(s1.a) <class 'pandas.core.series.Series'> # A Series
s1['b':'d'] # Slicing via index b 7 c 2 d 5 Name: x, dtype: int64 s1['b':'d':2] # Slicing with step b 7 d 5 Name: x, dtype: int64 s1['a':'b'] # Cannot use non-unique index for slicing KeyError: "Cannot get left slice bound for non-unique label: 'a'"
s1[2] # Indexing via numeric index 2 s1[-1] 0 s1[::2] # Slicing via numeric index a 0 c 2 a 0 Name: x, dtype: int64 Selection with a List of Indexes
s1'a', 'c' a 5 a 3 c 2 Name: x, dtype: int64 Element-wise Operations
s1['a'] = 0 s1 a 0 b 7 c 2 d 5 a 0 Name: x, dtype: int64 Constructing a Pandas' Series 2: From a Value-List with Default Numeric Index s1 = pd.Series([5, 7, 2, 7, 3]) s1 0 5 1 7 2 2 3 7 4 3 dtype: int64 s1.index RangeIndex(start=0, stop=5, step=1) # An iterator s1.values array([5, 7, 2, 7, 3], dtype=int64)
s1[1] 7 s1[-1] # Cannot use negative index in this case! KeyError: -1
s1[::2] 0 5 2 2 4 3 dtype: int64 Constructing a Pandas' Series 3: From a NumPy's 1D ndarray arr1d = np.array([1.1, 2.2, 3.3, 4.4]) s1 = pd.Series(arr1d, index=['a', 'b', 'c', 'd']) s1 a 1.1 b 2.2 c 3.3 d 4.4 dtype: float64
arr1d[0] = 99 s1 a 99.0 b 2.2 c 3.3 d 4.4 dtype: float64 Construct a Pandas' Series 4: From another Pandas' Series s1 = pd.Series([11, 22, 33, 44], index=['a', 'b', 'c', 'd']) s2 = pd.Series(s1) s2 a 11 b 22 c 33 d 44 dtype: int64 s2 is s1 False # different objects
s1['d'] = 88 # affect s4 too s2 a 11 b 22 c 33 d 88 dtype: int64 Constructing a Pandas' Series 5: From a Python's Dictionary as Index-Value Pairs dict = {'a': 11, 'b': 22, 'c': 33, 'd': 44} # keys are unique in dictionary s1 = pd.Series(dict) s1 a 11 b 22 c 33 d 44 dtype: int64
s2 = pd.Series(dict, index=['b', 'd', 'a', 'c', 'aa']) s2 b 22.0 # Order according to index d 44.0 a 11.0 c 33.0 aa NaN # Missing value for this index is assigned NaN dtype: float64 # NaN is float, all elements also converted to float 6.2 Operations on Series Operations between a Series and a Scalar The NumPy's element-wise arithmetic operators (+, -, *, /, //, %, **) and comparison operators (==, !=, >, <, >=, <=), as well as NumPy's module-level functions (such as sum(), min(), max()) are extended to support Pandas' Series. For examples,
s1 = pd.Series([5, 4, 3, 2, 1], index=['a', 'b', 'c', 'd', 'e']) s1 a 5 b 4 c 3 d 2 e 1 dtype: int64
s1 + 1 a 6 b 5 c 4 d 3 e 2
s1 > 3 a True b True c False d False e False dtype: bool s1[s1 > 3] # Filtering with boolean Series a 5 b 4 dtype: int64 Operations between Two Series are Index-based Operations between Series (+, -, /, *, **) align values based on their index, which need not be the same length. The result index will be the sorted union of the two indexes. s1 = pd.Series([1, 2, 3, 4, 5], index=['a', 'b', 'c', 'd', 'e']) s2 = pd.Series([4, 3, 2, 1], index=['c', 'a', 'b', 'aa']) s1 a 1 b 2 c 3 d 4 e 5 dtype: int64 s2 c 4 a 3 b 2 aa 1 dtype: int64
s1 + s2 a 4.0 # this index on both Series aa NaN # this index is not in both, assign NaN b 4.0 c 7.0 d NaN e NaN dtype: float64 # All elements converted to float, as NaN is float Statistical Methods on Series NumPy's module-level statistical functions are extended to support Pandas' Series. For examples,
s1 = pd.Series([5, 4, 3, 2, 1], index=['a', 'b', 'c', 'd', 'e']) np.sum(s1) # No pd.sum() 15 s1.sum() # Same as above. 15 np.cumsum(s1) a 5 b 9 c 12 d 14 e 15 dtype: int64 NaN (Not A Number), Inf (Positive Infinity) and -Inf (Negative Infinity) The IEEE 754 standard for floating point representation supports 3 special floating point numbers (See "Data Representation" article):
Inf (Positive Integer): 1/0, all positive floats are smaller than Inf. -Inf (Negative Infinity): -1/0, all negative floats are bigger than -Inf. NaN (Not a Number): 0/0 For examples,
f1, f2, f3 = float('inf'), float('-inf'), float('nan') f1, f2, f3 (inf, -inf, nan) type(f1), type(f2), type(f3) (<class 'float'>, <class 'float'>, <class 'float'>)
import math isinf(f1), isinf(f2), isinf(f3) math.isinf(f1), math.isinf(f2), math.isinf(f3) (True, True, False)
initial_value = 5 min, max = min(5, float('inf')), max(5, float('-inf')) min, max (5, 5)
f11, f12, f13 = math.inf, -math.inf, math.nan f11, f12, f13 (inf, -inf, nan)
f21, f22, f23 = np.inf, -np.inf, np.nan f21, f22, f23 (inf, -inf, nan) In Data Analysis, NaN is often used to represent missing data, and needs to be excluded from statistical operations. Hence, statistical methods from ndarray have been overridden in Pandas to automatically exclude NaN. For examples,
m1 = np.arange(12, dtype=float).reshape(3, 4) m1[0, 1] = np.nan # nan is a float, all elements converted to float m1 array([[ 0., nan, 2., 3.], [ 4., 5., 6., 7.], [ 8., 9., 10., 11.]]) m1.sum() nan m1.sum(axis=0) array([12., nan, 18., 21.])
s1 = pd.Series([1, 2, np.NaN, 4, 5]) s1 0 1.0 1 2.0 2 NaN 3 4.0 4 5.0 dtype: float64 # nan is float, all elements converted to float s1.sum() 12.0 # nan excluded More Statistics Methods s1 = pd.Series([3, 2, 2, 1, np.nan, 6, 8, 4], index=['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h']) s1.describe() count 7.000000 # nan excluded mean 3.714286 std 2.497618 min 1.000000 25% 2.000000 50% 3.000000 75% 5.000000 max 8.000000 dtype: float64
s1.mean() 3.7142857142857144 np.mean(s1) # Same as above 3.7142857142857144 s1.median() 3.0 s1.std() 2.4976179127511156 s1.var() 6.238095238095238 Correlation and Covariance between 2 Series s1 = pd.Series([3, 2, 2, 1, 6, 8, 4], index=['a', 'b', 'c', 'd', 'e', 'f', 'g']) s2 = pd.Series([1, 1, 5, 1, 6, 9, 3], index=['a', 'b', 'c', 'd', 'e', 'f', 'g']) s3 = pd.Series([1, 1, 5, 1, 6, 3, 9], index=['a', 'b', 'c', 'd', 'e', 'g', 'f']) # for verifying index-align
np.corrcoef(s1, s2) array([[1. , 0.85044339], [0.85044339, 1. ]]) # 2D array np.corrcoef(s1, s3) array([[1. , 0.3327822], [0.3327822, 1. ]])
# Different results, non-index-align for NumPy's operations
s1.cov(s2) 6.57142857142857 s1.cov(s3) 6.57142857142857 # Same result, index-align
s1.corr(s2) # No np.corr() 0.8504433897747548 # Scalar s1.corr(s3) 0.8504433897747548 The Pandas' Series Member Functions unique(), value_counts(), isin(), isnull(), notnull() s1 = pd.Series([1, 2, 2, 1, 3, 3, 1], index=['a', 'a', 'b', 'c', 'c', 'b', 'a'])
s1 a 1 a 2 b 2 c 1 c 3 b 3 a 1 dtype: int64
s1.unique() # filter unique values array([1, 2, 3], dtype=int64)
s1.duplicated() # Check duplicated values a False a False b True # value 2 duplicated c True c False b True a True dtype: bool
s1.value_counts() # unique value vs counts 1 3 3 2 2 2 dtype: int64
s1.isin([2, 3, 4]) # Check if the value is in the given list element-wise, return bool a False a True b True c False c True b True a False dtype: bool s1[s1.isin([2, 3, 4])] # Filter with a boolean Series a 2 b 2 c 3 b 3 dtype: int64
s2 = pd.Series([1, 2, np.NaN, 4, 5])
s2.isnull() # Check if value is NaN element-wise 0 False 1 False 2 True 3 False 4 False dtype: bool s2.notnull() # Inverse of isnull() 0 True 1 True 2 False 3 True 4 True dtype: bool s2[s2.notnull()] # Filter out NaN 0 1.0 1 2.0 3 4.0 4 5.0 dtype: float64 Sorting: sort_index(), sort_values() Ranking: rank() s1 = pd.Series([3, 2, 2, 1, 6, 8, 4], index=['a', 'd', 'b', 'c', 'c', 'e', 'a']) s1.sort_index() a 3 a 4 b 2 c 1 c 6 d 2 e 8 dtype: int64 s1.sort_values() c 1 d 2 b 2 a 3 a 4 c 6 e 8 dtype: int64 s1.rank() a 4.0 d 2.5 b 2.5 c 1.0 c 6.0 e 7.0 a 5.0 dtype: float64 6.3 Pandas' Categorical Data Type A categorical variable takes on a limited, and usually fixed, number of possible values. There are two kinds of categorical data:
Nominal (Unordered): e.g., gender, social class, blood type, country. Ordinal (Ordered): e.g. "strongly agree" vs "agree", band 1, 2, 3,... Numerical operations (such as additions, divisions, …) cannot be applied to categories data.
Pandas supports a "category" data type (dtype). All values of categorical data are either in categories or np.nan (for missing data).
Constructing a Categorical Series 1: Using dtype='category'
s1 = pd.Series(['a', 'b', 'c', 'd', 'a'], dtype='category') s1 0 a 1 b 2 c 3 d 4 a dtype: category Categories (4, object): [a, b, c, d] s1.dtype CategoricalDtype(categories=['a', 'b', 'c', 'd'], ordered=False)
s1.cat.categories Index(['a', 'b', 'c', 'd'], dtype='object') s1.cat.ordered False s1.cat.codes 0 0 1 1 2 2 3 3 4 0 dtype: int8
s1.value_counts() a 2 d 1 c 1 b 1 dtype: int64 Notes:
The categories are inferred from the data Always "Unordered" Constructing a Categorical Series 2: Using a CategoricalDtype
from pandas.api.types import CategoricalDtype cat = CategoricalDtype(categories=['b', 'c', 'd'], ordered=True) cat CategoricalDtype(categories=['b', 'c', 'd'], ordered=True) # ordered
s1 = pd.Series(['a', 'b', 'c', 'a'], dtype=cat) s1 0 NaN # no category 1 b 2 c 3 NaN dtype: category Categories (3, object): [b < c < d] s1.min(), s1.max() (nan, 'c') Constructing a Categorical Series 2: Converting using astype() s1 = pd.Series(['a', 'b', 'b', 'a', 'c']) s1 0 a 1 b 2 b 3 a 4 c dtype: object s2 = s1.astype('category') s2 0 a 1 b 2 b 3 a 4 c dtype: category Categories (3, object): [a, b, c] # Unordered s2.dtype CategoricalDtype(categories=['a', 'b', 'c'], ordered=False)
from pandas.api.types import CategoricalDtype cat = CategoricalDtype(categories=['b', 'c', 'd'], ordered=True) s3 = pd.Series(['a', 'b', 'c', 'a']) s3 = s3.astype(cat) s3 0 NaN 1 b 2 c 3 NaN dtype: category Categories (3, object): [b < c < d] Constructing a Categorical Series 3: via Pandas' Categorical()
cat = pd.Categorical(['a','b','c','b'], ordered=True, categories=['c', 'b', 'a']) cat [a, b, c, b] Categories (3, object): [c < b < a] # Ordered type(cat) <class 'pandas.core.arrays.categorical.Categorical'>
s1 = pd.Series(cat) s1 0 a 1 b 2 c 3 b dtype: category Categories (3, object): [c < b < a] s1.min(), s1.max() ('c', 'a') Operations on Categorical Data
s1 = pd.Series(['a', 'b', 'c', 'a']).astype(CategoricalDtype(ordered=True)) s1 0 a 1 b 2 c 3 a dtype: category Categories (3, object): [a < b < c]
s1.sort_values(inplace=True) s1 0 a 3 a 1 b 2 c dtype: category Categories (3, object): [a < b < c] [TODO] more
Creating a Pandas' DataFrame 1: From columns of Series, packed in a dict with Column Names
import numpy as np import pandas as pd s1 = pd.Series([1, 2, 3], index=['a', 'b', 'c']) s1 a 1 b 2 c 3 dtype: int64 s2 = pd.Series([11, 33, 22, 44], index=['a', 'c', 'b', 'd']) s2 a 11 c 33 b 22 d 44 dtype: int64 df = pd.DataFrame({'x1': s1, 'x2': s2}) # dictionary of column-header:Series df # DataFrame is a 2D table with column header and row index # Index-align, resultant index is sorted union of both indexes x1 x2 a 1.0 11 b 2.0 22 c 3.0 33 d NaN 44 # Missing value is assigned NaN (Not A Number) which is a float # column x1 is converted to float type(df) <class 'pandas.core.frame.DataFrame'>
df.dtypes # Data types of columns x1 float64 x2 int64 dtype: object
df['x1'] # Select a column a 1.0 b 2.0 c 3.0 d NaN Name: x1, dtype: float64 df.x1 # Same as above a 1.0 b 2.0 c 3.0 d NaN Name: x1, dtype: float64 type(df['x1']) <class 'pandas.core.series.Series'> # A Series df.x1.dtype # Data type of a column dtype('float64')
df'x1', 'x2' x1 x2 a 1.0 11 b 2.0 22 c 3.0 33 d NaN 44
df.columns # columns header Index(['x1', 'x2'], dtype='object') df.index # rows index Index(['a', 'b', 'c', 'd'], dtype='object') df.values # data array([[ 1., 11.], [ 2., 22.], [ 3., 33.], [nan, 44.]]) # Return a ndarray (of the same dtype) type(df.values) <class 'numpy.ndarray'>
df.to_csv('data_with_labels.csv') # default with column header and row index
,x1,x2 a,1.0,11 b,2.0,22 c,3.0,33 d,,44
df.to_csv('data_without_labels.csv', index=False, header=False) # No column header and row index
1.0,11 2.0,22 3.0,33 ,44 Creating a Pandas' DataFrame 2: Load from a CSV file
df1 = pd.read_csv('data_with_labels.csv') # default with column header, no row index df1 Unnamed: 0 x1 x2 0 a 1.0 11 1 b 2.0 22 2 c 3.0 33 3 d NaN 44
df2 = pd.read_csv('data_with_labels.csv', index_col=0) # First column is the row index df2 x1 x2 a 1.0 11 b 2.0 22 c 3.0 33 d NaN 44
df3 = pd.read_csv('data_without_labels.csv', names=['y1', 'y2']) # Provide column names df3 y1 y2 0 1.0 11 1 2.0 22 2 3.0 33 3 NaN 44
from io import StringIO # Python 3 rawText = StringIO(""" x1 x2 cat 0 101.23 1.39 Medium 1 103.26 1.86 Medium 2 202.76 8.87 High 3 142.40 5.25 Medium-High """) rawText <_io.StringIO object at ...> df4 = pd.read_csv(rawText, sep = "\s+") # 'sep' is one or more spaces df4 x1 x2 cat 0 101.23 1.39 Medium 1 103.26 1.86 Medium 2 202.76 8.87 High 3 142.40 5.25 Medium-High df4.dtypes x1 float64 x2 float64 cat object dtype: object Creating a Pandas' DataFrame 3: From columns of list, packed in a dict with Column Names lst_x1 = [1, 2, 3, 4, 5] lst_x2 = [1.1, 2.2, 3.3, 4.4, 5.5] lst_x3 = ['a', 'b', 'c', 'd', 'e']
df = pd.DataFrame({'x1': lst_x1, 'x2': lst_x2, 'x3': lst_x3}) # dict of {columnName:lst} df x1 x2 x3 0 1 1.1 a 1 2 2.2 b 2 3 3.3 c 3 4 4.4 d 4 5 5.5 e df.dtypes x1 int64 x2 float64 x3 object dtype: object
df['x4'] = 9 # Scalar broadcasts to all rows df x1 x2 x3 x4 0 1 1.1 a 9 1 2 2.2 b 9 2 3 3.3 c 9 3 4 4.4 d 9 4 5 5.5 e 9
df['x5'] = [51, 52, 53, 54, 55] # length of list shall match index df x1 x2 x3 x4 x5 0 1 1.1 a 9 51 1 2 2.2 b 9 52 2 3 3.3 c 9 53 3 4 4.4 d 9 54 4 5 5.5 e 9 55 df'x1','x5', 'x2' x1 x5 x2 0 1 51 1.1 1 2 52 2.2 2 3 53 3.3 3 4 54 4.4 4 5 55 5.5
df.index RangeIndex(start=0, stop=5, step=1) df.index = ['r1', 'r2', 'r3', 'r4', 'r5'] df.index Index(['r1', 'r2', 'r3', 'r4', 'r5'], dtype='object') df x1 x2 x3 x4 x5 r1 1 1.1 a 9 51 r2 2 2.2 b 9 52 r3 3 3.3 c 9 53 r4 4 4.4 d 9 54 r5 5 5.5 e 9 55
df['x3'] 0 a 1 b 2 c 3 d 4 e Name: x3, dtype: object df.x3 # Same as above 0 a 1 b 2 c 3 d 4 e Name: x3, dtype: object df'x1','x5', 'x2' # Reorder x1 x5 x2 0 1 51 1.1 1 2 52 2.2 2 3 53 3.3 3 4 54 4.4 4 5 55 5.5
df.columns Index(['x1', 'x2', 'x3', 'x4', 'x5'], dtype='object') df.columns = ['AA', 'BB', 'CC', 'DD', 'EE'] df AA BB CC DD EE r1 1 1.1 a 9 51 r2 2 2.2 b 9 52 r3 3 3.3 c 9 53 r4 4 4.4 d 9 54 r5 5 5.5 e 9 55
type(df['AA']) # Column is a Pandas' Series <class 'pandas.core.series.Series'> df['AA'] += 1 # Apply arithmetic operation df AA BB CC DD EE r1 2 1.1 a 9 51 r2 3 2.2 b 9 52 r3 4 3.3 c 9 53 r4 5 4.4 d 9 54 r5 6 5.5 e 9 55 del df['CC'] # del column df AA BB DD EE r1 2 1.1 9 51 r2 3 2.2 9 52 r3 4 3.3 9 53 r4 5 4.4 9 54 r5 6 5.5 9 55 Creating a Pandas' DataFrame 4: From NumPy's Multi-dimensional Array m = np.arange(1, 13).reshape(3, 4) m array([[ 1, 2, 3, 4], [ 5, 6, 7, 8], [ 9, 10, 11, 12]]) df1 = pd.DataFrame(m) # with default column-header and row-index
df2 = pd.DataFrame(m, index=['r1', 'r2', 'r3'], columns=['c1', 'c2', 'c3', 'c4'], dtype=float)
df2 c1 c2 c3 c4 r1 99.0 2.0 3.0 4.0 r2 5.0 6.0 7.0 8.0 r3 9.0 10.0 11.0 12.0
m[0, 0] = 99 df1 0 1 2 3 0 99 2 3 4 # This is affected, passed by reference 1 5 6 7 8 2 9 10 11 12 df2 c1 c2 c3 c4 r1 1.0 2.0 3.0 4.0 # This is not affected due to change in dtype r2 5.0 6.0 7.0 8.0 r3 9.0 10.0 11.0 12.0 Creating a Pandas' DataFrame 5: From nested-list data = 1, 2, 3], [4, 5, 6 df = pd.DataFrame(data) # default column-header and row-index df 0 1 2 0 1 2 3 1 4 5 6
df1 = pd.DataFrame(1, 2, 3, 4 * 3, columns=['a', 'b', 'c', 'd']) df1 a b c d 0 1 2 3 4 1 1 2 3 4 2 1 2 3 4 6.5 Operations on DataFrame Selecting a Column or a List of Columns: [colHdr|colHdrLst] dataframe[colHdr|colHdrLst]: Access a column or a list of columns
df = pd.DataFrame({'x1': [1, 2, 3, 4, 5], 'x2': [1.1, 2.2, 3.3, 4.4, 5.5], 'x3': ['a', 'b', 'c', 'd', 'e']}, index=['r1', 'r2', 'r3', 'r4', 'r5']) df x1 x2 x3 r1 1 1.1 a r2 2 2.2 b r3 3 3.3 c r4 4 4.4 d r5 5 5.5 e df['x2'] # Select one column with indexing r1 1.1 r2 2.2 r3 3.3 r4 4.4 r5 5.5 Name: x2, dtype: float64 type(df['x2']) <class 'pandas.core.series.Series'> # A one-column Series df.x2 # Select one column with dot r1 1.1 r2 2.2 r3 3.3 r4 4.4 r5 5.5 Name: x2, dtype: float64 df'x3', 'x1' # Select a list of columns x3 x1 r1 a 1 r2 b 2 r3 c 3 r4 d 4 r5 e 5 type(df'x3', 'x1') <class 'pandas.core.frame.DataFrame'> # A multi-column DataFrame Selecting (Filtering) Rows and Columns: loc[], iloc[], at[], iat[] dataframe.loc[rowIdx, colHdr]: Access a group of rows and columns by label(s) or a boolean array. Allowed inputs are:
A single label, e.g., 'a'. A list or array of labels, e.g., ['a', 'b', 'c']. A slice object with labels, e.g., 'a':'f' (both included). A boolean array of the same length as the axis being sliced, e.g., [True, False, True]. A callable function with one argument (the calling Series, DataFrame or Panel) and that returns valid output for indexing (one of the above) dataframe.iloc[rowIdxI, colHdrI]: for integer-location based indexing for selection by position. Allowed inputs are:
An integer, e.g., 5. A list or array of integers, e.g., [4, 3, 0]. A slice object with ints, e.g., 1:7:2 (start included, end excluded). A boolean array of the same length as the axis being sliced, e.g., [True, False, True]. A callable function with one argument (the calling Series, DataFrame or Panel) and that returns valid output for indexing (one of the above). dataframe.at[rowIdx, colHdr]: Access a single value for a row/column label pair.
dataframe.iat[rowIdx, colHdr]: Access a single value for a row/column integer index.
df = pd.DataFrame({'x1': [1, 2, 3, 4, 5, 6, 7], 'x2': [1.1, 2.2, 3.3, 4.4, 5.5, 6.6, 7.7], 'x3': ['a', 'b', 'c', 'd', 'e', 'f', 'g']}, index=['r1', 'r2', 'r3', 'r4', 'r5', 'r6', 'r7']) df x1 x2 x3 r1 1 1.1 a r2 2 2.2 b r3 3 3.3 c r4 4 4.4 d r5 5 5.5 e r6 6 6.6 f r7 7 7.7 g
df.head(2) # First n rows, default n=5 x1 x2 x3 r1 1 1.1 a r2 2 2.2 b df.tail(2) # Last n rows x1 x2 x3 r5 5 5.5 e r6 6 6.6 f r7 7 7.7 g
help(df.loc)
df.loc['r3'] x1 3 x2 3.3 x3 c Name: r3, dtype: object type(df.loc['r3']) <class 'pandas.core.series.Series'> # return a one-column Series df.loc['r3':'r5'] # by row start and end index, both included x1 x2 x3 r3 3 3.3 c r4 4 4.4 d r5 5 5.5 e type(df.loc['r3':'r5']) <class 'pandas.core.frame.DataFrame'> # return a multi-column DataFrame df.loc'r2', 'r4' # list of indexes x1 x2 x3 r2 2 2.2 b r4 4 4.4 d
df.loc['r3', 'x3'] # Single cell 'c' df.loc'r4', 'r2'], ['x3', 'x1' # row and column lists x3 x1 r4 d 4 r2 b 2 df.loc['r2':'r4', 'x1':'x2'] # row and column slices x1 x2 r2 2 2.2 r3 3 3.3 r4 4 4.4 df.loc['r2':'r5':2, 'x1':'x3':2] # slices with step x1 x3 r2 2 b r4 4 d
df.locTrue, False, True, True, False, False, False # bool list with the same length as row axis x1 x2 x3 r1 1 1.1 a r3 3 3.3 c r4 4 4.4 d df.loc[df['x1'] > 5] # Conditional that returns a bool list x1 x2 x3 r6 6 6.6 f r7 7 7.7 g df.loc[lambda df: df['x1'] > 5] # A function that returns a bool list x1 x2 x3 r6 6 6.6 f r7 7 7.7 g
help(df.iloc) df.iloc[2] # by row integer index (positional indexing) x1 3 x2 3.3 x3 c Name: r3, dtype: object df.iloc[2, 2] # row and column integer indexes 'c' df.iloc[2:5] # by start (inclusive) and end (exclusive) integer index x1 x2 x3 r3 3 3.3 c r4 4 4.4 d r5 5 5.5 e df.iloc[::3, ::2] x1 x3 r1 1 a r4 4 d r7 7 g
help(df.at) df.at['r3', 'x2'] 3.3 df['x2']['r3'] # Same as above 3.3 help(df.iat) df.iat[2, 1] 3.3 Testing Membership: isin() df = pd.DataFrame(np.arange(1,13).reshape(3, 4)) df 0 1 2 3 0 1 2 3 4 1 5 6 7 8 2 9 10 11 12 df.isin([5]) # Check if the data-values are in the list 0 1 2 3 0 False False False False 1 True False False False 2 False False False False df.isin([5, 8, 13]) 0 1 2 3 0 False False False False 1 True False False True 2 False False False False df[df.isin([5, 8, 13])] # Filtering with a boolean DataFrame 0 1 2 3 0 NaN NaN NaN NaN 1 5.0 NaN NaN 8.0 2 NaN NaN NaN NaN Operations on Row-Index df = pd.DataFrame(np.arange(1,13).reshape(3, 4), index=['red', 'green', 'blue']) df 0 1 2 3 red 1 2 3 4 green 5 6 7 8 blue 9 10 11 12 df.index Index(['red', 'green', 'blue'], dtype='object') df.index.is_unique True
df.drop(['green', 'red']) # Drop rows with these indexes, return a new DataFrame 0 1 2 3 blue 9 10 11 12
df.drop([0, 3], axis=1) # axis=1 for columns 1 2 red 2 3 green 6 7 blue 10 11 Arithmetic and Comparison Operations Between a DataFrame and a Scalar df = pd.DataFrame(np.arange(1,13).reshape(3, 4), index=['red', 'green', 'blue'], columns=['c1', 'c2', 'c3', 'c4']) df c1 c2 c3 c4 red 1 2 3 4 green 5 6 7 8 blue 9 10 11 12 df + 10 # apply to all values element-wise c1 c2 c3 c4 red 11 12 13 14 green 15 16 17 18 blue 19 20 21 22 df * 2 c1 c2 c3 c4 red 2 4 6 8 green 10 12 14 16 blue 18 20 22 24 df < 8 c1 c2 c3 c4 red True True True True green True True True False blue False False False False Functions on DataFrame You can apply most of the NumPy's functions (such as mathematical and statistical functions) on DataFrame. For examples,
df = pd.DataFrame(np.arange(1,13).reshape(3, 4), index=['red', 'green', 'blue'], columns=['c1', 'c2', 'c3', 'c4']) df c1 c2 c3 c4 red 1 2 3 4 green 5 6 7 8 blue 9 10 11 12
np.sqrt(df) c1 c2 c3 c4 red 1.000000 1.414214 1.732051 2.000000 green 2.236068 2.449490 2.645751 2.828427 blue 3.000000 3.162278 3.316625 3.464102
np.sum(df) c1 15 c2 18 c3 21 c4 24 dtype: int64 np.sum(df, axis=1) # Row-wise red 10 green 26 blue 42 dtype: int64 np.mean(df) c1 5.0 c2 6.0 c3 7.0 c4 8.0 dtype: float64 np.min(df) c1 1 c2 2 c3 3 c4 4 dtype: int32 np.cumsum(df) c1 c2 c3 c4 red 1 2 3 4 green 6 8 10 12 blue 15 18 21 24 DataFrame.apply() and DataFrame.applymap() You can apply an arbitrary function over a DataFrame via apply(func) on each column; and applymap(func) on each element.
df = pd.DataFrame(np.arange(1,13).reshape(3, 4), index=['red', 'green', 'blue'], columns=['c1', 'c2', 'c3', 'c4']) df c1 c2 c3 c4 red 1 2 3 4 green 5 6 7 8 blue 9 10 11 12
help(df.apply) df.apply(np.sum) c1 15 c2 18 c3 21 c4 24 dtype: int64 df.apply(np.sum, axis=1) # Apply row-wise red 10 green 26 blue 42 dtype: int64
df.apply(lambda lst: lst.max() - lst.min()) # Find the range c1 8 c2 8 c3 8 c4 8 dtype: int64
def f(x): print(x, type(x)); return 1 # Return a scalar df.apply(f) red 1 green 5 blue 9 Name: c1, dtype: int32 <class 'pandas.core.series.Series'> # argument is a Series ...... c1 1 c2 1 c3 1 c4 1 dtype: int64 df.apply(lambda s: [1, 2, 3]) # Can return a list of the same-length c1 c2 c3 c4 red 1 1 1 1 green 2 2 2 2 blue 3 3 3 3
df.apply(lambda s: s+1) # Return a Series with incremented value c1 c2 c3 c4 red 2 3 4 5 green 6 7 8 9 blue 10 11 12 13
df.apply(lambda s: pd.Series({'min': s.min(), 'max': s.max()})) # Return a new Series c1 c2 c3 c4 min 1 2 3 4 max 9 10 11 12
df.applymap(lambda x: x+1) c1 c2 c3 c4 red 2 3 4 5 green 6 7 8 9 blue 10 11 12 13 Statistics df = pd.DataFrame(4, 1, 10, 2], [6 , 7, 4, 2], [8, 4, 9, 1, index=['red', 'green', 'blue'], columns=['c1', 'c2', 'c3', 'c4']) df c1 c2 c3 c4 red 4 1 10 2 green 6 7 4 2 blue 8 4 9 1
help(df.describe) df.describe() c1 c2 c3 c4 count 3.0 3.0 3.000000 3.000000 mean 6.0 4.0 7.666667 1.666667 std 2.0 3.0 3.214550 0.577350 min 4.0 1.0 4.000000 1.000000 25% 5.0 2.5 6.500000 1.500000 50% 6.0 4.0 9.000000 2.000000 75% 7.0 5.5 9.500000 2.000000 max 8.0 7.0 10.000000 2.000000 df.mean() c1 6.000000 c2 4.000000 c3 7.666667 c4 1.666667 dtype: float64 df.std() # Standard deviation c1 2.00000 c2 3.00000 c3 3.21455 c4 0.57735 dtype: float64 df.var() # Variance c1 4.000000 c2 9.000000 c3 10.333333 c4 0.333333 dtype: float64 df.median() c1 6.0 c2 4.0 c3 9.0 c4 2.0 dtype: float64
df.corr() # Correlation Coefficients c1 c2 c3 c4 c1 1.000000 0.500000 -0.155543 -0.866025 c2 0.500000 1.000000 -0.933257 0.000000 c3 -0.155543 -0.933257 1.000000 -0.359211 c4 -0.866025 0.000000 -0.359211 1.000000 df.cov() # Covariance c1 c2 c3 c4 c1 4.0 3.0 -1.000000 -1.000000 c2 3.0 9.0 -9.000000 0.000000 c3 -1.0 -9.0 10.333333 -0.666667 c4 -1.0 0.0 -0.666667 0.333333 Sorting on Index and value, and Ranking df = pd.DataFrame(np.random.randint(1, 10, (3, 4)), index=['red', 'green', 'blue'], columns=['c1', 'c2', 'c3', 'c4']) df c1 c2 c3 c4 red 4 5 6 8 green 8 3 7 4 blue 1 1 3 5 df.sort_index() c1 c2 c3 c4 blue 1 1 3 5 green 8 3 7 4 red 4 5 6 8 df.sort_index(axis=1, ascending=False) c4 c3 c2 c1 red 8 6 5 4 green 4 7 3 8 blue 5 3 1 1 df.sort_values('c1') c1 c2 c3 c4 blue 1 1 3 5 red 4 5 6 8 green 8 3 7 4
df.rank() c1 c2 c3 c4 red 2.0 3.0 2.0 3.0 green 3.0 2.0 3.0 1.0 blue 1.0 1.0 1.0 2.0 df.rank(axis=1) c1 c2 c3 c4 red 1.0 2.0 3.0 4.0 green 4.0 1.0 3.0 2.0 blue 1.5 1.5 3.0 4.0 Operations Between Two DataFrames df1 = pd.DataFrame(np.arange(1,13).reshape(3, 4), index=['red', 'green', 'blue'], columns=['c1', 'c2', 'c3', 'c4']) df2 = pd.DataFrame(np.arange(1,10).reshape(3, 3), index=['blue', 'green', 'red'], columns=['c1', 'c2', 'c4']) df1 c1 c2 c3 c4 red 1 2 3 4 green 5 6 7 8 blue 9 10 11 12 df2 c1 c2 c4 blue 1 2 3 green 4 5 6 red 7 8 9
df1 + df2 # per [row-index, column-header] c1 c2 c3 c4 blue 10 12 NaN 15 green 9 11 NaN 14 red 8 10 NaN 13
df1 > df2 ValueError: Can only compare identically-labeled DataFrame objects df3 = pd.DataFrame(np.arange(12,0,-1).reshape(3, 4), index=['red', 'green', 'blue'], columns=['c1', 'c2', 'c3', 'c4']) df3 c1 c2 c3 c4 red 12 11 10 9 green 8 7 6 5 blue 4 3 2 1 df1 > df3 c1 c2 c3 c4 red False False False False green False False True True blue True True True True df1[df1 > df3] c1 c2 c3 c4 red NaN NaN NaN NaN green NaN NaN 7.0 8.0 blue 9.0 10.0 11.0 12.0 Operations Between a DataFrame and a Series df1 = pd.DataFrame(np.arange(1,13).reshape(3, 4), index=['red', 'green', 'blue'], columns=['c1', 'c2', 'c3', 'c4']) df1 c1 c2 c3 c4 red 1 2 3 4 green 5 6 7 8 blue 9 10 11 12 s1 = pd.Series([1, 2, 3], index=['c4', 'c3', 'c2']) df1 + s1 # Apply to each row aligning the column-header c1 c2 c3 c4 red NaN 5.0 5.0 5.0 green NaN 9.0 9.0 9.0 blue NaN 13.0 13.0 13.0 df1 * s1 c1 c2 c3 c4 red NaN 6.0 6.0 4.0 green NaN 18.0 14.0 8.0 blue NaN 30.0 22.0 12.0 6.6 Handling Missing Data Missing data are assigned NaN (Not A Number). You can use functions dataframe.isnull() to check for NaN, or dataframe.fillna(value) to fill NaN with value.
df = pd.DataFrame({'c1': pd.Series([1, 2, 3]), 'c2': pd.Series([11, 22, 33, 44, 55])}) df c1 c2 0 1.0 11 1 2.0 22 2 3.0 33 3 NaN 44 4 NaN 55 len(df) 5 df.isnull() c1 c2 0 False False 1 False False 2 False False 3 True False 4 True False df[df['c1'].isnull()] c1 c2 3 NaN 44 4 NaN 55 len(df[df['c1'].isnull()]) 2 df[~df['c1'].isnull()] c1 c2 0 1.0 11 1 2.0 22 2 3.0 33 df = df[~df['c1'].isnull()] # Remove missing data rows df c1 c2 0 1.0 11 1 2.0 22 2 3.0 33
df = pd.DataFrame({'c1': pd.Series([1, 2, 3]), 'c2': pd.Series([11, 22, 33, 44, 55])}) df.fillna(99) c1 c2 0 1.0 11 1 2.0 22 2 3.0 33 3 99.0 44 4 99.0 55 6.7 Query the Data df = pd.DataFrame({'c1': pd.Series([1, 2, 3]), 'c2': pd.Series([11, 22, 33, 44, 55])}) df c1 c2 0 1.0 11 1 2.0 22 2 3.0 33 3 NaN 44 4 NaN 55
help(df.query) df.query('c1 < 2') c1 c2 0 1.0 11 df.query('c1 < 2 | c2 <= 32') c1 c2 0 1.0 11 1 2.0 22 df.query('~(c1 < 2 | c2 <= 32)') c1 c2 2 3.0 33 3 NaN 44 4 NaN 55 df.query('c2 > c1') c1 c2 0 1.0 11 1 2.0 22 2 3.0 33 6.8 Hierarchical Multi-Level Indexing For supporting multi-dimensional data in 2D tabular structure of DataFrame.
df = pd.DataFrame(np.random.randint(1, 10, (8, 2)), index='i1', 'i1', 'i1', 'i2', 'i2', 'i3', 'i3', 'i3'], ['a', 'b', 'c', 'a', 'c', 'a', 'b', 'd', columns=['c1', 'c2']) df c1 c2 i1 a 8 3 b 9 8 c 7 3 i2 a 9 3 c 9 2 i3 a 4 3 b 7 1 d 5 2 df.index MultiIndex(levels='i1', 'i2', 'i3'], ['a', 'b', 'c', 'd', labels=0, 0, 0, 1, 1, 2, 2, 2], [0, 1, 2, 0, 2, 0, 1, 3) df.loc['i2'] c1 c2 a 8 4 c 1 1 df.loc['i2', 'c'] c1 1 c2 1 Name: (i2, c), dtype: int32
df = pd.DataFrame(np.random.randint(1, 10, (5, 6)), index='i1', 'i1', 'i1', 'i2', 'i2'], ['a', 'b', 'c', 'a', 'c', columns='c1', 'c1', 'c1', 'c2', 'c2', 'c2'], ['x', 'y', 'z', 'x', 'y', 'z') df c1 c2 x y z x y z i1 a 2 1 9 3 2 8 b 4 3 6 6 3 1 c 8 4 8 7 3 5 i2 a 6 6 1 7 1 8 c 9 1 8 7 3 5 df.index MultiIndex(levels='i1', 'i2'], ['a', 'b', 'c', labels=0, 0, 0, 1, 1], [0, 1, 2, 0, 2) df.columns MultiIndex(levels='c1', 'c2'], ['x', 'y', 'z', labels=0, 0, 0, 1, 1, 1], [0, 1, 2, 0, 1, 2) df['c2'] x y z i1 a 3 2 8 b 6 3 1 c 7 3 5 i2 a 7 1 8 c 7 3 5 df['c2', 'y'] i1 a 4 b 4 c 1 i2 a 2 c 3 Name: (c2, y), dtype: int32 df['c2']['y'] i1 a 2 b 3 c 3 i2 a 1 c 3 Name: y, dtype: int32 df = pd.DataFrame(np.random.randint(1, 10, 8), index='i1', 'i1', 'i1', 'i2', 'i2', 'i3', 'i3', 'i3'], ['a', 'b', 'c', 'a', 'c', 'a', 'b', 'd') df 0 i1 a 3 b 3 c 3 i2 a 1 c 2 i3 a 9 b 7 d 1 df.unstack() 0 a b c d i1 3.0 3.0 3.0 NaN i2 1.0 NaN 2.0 NaN i3 9.0 7.0 NaN 1.0 6.9 Grouping df = pd.DataFrame({'Cat 1': ['A', 'C', 'B', 'A', 'B', 'C', 'D'], 'Cat 2': ['X', 'Z', 'Y', 'Y', 'X', 'Z', 'Z'], 'Value': np.random.randint(1, 10, 7)}) df Cat 1 Cat 2 Value 0 A X 9 1 C Z 9 2 B Y 2 3 A Y 7 4 B X 4 5 C Z 5 6 D Z 2
df['Cat 1'].unique() array(['A', 'C', 'B', 'D'], dtype=object) df['Cat 2'].unique() array(['X', 'Z', 'Y'], dtype=object) df['Value'].unique() array([3, 9, 6, 5, 1], dtype=int64) df['Cat 1'].min() 'A' df['Value'].max() 9 df['Cat 1'].sum() # concatenation 'ACBABCD' df['Cat 1'].describe() count 7 unique 4 top B freq 2 Name: Cat 1, dtype: object
group_cat1 = df.groupby('Cat 1') group_cat1.size() Cat 1 A 2 B 2 C 2 D 1 dtype: int64 group_cat1.count() Cat 2 Value Cat 1 A 2 2 B 2 2 C 2 2 D 1 1 group_cat1.describe() Value count mean std min 25% 50% 75% max Cat 1 A 2.0 8.0 1.414214 7.0 7.5 8.0 8.5 9.0 B 2.0 3.0 1.414214 2.0 2.5 3.0 3.5 4.0 C 2.0 7.0 2.828427 5.0 6.0 7.0 8.0 9.0 D 1.0 2.0 NaN 2.0 2.0 2.0 2.0 2.0 group_cat1.sum() Value Cat 1 A 16 B 6 C 14 D 2 group_cat1.aggregate(np.sum) Value Cat 1 A 16 B 6 C 14 D 2 group_cat1.agg([np.sum, np.mean]) Value sum mean Cat 1 A 16 8 B 6 3 C 14 7 D 2 2
group_cat1_cat2 = df.groupby(['Cat 1', 'Cat 2']) group_cat1_cat2.size() Cat 1 Cat 2 A X 1 Y 1 B X 1 Y 1 C Z 2 D Z 1 dtype: int64 group_cat1_cat2.count() Value Cat 1 Cat 2 A X 1 Y 1 B X 1 Y 1 C Z 2 D Z 1 group_cat1_cat2.describe() Value count mean std min 25% 50% 75% max Cat 1 Cat 2 A X 1.0 9.0 NaN 9.0 9.0 9.0 9.0 9.0 Y 1.0 7.0 NaN 7.0 7.0 7.0 7.0 7.0 B X 1.0 4.0 NaN 4.0 4.0 4.0 4.0 4.0 Y 1.0 2.0 NaN 2.0 2.0 2.0 2.0 2.0 C Z 2.0 7.0 2.828427 5.0 6.0 7.0 8.0 9.0 D Z 1.0 2.0 NaN 2.0 2.0 2.0 2.0 2.0
- Pandas Data Analysis Data Analysis involved these steps:
Data Preparation Data Transformation (Map) Data Aggregation (Reduce) 7.1 Data Preparation: Loading/Saving Data to Files Pandas supports reading and writing to files in many formats:
read_csv(), to_csv(): Comma-Separated Values read_table(file, sep='\t'): sep takes regex, such as '\s+' (one or more spaces) read_josn(), to_json() read_html(), to_html() read_excel(), to_excel(): MS Excel read_hdf(), to_hdf(): Hierarchical Data Format read_sql(), to_sql() read_stata(), to_stata() read_clipboard(), to_clipboard() read_pickle(), to_pickle(): Python Object Serialization You need to invoke read_xxx() via pandas, and to_xxx() via DataFrame.
df = pd.DataFrame(np.arange(1, 13).reshape(3, 4), index=['r1', 'r2', 'r3'], columns=['c1', 'c2', 'c3', 'c4']) df c1 c2 c3 c4 r1 1 2 3 4 r2 5 6 7 8 r3 9 10 11 12
df.to_csv('data.csv') ,c1,c2,c3,c4 r1,1,2,3,4 r2,5,6,7,8 r3,9,10,11,12 df.to_csv('data_no_label.csv', index=False, header=False) 1,2,3,4 5,6,7,8 9,10,11,12 df_in = pd.read_csv('data.csv') df_in Unnamed: 0 c1 c2 c3 c4 0 r1 1 2 3 4 1 r2 5 6 7 8 2 r3 9 10 11 12 df_in = pd.read_csv('data.csv', index_col=0) df_in c1 c2 c3 c4 r1 1 2 3 4 r2 5 6 7 8 r3 9 10 11 12 df_in1 = pd.read_table('data.csv', sep=',', index_col=0) df_in1 .....
df.to_json('data.json') {"c1":{"r1":1,"r2":5,"r3":9}, "c2":{"r1":2,"r2":6,"r3":10}, "c3":{"r1":3,"r2":7,"r3":11}, "c4":{"r1":4,"r2":8,"r3":12}} df_in = pd.read_json('data.json') df_in c1 c2 c3 c4 r1 1 2 3 4 r2 5 6 7 8 r3 9 10 11 12
of rows ofdf.to_html('data.html')
columns. Check it out
SQL - need a database connectionExcel
Check out the resultant xls
[TODO]
Pandas-Matplotlib Line Plot: sin(x), cos(x), cos(x**2) for x=[-2pi, 2pi]import matplotlib.pyplot as plt import numpy as np import pandas as pd Generate x: linearly spaced in degree interval, both ends includedx = np.linspace(-2np.pi, 2np.pi, 721) Generate y'ssx, cx, cx2 = np.sin(x), np.cos(x), np.cos(x**2) Create Pandas DataFramedf = pd.DataFrame({'sin(x)': sx, 'cos(x)': cx, 'cos(x**2)': cx2}, index=x) df.index.name = 'x' print(df.head()) Plot through DataFrame and get axes handle for further customizingax = df.plot.line(title='Sines and Cosines (Pandas-Matplotlib Line Plot)', xlim=(-2np.pi, 2np.pi), ylim=(-1, 1)) Set the x-tick locations and labelsax.set_xticks([-2np.pi, -np.pi, 0, np.pi, 2np.pi]) ax.set_xticklabels([r'$-2\pi$', r'$-\pi$', r'$0$', r'$+\pi$', r'$+2\pi$']) # Using latex symbol Set ylabel. xlabel picked up from index-column's headerax.set_ylabel('y') plt.show() Raw Data Types: Raw data can come in many types: Categorical: Nominal: no intrinsic order, e.g., cat A, B, C,... Ordinal: has a predetermined order, e.g., band 1, 2, 3 with ordering Numerical: discrete: can be counted with distinct values continuous: from analog measurements Raw Data Formats: Raw data could take the format of: CSV (Comma-Separated Values) JSON (JavaScript Object Notation) XLS (Excel Spreadsheet) XML (Extensible Markup Language) HTML (Hypertext Markup Language) HDF (Hierarchical Data Format) SQL (Structure Query Language) Others Data Analysis Process The data analysis process consists of these stages: Problem identification and definition. Data preparation: gathering, extraction, cleaning, transformation. Data exploration and visualization. Predictive Modeling: classification models (categorical data), regression models (numeric data), clustering models (descriptive data) Model validation and testing: training set, validation (testing) set. Deployment and interpretation of predictive results. 9.2 The Iris Flower Dataset (for Supervised Classification) The Iris Flower Dataset is used for the first time by Sir Ronald Fisher in 1936. It is often also called Anderson Iris Dataset, after the person who collected the data. The dataset has: 4 input features (the length and width of the sepals, and the length and width of the petals). Input features are numerical and continuous. 1 output target of 3 categories (species of iris - Iris silky, virginica Iris, and Iris versicolor). Target is categorical and nominal (unordered). 150 samples, 50 samples per output category, no missing data. You can load the iris dataset from scikit-learn as follows:
Check out the dataset
Check out the input features
Check out the output target
Setup DataFrameiris = datasets.load_iris() Shorten the feature names to fit the displayfor i in range(len(iris.feature_names)): iris.feature_names[i] = iris.feature_names[i][0:-5] df = pd.DataFrame(iris.data, columns=iris.feature_names) df['category'] = iris.target # Add the output target column df.dtypes sepal length float64 sepal width float64 petal length float64 petal width float64 category int32 dtype: object Print samplesdf.head() sepal length sepal width petal length petal width category 0 5.1 3.5 1.4 0.2 0 1 4.9 3.0 1.4 0.2 0 2 4.7 3.2 1.3 0.2 0 3 4.6 3.1 1.5 0.2 0 4 5.0 3.6 1.4 0.2 0 Setup Pandas printing format for floatpd.options.display.float_format = '{:,.2f}'.format Describe ALL Categoriesdf.describe() sepal length sepal width petal length petal width category count 150.00 150.00 150.00 150.00 150.00 mean 5.84 3.06 3.76 1.20 1.00 std 0.83 0.44 1.77 0.76 0.82 min 4.30 2.00 1.00 0.10 0.00 25% 5.10 2.80 1.60 0.30 0.00 50% 5.80 3.00 4.35 1.30 1.00 75% 6.40 3.30 5.10 1.80 2.00 max 7.90 4.40 6.90 2.50 2.00 Describe Category 0print('Cat 0:', iris.target_names[0]) df.loc[df['category'] == 0].describe() Cat 0: setosa sepal length sepal width petal length petal width category count 50.00 50.00 50.00 50.00 50.00 mean 5.01 3.43 1.46 0.25 0.00 std 0.35 0.38 0.17 0.11 0.00 min 4.30 2.30 1.00 0.10 0.00 25% 4.80 3.20 1.40 0.20 0.00 50% 5.00 3.40 1.50 0.20 0.00 75% 5.20 3.68 1.58 0.30 0.00 max 5.80 4.40 1.90 0.60 0.00 Describe Category 1print('Cat 1:', iris.target_names[1]) df.loc[df['category'] == 1].describe() Cat 1: versicolor sepal length sepal width petal length petal width category count 50.00 50.00 50.00 50.00 50.00 mean 5.94 2.77 4.26 1.33 1.00 std 0.52 0.31 0.47 0.20 0.00 min 4.90 2.00 3.00 1.00 1.00 25% 5.60 2.52 4.00 1.20 1.00 50% 5.90 2.80 4.35 1.30 1.00 75% 6.30 3.00 4.60 1.50 1.00 max 7.00 3.40 5.10 1.80 1.00 Describe Category 2print('Cat 2:', iris.target_names[2]) df.loc[df['category'] == 2].describe() Cat 2: virginica sepal length sepal width petal length petal width category count 50.00 50.00 50.00 50.00 50.00 mean 6.59 2.97 5.55 2.03 2.00 std 0.64 0.32 0.55 0.27 0.00 min 4.90 2.20 4.50 1.40 2.00 25% 6.23 2.80 5.10 1.80 2.00 50% 6.50 3.00 5.55 2.00 2.00 75% 6.90 3.18 5.88 2.30 2.00 max 7.90 3.80 6.90 2.50 2.00 Scatter Plot for Each of the Input Feature vs. Category It is very hard to visualize the statistics of the categories. Let's do a scatter plot for each of the input feature vs. category. Scatter plot for each of the input features vs output categoryimport matplotlib.pyplot as plt import numpy as np import pandas as pd from sklearn import datasets iris = datasets.load_iris() Scatter plot on each of input feature columnsfig, ax = plt.subplots(2, 2, figsize=(8.0, 6.4)) fig.suptitle('Input Feature vs. Category') for feature_col in [0, 1, 2, 3]: # for each feature ax_row, ax_col = feature_col//2, feature_col%2 ax[ax_row][ax_col].scatter(iris.data[:, feature_col], iris.target, c='red', s=8) ax[ax_row][ax_col].set_xlabel(iris.feature_names[feature_col]) ax[ax_row][ax_col].set_ylabel('category') ax[ax_row][ax_col].set_yticks([0, 1, 2]) ax[ax_row][ax_col].set_yticklabels([0, 1, 2])
fig.tight_layout() # Prevent subplots overlap fig.subplots_adjust(top=0.9) # Prevent figure-title overlaps plt.show() Observation: Each of input feature is closely related to the output category except sepal width. Sepal Length: cat 0 has the smallest, cat 2 has the largest. Sepal Width: not quite related. Petal Length: cat 0 has the smallest, cat 2 has the largest, clearly separated. Petal Width: Correlation coefficient is not applicable to categorical nominal (unordered) data. KNN is: Non-Parametric: there is no assumption for underlying data distribution. KNN decision boundary could be irregular. a Non-generalizing Lazy Learner: there is no need for training of the model. It simply remembers all its training data, possible transformed into a fast indexing structure such as a Ball Tree or KD Tree. All computation is deferred until classification (known as lazy learner). KNN suffers from "curse of dimensions" (Euclidean distance is useless in high dimensions because all vectors are almost equidistant to the search query vector). It is also sensitive to the local structure of the data. 9.4 KNN on Iris Dataset Scikit-learn supports KNN via module sklearn.neighbors (@ https://scikit-learn.org/stable/modules/neighbors.html). K-Nearest Neighbors (KNN) Supervised Classification In supervised learning, you need to provide both the input features and output target. scikit-learn implements two nearest neighbors classifiers: KNeighborsClassifier implements learning based on the k nearest neighbors of each query point, where k is an integer value specified by the user. The optimal choice of the value is highly data-dependent: in general a larger k suppresses the effects of noise, but makes the classification boundaries less distinct. RadiusNeighborsClassifier implements learning based on the number of neighbors within a fixed radius r of each training point, where r is a floating-point value specified by the user. This is a better choice if the data is not uniformly sampled. Three algorithms are supported: BallTree, KDTree, and a brute-force approach based on sklearn.metrices.pairwise (i.e., compare with each of the training samples), which can be chosen via keyword argument algorithm='auto'|'ball_tree'|kd_tree'|'brute'. When the default 'auto' is used, it attempts to determine the best approach from the training data.
from scikit-learn: ['cityblock', 'cosine', 'euclidean', 'l1', 'l2', 'manhattan'] from scipy.spatial.distance: ['braycurtis', 'canberra', 'chebyshev', 'correlation', 'dice', 'hamming', 'jaccard', 'kulsinski', 'mahalanobis', 'minkowski', 'rogerstanimoto', 'russellrao', 'seuclidean', 'sokalmichener', 'sokalsneath', 'sqeuclidean', 'yule'] Example: Iris Flower Dataset KNN Supervised Classificationimport numpy as np from sklearn import datasets iris = datasets.load_iris() Prepare data (features and target) for trainingx = iris.data # features y = iris.target Split the data into training set and test setfrom sklearn.model_selection import train_test_split x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.15) # Randomized print('Number of Training Set samples:', len(x_train)) print('Number of Test Set samples:', len(x_test)) KNN Supervised Classifierfrom sklearn.neighbors import KNeighborsClassifier n_neighbors, weights = 5, 'uniform' knn = KNeighborsClassifier(n_neighbors=n_neighbors, weights=weights) knn.fit(x_train, y_train) # Provide the features and target Get the prediction on test sety_predict = knn.predict(x_test) Compare prediction and actualprint(y_predict == y_test) Check Accuracyfrom sklearn import metrics print("Accuracy is:", metrics.accuracy_score(y_test, y_predict)) Number of Training Set samples: 127 Number of Test Set samples: 23 [ True False True True True True True False True True True True True True True True True True True True True True True] Accuracy is: 0.9130434782608695 Observation: With the training-testing set split of 85%:15%, 2 of the test set samples fail. Try: Try k (n_neighbors) of 5, 10, 15 Try weights='distance' (instead of 'uniform') Use 10%, 15%, 20% for test set. KNN Classifier Decision Boundary for Sepal Length/Width Plot the Decision Boundary using only 2 input features: Sepal length and widthimport numpy as np import matplotlib.pyplot as plt from matplotlib.colors import ListedColormap from sklearn import datasets, neighbors Prepare data for training - only use sepal length and widthiris = datasets.load_iris() xy = iris.data[:, 0:2] # Input - sepal length (x) and width (y) z = iris.target # Output - species x_min, x_max = xy[:,0].min()-0.5, xy[:,0].max()+0.5 y_min, y_max = xy[:,1].min()-0.5, xy[:,1].max()+0.5 Setup color meshgridstep = 0.02 xx, yy = np.meshgrid(np.arange(x_min, x_max, step), np.arange(y_min, y_max, step)) cmap_rgb_light = ListedColormap(['#FFCCCC','#CCFFCC','#CCCCFF']) # for 3 classes Run KNN supervised classifiern_neighbors = 15 # default is 5 weights = 'uniform' # default, to try 'distance' knn = neighbors.KNeighborsClassifier(n_neighbors=n_neighbors, weights=weights) knn.fit(xy, z) # Features and target Run prediction on all points on the meshgridz_predict = knn.predict(np.c_[xx.ravel(), yy.ravel()]) # column stack z_predict = z_predict.reshape(xx.shape) # back to 2D Plot color mesh on prediction (decision boundary)plt.pcolormesh(xx, yy, z_predict, cmap=cmap_rgb_light) Overlay the training pointsx, y = xy[:, 0], xy[:, 1] cmap_rgb_dark = ListedColormap(['#FF4444','#44FF44','#4444FF']) # darker plt.scatter(x, y, c=z, cmap=cmap_rgb_dark, s=12) # s: marker size plt.xlim(xx.min(), xx.max()) plt.ylim(yy.min(), yy.max()) plt.xlabel(iris.feature_names[0]) plt.ylabel(iris.feature_names[1]) plt.title('KNN Classifier Decision Boundary on Sepal length and width') plt.show() Notes: The decision boundary is irregular. Try other values for n_neighbors and weights='distance'. Try on petal length and width KNN - Unsupervised Nearest Neighbors In unsupervised learning, you provide the input features, but do not need to provide the output target. The sklearn.neighbors.NearestNeighbors implements unsupervised nearest neighbors learning. It acts as a uniform interface to three algorithms: BallTree, KDTree, and a brute-force approach based on sklearn.metrices.pairwise (i.e., compare with each of the training samples), which can be chosen via keyword argument algorithm='auto'|'ball_tree'|kd_tree'|'brute'. When the default 'auto' is used, it attempts to determine the best approach from the training data.
KNN - Find K Nearest Neighbors via Unsupervised Learningimport numpy as np from sklearn import datasets iris = datasets.load_iris() from sklearn.neighbors import NearestNeighbors knn = NearestNeighbors() # Construct an instance of KNN default k=5, weights='uniform' knn.fit(iris.data) # Only training sample, no target for unsupervised training Find the K-nearest neighbors for a test sampletest = np.array([5.3, 2.1, 2.2, 2.4]) test = test.reshape(1, -1) # reshape to column vector results = knn.kneighbors(test, 7) print(results) # (distances, indexes) of the nearest neighbors print(iris.data[results[1][0]]) # Get the features of the nearest neighbors print(iris.target[results[1][0]]) # Get the target of the nearest neighbors (array([[1.59059737, 1.81659021, 1.8493242 , 1.93649167, 1.97484177, 2.01494417, 2.06397674]]), # distance to the nearest neighbors, sorted array(98, 93, 57, 60, 64, 79, 59, # indexes of the nearest neighbors dtype=int64)) [[5.1 2.5 3. 1.1] # features [5. 2.3 3.3 1. ] [4.9 2.4 3.3 1. ] [5. 2. 3.5 1. ] [5.6 2.9 3.6 1.3] [5.7 2.6 3.5 1. ] [5.2 2.7 3.9 1.4]] [1 1 1 1 1 1 1] # target Reducing the Feature's Dimension via PCA Reduce the feature dimension from 4 to 3 via PCAimport numpy as np import matplotlib.pyplot as plt from matplotlib.colors import ListedColormap from mpl_toolkits.mplot3d import Axes3D from sklearn import datasets, decomposition iris = datasets.load_iris() Run PCAiris_reduced = decomposition.PCA(n_components=3).fit_transform(iris.data) print('Sample output records') print(iris_reduced[:5]) np.savetxt('iris_reduced.csv', iris_reduced, delimiter=',') # Save for further operation Scatter Plot 3Dfig = plt.figure() ax = Axes3D(fig) ax.set_title('Iris Dataset reduced by PCA', size=14) cmap_rgb_dark = ListedColormap(['#FF4444','#44FF44','#4444FF']) ax.scatter(iris_reduced[:,0], iris_reduced[:,1], iris_reduced[:,2], c=iris.target, cmap=cmap_rgb_dark) ax.set_xlabel('1st eigenvector') ax.set_ylabel('2nd eigenvector') ax.set_zlabel('3rd eigenvector') plt.show() Sample records [[-2.68412563 0.31939725 -0.02791483] # 3D feature [-2.71414169 -0.17700123 -0.21046427] [-2.88899057 -0.14494943 0.01790026] [-2.74534286 -0.31829898 0.03155937] [-2.72871654 0.32675451 0.09007924]] KNN Classifier with Reduced Dimension KNN Supervised Classification on reduced dimensionimport numpy as np from sklearn import datasets iris = datasets.load_iris() Prepare data (features and target) for trainingx = np.loadtxt('iris_reduced.csv', delimiter=',') # Retrieved from save file y = iris.target Split the data into training set and test setfrom sklearn.model_selection import train_test_split x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.15) # Randomized print('Number of Training Set samples:', len(x_train)) print('Number of Test Set samples:', len(x_test)) KNN Supervised Classifierfrom sklearn.neighbors import KNeighborsClassifier n_neighbors, weights = 10, 'distance' knn = KNeighborsClassifier(n_neighbors=n_neighbors, weights=weights) knn.fit(x_train, y_train) # Provide the features and target Get the prediction on test sety_predict = knn.predict(x_test) Compare prediction and actualprint(y_predict == y_test) Check Accuracyfrom sklearn import metrics print('Accuracy is:', metrics.accuracy_score(y_test, y_predict)) Number of Training Set samples: 127 Number of Test Set samples: 23 [ True True True True True True True True True True True True True True True True True True True True True True True] Accuracy is: 1.0 Observations: With the reduced dimension, the accuracy is 100%. Nearest Centroid Classifier The NearestCentroid classifier is a simple algorithm that represents each class by the centroid of its members. It is similar to the label updating phase of the sklearn.KMeans algorithm. It has no parameters to choose, making it a good baseline classifier. It does, however, suffer on non-convex classes, as well as when classes have drastically different variances, as equal variance in all dimensions is assumed. NearestCentroid(metric='euclidean', shrink_threshold=None) For example, Plot the Decision Boundary for Nearest Centroid Classifierusing only 2 input features: Sepal length and widthimport numpy as np import matplotlib.pyplot as plt from matplotlib.colors import ListedColormap from sklearn import datasets, neighbors Prepare data for training - only use sepal length and widthiris = datasets.load_iris() xy = iris.data[:, 0:2] # Input - sepal length (x) and width (y) z = iris.target # Output - species x_min, x_max = xy[:,0].min()-0.5, xy[:,0].max()+0.5 y_min, y_max = xy[:,1].min()-0.5, xy[:,1].max()+0.5 Setup color meshgridstep = 0.02 xx, yy = np.meshgrid(np.arange(x_min, x_max, step), np.arange(y_min, y_max, step)) cmap_rgb_light = ListedColormap(['#FFCCCC','#CCFFCC','#CCCCFF']) # for 3 classes Run Nearest Centroid Classifierknn = neighbors.NearestCentroid() knn.fit(xy, z) # Features and target Run prediction on all points on the meshgridz_predict = knn.predict(np.c_[xx.ravel(), yy.ravel()]) # column stack z_predict = z_predict.reshape(xx.shape) # back to 2D Plot color mesh on prediction (decision boundary)plt.pcolormesh(xx, yy, z_predict, cmap=cmap_rgb_light) Overlay the training pointsx, y = xy[:, 0], xy[:, 1] cmap_rgb_dark = ListedColormap(['#FF4444','#44FF44','#4444FF']) # darker plt.scatter(x, y, c=z, cmap=cmap_rgb_dark, s=12) # s: marker size plt.xlim(xx.min(), xx.max()) plt.ylim(yy.min(), yy.max()) plt.xlabel(iris.feature_names[0]) plt.ylabel(iris.feature_names[1]) plt.title('Nearest Centroid Classifier Decision Boundary on Sepal length and width') plt.show() Nearest Shrunken Centroid Classifier The NearestCentroid classifier has a parameter shrink_threshold, which implements the nearest shrunken centroid classifier that removes noisy features from the classification. The value of each feature for each centroid is first divided by the within-class variance of that feature. It it then reduced by shrink_threshold. If the resultant value crosses zero, it is set to zero. In effect, this removes the feature from affecting the classification. This is useful for removing noisy features. Example: Modify the above program for shrink_threshold of 0.1, 0.2, 0.3. [TODO] Compare KNN Regressors Neighbors-based regression can be used in cases where the data labels are continuous rather than categorical. The label assigned to a query point is computed based on the average of its nearest neighbors. scikit-learn implements two nearest neighbors regressors: KNeighborsRegressor based on the nearest k neighbors of each query point, where k is an integer value specified by the user. RadiusNeighborsRegressor based on the neighbors within a fixed radius r of the query point, where r is a floating-point value specified by the user. Example: [TODO] Iris Flower Dataset are not applicable to regression as its target is categorical, not continuous. Correlation [TODO] 9.5 Wine Dataset This dataset is the result of a chemical analysis of wines grown in the same region in Italy using three different cultivars. 13 Input Features: 'alcohol', 'malic_acid', 'ash', 'alcalinity_of_ash', 'magnesium', 'total_phenols', 'flavanoids', 'nonflavanoid_phenols', 'proanthocyanins', 'color_intensity', 'hue', 'od280/od315_of_diluted_wines', 'proline'. Target: 3 type of cultivars ('class_0', 'class_1', 'class_2') Samples: 178 (class_0: 59, class_1: 71, class_2: 48) You can load the Wine dataset from scikit-learn's datasets: from sklearn import datasets wine = datasets.load_wine() print(wind.DESCP) Wine recognition dataset
KNN Supervised Classifier KNN Classifier for Wine datasetfrom sklearn import datasets wine = datasets.load_wine() Split data into training set and test setfrom sklearn.model_selection import train_test_split x_train, x_test, y_train, y_test = train_test_split( wine.data, wine.target, test_size=0.2) Generate KNN Classifier modelfrom sklearn.neighbors import KNeighborsClassifier knn = KNeighborsClassifier(n_neighbors=15, weights='distance') Train the model using the training setsknn.fit(x_train, y_train) # Feature, target Predict the response for test datasety_pred = knn.predict(x_test) Evaluate modelprint(y_pred == y_test) # Check predication results from sklearn import metrics print('Accuracy:', metrics.accuracy_score(y_test, y_pred)) [ True True True False True True True True False True True True False False True True False True True False False True False False False True True True True False True True True True True False] Accuracy: 0.6666666666666666 Observations: Accuracy is 66%.
Case Study 2 [TODO] REFERENCES & RESOURCES Scikit-Learn Documentation @ https://scikit-learn.org/stable/documentation.html. Fabio Nelli, Python Data Analytics, Apress, 2015. |