Python Workshop 3 NumPy - cstkennedy/Python-Workshop GitHub Wiki


Title: Python Workshop 3 - NumPy Author: Thomas J Kennedy Email: [email protected]

Introduction

I am going to go for a Raymond Hettinger style presentation, https://www.cs.odu.edu/~tkennedy/cs330/latest/Public/languageResources/#python-programming-videos.

These materials are web-centric (i.e., do not need to be printed and are available at https://cstkennedy.dev/Python-Workshop/wiki/Python-Workshop-3-NumPy).

Who am I?

I have taught various courses, including:

  • CS 300T - Computers in Society
  • CS 333 - Programming and Problem Solving
  • CS 330 - Object Oriented Programming and Design
  • CS 350 - Introduction to Software Engineering
  • CS 410 - Professional Workforce Development I
  • CS 411W - Professional Workforce Development II
  • CS 417 - Computational Methods & Software

Most of my free time is spent writing Python 3 and Rust code, tweaking my Vim configuration, or learning a new (programming) language. My current language of interests are Rust (at the time of writing) and Python (specifically the NumPy library).

Referenced Courses & Materials

I may reference materials (e.g., lecture notes) and topics from various courses, including:

  • CS 330 - Object Oriented Programming & Design
  • CS 350 - Introduction to Software Engineering
  • CS 417 - Computational Methods & Software

I may also reference a couple examples from the previous:

Overview

What is NumPy?

NumPy is the fundamental package for scientific computing in Python. It is a Python library that provides a multidimensional array object, various derived objects (such as masked arrays and matrices), and an assortment of routines for fast operations on arrays, including mathematical, logical, shape manipulation, sorting, selecting, I/O, discrete Fourier transforms, basic linear algebra, basic statistical operations, random simulation and much more.

Retrieved from https://numpy.org/doc/stable/user/whatisnumpy.html

Managing Expectations

We can only scratch the surface during a one (1) hour workshop.

For this workshop...

  • a basic understanding of Python is assumed.
  • no prior knowledge of NumPy is assumed.

A Few Short Examples

My original intention was to focus on a few short (10 to 20 line) examples that focused on arrays and statistics... with a little linear algebra at the end.

If you are interested in a collection of problems to develop your NumPy understanding... Coffee Break NumPy by Christian Mayer is an excellent read. In fact... I read the book during the 2020 Winter Break.

Let us start with a few short NumPy basics (e.g., arrays and broadcasting) then build a Matrix Solver!.

Creating Arrays

The code snippets from this section are extracted from array_creation.py.

NumPy arrays can be...

  • Initialized to all zeroes

        array_size = 8
        zeroes_array = np.zeros(array_size)
        print(zeroes_array)
        print()
    
  • Initialized to all ones

        array_size = 12
        ones_array = np.ones(array_size)
        print(ones_array)
        print()
    
  • Allocated and left uninitialized

        # Contents are "whatever happens to be in memory"
        array_size = 16
        unitialized_array = np.empty(array_size)
        print(unitialized_array)
        print()
    
  • Created from a Python List of int-s

        python_list = [2, 4, 8, 16, 32, 64]
        np_array = np.array(python_list)
        print(np_array)
        print()
    
  • Created from a Python List of float-s

        python_list = [2., 4., 8., 16., 32., 64.]
        np_array = np.array(python_list)
        print(np_array)
        print()
    

Broadcasting

The next couple snippets are extracted from broadcasting.py.

In Python... each element of a list must be updated one at a time. If a list of prices needed to be reduced by 10%, each one would need to be multiplied by 0.9 within a loop...

    prices = [1.00, 2.95, 8.40, 3.50, 3.30, 16.91]

    for idx in range(len(prices)):
        price[idx] *= 0.9

or using a list comprehension...

    prices = [1.00, 2.95, 8.40, 3.50, 3.30, 16.91]
    prices = [0.9 * price for price in prices]

    print(prices)

NumPy's broadcasting mechanic allows us to write a simple prices *= 0.9.

    prices = np.array([1.00, 2.95, 8.40, 3.50, 3.30, 16.91])
    prices *= 0.9

    print(prices)

    print()
    print("*" * 80)
    print()

The obvious benefit is less typing. The more important one is optimization. NumPy's core is implemented in C. The official NumPy Documentation provides a succinct overview in its Why is NumPy Fast? section.

How much faster is NumPy? Let us run a quick using benchmark_broadcasting.py.

    num_values = 1000000
    num_runs = 100

    def op_wrapper_py():
        prices = range(1, num_values, 1)
        prices = [0.9 * price for price in prices]

    py_list = timeit.timeit(op_wrapper_py, number=num_runs)

    def op_wrapper_np():
        prices = np.arange(0, num_values, 1, dtype=np.float64)
        prices[:] *= 0.9

    np_array = timeit.timeit(op_wrapper_np, number=num_runs)

    print(f"Python Time: {py_list:.4f}")
    print(f"NumPy Time : {np_array:.4f}")

On a Core i7-6700k... For 1 million numbers, run 100 times... The NumPy code is a little over 10 times faster than the pure Python code.

Time (sec)
Python 5.1248
NumPy 0.3168

Remaining Topics

There are a few more topics to introduce:

  • Indexing - will be covered as part of the Matrix Solver example
  • I/O
  • Index Arrays
  • Boolean (Mask) Index Arrays

Implementing a Matrix Solver

In CS 417/517 Computational Methods... I require students to implement a Matrix Solver... from scratch. NumPy provides implementations of matrix operations (e.g., multiplication). Let us implement a quick NumPy based Matrix solver, for this Discrete Case Least Squares Approximation Problem. The code snippets in this section are extracted from least_squares.py.

Getting Started

The first line is the NumPy import. By convention the numpy module is given the alias np.

import numpy as np

The first function (i.e., print_matrices) was debugging/instrumentation logic. It was used to display the XTX (X-Transpose-X) and XTY (X-Transpose-Y) matrices during development.

def print_matrices(matrix_XTX, matrix_XTY):
    """
    Print the XTX and XTY matrices
    """

    print("{:*^40}".format("XTX"))
    print(matrix_XTX)

    print()
    print("{:*^40}".format("XTY"))
    print(matrix_XTY)

Next... let us jump to the main function and...

  1. set up the input data (points)

       points = [(0., 0.), (1., 1.), (2., 4.)]
    
  2. create the X matrix

       matrix_X = np.array([[1., 0., 0.],
                            [1., 1., 1.],
                            [1., 2., 4.]])
    
  3. create the Y matrix

       matrix_Y = np.array([0,
                            1,
                            4])
    
  4. compute X-transpose

      matrix_XT = matrix_X.transpose()
    

After setting everything up... it is time for a couple matrix multiplications.

    matrix_XTX = np.matmul(matrix_XT, matrix_X)
    matrix_XTY = np.matmul(matrix_XT, matrix_Y)

The augmented XTX|XTY matrix is what we are going to solve. Note that the @ operator is often used instead of np.matmul.

The Solver

A Gaussian Elimination Matrix solver can viewed as 4 operations:

  1. Pivot
  2. Scale
  3. Eliminate
  4. Backsolve

For our implementation... these operations will be wrapped in a solve_matrix function...

def solve_matrix(matrix_XTX, matrix_XTY):
    """
    Solve a matrix and return the resulting solution vector
    """

    # Get the dimensions (shape) of the XTX matrix
    num_rows, num_columns = matrix_XTX.shape

    for i in range(0, num_rows):

Note the use of the .shape tuple.

Pivot

Pseudocode

    idx <- find_largest_row_by_col(A, col_index, num_rows)
    swap_row(A, i, idx)

Implementation

        # Find column with largest entry
        largest_idx = i
        current_col = i
        for j in range(i + 1, num_rows):

            if matrix_XTX[largest_idx, i] < matrix_XTX[j, current_col]:
                largest_idx = j

        if largest_idx != current_col:
            matrix_XTX[[i, largest_idx], :] = matrix_XTX[[largest_idx, i], :]
            matrix_XTY[i, largest_idx](/cstkennedy/Python-Workshop/wiki/i,-largest_idx) = matrix_XTY[largest_idx, i](/cstkennedy/Python-Workshop/wiki/largest_idx,-i)

Swap

Pseudocode

def scale_row(A, row_idx, num_cols, s):

    for every j in 0 to num_cols:
        A[row_idx][j] = A[row_idx][j] / s

Implementation

        # Scale
        scaling_factor = matrix_XTX[i, i]
        matrix_XTX[i, :] /= scaling_factor
        matrix_XTY[i] /= scaling_factor

Eliminate

Pseudocode

def eliminate(A, src_row_idx, num_cols, num_rows):

    start_col <- src_row_idx

    for every i in (src_row_idx + 1) to num_rows:

        s <- A[i][start_col]

        for every j in start_col to num_cols:
            A[i][j] = A[i][j] - s * A[src_row_idx][j]

        A[i][start_col] = 0

Implementation

        # Eliminate
        for row_i in range(i + 1, num_rows):
            s = matrix_XTX[row_i][i]

            matrix_XTX[row_i] = matrix_XTX[row_i] - s * matrix_XTX[i]
            matrix_XTY[row_i] = matrix_XTY[row_i] - s * matrix_XTY[i]

Backsolve

Pseudocode

def back_solve(matrix):

    augColIdx <- matrix.cols()  # the augmented column
    lastRow = matrix.rows() - 1

    for i in lastRow down to 1:
        for j <- (i - 1) down to 0:
            s <- matrix[j][i]

            matrix[j][i] -= (s * matrix[i][i])
            matrix[j][augCol] -= (s * matrix[i][augColIdx])

Implementation

def _backsolve(matrix_XTX, matrix_XTY):

    num_rows, _ = matrix_XTX.shape

    for i in reversed(range(1, num_rows)):
        for j in reversed(range(0, i)):
            s = matrix_XTX[j, i]

            matrix_XTX[j, i] -= (s * matrix_XTX[i, i])
            matrix_XTY[j] -= (s * matrix_XTY[i])

Statistics

Suppose we need to compute some grade statistics for the following data.

Name Homework 1 Homework 2 Exam 1 Exam 2
John 100 98 100 90
Tom 100 0 70 90
Bob 100 70 90 80

The first step is to convert the table into...

  • a list of exercises

    exercises = ["Homework 1", "Homework 2", "Exam 1", "Exam 2"]
    
  • a list of names

    students = ["John", "Tom", "Bob"]
    
  • a NumPy ndarray

    grades = np.array([[100., 98, 100., 90.],
                       [100.,  0., 70., 90.],
                       [100., 70., 90., 80.]])
    

NumPy provides a number of statistics operations that can be applied by axis. For a two-dimensional ndarray there are two choices:

  • by row (student)

    avg_by_student = grades.mean(axis=1)
    min_by_student = grades.min(axis=1)
    max_by_student = grades.max(axis=1)
    
    Name Avg Min Max
    John 97.0 90.0 100.0
    Tom 65.0 0.0 100.0
    Bob 85.0 70.0 100.0
  • by column (exercise)

    avg_by_exercise = grades.mean(axis=0)
    max_by_exercise = grades.max(axis=0)
    std_by_exercise = grades.std(axis=0)
    
    Exercise Avg Max Std Dev
    Homework 1 100.0 100.0 0.0
    Homework 2 56.0 98.0 41.2
    Exam 1 86.7 100.0 12.5
    Exam 2 86.7 90.0 4.7