Introduction to Python for Data Science - ua-datalab/Workshops GitHub Wiki

An Introduction to Python for Data Science

Python Ecosystem

Python is one of the most used programming languages in Data Science, as also R, Julia and SQL.

Python is an open-source, high-level, interpreted, general-purpose programming language. Its design philosophy emphasizes code readability with the use of significant indentation.

Python was created by the Dutch programmer Guido van Rossum, releasing version 0.9 on February 20th, 1991. The current Python stable version is Python 3.10.4, released on March 24th, 2022.

According to the TIOBE Index, Python is the most popular language used by the program developers community.

Python is one of the most popular programming languages used by data scientists. It provides excellent functionality for mathematics, statistics, machine learning, data visualization, and scientific computing. Its wide collection of libraries makes it ideal for data science applications. See Kaggle free courses on Python.

Main Python Libraries

Thanks to Python's large community of developers and users, many specialized libraries are available for different tasks. Here are some key examples:

Numpy. The fundamental library for scientific computing.
SciPy. Fundamental algorithms for scientific computing.
Pandas. Basic library for Data Analysis with Python.
Polars. Polars is an open-source data manipulation library that's among the fastest single-machine data processing tools available.
Matplotlib. Basic data visualization library in Python.
Seaborn. Specialized library for statistical data visualization.
Scikit-learn. A Machine Learning library for Python.
Scikit-image. A machine learning library for digital image processing.
Tensorflow. Specialized library for Deep Learning Models.
PyTorch. Another library for Deep Learning.
Hugging Face. A collection of Deep Learning models named Transformer used in Natural Language Processing, Large Language Models, Machine Translation and Computer Vision.
OpenCV. A computer vision library.
YOLO. Python library for real-time object detection and image segmentation model.

Machine Learning and Deep Learning are beyond the scope of this workshop.

Python Programming Environments.

There are several options for working with Python.

There is a Command-Line Interface for a command shell Python named iPython (interactive Python).
There are GUI (graphical user interface) options like the web-based Jupyter Lab / Jupyter Notebooks or Spyder.

In our workshops, we will use Jupyter Notebooks running on Google Colab.

Please see Slides

Working in Python.

There are two options for working in Python: Offline and Cloud-based platforms.

Offline method. You need to install all Python libraries on a local machine. The Anaconda Python has all the packages needed. You can download the free academic license version.
Cloud-based option. Again, there are several options. We recommend using Google Colab (colab.research.google.com) with your Gmail account. If you are a student at the University of Arizona, you have access to cloud computing infrastructure like CyVerse, High-Performance Computing, and others.

Using Jupyter Notebooks in Google Colab.

Google Colab offers a basic free Python development environment on Google Cloud. It has the advantage of storing all our files in Google Drive, as well as storing a copy of our code in Github.com to be shared with others.

Start your Google Colab session login into the platform.

📝 Note (Click to open)

To execute a Code Cell: SHIFT+ENTER or use execute button.

Python basics.

Python, like any programming language, has data types and arithmetic operations.

📝 Code Style: Python Programming Best Practices(Click to open)

Use 4-space indentation and no tabs.
4 spaces are a good compromise between small indentation (allows greater nesting depth) and large indentation (easier to read). Tabs introduce confusion and are best left out.
Wrap lines so that they don’t exceed 79 characters. Use \ to break a long line.
This helps users with small displays and makes it possible to have several code files side-by-side on larger displays.
Use blank lines to separate functions and classes, and larger blocks of code inside functions.
When possible, put comments on a line of their own (Everything to the right of # is a comment.
Use docstrings, that is comments extending several lines to document your code.
Use spaces around operators and after commas, but not directly inside bracketing constructs: a = f(1, 2) + g(3, 4).
Name your classes and functions consistently; the convention is to use UpperCamelCase for classes and lowercase_with_underscores for functions and methods.
Special symbols and non-Roman scripts can sometimes cause encoding and compatibility issues. Consider using Plain ASCII.

Using, storing, and accessing data with Python

Variables.

A variable has two parts, a string of characters and numbers (name), and an associated piece of information (value). We use the assignment operator “=” symbol, to assign values to variables in Python. For example, the line x=5 assigns the value 5 to the variable with name “x”. When we execute this line in Python, this number will be stored in this variable. Until the value is changed or the variable deleted, the character x behaves like the value 5. We can manipulate the variable in many ways, such as performing mathematical operations with it, or printing it:

Numbers (Integers and Floating) Values assigned to numeric variables can be used for arithmetic calculations:

x = 5          # Define x = 5 as an Integer. 
print(type(x)) # Prints type "<class 'int'>"
print(x)       # Prints "5"
print(x + 1)   # Addition; prints "6"

Python automatically classifies the type of variables. Besides floats and integers, we can also create Boolean variables and strings:

Boolean or logical variables (True, False) Boolean variables store truth values (true/false) in our programs.

t = True
f = False
print(type(t)) # Prints "<class 'bool'>"
print(t and f) # Logical AND; prints "False"
print(t or f)  # Logical OR; prints "True"
print(not t)   # Logical NOT; prints "False"
print(t != f)  # Logical XOR; prints "True"

📝 Some details about Boolean values (Click to open)

To assign a _boolean value_ to a variable, we can use the words `True` or `False` after the assignment operator (note the capitalization).

True and False behave similar to integers like 1 and 0. It’s possible to assign a Boolean value to variables, but cannot use True as a variable name.

  False = 5
    File "<stdin>", line 1
    SyntaxError: cannot assign to False

Strings We use string for alphanumeric information:

hello = 'Hello'    # String literals can use single quotes
world = "World!"    # or double quotes; it does not matter.
print(hello)       # Prints "Hello"
print(len(hello))  # String length; prints "5"
hw = hello + ' ' + world  # String concatenation
print(hw)  # prints "Hello World!"
hw2 = '%s %s %d' % (hello, world, 2)  # sprintf style string formatting
print(hw2)  # prints "Hello World! 2"

Numbers (Integers and Floating)

x = 5          # Define x = 5 as an Integer. 
print(type(x)) # Prints type "<class 'int'>"
print(x)       # Prints "5"
print(x + 1)   # Addition; prints "6"
print(x - 1)   # Subtraction; prints "4"
print(x * 2)   # Multiplication; prints "10"
print(x ** 2)  # Exponentiation; prints "25"
print(x + 1 + 3 * x) # Prints "21". The multiplication operator has precedence over addition. 
print((x + 1) + (3 * x)) # Prints "20". The preferred way of writing an operation. 
x += 1    # Equivalent to "x = x + 1"
print(x)  # Prints "6"
x *= 2    # Equivalent to "x = x * 2"
print(x)  # Prints "12"
y = 5/2   # Automatically defines y = 2.5 as a floating or real number.
print(type(y)) # Prints "<class 'float'>"
print(y, y + 1, y * 2, y ** 2) # Prints "2.5 3.5 5.0 6.25"

For more details, see this helpful tutorial.

Operators in Python

In order to structure queries we need operators. Python offers many conditional and comparison operators:

Operators
Comparison	`'<', '<=', '==', '!=', '>=', '>',`
or their wrappers	`'.__lt__()', '.__le__()', '.__eq__()', '.__ne__()', '.__ge__()', '.__gt__()'`
Conditionals	`and, or, not`
Arithmetic	`'+', '-', '', '/', '%'(modulus), '*'`
Identity	`'is', 'is not'`
Membership	`'in', 'not in'`

Data Structures.

By default Python has several objects to store data: lists, dictionaries, sets, and tuples.

📝 Note (Click to open)

Python counting start at 0.

Lists A list is created by using the [] parenthesis.

xs = [1, 2, 3]    # Create a list
print(xs, xs[2])  # Prints "[1, 2, 3] 3"
print(xs[-1])     # Negative indices count from the end of the list; prints "3"
xs[2] = 'foo'     # Lists can contain elements of different types
print(xs)         # Prints "[1, 2, 'foo']"
xs.append('bar')  # Add a new element to the end of the list
print(xs)         # Prints "[1, 2, 'foo', 'bar']"
x = xs.pop()      # Remove and return the last element of the list
print(x, xs)      # Prints "bar [1, 2, 'foo']"

Slicing or accessing the contents of a list.

nums = list(range(5))     # range is a built-in function that creates a list of integers
print(nums)               # Prints "[0, 1, 2, 3, 4]"
print(nums[2:4])          # Get a slice from index 2 to 4 (exclusive); prints "[2, 3]"
print(nums[2:])           # Get a slice from index 2 to the end; prints "[2, 3, 4]"
print(nums[:2])           # Get a slice from the start to index 2 (exclusive); prints "[0, 1]"
print(nums[:])            # Get a slice of the whole list; prints "[0, 1, 2, 3, 4]"
print(nums[:-1])          # Slice indices can be negative; prints "[0, 1, 2, 3]"
nums[2:4] = [8, 9]        # Assign a new sublist to a slice
print(nums)               # Prints "[0, 1, 8, 9, 4]"

Tuples

A tuple is an ordered list of values. It is immutable, i.e, once created, its state cannot be modified. A tuple is in many ways similar to a list. One of the most important differences is that tuples can be used as keys in dictionaries and as elements of sets, while lists cannot. A tuple is created by using the () parenthesis.

d = {(x, x + 1): x for x in range(10)}  # Create a dictionary with tuple keys
t = (5, 6)        # Create a tuple
t = 5,            # tuple created due to trailing comma
print(type(t))    # Prints "<class 'tuple'>"
print(d[t])       # Prints "5"
print(d[(1, 2)])  # Prints "1"

Sets

A set is an unordered collection of distinct elements. It is created by enclosing elements in the {} parenthesis.

animals = {'cat', 'dog'}
print('cat' in animals)   # Check if an element is in a set; prints "True"
print('fish' in animals)  # prints "False"
animals.add('fish')       # Add an element to a set
print('fish' in animals)  # Prints "True"
print(len(animals))       # Number of elements in a set; prints "3"
animals.add('cat')        # Adding an element that is already in the set does nothing
print(len(animals))       # Prints "3"
animals.remove('cat')     # Remove an element from a set
print(len(animals))       # Prints "2"

Dictionaries

A dictionary stores (key, value) pairs. In a list, we call an item by using its index or position. While using dictionaries, we use a pair's key in order to call the value:

d = {'cat': 'cute', 'dog': 'furry'}  # Create a new dictionary with some data
print(d['cat'])       # Get an entry from a dictionary; prints "cute"
print('cat' in d)     # Check if a dictionary has a given key; prints "True"
d['fish'] = 'wet'     # Set an entry in a dictionary
print(d['fish'])      # Prints "wet"
print(d['monkey'])  # KeyError: 'monkey' not a key of d
print(d.get('monkey', 'N/A'))  # Get an element with a default; prints "N/A"
print(d.get('fish', 'N/A'))    # Get an element with a default; prints "wet"
del d['fish']         # Remove an element from a dictionary
print(d.get('fish', 'N/A')) # "fish" is no longer a key; prints "N/A"
mydict.keys()          # creates an object in datatype dict_keys with all keys in the dictionary
key_list = list(d.keys()) #creates a list of keys in the dictionary
values_list = list(d.values()) #creates a list of values in the dictionary

Iteration and Loops

In Python, objects like lists, tuples and dictionaries provide a stream of items that can be used one after the other, automatically. We can loop over the elements of any such iterable object, in order to generate multiple, automatic outputs.

Ex. a for loop for a list, that will run as many times as there are objects in the list:

animals = ['cat', 'dog', 'monkey']
for animal in animals:
    print(animal)
# Prints "cat", "dog", "monkey", each on its own line.

nums = [0, 1, 2, 3, 4]
squares = []
for num in nums:
  square = num ** 2
  squares.append(square)

We can also use a list to execute the same for loop seen above:

nums = [0, 1, 2, 3, 4]
squares = [num ** 2 for num in nums]
print(squares)   # Prints [0, 1, 4, 9, 16]

Similarly, we can generate the squares for all numbers from 0-4 using range():

squares = [x ** 2 for x in range(5)]
print(squares)   # Prints [0, 1, 4, 9, 16]

Loop or iterate over the keys in a dictionary:

d = {'person': 2, 'cat': 4, 'spider': 8}
for animal in d:
    legs = d[animal]
    print(f'A {animal} has {legs} legs')
# Prints "A person has 2 legs", "A cat has 4 legs", "A spider has 8 legs"

Loop or iterate over the items in a dictionary, while accessing both keys and values:

d = {'person': 2, 'cat': 4, 'spider': 8}
for key, value in d.items():
  print(f'key: {key} \n value: {value}')

Python loops

Advanced operations with Numpy

Numpy is the core library for scientific computing in Python. It provides high-performance multidimensional arrays and tools for working with them. This library significantly expands Python's capabilities for data manipulation.

Numpy includes a large collection of mathematical defined functions.

Basic Mathematical Functions
Linear Algebra functions based on the matrix algebra BLAS and numeric linear algebra LAPACK software libraries.
Discrete Fourier Transform for spectral analysis.
Random sampling library
And more ...

Before we start working with the Numpy Library, we need to load (import) it into the current working memory, by including the following line in a Jupyter Notebook code cell.

import numpy as np

where the alias or short name np is given to refer to Numpy.

import numpy as np

print('Pi number = ', np.pi) # Using the definition of Pi from Numpy
print('The square root of Pi is = ', np.sqrt(np.pi)) # Using the square root function in Numpy

Arrays

A numpy array is a grid of values, all of the same type, and is indexed by a tuple of nonnegative integers. The number of dimensions is the rank of the array; the shape of an array is a tuple of integers giving the size of the array along each dimension.

import numpy as np

a = np.array([1, 2, 3])   # Create a rank 1 array
print(type(a))            # Prints "<class 'numpy.ndarray'>"
print(a.shape)            # Prints "(3,)"
print(a[0], a[1], a[2])   # Prints "1 2 3"
a[0] = 5                  # Change an element of the array
print(a)                  # Prints "[5, 2, 3]"

b = np.array([[1,2,3],[4,5,6]])    # Create a rank 2 array
print(b.shape)                     # Prints "(2, 3)"
print(b[0, 0], b[0, 1], b[1, 0])   # Prints "1 2 4"

Numpy also includes a set of functions to create arrays

import numpy as np

a = np.zeros((2,2))   # Create an array of all zeros
print(a)              # Prints "[[ 0.  0.]
                      #          [ 0.  0.]]"

b = np.ones((1,2))    # Create an array of all ones
print(b)              # Prints "[[ 1.  1.]]"

c = np.full((2,2), 7)  # Create a constant array
print(c)               # Prints "[[ 7.  7.]
                       #          [ 7.  7.]]"

d = np.eye(2)         # Create a 2x2 identity matrix
print(d)              # Prints "[[ 1.  0.]
                      #          [ 0.  1.]]"

e = np.random.random((2,2))  # Create an array filled with random values between 0 and 1.
                             # These numbers hsow a uniform distribution
print(e)                     # If run many times will give different results

Integer array indexing

import numpy as np

a = np.array([[1,2], [3, 4], [5, 6]])

# An example of integer array indexing.
print(np.array([a[0, 0], a[1, 1], a[2, 0]]))  # Prints "[1 4 5]"
print(np.array([a[0, 1], a[0, 1]]))  # Prints "[2 2]"

Slicing can also be used for arrays similarly as it was used for lists.

import numpy as np

# Create the following rank 2 array with shape (3, 4)
# [[ 1  2  3  4]
#  [ 5  6  7  8]
#  [ 9 10 11 12]]
a = np.array([[1,2,3,4], [5,6,7,8], [9,10,11,12]])

# Use slicing to pull out the subarray consisting of the first 2 rows
# and columns 1 and 2; b is the following array of shape (2, 2):
# [[2 3]
#  [6 7]]
b = a[:2, 1:3]

Numpy offers several ways to index into arrays.

# A slice of an array is a view into the same data, so modifying it
# will modify the original array.
a = np.array([[1,2,3,4], [5,6,7,8], [9,10,11,12]])
b = a[:2, 1:3]
print(a[0, 1])   # Prints "2"
b[0, 0] = 77     # b[0, 0] is the same piece of data as a[0, 1]
print(a[0, 1])   # Prints "77"

Numpy Datatypes

Every numpy array is a grid of elements of the same type.

import numpy as np

x = np.array([1, 2])   # Let numpy choose the datatype
print(x.dtype)         # Prints default "int64"

x = np.array([1.0, 2.0])   # Let numpy choose the datatype
print(x.dtype)             # Prints default "float64"

x = np.array([1, 2], dtype=np.float64) # Force a particular datatype
print(x.dtype)                         # Prints "float64"

Array math

Basic mathematical functions operate elementwise on arrays.

import numpy as np

x = np.array([[1,2],[3,4]], dtype=np.float64)
y = np.array([[5,6],[7,8]], dtype=np.float64)

# Elementwise sum; both produce the array
# [[ 6.0  8.0]
#  [10.0 12.0]]
print(x + y)
print(np.add(x, y))

# Elementwise difference; both produce the array
# [[-4.0 -4.0]
#  [-4.0 -4.0]]
print(x - y)
print(np.subtract(x, y))

# Elementwise product; both produce the array
# [[ 5.0 12.0]
#  [21.0 32.0]]
print(x * y)
print(np.multiply(x, y))

# Elementwise division; both produce the array
# [[ 0.2         0.33333333]
#  [ 0.42857143  0.5       ]]
print(x / y)
print(np.divide(x, y))

# Elementwise square root; produces the array
# [[ 1.          1.41421356]
#  [ 1.73205081  2.        ]]
print(np.sqrt(x))

Dot product:

import numpy as np

x = np.array([[1,2],[3,4]])
y = np.array([[5,6],[7,8]])

v = np.array([9,10])
w = np.array([11, 12])

# Dot product of vectors; both produce 219
print(v.dot(w))
print(np.dot(v, w))

# Matrix / vector product; both produce the rank 1 array [29 67]
print(x.dot(v))
print(np.dot(x, v))

# Matrix / matrix product; both produce the rank 2 array
# [[19 22]
#  [43 50]]
print(x.dot(y))
print(np.dot(x, y))

Sum of elements:

import numpy as np

x = np.array([[1,2],[3,4]])

print(np.sum(x))  # Compute sum of all elements; prints "10"
print(np.sum(x, axis=0))  # Compute sum of each column; prints "[4 6]"
print(np.sum(x, axis=1))  # Compute sum of each row; prints "[3 7]"

Transposing a matrix:

import numpy as np

x = np.array([[1,2], [3,4]])
print(x)    # Prints "[[1 2]
            #          [3 4]]"
print(x.T)  # Prints "[[1 3]
            #          [2 4]]"

# Note that taking the transpose of a rank 1 array does nothing:
v = np.array([1,2,3])
print(v)    # Prints "[1 2 3]"
print(v.T)  # Prints "[1 2 3]"

📚 Basic References

Python Tutorial. Python.org.
Numpy Tutorial. Numpy.org.
SciPy User's Guide. SciPy.org.
Digital Library: A Collection of Python Programming Books
UArizona DataLab Introduction to Data Science

Cheat Sheets

Jupyter Notebook. Cheat Sheet. Datacamp.
Python 3. Cheat Sheet. Laurent Pointal. Mémento v.2.0.6.
Python Basics. Data Science Cheat Sheet. Dataquest.
Python Intermediate. Data Science Cheat Sheet. Dataquest.
Importing Data. Python for Data Science Cheat Sheet. Datacamp.
Numpy. Data Analysis in Python. Cheat Sheet. Datacamp.
Numpy. Data Science Cheat Sheet. Dataquest.

Free Short Courses.

Introduction to Programming. Kaggle.com.
Python. Kaggle.com.
Data Cleaning. Kaggle.com.

Created: 05/20/2022 (C. Lizárraga); Last update: 02/05/2025 (C. Lizárraga)

CC BY-NC-SA

UArizona DataLab, Data Science Institute, University of Arizona, 2025.