Chapter 2 ‐ Python Programming for Data Science - sarahwsutton/Introduction_to_datascience_for_librarians GitHub Wiki

2.1 Introduction

This chapter includes example code snippets and code exercises. This version of the chapter includes images of the code snippets and coding exercises paired with images of their expected outputs. Readers are encouraged to download the interactive (.ipynb) version of this chapter then open it using Google Colab as is described below. To use the interactive file, click on the link then use the download icon as pictured in Figure 2.1.1 below to download a copy of the interactive file to your local machine.

Figure 2.1.1 GitHub file download link

Once you have downloaded the file you will be able to upload it to your Google drive and click on it to open it in Google Colab. Once it is open you will be able to run the code snippets and compare the outputs you get to the outputs pictured in this version of the chapter.

2.2 Getting started in Google CoLab

There are a number of platforms on which Python code can be used and written. In data science these are often called integrated development environments, or IDEs for short. An IDE is a software application that combines multiple tools into a single interface or platform. Working in a programming interface makes programming (and troubleshooting or de-bugging) faster and more efficient, not to mention easier to learn.

IDEs come in a variety of levels of complexity from simple, like Google Colaboratory (Colab for short) to very robust, like Anaconda. IDEs simplify the process of sending commands to a computer's operating system. For instance, you might see a drop down menu to use for choosing the folder into which you save the files you create. IDEs work similarly to a graphical user interface (GUI) such as those used to communicate with library database like the EbscoHost platform.

In addition to IDEs, many advanced programmers will use their computer's command line interface, or CLI for short. The CLI appears as an almost blank window, see Figure 2.2.1. It is text-based means of communicating commands to a computer's operating system. Some CLIs will appear with a black background and white or green text.

Figure 2.2.1 Command line interface on my Macbook Air.

Command line interface

For the exercises and examples in this book we'll be using Google Colab instead of the command line interface because you can store files on an existing Google account and because it does not require you to download and configure any new software to use it. It is recommended for this course that you use your student Google account with Colab.

Exercise 2.1 Using Google Colab - Create a New Colab Notebook

Step 1

To get started, you should navigate to and log in to your student Google account. Once there, it is recommended that you create a new folder in which to save the files you create while working through the exercises in this book.

After you've created that new folder, open it, then use the "+ New" icon to open a new Colab file. See Figure 2.3

Figure 2.3 Opening a new Colab file

pop-up windows in Google Colab

Your new file will be blank and have the name "Untitled.ipynb". It will have a single code box, see Figure 2.3.

Step 2

Take a few minutes to hover your mouse over the various icons on your screen to see what they do. In the exercises in this book we'll primarily be using the Files icon on the left hand navigation menu, the + Text and the + Code icons, and the "File" drop down menu.

Step 3

Give your file a new name by selecting the existing file name (the letters and numbers preceding the ".ipynb") then typing your new file name. Then either click on File and choose Save OR use the keyboard short cut CNTL+S to save your file with the new name. Check your Google drive folder to see if the new file appeared in the correct folder.

Congratulations! You've just created your first Python notebook, which uses the file type ".ipynb".

2.3 Syntax

The Oxford English dictionary definition of syntax is "The set of rules and principles in a language according to which words, phrases, and clauses are arranged to create well-formed sentences." In the context of grammar, to be well-formed means to conform to the rules of grammar. In Python (and other programming languages), to be well-formed means to conform to the rules of Python, that is, to conform to Python syntax. This is important to us because only statements and commands that are well-formed, that is, correctly formatted according to the rules of the Python language will run correctly and output what is asked of them.

Indentation

Some programming languages use indentation to make the code human readable, but Python uses it to identify lines of code that "go together." It's often used for a complex command that includes multiple steps. For instance, if you want to execute one command or another depending on something you'd use a command called an "if" statement. Say you wanted to print the word "correct" if the answer to a simple mathematical operation is correct, you could ask for it this way:

NOTE: In Python, print() is a function that sends whatever you've asked it to print to your screen (inside the parentheses), not to an actual printer.


Sidebar: Elements of a Code Snippet and Output in Colab


Functions

We're talking about functions in this section about syntax because functions also have rules to be followed in order for them to work correctly.

A Python function is a pre-written piece of code that sends a command to the computer. Python syntax requires that a function have no spaces in their names and that they be followed by parentheses. Sometimes a function needs an argument, that is, additional information, in order to run correctly. The argument(s) are placed between the function's parentheses. Sometimes the argument must be enclosed in quotation marks. When you use an argument in a function, you are said to have passed the argument to the function.

Exercise 2.2

print() is a simple function that will send whatever argument you include between the parentheses to your screen (which you saw in Exercise 2.1). Let's give it a try. In the code block below, replace "your name here" with your name (including the quotation marks).

2.3 Comments

When you precede a line in a block of code with a hash mark, #, Python will read it as a comment and not act on whatever follows the hash mark. Comments are used to make code more readable, to explain a piece of code, and to prevent a piece of code from being executed while keeping it visible.

Comments made with a hash mark can only be use on a single line. If you need more than one line for your comment, use a triple set of quotation marks before and after your multi-line comments.

What do you think will be output to your screen based on the following code with comments?

2.4 Variables and data types

In order to work with data in Python, you have to give it a name and a value. The combination is referred to as a variable.The variable name is how you will refer to the data when you want the computer to do somthing with it. The value is the data itself. The value can be a single number or letter or a huge list of things.

Let's continue working with the print() function to look at a simple example of a variable.

You can also use Python to do other thing to variables, like add them together.

Notice that I didn't have to re-define x, it was still equal to 13 because Python once you've assigned a value to a variable, it doesn't change unless you take an action to change it.

It's important to know that variable names are case sensitive. It's also important not to use words Python is already using as variable names. Use the code below to view a list of these words that should not be used as variable names.

Note that the output in this image is partial. You should open the interactive version of this chapter and run this code snippet to see the whole list.

It's easy to get into the habit of using short, generic variable names like x or y, but it's not recommended. It's better to use a descriptive name because this will help you to remember what the variable contains. There are a few other rules for variables names:

  • A variable name must start with a letter or the underscore character
  • A variable name cannot start with a number
  • A variable name can only contain alpha-numeric characters and underscores (A-z, 0-9, and _ )
  • Variable names are case-sensitive (age, Age and AGE are three different variables) (W3Schools, 2025).

Python categorizes data values into types in order to more easily take action on them. The actions you can take depend on the type (or category) of data. For instance, a data value could be a number or a letter. You can add one number to another number as we did above, but you can't add a number and a letter.

Python assigns a data type to variables you create automatically based on the syntax you use to create them. In Table 2.1 below are the types of data we'll be working with, a definition of the type, the syntax for creating the data type, and an example.

Table 2.1 Python data types table of python data types

You may be wondering about the syntax for integers and floats. The simple answer is that they don't require any special syntax for Python to recognize them as integers and floats.

You may also be wondering about the difference bewteen a list and a tuple. They are very similar, but tuples, like integers, floats, strings, and Booleans, are immutable. In other words, once you've created them they can't be changed. Instead, a new variable is created every time you need to change it. Tuples are mutable, that is, they can be changed.

If you have a variable and aren't sure what kind of data it contains, you can use the type() function to find out.

Remember when we assigned the value 14 to the variable x above?

Finally, if a variable is one type and you need it to be another, you can often (but not always) "cast" it to that other type.

2.5 Functions and Methods

When I was first learning Python, I had trouble with the difference between functions and methods. As we learned in section 2.2 above, a Python function is a pre-written piece of code that sends a command to the computer. It has a name, like print, which is used to invoke it, followed by a set of parentheses in which you sometimes must insert an argument.

Strictly speaking, a method is a special kind of function, special because it doesn't stand alone but rather is associated with an object.

The term object has a special and important meaning in Python. If you Googled "what is an object in Python," you may read that "an object is an instance of a class." For our (very basic) purposes, think of a class as an abstraction and instance as one occurance of that abstraction. For example, if car is the abstraction, then a car dealer's lot full of Chevrolets is one instance of the class called car.

In Python, a method is a piece of code that has a relationship with an object. For a function to work, you often have to pass it some data on which to take action. For example print() doesn't work unless you tell it what to print by inserting some data between the parentheses. A method has data, in the form of an object, embedded in it.

In terms of syntax, a method looks like a function but it's attached to an object by a period:

object.method()

A simple method used on a variable that is a string is .capitalize()

2.6 Python Libraries Dataframes

Python has a lot of built in functions that are used for simple and/or common tasks. We've already used the print() function, which outputs something to your screen (rather than printing it on a local printer). The print() function is built in to Python, but there are also collections of functions called libraries that can be imported into and used in Python for various specific tasks. Pandas is the name of one of the most common Python libraries. It contains functions and code used to clean up data and prepare it for analysis as well as functions and code for conducting data analysis. We'll be using Pandas and other Python libraries.

A dataframe is a tool used in Pandas to organize data into columns and rows and then perform operations on that data. Think of it as a powerful spreadsheet. Unlike a spreadsheet, a dataframe can be used with very large sets of data, also known as big data.

Exercise 2.6.1 Create a dataframe

In this exercise, we'll make up some data, add it to a dataframe, perform some simple operations on it, and print the results to our screen.

Not only do we see the average of the numbers in the distance column (0.5285714285714286), we can also see that the information being returned is a float (a number with decimal points). The letters "np" at the beginning refer to the library NumPy, which is a sublibrary of the Pandas library.

Notice that it was easy to type in a small set of data above, but that it would become cumbersome to type in a large set of data. Because dataframes can be used to analyze large data sets (some so large that they may be so big they won't fit on a single computer), large data files can be imported into a dataframe. The data sets we'll use in class are large, but not so large that they can't be accomodated on a single computer.

There are several ways to import data into a Python Pandas dataframe:

  • upload a date file to the Colab temporary files,
  • Allow Colab to access data files in your Google drive,
  • read data files from a web page.

We'll practice each one below.

Exercise 2.6.2 Create a dataframe from temporary Colab files

In this example, you'll upload an existing data file to the temporary Colab folder from your local machine, create a dataframe, save the dataframe to a file, and return the new file to your local machine. Important Remember that none of the files you add to the temporary Colab folder will be saved after then end of your session so you must download them before you end the session.

Download the file: Emporia_daily_temps_10_24.csv

Watch this short step-by-step video then try it yourself using the code below.

Exercise 2.6.3 Create a dataframe from a file on your Google drive

In this example you'll use a file stored on your Google drive to create a dataframe, save the dataframe to a file, and return the new file to your Google drive. Use the file, Emporia_daily_temps_10_24.csv, that you downloaded in the previous exercise. This time, upload it to your Google drive before continuing with this exercise.

Watch this short step-by-step video then try it yourself using the code below.

Exercise 2.6.4 Create a dataframe from a file on a web page

In this example you'll use a file stored on a web page to create a dataframe, save the dataframe to a file, and return the new file to your Google drive. The web page we're using in this example is https://www.ncei.noaa.gov/access/monitoring/climate-at-a-glance/county/time-series where we'll find weather data for Emporia, KS.

Watch this short step-by-step video then try it yourself using the code below.

2.7 Conclusion

In this chapter you've learned how to get started using the Python programming language in Google Colab. Remember that Python is a language just like English. While English is used for communication between people, Python is used for communication between people and machines. Like English, Python has it's own "grammatical" rules and structures. In English it's important to know how to structure a sentence correctly in order to clearly convey your message to another person. In Python the same thing occurs. If you don't follow the rules correctly, you're message may become garbled and you'll often receive an error instead of the output you were hoping for.

You've learned about the structure of the Python language and you've learned some very basic Python commands and code. You also had a chance to practice some of what you learned. You'll be applying what you learned and building on it throughout the rest of this book.

Continue on to Chapter 3: Data and Data Management