Chapter 5 ‐ Obtaining Data - sarahwsutton/Introduction_to_datascience_for_librarians GitHub Wiki

5.1 Introduction

This chapter is called Obtaining Data because the word obtaining covers both collecting data (e.g. creating new data) and gaining access to and acquiring existing data. Librarians, in their role as researchers, may collect data as part of a study. As we saw in Chapter 3, in their role providing research services, librarians may assist other researchers to find and use existing data as well as to manage, preserve, and store it.

Data collection methods are as myriad as there are research questions and beyond since there will often be multiple ways to answer a research question using multiple types of data depending on the researcher's philosophical approach to research. Considering philosophical approaches to research is extremely important for conducting research but are by and large beyond the scope of this book. While students using this book are strongly recommended to participate in a course devoted to research philosophies and research methods as part of their studies, this book does not assume that the reader has done so. That is one of the primary reasons why the exercises in this book use existing data sets such as the weather data used in chapters 2 and 4.

We are interested in how data science techniques may be applied in the context of libraries so we will focus on the kinds of research questions libraries might be interested in. In chapter 3 we reviewed some sources of data that might be relevant to libraries. In this chapter we will use two methods for obtaining data from the web, webscraping and using an applicaton programming interface (API).

In the first section of this chapter, we will follow a scenario created by Lin and Scott (2023)in their book Hands on Data Science for Librarians to learn about web scraping. In their scenario, you are asked to imagine that you are a librarian working in the St Louis Public library. You've been asked to prepare a proposal for presentation to the aldermen on the city council. In the proposal, the library is asking for additional funding for a new library program. As in Lin and Scott's scenario, our first step is to collect contact information for the aldermen from the city's web site using web scraping.

In the second section of the chapter we'll learn to use existing data from the U.S. Census. But rather than then collecting data about unemployment rates among adults in the library's service population to learn about downloading Census data (as Lin and Scott do), we will collect data about languages spoken in the homes of children in K-12 public schools in order to plan a collection of new library materials in those languages.

5.2 Web Scraping

Web scraping is literally asking your computer to scrape data from an existing web site. The process is used to obtain a large amount of data in an efficient way and then organize the data for some purpose. The purpose may or may not be to directly answer a reseach question. For example, you might want to scrape book reviews from amazon.com to support collection development.

In the scenario we're exploring in this chapter, we're preparing a proposal to request additional funding for a new library program for parents of children in public schools whose primary language spoken at home is not English. We want to share our proposal with all members of the city's board of aldermen (similar to city councils in other places) so we will need to collect their email addresses. Given that there are only 15 alderman for the city of St Louis, we could go to their public facing web pages and download them one by one. But it would be more efficient to simply scrape the data from that web page and then organize that data into a useable format. Learning to do this now will allow us to have a technique in our data science tool kits for next time, when there may be hundreds of pieces of data to be collected.

Web scraping requires programming skill that is more complex than what is taught in chapter 2 of this book. Chapter 2 was meant to provide you with enough knowledge to read simple Python code, but not enough to write code for web scraping. Web scraping also requires some understanding of HTML (hyper-text markup language), which is also beyond the scope of this book. So, all of the code for web scraping will be provided and explained here. The reader is encouraged to download this chapter in its .ipynb form and run all of the code for themselves. You might even feel confident enough at the end to try it on another web page!

The URL to the web page where a list of aldermen for St Louis wards has changed since the Lin and Scott book was published. Since a plan for redistricting has been accomplished (https://www.stlouis-mo.gov/government/departments/aldermen/redistricting/redistricting-2021.cfm), reducing the number of wards from 28 to 14. The current list of 14 aldermen representing the new wards here: https://www.stlouis-mo.gov/government/departments/aldermen/representation/index.cfm.

We begin by importing the Python script libraries we'll use for web scraping and manipulating the data scraped into a useful form. We're going to use a library you may remember from chapter 2, Pandas, and a couple of new libraries: Requests, BeautifulSoup, and Regular Expressions.

The methods in the Requests library are used in Python to interact with code written in hyper-text markup language, HTML, which is the language in which many web pages are written. It is particularly useful for fetching data from web pages into Python IDEs. RealPython's Guide to the Requests Library is a good place to start learning more about it.

BeautifulSoup "is a Python package for parsing HTML and XML documents, including those with malformed markup" ("BeautifulSoup, HTML Parser," 2025). The BeautifulSoup documentation is a good place to learn more about how it works.

Regular expressions or regex, are "a sequence of symbols and characters expressing a string or pattern to be searched for within a longer piece of text" ("Regular expressions," 2025). Geeks for Geeks has a very nice tutorial about how to use regular expressions.

Next, we will ontain the html code from the web page we want to scrape, in this case, the list of aldermen to 2024-25 from the page https://www.stlouis-mo.gov/government/departments/aldermen/representation/index.cfm . This creates a Python object in our Colab session that contains the html data from the web page. We're naming that object "html."

Don't worry if you can't read the output from this action. Remember that html is a language used mainly for communication between computers. Also notice that the output is lengthy. There is an icon at the top left of the output box where you can hide lengthy output if necessary. Hiding lengthy output sometimes makes maneuvering in the notebook easier.

Also notice that in the web version of this chapter I have shortened included in images only part of the full output for the sake of readability.

Now that we've got the html code from the URL, we'll use commands from BeautifulSoup to extract the data we need, that is the aldermen's names and email addresses, from the rest of the html data.

The first thing BeautifulSoup does is create an object called 'soup' that contains the html code from the page we just scraped. This process is sometimes referred to as "parsing the data."

Using some simple commands from the BeautifulSoup library we can look at parts of the html code and isolate the data we need.

The results still aren't very human readable, even if you are familiar with html. But, if we separate out some of the results piece by piece using BeautifulSoup commands, we can begin to recognize some parts of the web page. Below we'll ask for just the title of the web page we scraped.

Notice that in the output the html code is slightly more readable. In this case, the title that appears on the web page, "Aldermen Serving During the 2025-2026 Session," is preceded by the word "title" enclosed by the greater than and the less than symbols and followed by the text "/title" enclosed by the same symbols. That combination of symbols and characters communicates to the computer: here is where the title starts, here is the actual title, and here is where the title ends.

This is the basis for the html coding language: instructions to the computer are enclosed between the "<" and ">" symbols. These are called "html tags." There's often, but not always, a beginning tag and an ending tag. Knowing this, we can begin to use those tags to parse out the data we want from the html code. We'll start by looking for a couple other small chunks.

We find it in the h4 tag. Again using BeautifulSoup methods we can isolate and refine exactly the data we want by specifying the html tag with the h4 tag that we want.

To retrieve all of the text from each h4 tag, we use the BeautifulSoul method .select.

At this point we can combine our lists of aldermen names with the URLs that point to their contact information into a data frame using functions and commands from the Pandas library.

We'll use the URLs in our dataframe to obtain the aldermen's email addresses, but first, let's add their wards to the data frame.

The last step is to obtain the aldermen's email addresses from their profile web pages. From each page we need the value of the a tag that contains href=mailto:

Now that we have our data cleaned and organized in a dataframe, it's probably a good idea to export our dataframe so that we can call it up whenever we need it, rather having to re-process it. Below is the code for saving the dataframe called aldermen-df into a .csv file. If you are exporting to the Colab temporary file storage, don't forget to download the file before leaving Colab.

5.3 U.S. Census data

According to the Census Bureau's web site, "Census data covers dozen of topics across 130+ surveys and programs. Get in the weeds with more than 2.5 million tables of raw data, maps, profiles, and more" (Census, 2025). It is so large that, again, teaching its use is beyond the scope of this book. But it is a treasure trove of data describing U.S. citizens that is often broken down geographically, which makes it an extremely useful tool for libraries wishing to learn more about their populations.

To illustrate this, we will continue using our learning scenario in this section. Recall that in this scenario, you were asked to imagine that you are a librarian working in the St Louis Public library. You've been asked to prepare a proposal for presentation to the aldermen on the city council. In the proposal, the library is asking for additional funding for a new library program supporting the people in St Louis who speak a language other than English at home in order to plan a collection of new library materials in those languages.`Specifially, we now want to know what those languages are and how many people speak them. We need a table of the top non-English languages spoken at home by those speak english less than very well (that's table C16001 of the ACS).

Using the Census website, data.census.gov we can find the name of the Census table that contains the data we're looking for. It's table C16001, Lanuage Spoken at Home for the Population 5 years and Over. Using the tools at data.census.gove, we are able to narrow the data in table C16001 to the city of St Louis (see footnote below).


Footnote: The results of that process can be downloaded from the site, https://data.census.gov/table/ACSDT1Y2021.C16001?g=050XX00US29510, but the aim of this chapter and section is to learn to obtain that data using Python. We will use the site only for reference.

Obtain the data

The first step, in the code block below, is to use Python to obtain the data from the Census site and use it to create a data frame called census_df.

Clean and Transform the Data

We practiced cleaning and transforming data in Chapter 2. In this section we will apply the same ideas to making our data frame of data more easy to read.

First, we see that our data frame contains one row with 155 columns. We'd like to have the 155 data points in a single row rather than 155 columns so we'll transpose it.

Now we'll add the index as a second column because it contains the abbreviation for the description of each data point.

Instead of "Index" and "0", let's change our column (variable) names to 'Name' and 'Value'.

Now we want to replace the data labels (e.g. C16001_038MA) with more human readable labels so we create a df where the name column has the variable code and the label column has the human readable variable name.

Download the data labels from ACSDT5Y2023.C16001-Column-Metadata.csv. Remember that once you open the file at this link, you should click on the "Download raw file" to download it to your local machine, then upload it to your Colab session or the Google drive you're using with your Colab.

Notice that in the census_df_transposed data frame, some of the values in the Name column include an A at the end. Because of that, they won't match identically with the values in the Name column in the C16001_metadata data frame. So we will remove those ending As.

Also notice that the first column in C16001_metadata is "Column Name". This two words with a space will cause problems as we go on, so we will change it to ColumnName.

Now we are ready to add a column to census_df_transposed for the descriptions of the data points. This block of code takes each value in the Name column of the census_df_transposed data frame and looks for it in the Name column of the C16001_metadata data frame. When a match is found, the corresponding value in the Label column of the C16001_metadata data frame is added to a new column in the census_df_transposed dataframe.

Now let's shorten the values in the Label column of census_df_transpose data frame to make it a little bit more readable. We'll remove the leading characters that are all the same, Estimate!!Total:!! and replace them with E- and we'll remove the Margin of Error!!Total:!! and replace it with M-.

For our purpose, to list in numerical order from highest to lowest the languages spoken at home by families in St Louis City who don't speak English well, the only rows from census_df_transpose that we need are those where the Label column

  • contains the phrase '!!Speak English less than very well', and
  • contains the characters 'E-'

Now we can remove those rows that contain None in the Value column and then sort the rest from highest to lowest on the Value column.

Now we can see that the language most often spoken at home by those who speak English less than very well is Spanish. In fact, according to this data, there are 2,895 people in this category in the St Louis Public Library's service population. The next largest group is those who speak Veitnamese at home. There are 1,080 of them. This information will be very useful to support our proposal for funding a new library collection to support these potential patrons. There is a bit more cleaning up to do with this table and we'd like to create a visualization of it as well. That will be covered in Chapter 6 on visualizing data.

Let's not forget to save our clean data to a .csv file so we do not have to go through the whole process of obtaining and cleaning it again the next time we want to use it (which will be in Chapter 6).

If you save your files to your temporary Colab session files, remember that you must also download them from your session files to your local machine. Below, I'm saving the file to my ESU Google drive.

Download the Python notebook with the code from this chapter and try it for yourself in Google Colab.

Continue on to Chapter 6.

⚠️ **GitHub.com Fallback** ⚠️