Chapter 6 ‐ Visualizing Data - sarahwsutton/Introduction_to_datascience_for_librarians GitHub Wiki

6.1 Introduction

Visual literacy is the ability to process visual information, to comprehend and make sense of imagage, which is something humans are inherently very good at (Johnson, 2019). It's something that many people do better than they do at making sense of a page full of either numbers or text. So, it's no surprise that researchers use visual aids like graphs and charts to help their audiences understand their research results, especially audiences who may not share the researcher's subject expertise.

Visualization can be used just as effectively with text-base data as it can with numeric data. The Google n-gram viewer is an excellent example of visualization of text-based data. An n-gram is a sequence of n words. The term is usually used in the context of seeking patterns in or the frequency with which the n-gram appears in a corpus (body) of text. Released in 2010, Google n-gram viewer allows anyone with internet access to view the relative usage of any n-gram over time within Google's corpus of digitized books and other publications.

The HathiTrust maintains a similar corpus of digitized materials "of 18+ million digitized items [from] the collections of more than 60 academic and research libraries from across North America and other countries" (HathiTrust, 2025). They have developed and maintain an online research center where scholars are free to "engage in research and development for text analysis of massive digital libraries" (HathiTrust, 2025). In this chatper we will focus on numeric data visualization. Those who are interested in a deeper introduction to textual data visualization are referred to Johnson's (2019) chapter 5.

Numeric data visualization is done for a variety of purposes from

initial evaluation of a data set to identifying patterns in data,
to ensuring that data are appropriate for a particular statistical test,
to summarizing and presenting research results.

Different tools and types of visualizations are used to accomplish each of these purposes. Some of the basic chart types are:

Column charts, sometimes called bar graphs, are used to display categories and sometimes to make comparisons between categories.
Bar charts are similar to column charts except that the categories are organized along the y-axis and the values along the x-axis.
Line charts are used to display data that is arranged in columns or rows in an array (like a data frame). They are often used to display data points over time and to compare values of two variables, one on the x-axis and another on the y-axis.
Pie or donut charts are used to display (and sometimes compare) data that is arranged in a single column or row in an array.
Scatter plots are used to display combinations of two values or variables, one on the x-axis and one on the y-axis. Unlike line charts, on a scatter plot each pair of x, y values is represented by a dot on the chart. Scatter plots are often used to identify potential relationships among pairs of variables such as correlation.
Histograms are used to display frequencies in a distribution of data values. A distribution describes all the possible or occurring values for a variable and how frequently each one occurs (365 Data Science, 2025). Histograms may be used to determined whether data follow a normal (bell) curve, which is important for many statistical tests.

6.2 Using Matplotlib and Seaborn

What is Matplotlib?

Like Pandas and Requests and BeautifulSoup, Matplotlib is a library of Python commands created and grouped for a specific purpose: "Matplotlib is a comprehensive library for creating static, animated, and interactive visualizations in Python. Matplotlib makes easy things easy and hard things possible" (The Matplotlib Development Team, 2025).

The following is quoted directly from Scudder (2024).

What is Seaborn?

Seaborn refers to itself as "a Python data visualization library based on Matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics."

View more graph examples.

Types of Plots in Seaborn

These are three of the primary "families" of plots available in seaborn.

relplot is used for showing relationships among variables
displot is used for showing distributions of data
catplot is used for plotting categorical data The plot types that fall under each of these can be expected to share some underlying code and accept similar arguments. As the documentation puts it, "similar functions for similar tasks."

Importing libraries and tools into Colab

Seaborn provides a sensible means for creating matplotlib graphics, and the defaults tend to be more aesthetically pleasing with less effort. It can still be useful to directly access matplotlib features and styles, so in addition to Seaborn we will be importing parts of matplotlib.

It is possible for most Seaborn plotting functions to work with data that has been constructed or loaded using the Pandas or Numpy libraries (e.g. data frames and arrays), as well as built-in Python data structures (e.g. lists and dictionaries). In addition to Seaborn and matplotlib, we will also load in Pandas to demonstrate this.

We will also import tools from google.colab so that we are able to save figures.

This ends the direct quote from Scudder (2024).

6.3 Exercise: Visualization Practice

In this exercise we'll practice creating a visualization of the data we gathered and cleaned in chapter 5. Recall that in this scenario, you were asked to imagine that you are a librarian working in the St Louis Public library. You've been asked to prepare a proposal for presentation to the aldermen on the city council. In the proposal, the library is asking for additional funding for a new library program supporting the people in St Louis who speak a language other than English at home in order to plan a collection of new library materials in those languages. In the exercises in chapter 5, we obtained from the U.S. Census American Community Survey (ACS) a table of the top non-English languages spoken at home by those speak English less than very well.

After obtaining that data and creating a data frame, we cleaned it up so that it was more human readable. But we'd like to present our data as a visualization because we hope that a visualization will be more impactful with the aldermen and women of St Louis city. Since our data are categorical, we could use a pie chart or a bar-chart. We'll use a bar-chart since our purpose in creating a visualization is to compare the counts of our categories, that is the languages spoken at home by those in St Louis who speak English less than well.

The first thing to do is create a new data frame. Luckily, we saved the data frame we created in the exercises from Chapter 5 to a .csv file, so all we need to do is import them into a new data frame for this session.

We succeeded in creating a chart, but the text on the chart is still hard to read so we'll clean it up a bit more.

Now we'll shorten the variable labels in the chart.

It might help our audience (St Louis aldermen and alderwomen) to see the exact values for each language spoken so lets add data labels to each bar in the chart.

Finally, let's change the color of the bars and the font inside each bar.

Finally, we'll need to export the chart so we can use in our presentation.

Before closing your Colab session, be sure that your .png file is in your Google drive if that's where you saved it or than you download it to your own machine if you saved it in the Colab temporary session files.

Download the Python notebook with the code for this chapter and try it for yourself in Google Colab.

Continue on to Chapter 7.

References

HathiTrust. (2025). About the collection. Retreived May 25, 2025 from https://www.hathitrust.org/the-collection/

HathiTrust. (2025). HathiTrust Research Center: What can you do with so many books? Retrieved May 25, 2025 from http://hathitrust.org/about/research-center/

Johnson, E. O. (2019). Working as a data librarian: A practical guide. Libraries Unlimited.

The Matplotlib Development Team. (2025). matplotlib. Retrieved May 26, 2025 from https://matplotlib.org/

Scudder, P. (2024). Best of CORE forum: Python for data visualization. Retrieved May 26, 2025 from http://ala-events.zoom.us/rec/play/uPFL9YwefI2Nc--Kc4CEXOpahtNOyrASxrJWL2EYq0IjX4rY2mykCaHR0fPgDaqLmAu78dGLcGS0F8X9.-MWeucm6DBJduF64?eagerLoadZvaPages=sidemenu.billing.plan_management&accessLevel=meeting&canPlayFromShare=true&from=share_recording_detail&continueMode=true&componentName=rec-play&originRequestUrl=https%3A%2F%2Fala-events.zoom.us%2Frec%2Fshare%2F1FV9TEv_VNS58owhgjUhipIlPmwiCcx19sO5wrFD2_xwwCyEiB7lqALgOMTsbmLU.nRNKhjAnUobctlqe