Chapter 7 ‐ Data Analysis - sarahwsutton/Introduction_to_datascience_for_librarians GitHub Wiki
7.1 Introduction
Data analysis refers to manipulating and/or interacting with a set of data in some way in order to answer a research question. In the research data life-cycle, analysis happens about mid-way through the process after planning, collection, cleaning, and, perhaps, initial visualization. Remember that we noted in Chapter 4 that cleaning data in preparation for analysis is often the most time consuming part the work of a data scientist. In large part, this is because data are messy. But it is also because data analysis is very carefully planned at the start of a research project. Along with identifying a research problem and question(s), at the start of a research project the researcher will have identified what kind(s) of data will be used to answer their research questions and how that data will be used. Those decisions are often driven by the researcher's philosophical approach to research that was discussed in Chapter 3, section 3.1 on types of data. A researcher may even do a pilot test of their data analysis process to ensure that the type of data that is to be collected will yield appropriate results.
7.2 Numeric data analysis
The analysis of numeric data, that is quantitative data, often makes use of the branch of mathematics known as statistics. Because there are entire courses and entire degrees in statistics and because this textbook assumes no prior mathematical or statistical knowledge, coverage of numeric data analysis here will necessarily be broad and shallow. The hope being that it will inspire a desire for additional leaning in the reader. Having said that, in this section we will cover some of the basic assumptions and approaches to statistics.
One of the most basic assumptions of statistics is that something is being measured or counted. It could be test scores or temperatures or the number of cats with six toes in Key West, FL. No matter what is being measured, when the measurements differ enough from what would happen by chance, statistics tells us that there is reason to believe that there is a relationship between or among the things that are being measured. For instance, if test scores for a group of students that received a special lesson are higher than the scores of students who did not receive the special lesson, then, everything else being equal, there may be reason to believe that there is a relationship between the special lesson and higher test scores. Often, more statistical analysis may be used to verify and/or quantify the relationship.
Another of the most basic assumptions of statistics is that just because there is reason to believe that a relationship exists does not mean that the relationship is causal. That is, just because the number of cats with six toes in Key West, FL rose between 2000 and 2025 and the price of eggs rose between 2000 and 2025 does not mean that the the number of cats with six toes in Key West, FL caused the price of eggs to rise.
There are two very basic types of statistical analysis that are commonly applied to numeric data: descriptive analysis and inferential analysis.
Inferential statistical analysis is used to make inferences, that is to draw conclusions, based on the chance that (probability of) a particular result is almost impossible. We say "almost impossible" because what is being measured is usually a representative group of the things rather than all of the things in existence. This is because it is rarely cost effective to measure all of the things in existence. If you've read a journal article reporting on research that used inferential statistics you may have seen those results reported with an "alpha" of 0.05 or 0.01. The alpha figure tells you how close the chances are that the result being reported could have happened randomly (rather than being caused by something). For example, if the alpha for the test scores example given above was 0.01, you could say that the likelihood that one group of students got higher test scores than the other just by chance is very small, only 1%. It also means that, everything else being equal, there is a 99% likelihood that it was the special lesson and not just chance that caused the higher test scores.
The other basic type of analysis that we often run into is descriptive analysis. Descriptive analysis of numeric data is used to describe the data, that is, to describe the characteristics of the thing(s) that is being measured and the measurements themselves. Descriptive methods are related to inferential methods. For example, they may be used to determine whether a group of things (say pencils used in a single classroom) being measured for a research study is representative of all of the things in existence (that is, all of the pencils in the world). The group of things in the study is called the sample and all of things in existence is called the population. Similarity between sample and population is important when the researcher wants to be able to say the results of studying the sample can be extrapolated or generalized to the entire population.
Descriptive statistics are also used to summarize a set of data and so can be a result in and of themselves. In Chapter 5 we obtained some data from the U.S. Census about the languages spoken in homes in St. Louis city. In cleaning and preparing that data, we summarized the data in a data frame with 155 rows into a table with 13 rows. We stopped with that 13-row table because we had answered the research questions which languages are spoken in the homes where English is spoken less than very well. In that case, descriptive statistics were sufficient to answer our question.
Care should be taken when using descriptive statistics. In the exercise in Chapter 5, as part of our data cleansing we removed rows that contained "margin of error" and kept only the rows that contained estimated number households in which each language was spoken. Margin of error represents the amounts that the estimates might be off by. For instance, in the data from Chapter 5 the estimate for the number of homes in which only English was spoken was 254,492 and the margin of error for that estimate was plus or minus 2,563. In other words the number of homes in which only English was spoken could range from 251,929 (254,492 - 2,563) to 257,055 (254,492 + 2,563), or a difference of 5,126. Although that is a broad range, for our purposes in Chapter 5 it was sufficient. It is important to know about margin of error because there may be contexts in which a broad margin of error causes results to be less useful.
7.3 Textual data analysis
Until quite recently, textual data analysis was associated with qualitative data analysis, which meant "analysing the subjective meaning or the social production of issues, events, or practices by collecting non-standardized data and analysing texts and images" (Flick, 2022). With the advent of dramatically lower costs for massive data storage combined with machine learning came the ability to use computers to analyze textual data. In this context, massive data refers not only to more data than can be store in a spreadsheet but also more data than can be stored on a single computer. Figure 7.1 depicts a table that should give you some idea of what those amounts of data mean. This is more data than a human could possibly analyze. Being able to store such massive amounts of data lead directly to being able to analyze these data using computers. Since the focus of this book is on data scientific forms of data analysis, we will focus on the use of data science (rather than qualitative) techniques for analyzing text).
Figure 7.1 Examples of Data Volumes
Source: Nasa. (n.d.). Data volume units. https://mynasadata.larc.nasa.gov/print/pdf/node/273#:~:text=Peta%2D%20means%201%2C000%2C000%2C000%2C000%2C000;%20a%20Petabyte,a%20Yottabyte%20is%201%2C000%20Zettabytes
From the fields of computer science, information science, statistics, and linguistics developed text-mining, natural language processing, and machine learning. All three are closely related in both their development and use. Text mining leverages natural language processing techniques to analyze and understand unstructured textual data, and then uses machine learning algorithms to extract insights about the data, such as identifying patterns, relationships, and meaning from the text. Before we can talk about how to use them, we need to define them.
-
Natural language processing (NLP) is a specific area within machine learning that focuses on enabling computers to process and understand human language. NLP techniques are crucial for text mining because they allow computers to extract meaningful information from text (IBM, 2025).
-
Machine learning (ML) is a broad field of artificial intelligence (AI) that allows computers to learn from data without being explicitly programmed. It involves using algorithms to identify patterns, make predictions, and automate tasks (Google Cloud, n.d.).
-
Text mining applies NLP techniques and ML algorithms to extract knowledge and insights from unstructured textual data. It's essentially a data analysis approach that focuses on textual data (IBM, 2023).
We also need to define the term algorithm. An algorithm in this context means a precise set of instructions used by a computer to solve a problem or perform a calculation ("Algorithm," 2025). Algorithms are powerful and they used for some things you may be familiar with but, perhaps, had not thought about. For instance, algorithms are used to determine the order in which database search results are presented. Consider the last time you searched Library and Information Science Source, what order were your search presented in? You might answer, in order of their relevance to my search terms. That's correct, but how did the EbscoHost platform decide what was more relevant and what was less relevant? The answer is we don't know precisely, both because EbscoHost, like Google, does not share the specifics of their algorithms, but also because EbscoHost allows the libraries that subscribe to their databases to make some (not all) of the decisions about relevance.
With these concepts in mind, we turn to one popular kind of computer based text analysis called sentiment analysis. Sentiment Analysis is a technique that uses natural language processing and machine learning to understand the emotional tone or sentiment expressed in text data. It is currently used more often by businesses to understand customer feedback, social media conversations, and other written content to identify positive, negative, or neutral sentiments. Also sometimes called opinion mining, sentiment analysis is said to provide objective results with speeds approaching real time using large to extremely large bodies of text. It is important to recognize that like all computer assisted text analysis, the results are only as good as the data they are trained on. In other words, if an algorithm is trained using biased data, the results analyzing new data are likely to also be biased.^*
^* For a much more in-depth look at bias in big data analysis see Catherine D'Ignazio and Lauren Klein's book Data Feminism.
Exercise 7.3
In this exercise we'll use sentiment analysis to examine a current, library-centric topic: the attitude toward libraries expressed in the New York times during 2025.
It is important to bear in mind that this analysis is not without bias. While it claims to be independent and non-partisan in its reporting, the New York Times editorial endorsements and overall coverage often reflect a left-leaning perspective. It was selected for this exercise because of the relative ease with which its content can be gathered for analysis.
Unlike the data we obtained from the U.S. Census in Chapter 5, in order to obtain the data from NYT articles, we'll need to have an API key. API is the acronym for application programming interface. "In the context of APIs, the word Application refers to any software with a distinct function. Interface can be thought of as a contract of service between two applications. This contract defines how the two communicate with each other using requests and responses (AWS, 2025). An API defines how one application, such as a Python notebook, can access data or functionality offered by another software program, such as the NYT server. When a request is made, the API processes it, executes the necessary actions, and returns a response, often in formats like JSON or XML. An API key is a unique identifier assigned to an individual or individual machine that gives them permission to make a request for data or functionality. API keys allow the owner of the data or functionality to protect and track the usage of their data and to control the number of requests an application can make within a given time period.
The first step in this exercise is to obtain an API key from the NYT by visiting their developer portal. Follow the directions for creating an account and creating an API key unique to you. Next, download the Python libraries we'll need. Most of them we have used in earlier chapters but a couple are new.
-
Time includes functions that allow us to set the speed with which requests are sent and results returned so that we do not overload the server whose content we are obtaining.
-
JSON stands for JavaScript Object Notation and is a lightweight format for storing and transporting data. JSON is often used when data is sent from a server to a web page (Python Cheat Sheet, n.d.).
-
NLTK stands for Natural Language Toolkit, "NLTK is a leading platform for building Python programs to work with human language data" (Natural Language Toolkit, 2024). It contains several sub-libraries that we'll use including that we'll use:
- tokenize, which we'll use to clean up the text we obtain,
- corpus, which contains a set of stopwords, that is words whose meaning is not useful in determining sentiment such as "a", "the", "and", and so on,
- Vader sentiment is a "lexicon", a repository of words and relationships between words used to help computers make sense of human language. There are multiple lexicons available for this purpose, Vader is one of the more popular lexicons.
Now you will store your NYT API key in a Python object for this session. Paste your actual API key over the words YOUR API KEY in the code block below.
https://github.com/sarahwsutton/Introduction_to_datascience_for_librarians/blob/main/7-4.png
Now you're ready to run a query in the NYT. In the code block below you'll create a Python object called url which stores the URL for the query. Notice that the search term used below is 'library.' We know this because it follows the string 'articleresearch.json?q='. The server's response is stored in an object called 'response', which, in turn, is converted to a .json object called 'data'. Finally, the .json file is used to create a dataframe in which we will work with the data.
Notes:
-
"JSON (JavaScript Object Notation) is a lightweight data-interchange format. It is easy for humans to read and write. It is easy for machines to parse and generate" ("Introducing JSON," n.d.).
-
You could easily change the search by replacing the search term 'library' with another search term.
![])https://github.com/sarahwsutton/Introduction_to_datascience_for_librarians/blob/main/7-6.png)
We obtained 10 articles from the first page of results to our query, but there are more and we want to include them in our analysis. First, we'll try adding one more page of article results.
So far, so good. Let's go for about 250 articles or 25 pages of results.
As you can see, the code block above resulted in an error. The error is included here as an illustration of a common problem with this kind of data seeking with an API. The error is what is called a rate limit error, which we're getting here even though we included a 6 second pause between page requests. Google's AI, Gemini, explains the error this way:
The loop is attempting to fetch 25 pages of results, and the time.sleep(6) call is intended to prevent hitting the rate limit. However, the New York Times Article Search API has a rate limit of 10 requests per minute [1]. Sleeping for 6 seconds between requests means sending a request every 6 seconds, which is 10 requests per minute. While this should theoretically keep you within the limit, network latency, slight variations in execution time, or other factors might cause you to exceed the limit occasionally, especially when making a large number of requests in a loop. Therefore, the most likely cause of the error is hitting the API rate limit, resulting in a response that does not contain the expected 'response' key.
You may run the code above and then used the 'Explain error' link to view the full explaination of the error from Gemini.
The code below contains Gemini's recommended revisions to the code. In the results you should notice that some rate limit errors are still occurring but that the requests were re-submitted for those pages and we did obtain 250 articles. Also note that it took approximately 2 minutes to fully run the code (the number just below the green check mark to the left of the code block).
Now we can reduce the size of our df by pulling out only the colums (variables) needed for our text analysis, specifically the abstract, snippet, and headlines for each article.
The first step in cleansing textual data is to break it down into chunks, usually individual words, using functions from the NLTK's tokenize package.
Part of cleaning textual data is changing all words in the data set to lower case because Python differentiats bewteen upper and lower case. For example, Python would consider 'library' and 'Library' to be two different words.
Another part of cleaning textual data is to remove words that hold little or no meaning, these are called stopwords and NLTK includes (and we have already loaded) a file containing a generic list of stopwords. You should know that in some situations you may need to create and add customized stopwords.
At this point we might like to use a visualization of our data in order to get a sense of what sentiment analysis might uncover but also to determine whether there are any additional words with little meaning that might need to be removed. We'll use a wordcloud for this.
The code block below is modified from code written by Mingqian Liu, Xinyu Li, Xin Xiang, Yanfeng Zhang (https://github.com/XinXiang0307/MBTI-reddit-project-team-10_public).
The word cloud looks pretty good so we won't make any changes to our data at this point. But we will download the word cloud to our local computer in case we decide to use in a presentation or publication.
Up to now we've collected and cleaned up our data so that it will work correctly in our analysis. As you can see, that takes a lot of time and effort, sometimes more than the analysis itself will take! Now, however, we're ready to run our sentiment analysis.
We have some results! We can see that there is an almost equal number of positive word and negative words in our textual data. This suggests that the NYT reporting on libraries is fairly well balanced. We can take the analysis even further by taking a look at the most frequently used words in our data.
These results show a list of the 10 most frequent.y used words in our data set. We might also want a visualization of this list. A bar chart would be a good choice, and we'll plot the top 25 words.
It's not surprising that the word 'library' is the most frequent word in our data set since that's the term we searched for. Nor is it a surpize that words 'new' and 'york' appear in our list of the 25 most frequent words since the source of our data is the NYT. It is interesting that the top 25 most frequent words list includes the words 'trump' and 'congress', which suggests a relationship between libraries and the current presidential administration. Combined with the results of the sentiment analysis, specifically the almost equal number of positive word and negative words in our textual data, suggests that the relationship, as reported in the NYT, is neither overly negative nor overly positive.
As you may have noticed, there are several directions in which this analysis could be continued:
- There are some relatively meaningless word in our top 25 that might be added to a customized stopword list, for example 'also', 'first', 'one', and 'two'.
- Our original search query was for the word 'library', if we changed it to 'libraries' or included both 'library' and 'libraries' in our search we might see a change in our results.
- This type of analysis, especially the visualizations, might also benefit from an exploration of n-grams in the data. For example we might find the bigrams like 'public library', tri-grams like 'librarian of congress', and so on.
7.4 Summary
In this chapter we learned that numerical, quantitative data analysis uses statistics to describe sets of data as well as to draw inferences about whatever the data measures. We learned that statistics are based on assumptions about the data that, when they are not true, can result in erroneous conclusions. We learned that just because there seems to be a relationship between two (or more) measurements, does not mean that one thing causes another. Causal relationships require additional statistical analysis.
Qualitative data analysis, including the analysis of unstructured bodies of text, has traditionally been done by humans on a relatively small scale, but the advent of inexpensive data storage combined with advances in computing (such as text mining, natural language processing, and machine learning) have made possible the analysis of very large corpora of texts. The existence of the latter processes, however, does not in any way negate the value of the former processes.
Download the Python notebook with the code from this chapter and try it for yourself in Google Colab.
References
365 Data Science. (2025). What is a distribution in statistics? Retrieved May 30, 2025 from https://365datascience.com/tutorials/statistics-tutorials/distribution-in-statistics/
Algorithm. (2025). In Wikipedia. http://en.wikipedia.org/wiki/Algorithm#:~:text=In%20mathematics%20and%20computer%20science,performing%20calculations%20and%20data%20processing
AWS. (2025). What is an API (application programming interface)? Retrieved May 29, 2025 from https://aws.amazon.com/what-is/api/#:~:text=API%20stands%20for%20Application%20Programming,other%20using%20requests%20and%20responses.
Google Cloud. (n.d.). What is machine learning (ML)? Retrieved May 29, 2025 from http://cloud.google.com/learn/what-is-machine-learning
Flick, U. (2022). An introduction to qualitative research (7th ed.). Sage.
IBM. (2023). Leveraging user-generated social media content with text-mining examples. Retrieved May 29, 2025 from https://www.ibm.com/think/topics/text-mining-use-cases#:~:text=streamline%20the%20process.-,What%20is%20text%20mining?,giving%20companies%20a%20competitive%20edge.
IBM. (2024). What is NLP (natural language processing? Retrieved May 29, 2025 from http://ibm.com/think/topics/natural-language-processing#:~:text=Statistical%20NLP&text=This%20relies%20on%20machine%20learning,on%20Touch-Tone%20telephones).
Introducing JSON. (n.d.). Retrieved May 29, 2025 from https://www.json.org/json-en.html
Natural Language Toolkit. (2024). Documentation. Retrieved May 29, 2025 from http://nltk.org/
PythonCheatSheet.org. (n.d.). Python JSON module. Retrieved May 29, 2025 from https://www.pythoncheatsheet.org/modules/json-module