Chapter 4 ‐ Preparing Data for Analysis - sarahwsutton/Introduction_to_datascience_for_librarians GitHub Wiki
4.1 Introduction
This chapter begins a deep dive into the uses of data science techniques in libraries in the next few chapters. Rather than begin with obtaining data, the first step in the data lifecycle, we begin with a closer look at, arguably, the most important step in using data because knowing what our data must look like in order for us to conduct an analysis is incredibly helpful in all of the stages its lifecycle. Knowing what type, amount, and format of data is needed to answer a research question allows us to more efficiently obtain only the data we really need and in the format we need, whether we're using existing data or collecting new data. That step is data cleansing and it is estimated to make up to 80% of a data scientists time (Sarih et al., 2019) as is shown in Figure 4.1.
Figure 4.1 Percent of work time spent by data scientists.
4.2 Data Cleansing
Data cleansing is simply cleaning and organizing data so that they can be manipulated and analyzed. It "refers to the process of identifying errors, anomalies, and inconsistencies in the dataset" (Amit, 2024). Consider a spreadsheet that contains the names and addresses of the people to whom I regularly send Christmas cards. My spreadsheet includes the dates on which I sent and recieved cards from each person. It is important to me to be able to sort my spreadsheet by date so that I can tell who I haven't heard from in a while. But, sometimes I am inconsistent in the way I enter my dates. One year I might use MM-DD-YY, the next I might use MM-DD-YYYY, and back when I started my sheet, I might have used Month Day, Year. When I try to sort (organize) my spreadsheet by the date I last received a card from someone, the sort will not be correct if I've used multiple different formats for those dates! In order to get a correctly sorted list, I have to clean up those dates so they are all expressed in the same way. Voila! Data cleansing!
Cleansing data is imperitive because messy data can create skewed or even false results. It is also time consuming and many data scientists' least favorite part of the reseach process (Press, 2013). So, what is messy data? It is not just data in conflicting formats as was the case in my spreadsheet example. It can also include missing data points, duplicate data points, and even incorrect data points.
Steps for cleansing data
There are a few, basic steps to cleansing data. What they look like in specific practice will vary depending on the type of data, but in general they are:
- Managing duplicate data. Duplicates should generally be removed.
- Managing missing data. What to do about missing values depends on the type of data, the size of the data set, and the number of data points that are missing.
- Standardizing data formats. As in the Christmas card dates example, it is usually a good idea to ensure that all data points for a particular variable are expressed the same way.
- Managing outliers. Outliers are data points that appear very different from the majority of data points for a given variable. Their management may include determining whether they are errors and whether to ignore them, both of which depend on the type of date and size of the data set.
4.3 Data Transformation
Data transformation is related to but slightly different from data cleansing. Sometimes raw data may not come in the form needed for the tests the researcher wishes to perform. "Data transformation refers to formatting, restructuring, and modifying the original data to a more suitable or recommended format" (Amit, 2024). For instance, certain statistical tests can only be performed effectively on certain types of data. This means that raw data may need to be transformed from one type to another.
Data transformation differs from data cleansing in that it applied to data that is already ready for (e.g. already cleansed) analysis. One example of data transformation is performing mathematical operations such as applying a logarithm to a numerial (ratio) variable in order to obtain values that can be plotted on a normal distribution because a normal distribution* is a requirement for a number of statistical operaions. But before such a transformation is performed, the data should be clean, e.g. free of duplicates, missing values and outliers dealt with, and standardized in format.
- A normal distribution is a probability distribution that is symmetrical around the mean creating a bell shaped curve when graphed.
Exercise 4.1 Simple Data Cleansing
Most data cleansing can be programmed. In this exercise, we will practice cleansing data using pre-written Python code.
For this simple data cleansing exercise, we'll use the weather data for Emporia KS in October 2024. To start with, we'll load the data from the .csv file we created in the exercises in chapter 2, which I've modified slightly for this exercise. Download it from Oct_temps.csv. Upload it to either you Colab temporary files or your Google Drive, then copy the file path and enter it below.
Notice the first column in the dataframe, this column is called the index although that label doesn't appear at the top of the column. It contains a unique numerical label for each row of the dataframe. No matter how we might sort this dataframe, the row number "goes with" the row it is first assigned to.
Also notice that the first row is numbered "0." This is a Python convention. It's important to be aware of this when we start counting the total number of rows in our dataframe.
Let's check to see how many rows are in our dataframe.
Notice the the 32 rows is correct because we know that October has 31 days and we see that the first row below the column names contains some text. This happened because the .csv file we loaded into our dataframe had two rows of column labels. It means that we have one row for weather data for each of the 31 days in October plus that extra row of text.
This is the first thing we're going to clean up in our dataframe. We'll do that by giving our columns more understandable labels and getting rid of that extra row of text.
Great! Now we have 31 rows!
Now we can start working through the steps described earlier in this chapter for cleansing data. First up is checking for duplicates. In this case, we're looking at weather data for each of the 31 days in October 2024. That tells us that the column that has to be checked for duplicates is column called "Oct."
Great, no duplicates! Next, we'll look for missing data in our dataframe.
The result shows that there are 3 missing values in our dataframe and that they're all in the column "MaxTemp."
Next we need to know which rows the missing values are in.
The results show values in rows 8, 9, and 21 of the MaxTemp column are "NaN," in many programming languages including Python, NaN stands for not a number. In this case, Python is telling us that rows 8, 9, and 21 contain missing values.
How to handle missing values depends on your data, what you know about your data, what manipulations you're going to do to it in your analysis, and how much data you have. In this case, because we have a column for AvgTemp and MinTemp, we could calculate the missing MaxTemp values.
Note that the image above doesn't include all output, it is just a screenshot.
We can see that there are now values for MaxTemp in rows 8, 9, and 21. Just for fun, we could double check the calculations for the MaxTemp in row 21:
Sure enough, if we look back at the full dataframe, 84.6 is the MaxTemp for row 21.
If we were unable to calculate the missing values, another alternative would be to delete those rows with missing values using the code below.
Note that if you ran the code block above, you will have no rows to delete because we just filled them in. The code for removing rows with missing values is provided below so that you can refer back to it as needed.
The next step in data cleansing is to standardize the formatting of our variable values. We can see by looking at our df that all the columns appear to be numbers. But you should remember that in Python there are several different data types that can appear as numbers, and that some types of data analyses require a specific numerical data type. Let's assume that we want all our data to be of the type "float," that is, a number with decimal points. First, we'll check the object type in each column of our df.
The results displayed tell us that the values in the AvgTemp column are floats, and the values in the MinTemp column are integers, which is good. But the values in the other columns are objects, which won't work if we want to perform mathematical calculations on them. So, we'll convert them to floats.
Great, now the values in all our colums are either integers or floats (those columns where all the numbers were whole numbers were converted to integers instead of floats).
We're ready to look for outliers. Remember that outliers may represent incorrect data. There are several ways to check data for outliers and the method used depends on the data and its context. In this case we'll use a simple method: plotting each column over time.
The hard part about looking for outliers in this way is that we have to look closely for values that don't make sense in their context. We can see that there are some moderately large fluctuations in our weather metrics. For instance, the average temperatures (AvgTemp) fluctuated between slightly above 75 degrees and slightly below 40 degrees. These are big fluctuations, but they're reasonable for Emporia in October. On the other hand, there might be an outlier in the maximum temperatures (MaxTemp) for October 31 when the value entered is 157. That's not only an outlier but also probably an error since there's very little chance that the temperature in Emporia actually reached 157 degrees. The same may be true for the maximum wind speed on Ocobter 14. There again the value isn't just very different from the rest of the maximum wind speeds recorded, it's also almost impossible that the wind reached a speed of 70 miles per hour.
There are several other, more effective and efficient ways to look for outliers in numerical data, but they require a more complex understanding of statistial calculations than is assumed of readers of this introductory text book. As we move into data analysis, if we need them, code and brief explainations for non-experts will be provided.
4.4 Conclusion
In this chapter we learned about the importance of preparing data for analaysis. Cleansing data "refers to the process of identifying errors, anomalies, and inconsistencies in the dataset" (Amit, 2024) and transforming data "refers to formatting, restructuring, and modifying the original data to a more suitable or recommeneded format" (Amit, 2024). Performing these operations helps us not only to become familiar with our data but also to know that our data are appropriate for the analyses we wish to perform in order to answer our research questions. This knowledge can be very helpful in the data collection process because it prevents us from spending a lot of effort to collect or obtain data on which we are unable to perform the analyses necessary for answering our research questions.
Download the Python notebook with the code for this chapter and try it for yourself in Google Colab.
Continue on to Chapter 5.
References
Amit, P. (2024, February 13). Data cleaning in data science: Process, benefits and tools. UpGrad Knowledge Hut. https://www.knowledgehut.com/blog/data-science/data-cleaning
Sarih, H., Tchangani, A. P., Medjaher, K., & Pere, E. (2019). Data preparation and preprocessing for broadcast systems monitoring in PHM framework. 2019 6th International Conference on Control, Decision and Information Technologies (CoDIT), 1444–1449. https://doi.org/10.1109/codit.2019.8820370
Continue on to Chapter 5 - Obtaining Data