Introduction to Data Analysis Using Pandas - 180D-FW-2023/Knowledge-Base-Wiki GitHub Wiki
Introduction to Pandas
In the days of Big Data, there is more and more need for everyone to have basic data analysis skills. Pandas is a library in Python that makes analyzing and manipulating data much simpler than ever before. Compared to other data analysis tools, such as R and SQL, the Pandas library is far more user-friendly in terms of being able to quickly learn syntax and in its compatibility with different data types. Pandas’s vast variety of functionality allows it to support all kinds of users from engineers in industry to students. My personal introduction to Pandas came when I was working in a company and doing analysis on their device usage data. Here I primarily used the counting and average functionalities. Since then, I have used the data visualization functions for my courses at UCLA. Pandas’s versatility and compatibility with Python makes it one of the most robust and easy to use data analysis tools, and a must have for anyone interested in the field. In this article, we will explore the process of setting up with Pandas along with common applications and functions.
History of Pandas
Pandas was created by Wes McKinney in 2008. Wes also helped to create Apache Arrow. The basis for Pandas is matplotlib for data visualization and numpy for the advanced math calculations. Before the creation of Pandas, basic Python was used for data analysis, which could be tiresome and out-of-reach for newer developers. This entailed writing code to split csv files on commas and newlines, and then analyzing said data using loops. Not only was this far more inefficient, it also had little handling for error cases in case a file was poorly formatted. The Pandas library is continually being updated, with new features being added as technology develops. For instance, with new versions of Pandas, users can specify what engine to use when reading in a .csv file into a dataframe. This is helpful to optimize for efficiency and multithreading.
Setting Up with Pandas
To install Pandas:
Run pip install pandas
in your command line, assuming Python and pip are already installed.
Pandas Data Structures
The two key data structures in Pandas are the series and the dataframe. The series is effectively a column in a table, so a one-dimensional array. Whereas the dataframe is a representation of a table itself, so a two-dimensional array. Series are used much less frequently than dataframes, since our need to process data is mostly applicable to tables and not lists. However, if one simply wanted to perform analysis on a list of data, for instance, getting information about a grade distribution, that is something Pandas can handle very easily.
Creating a Pandas Dataframe
Most of our analysis operation will be performed on a Pandas dataframe. We can create this in many ways, but the most common are either from a Python dictionary or from a spreadsheet.
For dictionary:
import pandas as pd
dict = {'name': ['Krisha', 'Isabella', 'Jolin', 'Maya'], 'favorite food':['Brownies', 'Cookies', 'Pie', 'Muffins']}
df = pd.dataframe(dict)
For spreadsheet (data.csv):
import pandas as pd
df = pd.read_csv('data.csv')
A Pandas dataframe is basically just a table or spreadsheet that is compatible with Pandas. One important thing to note is that we can isolate specific columns when we load in a dataframe. For example, when I was working with Pandas at a company, we had a very large dataset and loading in each column was both time consuming and unnecessary. This functionality of Pandas when creating the dataframe allows one to isolate only the relevant columns. For example, we can do this:
import pandas as pd
df = pd.read_csv('data.csv', usecols = [0,1])
It's important to note that usecols can accept a list of both indexes or column labels.
Key Functionality in Pandas
Data Visualization in Pandas
To visualize data in Pandas, we can use the plot() function with a dataframe or series along with the matplotlib.pyplot. With this function we can also specify the kind of plot (line, bar, histogram, scatter, etc.) and the titles of both the axes and the plot itself. For example:
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv('data.csv')
df.plot(kind = ‘scatter’, x = ‘Time’, y = ‘Count’)
plt.show()
This produces an output like this:
Date and Time in Pandas
Generally, when doing data analysis, having a date and time can be helpful. In order to convert a dataframe containing times, we can use the ‘pd.to_datetime()’ function, which will convert the row into type of datetime64. We can subtract datetime64 values in order to get duration as well. The date and time functionality is most useful when plotting any interval-based data. For instance, if someone is keeping track of their heart rate at irregular intervals in a spreadsheet with columns for time and HR, we could plot it like this:
import pandas as pd
Import matplotlib.pyplot as plt
df = pd.read_csv('data.csv')
dates = pd.to_datetime(df[dates])
plt.plot(dates, df[HR])
plt.show()
As you can see, we can use the to_datetime() function to make sure it is a standardized set of timestamps.
Pivot Tables in Pandas
Pivot tables are a key part of data analysis, especially in SQL. They allow us to quickly summarize large amounts of data, making it easy to identify specific patterns or trends. Basically the function will take in a dataframe, along with some parameters, and output a summarized view of the data. The parameters for the function are as follows:
Data: The dataframe
Values: Column or columns to aggregate
Index: Keys to group by on the pivot table index
Aggfunc: An aggregate function specified to make the pivot table
For example, creating a pivot table based on data of a table containing classes and grades:
import pandas as pd
df = pd.read_csv('data.csv')
Pv = pd.pivot_table(df, index = [‘class]’, columns = [‘grade’], aggfunc = “mean”)
This would create a pivot table outlining the average grade for each class, without having to iterate through the table and manually calculate that. It is important to note that we could change the aggfunc to find the mode or max of each class as well. As you can see, the pivot table in particular makes data analysis dramatically simpler, as before we would have had to write many more lines of code to achieve the same result.
Multiprocessing with Pandas
To me, one of the best features of Pandas is its compatibility with multiprocessing in Python. By design, when Pandas functionality is run, its done with a single core. This works for more basic data analysis, but when you have larger datasets, it can become very inefficient. That is where multiprocessing can help. In the past, I have used multiprocessing with Pandas, but it was a very manual process. I had to pass different sections of the dataset to different cores and have them compute in parallel. Obviously, this helped make the process far more efficient, but it required much more time and effort on my part. Now, a new python library has been released called Modin. Modin.pandas has all the same functionality as the original pandas, but makes use of all the cores on your device, instead of just one. To install modin, we can do:
pip install modin
pip install “modin[all]”
Now instead of using import pandas as pd
, we can just use import modin.pandas as pd
. This functionality has now made it so much easier to make full use of our computation power in a user-friendly way. You can see the difference made even with a simple read_csv() in this chart:
Conclusion
This article outlines the basic functions of Pandas and should give readers an insight into how it may be applied to their data analysis needs. It shows how to manipulate data using pivot tables, how to isolate specific data types (such as date and time), and it specifies how to visualize the data using plots. We also discuss the different applications of pandas based on an analyst’s needs. While the basic functionality is enough for most students, for those in industry looking for efficiency, the ability to isolate specific columns and make use of multiprocessing is enormous. Also, since Pandas is relatively young, more libraries and functionality is being added as need arises. When I was working with Pandas two years ago and used multiprocessing, I did everything manually. Now, there is an entire library that was created to simplify this task. Overall, Pandas is a great tool for data analysts of any skill level and one that is constantly evolving.
References
Pandas: https://pandas.pydata.org Modin: https://modin.readthedocs.io/en/stable/