Pandas: Not Just Cute Animals - ECE-180D-WS-2024/Wiki-Knowledge-Base GitHub Wiki
Pandas: Not Just Cute Animals
1) Intro
At the heart of the booming industry of data-science applications is one powerful library... Pandas. As the name of this article suggests, Pandas is an essential Python library designed for efficient data manipulation and analysis. Its versatility extends across various domains, especially sensor data. It triumphs in handling sensor data because it allows for data cleaning, exploration, and statistical analysis. Because it excels in handling structured data, IMU sensor data can be easily analyzed and processed for both data visualization and gesture classification. This tutorial will help you get familiar with Pandas and do basic gesture classification.
2) Background and Importance
In 2008, Wes McKinney, while working at AQR Capital Management, created Pandas, to then later release it to the public in 2009. He chose the name "Pandas" as a short form for "Panel Data". 'Panel Data' describes data sets that include observations over multiple time periods. Its DataFrame structure mirrors the familiar layout of spreadsheets, fostering an intuitive experience for users transitioning from tools like Excel. With Pandas, tasks such as data cleaning, transformation, and aggregation flow seamlessly, thanks to its extensive library of functions and methods. Furthermore, its seamless integration with other prominent Python libraries like NumPy and Matplotlib facilitates holistic data analysis workflows. In essence, Pandas streamlines intricate data operations, earning it a well-deserved reputation as the preferred choice for analysts, data scientists, and researchers alike.
Many look at Pandas and question whether there are other tools better suited for sensor data. Keras and scikit-learn come to mind. Although both libraries have their own strengths, such as machine-learning tasks, there is no competition with Pandas when it comes to raw sensor data handling. Sensor data typically requires preprocessing steps such as cleaning noisy data, handling missing values, and scaling or normalizing features. Pandas offers a comprehensive suite of methods for data cleaning and transformation, enabling users to prepare the sensor data efficiently before feeding it into machine learning libraries such as Keras or scikit-learn. Not only this, but, many sensor datasets regularly involve time-series data, where readings are collected at regular intervals over time. Pandas provides robust support for time series manipulation and analysis, including functionalities for resampling, time-based indexing, and rolling window calculations, making it ideal for analyzing trends and patterns in sensor data over time.
Key Features of Pandas:
DataFrame: The primary data structure used in Pandas is called a DataFrame. It is a two-dimensional labeled data structure with rows and columns, similar to an Excel spreadsheet.
Series: Each of the columns in the Pandas Dataframe can be labeled as their own Series. Data Cleaning: Pandas offers tools for handling missing data, removing duplicates, and transforming data.
Data Manipulation: Pandas provides a wide range of functions for data manipulation, including merging, reshaping, slicing, indexing, and filtering data. Some of which we will go over in this tutorial
Integration: Pandas seamlessly integrates with other Python libraries, such as Matplotlib for visualization and scikit-learn for machine learning. This integration is one of the many reasons why Pandas is useful for sensor data and gesture classification.
Active Community: As an open-source library, there is a huge community ready to answer questions and many examples on the web for common use cases.
Input/Output: Pandas supports reading and writing data from various file formats, including CSV, Excel, SQL databases, JSON, and HTML.
2) Installation
Installation is straightforward using pip, Python's package manager:
pip install pandas
3) Simple Tutorial
Let's start with a basic example to understand some of Pandas' capabilities:
In this example, we will create a DataFrame, access it, and sort data in different ways. This will serve as a precursor to accessing and modifying a DataFrame object necessary for sensor data handling. Creating a DataFrame: We create a DataFrame df from dictionary data.
import pandas as pd
# Create a DataFrame
data = {'Name': ['Greg', 'Sunay', 'Kia', None],
'Age': [20, None, 21, 24],
'Gender': ['Male', 'Male', 'Male', 'Female']}
df = pd.DataFrame(data)
Displaying DataFrame information: We use head() to display the first few rows and info() to get basic information about the DataFrame.
# Visualize the first few rows of the DataFrame
print("DataFrame Head:")
print(df.head())
# Display basic information about the DataFrame
print("\nDataFrame Info:")
print(df.info())
Accessing specific columns: We access the 'Name' column using bracket notation. The same can be done for 'Age' and 'Gender'.
# Accessing specific columns (series)
print("\nAccessing Specific Columns:")
print(df['Name'])
Sorting by a column: We sort the DataFrame by the 'Age' column using sort_values().
# Sorting by a column
sorted_df = df.sort_values(by='Age')
print("\nSorted DataFrame:")
print(sorted_df)
Filtering rows based on condition: We filter the DataFrame to include only rows where the 'Age' is greater than 20. Once filtered, the DataFrame removes the rows that don't fit the criteria.
# Filtering rows based on condition
filtered_df = df[df['Age'] > 20]
print("\nFiltered DataFrame:")
print(filtered_df)
4) Advanced Tutorial
Building upon the basics, let's explore more advanced features:
In this example, we will show specifically how to clean invalid data, export data, and do statistical analysis. This will serve as a precursor to preprocessing and analyzing real sensor data.
Filling Missing Values: We can use many different statistical methods to fill data. The one used in this example was filling missing/invalid data using .fillna() with the average of all the valid data using .mean(). Other options include the mode, a singular value, or random.
# Filling missing values
df['Age'].fillna(df['Age'].mean(), inplace=True)
print("\nDataFrame after Filling Missing Values:")
print(df)
Dropping Rows with Missing Values: If it doesn't make sense to fill the missing value, there is an option to completely remove that data point, we do so using the dropna() function.
# Dropping rows with missing values
df = df.dropna()
print("\nDataFrame after Dropping Rest of Missing Values:")
print(df)
Exporting Data: The versatility of Pandas allows us to export the data created, modified, and analyzed by Pandas into many different formats. In this example, we export the data to a .csv file that can be opened in Excel.
# Exporting Data
df.to_csv('geniuses.csv', index=False) # Save DataFrame to a CSV file
Grouping and Calculating: To see the total statistics of different types of data, we can group them by different features and then take the statistics of each of those groups. We can also simply find the statistics of each feature without grouping by removing .groupby().
# Group by gender and calculate average age
avg_age = df.groupby('Gender')['Age'].mean()
print("\nAverage Age by Gender:")
print(avg_age)
Full code for General Tutorial found in Appendix A.
5) Advanced IMU Data Processing Tutorial
Now with some basic knowledge of some Pandas functions, let's explore advanced IMU data processing techniques:
In this example, we will go over populating a Pandas DataFrame with IMU data and doing simple data analysis on it such as averages, cleaning data, and sorting data. Assume we have our own IMU data stored in a dictionary called IMU:
Adding New Data to DataFrame: In this first section, we create a DataFrame object already containing some IMU sensor data and a sample IMU dictionary object that could be sent over through MQTT. We show how to add a new row of data to an existing DataFrame using loc[]. This can be used to continuously add new data to a DataFrame for time series analysis.
import pandas as pd
# Create DataFrame with past IMU Values (one instance of invalid gyroscope values)
data = {"time": [0,1,2,3],
"ax": [0.1, None, 0.3, 0.4],
"ay": [0.5, 0.6, 0.7, 0.8],
"az": [1.0, 1.1, 1.2, 1.3],
"gx": [10, 20, None, 40],
"gy": [50, 60, None, 80],
"gz": [100, 110, None, 130]}
imu_df = pd.DataFrame(data)
# Sample IMU data dictionary per one message from MQTT
IMU = { "time": 4,
"ax": 0.1,
"ay": 0.4,
"az": 1.4,
"gx": 18,
"gy": 60,
"gz": 140}
#Add one row of IMU values to DataFrame
imu_df.loc[len(imu_df)] = IMU
Displaying Sensor Data: We now show what the DataFrame object looks like after updating new sensor values.
# Display DataFrame
print("IMU DataFrame:\n", imu_df)
Cleaning Invalid Values: Sensor data can contain invalid points or rows of data that can mess up gesture recognition. As such, we can either fill in the missing data or remove those rows of data entirely. In this section, we first fill in any missing values of the AX feature using the average AX across all of the data stored in the DataFrame. Then we remove any extra invalid data using dropna().
# Filling in Missing Data
imu_df["ax"].fillna(imu_df["ax"].mean(), inplace=True)
print("Filling Missing Data:\n", imu_df)
# Removing Extra Invalid Data
imu_df.dropna(inplace=True)
print("Removing Extra Invalid:\n", imu_df)
Preliminary Statistical Analysis: Once our DataFrame object is cleaned and ready to be processed, we can do some initial statistical analysis and check out what the averages and modes of each feature are. We do this through the mean() and mode() functions.
# Finding Averages of Each Feature
feature_averages = imu_df.mean()
print("\nAverages of Each Feature:")
print(feature_averages)
# Finding Mode of Each Feature
feature_modes = imu_df.mode()
print("\nModes of Each Feature:")
print(feature_modes)
Sorting Data: Now that we know some statistics, we may want to sort the existing data based on the highest rotational movement, to see when a certain gesture may have occurred. In this section, we sort by highest gyroscopic rotation along the X, Y, and Z axes using sort_values(). The sorted DataFrames look like the following.
# Sorting by Highest Gyroscopic Rotation
sorted_df = imu_df.sort_values(by='gz', ascending=False)
print("\nDataFrame Sorted by Highest Z-Gyroscopic Rotation:")
print(sorted_df)
sorted_df = imu_df.sort_values(by='gy', ascending=False)
print("\nDataFrame Sorted by Highest Y-Gyroscopic Rotation:")
print(sorted_df)
sorted_df = imu_df.sort_values(by='gx', ascending=False)
print("\nDataFrame Sorted by Highest X-Gyroscopic Rotation:")
print(sorted_df)
Full code for IMU Data Processing found in Appendix B.
6) Data Visualization
Because Pandas specializes in organizing and handling large data sets, it works seamlessly with other libraries such as Matplotlib. Matplotlib is a widely used Python library that specializes in the creation of static and dynamic plotting visualizations. In conjunction with Pandas, users are able to easily generate plots directly from manipulated data frames. Since Pandas and Matplotlib often are present together in data analysis projects, we will now start a tutorial on some of the basic integration techniques for using Pandas with Matplotlib.
For context, because the Matplotlib was originally written as an open source alternative to MATLAB there are many similarities between them in terms of structure of the plotting objects. In this, Matplotlib uses the ‘pyplot’ module as the tool for creating and managing plots.
Here are the key objects in the pyplot implementation object:
matplotlib.pyplot.figure: Figure is the top-level container window. It includes everything visualized in a plot (including one or more Axes). [6]
matplotlib.pyplot.axes: Axes contain most of the elements in a plot: Axis (X,Y, or Z), Tick Marks/Labels, accompanying Titles/Legends, etc. This is the area where data is plotted. [6]
Similar to Pandas, installation is through pip:
pip install matplotlib
And at the beginning of the desired codeblock be sure to import matplotlib and pyplot object:
from matplotlib import pyplot as plt
Using this ‘plt’ object there are a few common functions [7]:
Plotting:
plt.plot(): Plot y versus x as lines and/or markers.
plt.scatter(): A scatter plot of y vs x with varying marker size and/or color.
plt.bar(): Makes a bar chart.
plt.hist(): Plots a histogram.
plt.pie(): Plots a pie chart.
Customization:
plt.title(): Set a title for the axes.
plt.xlabel(): Sets the label for the x-axis.
plt.ylabel(): Sets the label for the y-axis.
plt.xlim(): Get or set the x limits of the current axes.
plt.ylim(): Get or set the y limits of the current axes.
Figure and Axes:
plt.figure(): Create a new figure.
plt.subplot(): Add a subplot to the current figure.
plt.subplots(): Create a figure and a set of subplots.
Displaying and Saving:
plt.show(): Display a figure.
plt.savefig(): Save the current figure.
In this example, we will show a simple method on how we can plot sample IMU data by using the Pandas data frames to extract axis data on differing subplots. We will also customize the plots to include different line colors, legends and labels that make the data easier to visualize.
import pandas as pd
import matplotlib.pyplot as plt
# Sample IMU Data
data = {
"time": list(range(10)),
"ax": [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0],
"ay": [0.5, 0.6, 0.7, 0.8, 0.9, 1.0, 1.1, 1.2, 1.3, 1.4],
"az": [1.0, 1.1, 1.2, 1.3, 1.4, 1.5, 1.6, 1.7, 1.8, 1.9],
"gx": [10, 20, 30, 40, 50, 60, 70, 80, 90, 100],
"gy": [50, 60, 70, 80, 90, 100, 110, 120, 130, 140],
"gz": [100, 110, 120, 130, 140, 150, 160, 170, 180, 190]
}
# Pandas Data Frame
imu_df = pd.DataFrame(data)
# Setting up figures
fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(10, 10))
# Plotting Acceleration Data
ax1.plot(imu_df['time'], imu_df['ax'], label='ax', color='red')
ax1.plot(imu_df['time'], imu_df['ay'], label='ay', color='green')
ax1.plot(imu_df['time'], imu_df['az'], label='az', color='blue')
ax1.set_title('Acceleration Data')
ax1.set_xlabel('Time (s)')
ax1.set_ylabel('Acceleration (m/s²)')
ax1.legend()
ax1.grid(True)
# Plotting Gyroscope Data
ax2.plot(imu_df['time'], imu_df['gx'], label='gx', color='red')
ax2.plot(imu_df['time'], imu_df['gy'], label='gy', color='green')
ax2.plot(imu_df['time'], imu_df['gz'], label='gz', color='blue')
ax2.set_title('Gyroscope Data')
ax2.set_xlabel('Time (s)')
ax2.set_ylabel('Angular Velocity (deg/s)')
ax2.legend()
ax2.grid(True)
plt.show()
This implementation uses customization functions with the prefix “set_” to reference the specific axes, since the version without it references the current window which can be overwritten. This example uses customized subplots to effectively visualize IMU data. Because IMU information can be difficult to analyze, using Matplotlib with the Pandas data frames can make it easier to track certain trends and data patterns that allow for intuitive insights of behavior over time. This can be especially helpful when it comes to gesture recognition and verification, as it enables the visualization of sensor data in a graphical format, facilitating the identification of distinctive patterns associated with specific gestures.
7) Gesture Detection from Advanced IMU Data
Now that we have a good example of how to preprocess, clean, find and visualize statistical data on IMU sensor data, let's proceed with gesture detection:
In this example, we will detect a simple gesture of orienting the IMU from flat to upside down on the Z axis within a window frame of 10 IMU readings. We will also detect whether, in the last 10 IMU readings, the IMU controller has, on average, experienced a greater gyroscopic rotational change on the x-axis > +10 degrees. This will indicate a high rotational movement, and if so, tabulate when the highest rotational movement occurred to mark a gesture.
Initializing: We do the same form of initializing and adding new data to the DataFrame object as earlier. All data points are valid since we assume that all preprocessing and cleaning steps have already occurred as in the previous example.
import pandas as pd
# Window length
window_len = 10
# Create DataFrame with past IMU Values (one instance of invalid gyroscope values)
data = {"time": [0,1,2,3],
"ax": [0.1, 0.2, 0.3, 0.4],
"ay": [0.5, 0.6, 0.7, 0.8],
"az": [.85, 1.1, 0.5, 0.2],
"gx": [10, 20, 60, 40],
"gy": [50, 60, 120, 80],
"gz": [100, 110, 150, 130]}
imu_df = pd.DataFrame(data)
# Sample IMU data dictionary per one message from MQTT
IMU = { "time": 4,
"ax": 0.1,
"ay": 0.4,
"az": -0.95,
"gx": 18,
"gy": 60,
"gz": 140}
imu_df.loc[len(imu_df)] = IMU
Data Managing: To keep only a tab on the past 10 IMU data readings, we must use our DataFrame object as a queue. We use the .iloc[] to pop the oldest row of data once we have more than 10 readings.
rotation_gestures = []
orientation_gestures = []
# Removing Data based on Window
if imu_df.shape[0] > window_len:
imu_df = imu_df.iloc[1:]
Orientation Gesture Recognition: Now to move on to actual gesture recognition. We define a threshold for gesture detection based on accelerometer readings along the z-axis (az). We iterate through the DataFrame using .iterrows() to detect gestures by identifying threshold crossings in the 'az' column within a certain time window. Detected gestures are represented by start and end times using the 'time' column.
# Detect Gestures based on Threshold Crossing
in_gesture = False
start_index = None
for index, row in imu_df.iterrows():
if row['az'] > 0.9 and not in_gesture:
in_gesture = True
start_index = row['time']
elif row['az'] < -0.9 and in_gesture:
in_gesture = False
end_index = row['time']
orientation_gestures.append((start_index, end_index))
Rotation Gesture Recognition: To detect this gesture, we use some statistical analysis to find the average of the 'gx' feature in the past 10 IMU readings. If our average meets a threshold (10), we then do some more analysis to find the time at which the max 'gx' occurred, using idxmax(), to give us this high average. We then add the result to the gestures list to be used for further analysis.
# Detect Gestures based on Statistical Analysis
window_average = imu_df["gx"].mean()
if window_average >= 10:
idx_max = imu_df["gx"].idxmax()
time = imu_df.loc[idx_max]['time']
if time not in rotation_gestures:
rotation_gestures.append(time)
Displaying: These gestures are then printed for further analysis or processing.
# Output Detected Gestures
print("\nDetected Orientation Gestures:")
for gesture in orientation_gestures:
print("Gesture from time ", gesture[0], " to ", gesture[1])
print("\nDetected Rotation Gestures:")
for time in rotation_gestures:
print("Gesture at time ", time)
Full code for IMU Gesture Recognition found in Appendix C.
8) Conclusion:
In this tutorial, we showcase Pandas' prowess in managing and visualizing IMU sensor datasets with ease. Through examples, we illustrate Pandas' efficiency in cleaning noisy sensor data and providing continuous, analysis-ready metrics. We also highlight its robust support for time-series analysis, including time-based indexing and trend analysis, making it indispensable for handling sensor data over time. Additionally, we explored Pandas’ dynamic structure and its integration in Matplotlib, allowing for the creation of appealing and clear representations of acquired data. Even on its own, Pandas boasts an arsenal of tools to preprocess data and implement some complex gesture recognition--perfect for multi-disciplinary projects. With in-built methods to calculate averages, maximums, and thresholds of values, Pandas is the number one choice for many applications, especially IMU sensor data.
8) References:
[1] https://en.wikipedia.org/wiki/Pandas_(software)
[2] https://pandas.pydata.org/docs/user_guide/index.html#user-guide
[3] https://pypi.org/project/pandas/
[4] https://www.nvidia.com/en-us/glossary/pandas-python/
[5] https://fabsta.github.io/blog/python/cheat%20sheet/2020/03/23/data-science-snippets.html
[7]https://matplotlib.org/3.1.1/api/pyplot_summary.html
9) Appendices:
Appendix A
import pandas as pd
# Create a DataFrame
data = {'Name': ['Greg', 'Sunay', 'Kia', None],
'Age': [20, None, 21, 24],
'Gender': ['Male', 'Male', 'Male', 'Female']}
df = pd.DataFrame(data)
# Visualize the first few rows of the DataFrame
print("DataFrame Head:")
print(df.head())
# Display basic information about the DataFrame
print("\nDataFrame Info:")
print(df.info())
# Accessing specific columns (series)
print("\nAccessing Specific Columns:")
print(df['Name'])
# Sorting by a column
sorted_df = df.sort_values(by='Age')
print("\nSorted DataFrame:")
print(sorted_df)
# Filtering rows based on condition
filtered_df = df[df['Age'] > 20]
print("\nFiltered DataFrame:")
print(filtered_df)
# Filling missing values
df['Age'].fillna(df['Age'].mean(), inplace=True)
print("\nDataFrame after Filling Missing Values:")
print(df)
# Dropping rows with missing values
df = df.dropna()
print("\nDataFrame after Dropping Rest of Missing Values:")
print(df)
# Exporting Data
df.to_csv('geniuses.csv', index=False) # Save DataFrame to a CSV file
# Group by gender and calculate average age
avg_age = df.groupby('Gender')['Age'].mean()
print("\nAverage Age by Gender:")
print(avg_age)
Appendix B
import pandas as pd
# Create DataFrame with past IMU Values (one instance of invalid gyroscope values)
data = {"time": [0,1,2,3],
"ax": [0.1, None, 0.3, 0.4],
"ay": [0.5, 0.6, 0.7, 0.8],
"az": [1.0, 1.1, 1.2, 1.3],
"gx": [10, 20, None, 40],
"gy": [50, 60, None, 80],
"gz": [100, 110, None, 130]}
imu_df = pd.DataFrame(data)
# Sample IMU data dictionary per one message from MQTT
IMU = { "time": 4,
"ax": 0.1,
"ay": 0.4,
"az": 1.4,
"gx": 18,
"gy": 60,
"gz": 140}
#Add one row of IMU values to DataFrame
imu_df.loc[len(imu_df)] = IMU
# Display DataFrame
print("IMU DataFrame:\n", imu_df)
# Filling in Missing Data
imu_df["ax"].fillna(imu_df["ax"].mean(), inplace=True)
print("Filling Missing Data:\n", imu_df)
# Removing Extra Invalid Data
imu_df.dropna(inplace=True)
print("Removing Extra Invalid:\n", imu_df)
# Finding Averages of Each Feature
feature_averages = imu_df.mean()
print("\nAverages of Each Feature:")
print(feature_averages)
# Finding Mode of Each Feature
feature_modes = imu_df.mode()
print("\nModes of Each Feature:")
print(feature_modes)
# Sorting by Highest Gyroscopic Rotation
sorted_df = imu_df.sort_values(by='gz', ascending=False)
print("\nDataFrame Sorted by Highest Z-Gyroscopic Rotation:")
print(sorted_df)
sorted_df = imu_df.sort_values(by='gy', ascending=False)
print("\nDataFrame Sorted by Highest Y-Gyroscopic Rotation:")
print(sorted_df)
sorted_df = imu_df.sort_values(by='gx', ascending=False)
print("\nDataFrame Sorted by Highest X-Gyroscopic Rotation:")
print(sorted_df)
Appendix C
import pandas as pd
# Window length
window_len = 10
# Create DataFrame with past IMU Values (one instance of invalid gyroscope values)
data = {"time": [0,1,2,3],
"ax": [0.1, 0.2, 0.3, 0.4],
"ay": [0.5, 0.6, 0.7, 0.8],
"az": [.85, 1.1, 0.5, 0.2],
"gx": [10, 20, 60, 40],
"gy": [50, 60, 120, 80],
"gz": [100, 110, 150, 130]}
imu_df = pd.DataFrame(data)
# Sample IMU data dictionary per one message from MQTT
IMU = { "time": 4,
"ax": 0.1,
"ay": 0.4,
"az": -0.95,
"gx": 18,
"gy": 60,
"gz": 140}
imu_df.loc[len(imu_df)] = IMU
rotation_gestures = []
orientation_gestures = []
# Removing Data based on Window
if imu_df.shape[0] > window_len:
imu_df = imu_df.iloc[1:]
# Detect Gestures based on Threshold Crossing
in_gesture = False
start_index = None
for index, row in imu_df.iterrows():
if row['az'] > 0.9 and not in_gesture:
in_gesture = True
start_index = row['time']
elif row['az'] < -0.9 and in_gesture:
in_gesture = False
end_index = row['time']
orientation_gestures.append((start_index, end_index))
# Detect Gestures based on Statistical Analysis
window_average = imu_df["gx"].mean()
if window_average >= 10:
idx_max = imu_df["gx"].idxmax()
time = imu_df.loc[idx_max]['time']
if time not in rotation_gestures:
rotation_gestures.append(time)
# Output Detected Gestures
print("\nDetected Orientation Gestures:")
for gesture in orientation_gestures:
print("Gesture from time ", gesture[0], " to ", gesture[1])
print("\nDetected Rotation Gestures:")
for time in rotation_gestures:
print("Gesture at time ", time)