Imbalanced Datasets - ofithcheallaigh/masters_project GitHub Wiki
Introduction
The issue of imbalanced datasets came up in the project when I started to look at processing data to train a model with a view to determining if a particular grid had an object or not.
The way the data was collected meant that the dataset with an object in place would contain 9 grids, each with data for an object, while a dataset with no object in place would also contain data from 9 grids. This is fine when the whole dataset is being processed, as is the case for a straight binary classification of "Is there an object in place or not". However, the issue gets more complicated as we start to look at doing a grid-by-grid analysis.
So, let's take a slight step back, before moving forward. If we say, for example, that a full dataset gathered when an object was in place contains information for 9 grids, has in total 9000 samples, with each grid containing 1000 samples. That is 1000 samples for Grid 1, 1000 samples for Grid 2 and so on. We also have another dataset with samples from when no object was in place. Again, there are 9000 samples, gathered across 9 grids, again, with 1000 samples per-grid.
In the initial "Is there an object there" analysis, the whole of the two datasets was used because the 'response' variable was either a 'yes' or a 'no'. So in that situation, 9000 samples were being processed for each dataset. But in a grid-by-grid approach, we are looking to model if an object is present in Grid x or not. For this, the grid number becomes the 'response'. We therefore cannot have two datasets with the same grid numbers (i.e. 1 to 9). So, we change the grid number for the "no object" to 0.
Now, if we feed both datasets into the algorithm, and if we take Grid 1 as an example, we have 1000 samples from Grid 1 being modelled against 9000 samples for Grid 0. This means that the Grid 0 data becomes the majority or dominant dataset, and therefore creates an imbalance.
Whys is this a problem?
According to the Google Developers Foundational Course on Machine Learning [1]
, there are three degrees of imbalance: mild, moderate and extreme. They break the degree of imbalance down as follows:
Degree of Imbalance | Size of Minority Class |
---|---|
Mild | 20% to 40% of the dataset |
Moderate | 1% to 2% of the dataset |
Extreme | <1% of the dataset |
The above table tells us that working with any imbalanced dataset is an issue, but the impact that issue could have varies depending on the amount of imbalance. This is because classification models in ML are generally built on the assumption that there will be an equal number of data points for each class [2]
. If a dataset is imbalanced, the model will spend most of the time being trained on the majority of data.
Balancing the datasets
To balance the datasets for this research, the data gathered for no object
will be added to the data object datasets, with the data for no object being assigned the grid position of 0. This was achieved with the following bit of code, where dataset1
and dataset2
will be the datasets for an object and no object. Also, please note the code snippet does not include any required imports:
def modify_to_grid_zero_fn(data):
# mod_dataset2 = dataset2
data.drop(['Grid'],axis=1) # Dropping the original grid with various grid numbers
data['Grid'] = 0 # Replacing with new grid with 0
return data
len_dataset1 = len(dataset1)
len_dataset2 = len(dataset2)
grid_dataset1_lengths = []
grid_dataset2_lengths = []
# for y in range(len_dataset1):
for x in range(1,10):
grid_length1 = len(dataset1[dataset1['Grid']==x])
grid_dataset1_lengths.append(grid_length1)
grid_length2 = len(dataset2[dataset2['Grid']==x])
grid_dataset2_lengths.append(grid_length2)
## Balancing dataset
# Now I have the lengths for the data in each grid, I need to start building a new dataset for the grid 0 data
data_points = sum(grid_dataset2_lengths) / len(grid_dataset2_lengths)
data_points = data_points / len(grid_dataset2_lengths) # I will need to take this value from each grid data
data_points = int(data_points)
# Now to try and build the dataset
# new_dataset = pd.DataFrame(columns=["Channel1", "Channel2", "LabelObject", "Grid"])
new_dataset = pd.DataFrame()
for x in range(1,10):
temp_data = dataset2[dataset2['Grid']==x]
temp_data = temp_data.iloc[:data_points,:]
new_dataset = new_dataset.append(temp_data)
dataset2 = modify_to_grid_zero_fn(new_dataset)
data = np.vstack((dataset1,dataset2)) # Create a variable with one dataset stacked on the other
With the data
now in place, this can be saved to a .csv file using data.to_csv
function.
When created, these datasets were given the following names:
grid0_closeddoor_clearhallway.csv
: The object data is the closed door, and the no-object data is from a clear hallwaygrid0_displaystand_clearhallway.csv
: The object data is the display stand, and no-object data is from a clear hallwaygrid0_largebin_clearhallway.csv
: The object data is the large bin, and no-object data is from a clear hallwaygrid0_storagebox_clearhallway.csv
: The object data is the storage box, and no-object data is from a clear hallway
The grid0
in the name indicates that the no-object data has been added. The screenshot below shows an example of this dataset, where the transition happens from grid 9 to grid 0:
This screenshot is from grid0_closeddoor_clearhallway.csv
.
Sources
[1] “Imbalanced Data.” Machine Learning. Jul. 18, 2022. https://developers.google.com/machine-learning/data-prep/construct/sampling-splitting/imbalanced-data (accessed: Feb. 08, 2022).
[2] J. Brownlee. "A Gentle Introduction to Imbalanced Classification - MachineLearningMastery.com." MachineLearningMastery.com. https://machinelearningmastery.com/what-is-imbalanced-classification/ (accessed Feb. 8, 2023).