Imbalanced Datasets - ofithcheallaigh/masters_project GitHub Wiki

Introduction

The issue of imbalanced datasets came up in the project when I started to look at processing data to train a model with a view to determining if a particular grid had an object or not.

The way the data was collected meant that the dataset with an object in place would contain 9 grids, each with data for an object, while a dataset with no object in place would also contain data from 9 grids. This is fine when the whole dataset is being processed, as is the case for a straight binary classification of "Is there an object in place or not". However, the issue gets more complicated as we start to look at doing a grid-by-grid analysis.

So, let's take a slight step back, before moving forward. If we say, for example, that a full dataset gathered when an object was in place contains information for 9 grids, has in total 9000 samples, with each grid containing 1000 samples. That is 1000 samples for Grid 1, 1000 samples for Grid 2 and so on. We also have another dataset with samples from when no object was in place. Again, there are 9000 samples, gathered across 9 grids, again, with 1000 samples per-grid.

In the initial "Is there an object there" analysis, the whole of the two datasets was used because the 'response' variable was either a 'yes' or a 'no'. So in that situation, 9000 samples were being processed for each dataset. But in a grid-by-grid approach, we are looking to model if an object is present in Grid x or not. For this, the grid number becomes the 'response'. We therefore cannot have two datasets with the same grid numbers (i.e. 1 to 9). So, we change the grid number for the "no object" to 0.

Now, if we feed both datasets into the algorithm, and if we take Grid 1 as an example, we have 1000 samples from Grid 1 being modelled against 9000 samples for Grid 0. This means that the Grid 0 data becomes the majority or dominant dataset, and therefore creates an imbalance.

Whys is this a problem?

According to the Google Developers Foundational Course on Machine Learning [1], there are three degrees of imbalance: mild, moderate and extreme. They break the degree of imbalance down as follows:

Degree of Imbalance Size of Minority Class
Mild 20% to 40% of the dataset
Moderate 1% to 2% of the dataset
Extreme <1% of the dataset

The above table tells us that working with any imbalanced dataset is an issue, but the impact that issue could have varies depending on the amount of imbalance. This is because classification models in ML are generally built on the assumption that there will be an equal number of data points for each class [2]. If a dataset is imbalanced, the model will spend most of the time being trained on the majority of data.

Balancing the datasets

To balance the datasets for this research, the data gathered for no object will be added to the data object datasets, with the data for no object being assigned the grid position of 0. This was achieved with the following bit of code, where dataset1 and dataset2 will be the datasets for an object and no object. Also, please note the code snippet does not include any required imports:

def modify_to_grid_zero_fn(data):
    # mod_dataset2 = dataset2
    data.drop(['Grid'],axis=1) # Dropping the original grid with various grid numbers
    data['Grid'] = 0 # Replacing with new grid with 0
    return data


len_dataset1 = len(dataset1) 
len_dataset2 = len(dataset2) 

grid_dataset1_lengths = []
grid_dataset2_lengths = []
# for y in range(len_dataset1):
for x in range(1,10):
    grid_length1 = len(dataset1[dataset1['Grid']==x])
    grid_dataset1_lengths.append(grid_length1)
    grid_length2 = len(dataset2[dataset2['Grid']==x])
    grid_dataset2_lengths.append(grid_length2)

## Balancing dataset
# Now I have the lengths for the data in each grid, I need to start building a new dataset for the grid 0 data
data_points = sum(grid_dataset2_lengths) / len(grid_dataset2_lengths)
data_points = data_points / len(grid_dataset2_lengths) # I will need to take this value from each grid data
data_points = int(data_points)

# Now to try and build the dataset
# new_dataset = pd.DataFrame(columns=["Channel1", "Channel2", "LabelObject", "Grid"])
new_dataset = pd.DataFrame()
for x in range(1,10):
    temp_data = dataset2[dataset2['Grid']==x]
    temp_data = temp_data.iloc[:data_points,:]
    new_dataset = new_dataset.append(temp_data)

dataset2 = modify_to_grid_zero_fn(new_dataset) 

data = np.vstack((dataset1,dataset2)) # Create a variable with one dataset stacked on the other 

With the data now in place, this can be saved to a .csv file using data.to_csv function.

When created, these datasets were given the following names:

  • grid0_closeddoor_clearhallway.csv: The object data is the closed door, and the no-object data is from a clear hallway
  • grid0_displaystand_clearhallway.csv: The object data is the display stand, and no-object data is from a clear hallway
  • grid0_largebin_clearhallway.csv: The object data is the large bin, and no-object data is from a clear hallway
  • grid0_storagebox_clearhallway.csv: The object data is the storage box, and no-object data is from a clear hallway

The grid0 in the name indicates that the no-object data has been added. The screenshot below shows an example of this dataset, where the transition happens from grid 9 to grid 0:

This screenshot is from grid0_closeddoor_clearhallway.csv.

Sources

[1] “Imbalanced Data.” Machine Learning. Jul. 18, 2022. https://developers.google.com/machine-learning/data-prep/construct/sampling-splitting/imbalanced-data (accessed: Feb. 08, 2022).

[2] J. Brownlee. "A Gentle Introduction to Imbalanced Classification - MachineLearningMastery.com." MachineLearningMastery.com. https://machinelearningmastery.com/what-is-imbalanced-classification/ (accessed Feb. 8, 2023).