Data Analysis: Machine Learning Algorithms - ofithcheallaigh/masters_project GitHub Wiki

Introduction

This section will contain the results of work carried out to obtain the initial analysis when the data sets were processed to train models.

The data was processed first to understand how well various models could be trained to detect if there was an object in place. After that, the data was trained to determine how well a model could be trained to detect if there was an object when looking at the grids.

Please note both my MSc Machine Learning module and Deep Learning module were used as sources for this analysis.

Classification Analysis

Binary Search

The datasets were first processed through several ML classification algorithms to understand how well the ML techniques were at detecting the presence of an object. In this analysis, each dataset for an object is processed alongside the dataset for no object.

The models selected for the task were:

  • Logistic Regression
  • Decision Tree Classifier
  • K-Nearest Neighbours
  • Linear Discriminant Analysis
  • Gaussian Naive Bayes

With the datasets ready and taken into the Python program, they were brought together into one DataFrame, and the DataFrame was assigned column names:

data = np.vstack((dataset2,dataset3,dataset4,dataset4)) # 4 individual datasets
data = pd.DataFrame(data) # Convert to DataFrame
data.columns=["Channel1","Channel2","LabelObject","Grid"] # Assign column names

The features were assigned to a variable name feature_names = ['Channel1','Channel2'] # Features we are interested in, to allow for ease of selection.

Next, the LabelObject, which in the original dataset was a Yes, for an object being present, and a No, for no object being present, was replaced with a 1 and a 0:

# Replaces Yes/No with 1/0. Required because some algorithms will not accept categorical data
data.LabelObject.replace(('Yes', 'No'), (1, 0), inplace=True) 

The next step is to pick out the data we want for processing, and then split the data into training and testing datasets:

# X and y are what will be passed through the algorithms to train the model
X = data[feature_names]
y = data['LabelObject'] # Use if carrying out a binary search

# Creating Training and Test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0) # train_test_split is the best way to spilt up the data
scaler=MinMaxScaler() # Using a scaler because there can be a lot of variability in the data values
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

Then we process the data through the models to determine how well they perform:

# The sections below generate the model accuracy scores
# Model: Logistic regression
logReg = LogisticRegression()
logReg.fit(X_train, y_train)
print('Accuracy of Logistic regression classifier on training set: {:.2f}'.format(logReg.score(X_train, y_train))) 
print('Accuracy of Logistic regression classifier on test set: {:.2f}'.format(logReg.score(X_test, y_test)))   

# Model: Decision tree
clf = DecisionTreeClassifier().fit(X_train, y_train)
print('Accuracy of Decision Tree classifier on training set: {:.2f}'.format(clf.score(X_train, y_train))) 
print('Accuracy of Decision Tree classifier on test set: {:.2f}'.format(clf.score(X_test, y_test))) 

# Model: KNN
knn = KNeighborsClassifier()
knn.fit(X_train,y_train)
print('Accuracy of K-NN classifier on training set: {:.2f}'.format(knn.score(X_train, y_train))) 
print('Accuracy of K-NN classifier on test set: {:.2f}'.format(knn.score(X_test, y_test))) 

# Model: Linear Discriminant Analysis
lda = LinearDiscriminantAnalysis()
lda.fit(X_train, y_train)
print('Accuracy of LDA classifier on training set: {:.2f}'.format(lda.score(X_train, y_train))) 
print('Accuracy of LDA classifier on test set: {:.2f}'.format(lda.score(X_test, y_test))) 

# Model: Gaussian Naive Bayes
gnb = GaussianNB()
gnb.fit(X_train, y_train)
print('Accuracy of GNB classifier on training set: {:.2f}'.format(gnb.score(X_train, y_train)))
# print('Accuracy of GNB classifier on test set: {:.2f}'.format(gnb.score(X_test, y_test))) 

Please note the accuracy results are presented to two significant figures, and converted to a percentage value. A screenshot of how the printout is generated in the IDE is shown below:

The results of this analysis can be seen below:

We can see from this that the Decision Tree Classifier and the K-Nearest Neighbours algorithms score 100% accuracy for both the training and the test dataset, with Gaussian Naïve Bayes producing the lowest accuracy value when trying to detect the display stand, with an accuracy of 69%.

Grid Search

These results indicate that using the available datasets, classification models are able to detect if an object is there or not. The next step now is to understand if these models can detect an object based on the grid location from which the data was collected. To do this, the datasets required a modification to ensure they were balanced (a discussion on this can be found here.

The same process as detailed for the binary search is followed for the grid search, with one slight change in the code to use the grid number instead of the object label: y = data['Grid']. This y is then fed into the train_test_split function to create the datasets.

Running the models produces the following results:

This analysis shows that, like the binary search, the models are able to detect with high accuracy, the presence of an object. And like the grid analysis, it can be seen that the Decision Tree Classifier and the K-Nearest Neighbours classifier produce an accuracy of 100% for both the training and testing datasets.

The fact that most of the results for the binary analysis and the grid analysis and the same for the training and test datasets could be due to the fact that there is not a lot of variation in the data, due to the fact that the measurement set up and the object, do not move throughout the data collection process (do not move when placed in a grid, that is). This will likely mean that there is not a lot of variation in the data, so the split generated for training "looks" a lot like that data split for testing. One way to investigate this is to pass data to the algorithms that they have not seen previously. The outcome of this analysis will guide the future of the research because based on what this analysis indicates, there may be no need to implement an object detection system using machine learning techniques when potentially a threshold value enclosed in some if statement could suffice, i.e. if less than the threshold, there is an object. And while this may be enough to detect an object, it does not help a person navigate a hallway. However, a threshold could be helpful in a system algorithm as a lead into the navigation section. For example, first, detect an object, then try and navigate around that object.

Unseen data investigation

To be able to pass unseen data to the algorithm, the code has to be modified slightly. The data is imported as normal, however, the datasets are processed differently, as shown below:

data = np.vstack((dataset2,dataset3,dataset4))
data = pd.DataFrame(data) # Convert to DataFrame
data.columns=["Channel1","Channel2","LabelObject","Grid"] # Assign column names

# For new_data
new_data = ((dataset1))
new_data = pd.DataFrame(new_data) # Convert to DataFrame
new_data.columns=["Channel1","Channel2","LabelObject","Grid"] # Assign column names
new_data.LabelObject.replace(('Yes', 'No'), (1, 0), inplace=True)

We can see here that once all datasets are imported, three of the datasets are assigned to data, and one is assigned to new_data. It is this new_data which will be the unseen data.

Now the data needs to be prepared for the train_test_split function:

# X and y are what will be passed through the algorithms to train the model
X = data[feature_names]
y = data['LabelObject']

X_new_data = new_data[feature_names]
y_new_data = new_data['LabelObject']

# Creating Training and Test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0) # train_test_split is the best way to slipt up the data
scaler=MinMaxScaler() # Using a scaler because there can be a lot of variability in the data values
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# This will be where I will gather my train data that the system has not been tested on
X_train_new_data, X_test_new_data, y_train_new_data, y_test_new_data = train_test_split(X_new_data, y_new_data, random_state=0) # train_test_split is the best way to slipt up the data
scaler=MinMaxScaler() # Using a scaler because there can be a lot of variability in the data values
X_train_new_data = scaler.fit_transform(X_train_new_data)
X_test_new_data = scaler.transform(X_test_new_data)

We can see that there are now two train_test_split functions. This allows the generation of unseen test data.

The next important step is to process the models:

# The sections below generate the model accuracy scores
# Model: Logistic regression
logReg = LogisticRegression()
logReg.fit(X_train, y_train)
print('Accuracy of Logistic regression classifier on training set: {:.2f}'.format(logReg.score(X_train, y_train))) 
# Below is used for the unseen data
print('Accuracy of Logistic regression classifier on test set: {:.2f}'.format(logReg.score(X_test_new_data, y_test_new_data))) 

# Model: Decision tree
clf = DecisionTreeClassifier().fit(X_train, y_train)
print('Accuracy of Decision Tree classifier on training set: {:.2f}'.format(clf.score(X_train, y_train))) 
# Below is used for the unseen data
print('Accuracy of Decision Tree classifier on test set: {:.2f}'.format(clf.score(X_test_new_data, y_test_new_data))) 

# Model: KNN
knn = KNeighborsClassifier()
knn.fit(X_train,y_train)
print('Accuracy of K-NN classifier on training set: {:.2f}'.format(knn.score(X_train, y_train))) 
# Below is used for the unseen data
print('Accuracy of K-NN classifier on test set: {:.2f}'.format(knn.score(X_test_new_data, y_test_new_data))) 

# Model: Linear Discriminant Analysis
lda = LinearDiscriminantAnalysis()
lda.fit(X_train, y_train)
print('Accuracy of LDA classifier on training set: {:.2f}'.format(lda.score(X_train, y_train)))  
# Below is used for the unseen data
print('Accuracy of LDA classifier on test set: {:.2f}'.format(lda.score(X_test_new_data, y_test_new_data))) 

# Model: Gaussian Naive Bayes
gnb = GaussianNB()
gnb.fit(X_train, y_train)
print('Accuracy of GNB classifier on training set: {:.2f}'.format(gnb.score(X_train, y_train)))
# Below is used for the unseen data
print('Accuracy of GNB classifier on test set: {:.2f}'.format(gnb.score(X_test_new_data, y_test_new_data))) 

With this, the system is trained using one set of data and tested using unseen data.

The algorithms were run for each of the following situations:

  1. Run 1s
    • Trained on: closed door, display stand,large bin datasets
    • Tested on: storage box dataset
  2. Run 2
    • Trained on: closed door, storage box,large bin datasets
    • Tested on: display stand dataset
  3. Run 3
    • Trained on: closed door, storage box,display stand datasets
    • Tested on: large bin dataset
  4. Run 4
    • Trained on: large bin, storage box,display stand datasets
    • Tested on: closed door dataset

The results of the binary search are shown below:

How the analysis is showing the difference between the training results, and the test results. As before the Decision Tree Classifier and the K-Nearest Neigbhours` algorithms produce the highest training results, at 100%, but there is more variation in the test results.

Now, the same process is followed for the grid analysis. For this, the following change to the code is required:

X = data[feature_names]
y = data['Grid']

X_new_data = new_data[feature_names]
y_new_data = new_data['Grid']

With this change completed, the following results are found:

These results show a familiar pattern with the Decision Tree Classifier and the K-Nearest Neigbhours algorithms, in that the training accuracy is 100%, however, there is a large drop in the accuracy scores for the grid analysis when using unseen data.

This analysis raises a few questions. First, having a threshold value alone to help detect an object is possible, but it does not help with navigation. Second, machine learning techniques alone are not sufficient to be able to detect objects which the algorithms were trained on. This second point leaves two possible solutions:

  1. Train the machine learning algorithms on data from as many objects as possible
  2. Use deep learning techniques to detect objects.

Obviously, training the algorithms on a massive amount of data is not practicable, so training neural networks is the next step to investigate.