Data Analysis: Extra Analysis - ofithcheallaigh/masters_project GitHub Wiki

Introduction

This section of the Wiki will contain some more data analysis that was carried out when having an initial look at the dataset.

Confusion Matrix and Trees

The confusion matrix and representations of the decision trees can be generated in MATLAB. The code for doing this is shown below. The code can be modified to predict for grids and LabelObjects as needed. For example the lines

% cm = confusionchart(testData.Grid,predictedY);
cm = confusionchart(testData.LabelObject,predictedY);

can be swapped in and out (using commenting), depending if you want to look at the grid analysis or the binary (i.e. LabelObject) analysis.

% Standard Opening
clear;
clc;
close all;

% Setting up colours for plotting
colourArray = ["#F73309", "#27F10E", "#EBF10E", "#EB0EF1", "#0E14F1", "#A4A5BC", "#ADF5CA", "#8B0E31", "#43F9FD"];

data_folder = "[path\to\my\folder]\MATLAB\Data";
run_folder = pwd;

cd(data_folder);

% Get data
closedDoor = readtable("grid0_closeddoor_clearhallway.csv");
displayStand = readtable("grid0_displaystand_clearhallway.csv");
largeBin = readtable("grid0_largebin_clearhallway.csv");
storageBox = readtable("grid0_storagebox_clearhallway.csv");

% Brings all data together in one vector
inputTable = vertcat(closedDoor,displayStand,largeBin,storageBox);


% ~~ Predictors and Response ~~
% Processes the data for training
predictorNames = {'Channel1','Channel2'};
toNormalise = inputTable(:,predictorNames);
predictors = inputTable(:,predictorNames);  % Could also be predictorNames = inputTable(:,[3,5,12]);

% response = inputTable.Grid;
response = inputTable.LabelObject;

% ~~ Train the classifier ~~
% This code specifies all the classifier options and trains the classifier option and trains the classifier
% trainedDecisionTreeModel = fitctree(predictors,response,'OptimizeHyperparameters','auto');
trainedDecisionTreeModel = fitctree(predictors,response);
validationAccuracy = 1 - loss(trainedDecisionTreeModel,predictors,response);

% ~~ Graphic display of the tree ~~
view(trainedDecisionTreeModel,'mode','graph')

predicatedY = resubPredict(trainedDecisionTreeModel); 

% ~~ Use Train/Test to evaluate the model performance ~~
% Split the data randomly into train and test groups, on a 70%/30% split
% First, get the size of the data
[m,n] = size(inputTable);

% Generate a vector containing a random permutation of the integers from 1 to n without repeating
idx = randperm(m);


splitPercentage = 0.70; % Sets the split percentage value
m1 = round(splitPercentage*m); % m1 is the number of the training data

% Now split the data
trainingData = inputTable(idx(1:m1),:);
testData = inputTable(idx(m1+1:end),:);

% Build a new tree on the training datasets only
predictors = trainingData(:, predictorNames);
% response = trainingData.Grid;
response = trainingData.LabelObject;

trainedDecisionModdel1 = fitctree(predictors,response);

% Compute the accuracy of the training data
validationAccuracy1 = 1 - loss(trainedDecisionModdel1,predictors,response);

% Preformance evaluation on the test data
% Predict the labels of the test data
predicted_Y = predict(trainedDecisionModdel1,testData(:,predictorNames));

% Create a confusion matrix chart from the true labels and the predicted labels
% cm = confusionchart(testData.Grid,predicted_Y);
cm = confusionchart(testData.LabelObject,predicted_Y);

The above code will produce a decision tree view, which can be seen below:

This view is what is produced when looking at the binary search option. As can be seen, this is quite a confusing image which indicates the work of the decision tree is going to come to a solution. This tree can be pruned, and zoomed in on to give a bit more of a clearer image of the early stages:

This image shows how the initial split is made on Channel1 and as the tree goes deeper, it can be seen that the process starts to bring in Channel2 to make decisions.

We are also able to look at the confusion matrix for this analysis:

Finally, accuracy stores can be extracted for this decision tree model:

This shows that both models, which use the data in a slightly different way, produce accuracy scores of 0.9993.

The same analysis can be carried out when looking at the grid analysis. This can be done in code by using the grid for a response, instead of the LabelObject:

response = inputTable.Grid;

The decision tree is shown below:

It appears that this tree has more decisions being made, which makes sense given this is a grid analysis, rather than a binary analysis. The confusion matrix is shown below:

Cross Fold Analysis and Hold Outs

A cross-fold analysis and hold out can also be carried out. To understand the impact of cross fold, we can vary the number of folds, and measure the accuracy

The code to achieve this is shown below:

data_folder = "D:\Courses\UUJ\Research Project\masters_project\MATLAB\Data";
run_folder = pwd;

cd(data_folder);
% [file,path] = uigetfile('*.csv','Select One or More Files','MultiSelect','on');
% input_table = readtable(strcat(path,file));

closedDoor = readtable("grid0_closeddoor_clearhallway.csv");
displayStand = readtable("grid0_displaystand_clearhallway.csv");
largeBin = readtable("grid0_largebin_clearhallway.csv");
storageBox = readtable("grid0_storagebox_clearhallway.csv");

inputTable = vertcat(closedDoor,displayStand,largeBin,storageBox);

predictorNames = {'Channel1','Channel2'};
% toNormalise = inputTable(:,predictorNames);
% N = normalize(toNormalise,'range');
predictors = N;

% response = inputTable.Grid;
response = inputTable.LabelObject;

mdl = fitcknn(predictors,response,'NumNeighbors',2); % Will be 2 or 10, depending on binary or grid

% Now we will carry out cross-validation of the model using crossval, 
% using x-fold cross-validation
cvmdl1= crossval(mdl,'KFold',10);

% We can use Holdout to keep 30% of the data for evaluation
cvmdl2 = crossval(mdl,'Holdout',0.3);

% Output the loss and accuracy of the classifier for both the 5-fold 
% cross validation and Handout validation
cvm1loss = kfoldLoss(cvmdl1, 'LossFun', 'ClassifError');
Accuracy1 = 1 - kfoldLoss(cvmdl1, 'LossFun', 'ClassifError');

cvm2loss=kfoldLoss(cvmdl2, 'LossFun', 'ClassifError');
Accuracy2 = 1 - kfoldLoss(cvmdl2, 'LossFun', 'ClassifError');

The accuracy results can be seen below:

Now a deeper analysis will be done to understand how well the data is classified using a number of machine learning algorithms.