Machine Learning for Categorical Classification: Seed Types and Colour Recognition Models - ivinnyaraujo/dataengineer-datascience-python GitHub Wiki

This project employs supervised machine learning techniques to address two distinct classification use cases:

Binary classification of pumpkin seeds using logistic regression
Multiclass colour recognition using discriminant analysis

All data processing and analysis was conducted in RStudio using a suite of R packages: caret for model evaluation, nnet for multinomial logistic regression, tidyverse for data wrangling, ggplot2 for visualizations, dplyr for data manipulation, pROC for ROC analysis, and MASS for discriminant analysis. While RStudio provides an excellent environment for exploratory analysis and academic projects, transitioning to production requires a more robust framework. For enterprise deployment, it is recommended to automate the workflow through platforms like Microsoft Fabric Notebooks (if you are a data developer that works with Microsoft stack), which offers seamless integration with Power BI for data visualization.

The full code and dataset can be found here.

Click to expand: Logistic Regression & Discriminant Analysis R Code

library(caret)    # generating and evaluating the confusion matrix
library(nnet)     # neural network models, used here for multinom()
library(tidyverse) # Collection of packages for data manipulation, visualisation, and modeling
library(ggplot2)  # For plotting decision boundaries and other visualisations
library(dplyr)    # data manipulation (filtering, grouping, summarising, etc.)
library(pROC)     # For ROC curve analysis and evaluating model performance
library(MASS)     # For Quadratic Discriminant Analysis (QDA) and other statistical modeling tools

# Part A - Logistic Regression: Automatically Detecting Seed Types

# Load data
seeds_training <- read.csv("seeds_training.csv", stringsAsFactors = FALSE)
seeds_test <- read.csv("seeds_test.csv", stringsAsFactors = FALSE)

# Encoding the class features as factor variable
seeds_training$Class <- as.factor(seeds_training$Class)
seeds_test$Class <- as.factor(seeds_test$Class)

# Checking the classes distribution
table(seeds_training$Class)
prop.table(table(seeds_training$Class))

# Fit logistic regression model
model <- glm(Class ~ ., data = seeds_training, family = binomial)

# Predict probabilities
probabilities <- predict(model, newdata = seeds_test, type = "response")

# Create Density Plot
library(ggplot2)
data.frame(prob = probabilities, true_class = seeds_test$Class) %>%
  ggplot(aes(x = prob, fill = true_class)) +
  geom_density(alpha = 0.5) +
  geom_vline(xintercept = 0.5, linetype = "dashed", color = "red") +
  labs(title = "Probability Distribution by Seed Class", fill = "True Seed Class") + 
  theme(plot.title = element_text(hjust = 0.5, face = "bold"))

# Generate ROC Curve
roc_obj <- roc(seeds_test$Class, probabilities)
plot(roc_obj, print.thres = TRUE)
mtext("ROC Curve with Optimal Threshold", side = 3, line = 3, font = 2, cex = 1.2)

# Evaluate accuracy across thresholds
thresholds <- seq(0.1, 0.9, by = 0.05)
accuracies <- sapply(thresholds, function(t) {
  mean(ifelse(probabilities > t, "Urgup Sivrisi", "Cercevelik") == seeds_test$Class)
})

# Optimal threshold
best_threshold <- round(thresholds[which.max(accuracies)], 2)
print(paste("Optimal threshold:", best_threshold))

# Generate plot
data.frame(Threshold = thresholds, Accuracy = accuracies) %>%
  ggplot(aes(Threshold, Accuracy)) +
  geom_line() +
  geom_point() +
  geom_vline(xintercept = best_threshold, color = "red") +
  labs(title = "Accuracy Across Thresholds") + 
  theme(plot.title = element_text(hjust = 0.5, face = "bold"))

# Convert probabilities to class predictions
predicted_classes <- ifelse(probabilities > best_threshold, "Urgup Sivrisi", "Cercevelik")

# Calculate accuracy
accuracy <- mean(predicted_classes == seeds_test$Class)
print(paste("GLM Model Accuracy:", round(accuracy * 100, 2), "%"))

# Plot actual vs predicted class counts for glm model
results_df <- data.frame(
  Actual = seeds_test$Class,
  Predicted = predicted_classes
)

# Plot
ggplot(results_df, aes(x = Actual, fill = Predicted)) +
  geom_bar(position = "stack") +
  scale_fill_manual(values = c("Cercevelik" = "red3", "Urgup Sivrisi" = "green4"),
                    name = "Predicted Class",
                    labels = c("Cercevelik (Predicted)", "Urgup Sivrisi (Predicted)")) +
  labs(title = "Stacked Bar Chart of Actual vs Predicted (GLM Model)",
       x = "Actual Class",
       y = "Count") +
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5, face = "bold", size = 14),
    legend.title = element_text(face = "bold"),
    legend.position = "right"
  )

# Multiple logistic regression model
model_multinom <- multinom(Class ~ ., data = seeds_training)

# Predict class labels on the test set
predictions_model_multinom <- nnet:::predict.multinom(model_multinom, 
                                                    newdata = seeds_test)

# Check model accuracy
accuracy_model_multinom <- sum(predictions_model_multinom == seeds_test$Class) / 
                          nrow(seeds_test)
print(paste("Multinom Model Accuracy:", 
           round(accuracy_model_multinom * 100, 2), "%"))

# Both models achieved good accuracy on the test set for distinguishing Urgup Sivrisi and Cercevelik seeds (glm = 88.4%; multinom = 87.4%). While the multinomial (multinom) model successfully classified the seed types, its iterative optimisation process introduces unnecessary complexity for this binary task. The standard logistic regression (glm) matched multinom’s performance, is simpler and it is faster to train for larger datasets. Given these advantages, the glm model will be used for this binary classification.

# glm model confusion matrix
conf_matrix_glm <- confusionMatrix(
  factor(predicted_classes, levels = levels(seeds_test$Class)),
  seeds_test$Class
)
print(conf_matrix_glm)


# Calculate variable importance scores from the multinomial model - 
# varImp() computes importance metrics for each predictor
var_importance <- varImp(model)  

# Extract the top 3 predictors:
# 1. rownames(var_importance): gets feature names
# 2. order(-var_importance$Overall): sorts by importance (descending)
# 3. [1:3] selects the top 3
top_features <- rownames(var_importance)[order(-var_importance$Overall)][1:3]

print(paste("Top 3 features:", 
           paste(rownames(varImp(model))[order(-varImp(model)$Overall)][1:3], 
           collapse = ", ")))

# Part B - Discriminant Analysis: Predicting Color Name from RGB Values

# Read data
colors_train <- read.csv("colors_train.csv")
colors_test <- read.csv("colors_test.csv")

# Count classes
num_classes <- length(unique(colors_train$color)) # extracts distinct color labels
cat("There are", num_classes, "number of classes in the dataset.\n")

# Convert color to factor
colors_train$color <- as.factor(colors_train$color)

# Fit the QDA model using the training data
qda_model <- qda(color ~ r + b, data = colors_train)

# Make predictions on the test data
test_predictions <- predict(qda_model, colors_test)$class

# Create confusion matrix
conf_matrix <- confusionMatrix(as.factor(test_predictions), as.factor(colors_test$color))
conf_matrix

# Create grid of R and B values
rb_grid <- expand.grid(r = seq(0, 255, by = 5), b = seq(0, 255, by = 5))

# Predict the classifications
grid_predictions <- predict(qda_model, rb_grid)$class
grid_data <- cbind(rb_grid, color = grid_predictions)

# Plot decision boundaries
ggplot() +
  geom_tile(
    data = grid_data,
    aes(x = r, y = b, fill = color)
  ) +
  scale_fill_manual(
    values = c(
      "red" = "red",
      "blue" = "blue",
      "purple" = "purple", 
      "pink" = "pink",
      "brown" = "brown"
    )
  ) +
  labs(
    title = "QDA Decision Boundaries for Color Classification",
    x = "Red (R)",
    y = "Blue (B)"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(
      hjust = 0.5,
      face = "bold"
    )
  )

# Classifying (200,0,200) on the Test Set
# Establishing the test point
test_point <- data.frame(r = 200, b = 200)

# Color class prediction
prediction <- predict(qda_model, test_point)
predicted_color <- levels(colors_train$color)[prediction$class]
cat("The predicted color is:", predicted_color, "\n")

Feature Engineering

Feature engineering is the process of transforming raw data into meaningful variables that enhance a model's predictive performance. It involves selecting, modifying, or creating new features to improve classification accuracy, such as handling missing data through imputation, scaling numerical values for consistency, encoding categorical variables into numerical representations, and generating interaction terms to capture complex relationships. By refining relevant attributes, feature engineering helps models identify patterns more effectively.

Part A: Seed Classification via Logistic Regression

Goal: Automatically differentiate between Çerçevelik and Ürgüp Sivrisi pumpkin varieties using morphological features.

Conducted feature engineering to enhance class separability
Implemented and compared binary logistic regression (GLM) with multinomial regression, selecting GLM for its interpretability (88.4% accuracy)
Optimised classification thresholds through ROC curve analysis (optimal threshold = 0.44) to minimise Type I/II errors
Identified the most significant predictors to classify seed types

Results: The GLM model using all attributes to classify seed types presented 88.4% of accuracy.

Part B: Colour Classification using Discriminant Analysis

Goal: Classify colour names from RGB values (with G=0) using reduced-dimensionality input (R, B)

Applied Quadratic Discriminant Analysis (QDA) to model non-linear decision boundaries in colour space. QDA is the most suited for this classification task because it accounts for non-linear boundaries between color categories. Unlike Linear Discriminant Analysis that assumes uniform variance across classes and linear separability, the QDA model allows each class to have its own covariance structure. Transitions between colors in the (R, B) space don't follow a linear pattern, so the QDA provides flexibility in capturing these variations and establishing adaptive decision boundaries.
Achieved 95% classification accuracy with robust per-class performance
Plotted decision boundaries to visualise effective separation of colour categories

Results: A high-performance classifier validated by correct prediction of (200, 0, 200) as "pink"

Technical Notes

Data Preparation: Explicit factor encoding of categorical responses to ensure proper model interpretation. Explicit factor encoding ensures proper model interpretation by converting categorical text labels (like "red"/"blue") into predefined, numerically-indexed categories that machine learning algorithms can correctly process
Algorithm Selection: Rigorous comparison of model complexity versus generalisability
Performance Validation: Comprehensive evaluation using balanced accuracy, ROC analysis, and confusion matrices

Conclusion

This projects shows the importance of feature selection, algorithm choice, and threshold optimisation when developing classification models. It also covers the importance of balance accuracy requirements with implementation constraints. It shows how optimal model performance is influenced by three key factors:

Problem constraints (binary vs. multiclass classification requirements)
Data topology (linear vs. non-linear feature separability)
Computational trade-offs (model complexity vs. interpretability needs)

Overall:

GLM provides the best accuracy-interpretability balance for binary classification
QDA offers better non-linear separation capabilities for multiclass scenarios