Home - WHOIGit/ifcb_classifier GitHub Wiki

IFCB Classifier

Image classification with neural nets is a process by which input images can be binned into certain categories.
This git repository is host to an image classifying program designed to be trained and run on plankton images from an IFCB datasource. Please checkout the links on the sidebar to go into more depth on the relevant topic.

Terminology

  • IFCB - Imaging FlowCytobot. A device used to continually record microscope imagery of oceanic plankton.
  • bin - The default output format for an IFCB. A bin is a file containing a collection of plankton images. This project used pyifcb for reading ifcb bins.
  • NN - Neural Net. A mathematical/computational process of processing data through an interconnected series of nodes. This web of nodes is likened to how neurons in a brain are connected, hence the name. This project uses PyTorch and pytorch-lightning
  • CNN - Convolutional Neural Net. A neural net that performs "convolutions" to the data it processes, typically associated with image processing.
  • Classification - the act of associating something into some particular category. In this project we are classifying images of plankton into one of many user-defined categories, for example the plankton's taxonomic group.
  • class - a particular category/group/taxa the classification task may ascribe some input to.
  • model - in this context, a model is a CNN which can perform a desired image-classification task. Creating a good model involves iterative training.
  • training-dataset and validation dataset- labeled data used to train a model.
  • Dataset Directory - A directory who's immediate sub-directories represent classes. The image files within these class-directories are the data used during training and correspond to their respective class. Also knows as "labeled data".
  • HPC - High Performance Computer. Typically an institution-owned cluster of computers with lots of computing power.
  • SLURM - a task-queuing program typically used to manage jobs submitted to an HPC.
  • SBATCH - a command for submitting slurm sbatch job-scripts to SLURM
  • SCRATCH - a workspace / network-directory designated for use by the WHOI HPC.

Overview

CNN image classification is a two step process. A model must first be trained on a body of labeled data in order for it to perform well at the desired task, after which the trained model can be applied to body of raw or unclassified data. This kind of NN training is called "Supervised Learning", because the correct classification options are prescribed by the user.

Model training is a computationally intensive process. Although the core CNN model training/running processes in this project may run on various platforms, many of the features are designed to leverage the computational resource of the WHOI HPC.

User-Command Files

  • neuston_net.py - This program is used to TRAIN and RUN image-classification models.
  • neuston_sbatch.py - This program is used to streamline the process of submitting neuston_net commands to a SLURM-enabled computer using SBATCH.
  • neuston_util.py - This program is used to create default configuration templates and calculate values in support of a model training.
  • neuston_onnx.py - This program is used to convert trained models to the more portable .onnx format. It can also run onnx models.
  • Other neuston_*.py files, ie neuston_callback.py neuston_data.py neuston_models.py, are not user-accessible.

Directories

The following are the default directories for this project's data input and output. The defaults can be changed with command line arguments at runtime.

  • training-data - Directory for training datasets.
  • training-output - Directory training results get output to.
  • run-data - Directory for to-be-classified datasets.
  • run-output - Directory classification results get output to.