TRAIN dataset_params - WHOIGit/ifcb_classifier GitHub Wiki

Datasets

Specifying a dataset is done via the SRC positional argument of neuston_net TRAIN

neuston_net.py TRAIN SRC MODEL TRAINING_ID

SRC may be a Dataset Directory, or a Dataset CSV.

Dataset Directory - a directory who's sub-directories are the dataset's class labels. Images in these sub-directories are used for training and validation.
Dataset CSV - a csv file that specifies multiple Dataset Directories, and how classes from each ought be combined. For more details, including how to create a dataset CSV, see neuston_util.py MAKE_DATASET_CONFIG

Dataset Parameters

The following are flags used to adjust a dataset during training.

Dataset Adjustments:
  --seed SEED             Set a specific seed for deterministic output & dataset-splitting reproducability.
  --split T:V             Ratio of images per-class to split randomly into Training and Validation datasets.
                          Randomness affected by SEED. Default is "80:20"
  --class-config CSV COL  Skip and combine classes as defined by column COL of a CSV configuration file.
  --class-min MIN         Exclude classes with fewer than MIN instances. Default is 2.
  --class-max MAX         Limit classes to a MAX number of instances. 
                          If multiple datasets are specified with a dataset-configuration csv, 
                          classes from lower-priority datasets are truncated first.

Images for Training and Validation (`--split`)

The IFCB Classifier uses Supervised Learning to train its image classification neural net models. This technique necessitates a body of labeled data to train the neural net as to what "class" or grouping an image should be identified as belonging to. If allowed to run indefinitely a neural net will learn how to correctly identify images in the training dataset but fail when presented novel data, ie fail to generalize its classification task. To prevent this from happening, a validation dataset is commonly used. The validation dataset is not used to train the model's weights and biases, and is instead used to evaluate the performance of the in-training neural net on data the neural net hasn't directly trained on or learned. During training, when performance of a model against the validation dataset stops improving and starts to worsen, we know that the model has passed a threshold whereby it is no longer generalizing (a good place to stop training).

Neuston Net automatically selects a proportion of images from the SRC dataset to separated out into a Training dataset and a Validation dataset. It does so on a per-class basis such that the proportion of training and validation images stays consistent within a given class (particularly important when the number of instances/images between classed can vary widely). The images are selected at random (but see below) and the ratio of images between Training and Validation datasets can be set using --split. By default, the SRC dataset is split "80:20", where 80% of images are used for Training, and the remaining 20% are used for Validation.

Randomization (`--seed`)

A number of variables are subject to randomness when a training is initiated. For instance, images are split randomly into training and validation datasets. This behavior is normal, but affects repeatability. To account for this, it is possible to set the randomization to some known integer seed value which ensures that all random effects are reproducible. By default, neuston_net records its random seed in the args.yml file and internally in the .ptl file. A seed can manually be specified using the --seed flag.

Note: if other input parameters differ, then the output of Trainings with same seed values may still differ. For example, the specific images that get split into Training and Validation datasets may differ significantly if the SRC images differ or are affected by other runtime parameters like --class-max.

Classification Configuring (`--class-config` `--class-min` `--class-max`)

SRC may be further adjusted at runtime, allowing a user to rename, combine, skip, and limit certain classes.

--class-min sets a minimum number of instances required for a given class to be included during training. Images in excluded classes are not used during training and excluded classes do not feature in the list of possible output classes.
--class-max randomly truncates the number of instances for all classes / on a per-class basis to the specified maximum integer.
--class-config allows for class names to be combined, renamed, and skipped via a CSV-file and specified configuration COLumn header. See neuston_util.py MAKE_CLASS_CONFIG for more details.