UTIL DATASET_CONFIG - WHOIGit/ifcb_classifier GitHub Wiki

Dataset Config CSV

A Dataset Config CSV allows for the combining of two or more Dataset Directories at runtime. This is advantageous as it allows data from multiple datasets to be used without the need to manually move or innumerable small image files around or fuss with minor class-directory name differences. A Dataset Config CSV is used by specifying it as the SRC in neuston_net.py SRC MODEL TRAIN_ID.

Consider the following example Dataset Directories D1 & D2, and the Dataset Config CSV D1D2_config.csv.

path/to/
└─ D1/
   ├─ amoeba/
   ├─ Diatom/
   ├─ Ciliate/
   ├─ unknown1/
   └─ bad/
└─ D2/
   ├─ Amoeba/
   ├─ Diatom/
   ├─ Euglena/
   ├─ unknown1/
   ├─ other/
   └─ bad/
        ,   path/to/D1,    path/to/D2
  amoeba,       Amoeba,             0
  Amoeba,            0,             1
  Diatom,            1,             1
 Ciliate,            1,             0
 Euglena,            0,             1
unknown1,        Unknown_A, Unknown_B
   other,            0,             0
     bad,            0,             0

Note the following properties common to all Dataset Config CSVs:

  • The first column bears all the possible class names from the Dataset Directories one seeks to combine.
    • Class names are case sensitive and duplicate names are invalid.
    • The very first cell may be left blank. It is of no consequence.
  • Subsequent columns each represent the data contribution from a Dataset Directory.
    • Column headers must be a path to a valid Dataset Directory.
    • Column headers may be prefixed with an integer-and-colon to indicate dataset priority, eg 1:path/to/D1. See "Dataset Priority" below.
  • A 0 in a dataset column cell indicates that the corresponding class should be excluded from that dataset OR that the class doesn't appear for that dataset.
  • A 1 in a dataset column cell indicates that that class exists in the dataset and should be included
  • Text in a dataset column cell renames and includes the class, possibly allowing it to be combined with a class from a different dataset.

The example Dataset Config CSV above:

  1. Renames amoeba to Amoeba from D1, so as to match D2. The Amoeba classes from D2 and D1 combine.
  2. Includes the Diatom class from both D1 and D2, combining them
  3. Ciliate and Euglena only appear in D1 and D2 respectively. In this csv they are both included. A 0 must be placed in any dataset column that a class does not belong to.
  4. unknown1 from D1 and D2 represents two different unknown species which happen to have the same placeholder labels in the two datasets. The config csv here renames the class on a per-dataset basis such that they do not conflict.
  5. other from D2 is excluded
  6. although bad exists in both D1 and D2, this class is excluded from both

The resultant dataset will feature the following classes:

  • Amoeba - from D1 and D2
  • Diatom - from D1 and D2
  • Ciliate - from D1
  • Euglena - from D2
  • Unknown_A - from D1's unknown1
  • Unknown_B - from D2's unknown1

Making a baseline Dataset Config CSV using neuston_util.py

A baseline dataset config csv for a given set of Dataset Directories can be generated using neuston_util.py MAKE_DATASET_CONFIG. The command automatically generates the first-column of all available classes across the specified datasets, and populates the dataset columns with 1's (class appears in this dataset) and 0's (class does-not feature in this dataset). Although a baseline dataset config csv can be used as-is, it is up to the user to review and further edit the csv to make sure class-names and class-exclusions are satisfactory.

  • Editing a 1 to a 0 will exclude that class from that dataset.
  • Editing a 1 to some text will rename that class, possibly allowing it to be combined with another class in the first column.
  • Cells with 0's should not be edited.

Example Usage

./neuston_util.py MAKE_DATASET_CONFIG path/to/D1 path/to/D2 -o D1D2_config.csv

usage: neuston_util.py MAKE_DATASET_CONFIG [-h] [-o OUTFILE] PATH [PATH ...]

positional arguments:
  PATH                  List of dataset paths. Space deliminated. 
                        You may optionally prefix the paths with "n:" where n is an integer priority value. 

optional arguments:
  -h, --help            show this help message and exit
  -o OUTFILE            Specify an output file. If unset, outputs to stdout.

Misc.

A Dataset Config CSV may be used by neuston_util.py MAKE_CLASS_CONFIG to create a baseline Class Config CSV suitable for use in the --class-config argument. See here.

Dataset configuring occurs at runtime before any further class changes are made by --class-config, --class-min, or --class-max (but see Dataset Priority below for details).

Dataset Priority

Dataset Priority only comes into play when --class-max is used to cap class-instances to some maximum count. When --class-max is invoked, it is sometimes desirable to prioritize inclusion of samples from one dataset over another. This is noted in a dataset config csv in the dataset-column headers as an integer-and-colon prefix (eg 1: or 2:), where smaller values have higher priority.

Consider the following example where:

  • The header from D1D2_config.csv is modified to , 1:path/to/D1, 2:path/to/D2
  • amoeba from D1 has 6000 samples
  • Amoeba from D2 has 6000 samples
  • Diatom from D1 has 12000 samples
  • Diatom from D2 has 12000 samples
  • the class-instance maximum is set as --class-max 10000

The resultant dataset will have:

  • 10000 Amoeba instances (6000 from D1, 4000 from D2 selected at random)
  • 10000 Diatom instances (all from D1, selected at random, none from D2)

Datasets who's combined class instance counts don't exceed --class-max will of course not be truncated regardless of priority.

Multiple datasets can have the same level of priority. When no priorities are specified, all datasets have the same same equal priority. If only some datasets have priority values, datasets without priority values get designated a lower priority level. When datasets with the same level of priority are truncated to some maximum amount, the instances are selected from them at random. The random selection can be replicated if need be using --seed.