neuston_net RUN - WHOIGit/ifcb_classifier GitHub Wiki

Running a Trained Model

Once a .ptl model is trained with neuston_net TRAIN, it can be used with neuston_net RUN to perform inference on ifcb bins and unlabeled images. As output, the command creates one or more classification result files containing the each image's determined class, confidence score, and other run/model metadata.

Required Parameters

neuston_net RUN requires the following input parameters in the following order

SRC - input data
MODEL - a .ptl model file
RUN_ID - a label for this inference run (included in result metadata and output options)

Input Data

neuston_net RUN accepts as input for the SRC file either a single ifcb bin-id, a .txt text file with a list of bin-id's (newline-deliminated), or a directory containing ifcb-bins (directories are accessed recursively. ifcb bin-ids must be prefixed with the path to the ifcb bin's actual files on-disk (a bin comprises of three files bearing the same bin-id, see pyifcb for details on ifcb-bins). It is also possible to run inference on regular image files instead of bins using the --type img flag, though note that this affects output options.

Filtering (`--filter IN|OUT`)

It is possible to further tune what bins or images you with to run inference for using the --filter flag. You can exclude particular bins/images using the --filter OUT option. Contrarily, with --filter IN you can exclude all bins/images with the exception of the ones you specify. More that one filter values may be submitted sequentially on the command line. The filter option will also accept a .txt file (newline deliminated) of filter values. You do NOT need to specify a bin or image's filepath when filtering.

Note: For images, if any of the the values being filtered for appear in an image filename, that file will be filtered.

Examples:

--filter IN bin1 bin2 - limit processing to just bin1 and bin2
--filter IN list-of-binID.txt - limit processing to the list if bins in the .txt file
--filter OUT badbin - if you know that you don't want to classify data from badbin, you can filter it out
--type img --filter IN 2021-03-22 - assuming a multi-year directory of images for SRC that image filenames are formatted to include a date, this filter option will only process images with "2021-03-22" in the filename, ie images from March 22nd 2021

Reprocessing (`--clobber`)

By default if a target bin output file is found to already exist, re-processing for that bin is skipped. If bin-processing is interrupted before completion, this behavior is practical for picking up processing where it left. To disable this behavior and overwrite any existing files, use --clobber. This behavior is NOT enabled for --type img.

Output Options

By default, one output file is created per bin. The directory it gets saved under is determined by --outdir and --outfile. OUTDIR defines the root folder inference results get saved under, and OUTFILE specifies the filetype, filename, and any bin-based directory structure beyond OUTDIR. Note the formatting tags in the {curly braces} which get replaced by actual values at output.

`--outdir`

Default value: run-output/{RUN_ID}/v3/{MODEL_ID}/. MODEL_ID is the same as the model id in the MODEL's metadata and RUN_ID is of course provided directly in the neuston_net RUN command.

`--outfile`

Default value: "D{BIN_YEAR}/D{BIN_DATE}/{BIN_ID}_class.h5". This creates a year-date-files directory structure under OUTDIR. There are three available output formats: HDF .h5, matlab .mat, and json .json. {BIN_YEAR}, {BIN_DATE}, and {BIN_ID} get replaces with a given bins collection year, collection date, and bin id respectively. When processing bins, {BIN_ID} is required. Additionally, {INPUT_SUBDIRS} is an available formatting tag who's value is a bin's parent directory filepath (after/not-including SRC).

When processing images, no formatting tags are available. The default is img_results.json.

Usage

neuston_net.py RUN path/to/SRC path/to/MODEL RUN_ID

usage: neuston_net.py RUN [-h] [--type {bin,img}] [--outdir OUTDIR] [--outfile OUTFILE] 
                          [--filter IN|OUT [KEYWORD ...]] [--clobber] SRC MODEL RUN_ID

positional arguments:
  SRC                   Resource(s) to be classified. Accepts a bin, an image, a text-file, or a directory. 
                        Directories are accessed recursively
  MODEL                 Path to a previously-trained model file
  RUN_ID                Run ID. Used by --outdir

optional arguments:
  -h, --help            show this help message and exit
  --type {bin,img}      File type to perform classification on. Defaults is "bin"
  --outdir OUTDIR       Default is "run-output/{RUN_ID}/v3/{MODEL_ID}"
  --outfile OUTFILE     Name/pattern of the output classification file. 
                        If TYPE==bin, files are created on a per-bin basis. 
                        OUTFILE must include "{BIN_ID}", which will be replaced with the a bin's id. 
                        A few patters are recognized: {BIN_ID}, {BIN_YEAR}, {BIN_DATE}, {INPUT_SUBDIRS}. 
                        A few output file formats are recognized: .json, .mat, and .h5 (hdf). 
                        Default for TYPE==bin is "D{BIN_YEAR}/D{BIN_DATE}/{BIN_ID}_class.h5"; 
                        Default for TYPE==img is "img_results.json".
  --filter IN|OUT [KEYWORD ...]
                        Explicitly include (IN) or exclude (OUT) bins or image-files by KEYWORDs. 
                        KEYWORD may also be a text file containing KEYWORDs, line-deliminated.
  --clobber             If set, already processed bins in OUTDIR are reprocessed. 
                        By default, if an OUTFILE exists already the associated bin is not reprocessed.