Cluster Tool - MarineBioAcousticsRC/Triton GitHub Wiki

HOME > REMORAS > CLUSTER TOOL

The Cluster Tool Remora provides user interfaces and command line tools for identifying recurrent signal types in large acoustic datasets using unsupervised machine learning. A detailed description of the method and an illustrative use case are available in Frasier et al. 2017.

How It Works
How To Use It
- Step 1: Add Cluster Tool Remora to Triton
- Step 2: Cluster Bins
- Step 3: Composite Clusters
- Step 4: Post-Clustering Options

How It Works

This tool currently operates on short impulsive acoustic events, such as echolocation clicks, propeller cavitation noise, snapping shrimp signals, and echosounders. Events are compared to each other based on a variety of features, including spectral content, waveform shape, and temporal regularity.

Differences between events are used to represent the input dataset as a network in which "nodes" represent detections and the length of "edges" connecting nodes represents the degree of similarity between them, with shorter edges indicating higher similarity. A two-stage unsupervised clustering algorithm is used to automatically group sets of highly similar nodes in order to identify recurring signal types across a large dataset.

Considerations

Input files are expected in TPWS format, which are easily generated by running a detector such as the SPICE Detector Remora on your audio data.
This tool aims to identify the most common signal types in the input dataset. It is not appropriate for finding unique or rare signal types.

:warning: This tool is NOT a complete classifier. After identifying the classes of signals in your dataset, you will need to train a classifier to recognize those classes and then run that classifier across on the entire dataset. This tool outputs a subset of representative examples of each recurring signal type.

How To Use It

Step 1: Add Cluster Tool Remora to Triton

Learn how to download or clone the Triton repository with the Cluster Tool Remora in the quick setup section and how to add the Remora to Triton.

Step 2: Cluster Bins

This routine steps through the dataset using short sequential time intervals (usually a few minutes), summarizing the features of distinct detections types within those intervals.

In the Control Window, use the Remoras pull-down menu, and select Cluster Tool > Cluster Bins. This will bring up the Cluster Bins interface, in which bin-level clustering preferences can be configured. Hover your cursor over the field names to reveal tooltips with detailed information on the various parameters.

Preferred settings can be saved or loaded using the Save/Load Settings dropdown menu in the upper left corner of the interface. Settings can be loaded either from Matlab text files (.m format) or as Matlab data files (.mat format). Settings saved through the dropdown menu will be stored in .mat format.

Output

This routine outputs one Matlab file per input TPWS file containing a binData structure in which each row represents the signal type(s) found in one time interval as well as other metadata associated with that interval. Variables contained in the binData structure include:

clickSubset = Indices of the events in the time interval that were used for clustering.

clickClusterIds = Indices of detections assigned to each cluster. This index is relative to clickSubset.

nIsolated = Indices of events isolated from clusters.

sumSpec = Mean spectrum of each cluster identified within the interval.

nSpec = Number of events associated with each mean spectrum.

percSpec = Percentage of clustered events assigned to each cluster. Assuming any subset is randomized, this percentage can be used to estimate the fraction of events in the time interval associated with each signal type.

nClicks = Number of events in each interval.

tInt = Start and end times of each interval stored as Matlab datenumbers.

dtt = Inter-detection interval for each cluster.

cRate = Store detection rate distribution.

clusteredTF = Flag indicating wether or not the data in the interval was clustered (No = 0, Yes = 1). Data are not clustered if the number of events in the interval is less than the minimum number of events per cluster.

clickTimes = Individual times of events in each cluster.

envDur = Distribution of timeseries envelope durations per clusters.

envMean = Mean envelope shape for each cluster.

Additional variables in the output files include:

p : Contains the parameters used at runtime.

TPWSfilename : Path and filename of associated TPWS file.

f: Frequency vector associated with output spectra.

Command Line Use

The bin clustering step can also be run in the Matlab command line as

ct_cluster_bins(<settings_file_path>)

Step 3: Composite Clusters

This routine operates on the output of Cluster Bins, loading all binned data from the input folder and clustering on average features and feature distributions to identify the main classes of signals in the dataset.

In the Control Window, use the Remoras pull-down menu and select Cluster Tool > Composite Clusters. This will bring up the Composite Clusters interface, in which cross-bin clustering preferences can be configured. Hover your cursor over the field names to reveal tooltips with detailed information on the various parameters.

Preferred settings can be saved or loaded using the Save/Load Settings dropdown menu in the upper left corner of the interface. Settings can be loaded either from matlab text files (.m format) or as matlab data files (.mat format). Settings saved through the dropdown menu will be stored in .mat format.

Output

Files:

If the "save output" checkbox is checked, this routine will save one *_types_all.mat file with summary information for all clusters formed, and as well as individual *_type<N>.mat files containing summary information for each cluster formed.

Individual cluster output files contain the following variables:

s = composite clusters settings.

p = cluster bins settings (transfered from input files).

inFileList = A list of the bin files that were used as input for Composite Clusters.

TPWSList = A list of the TPWS files associated with the input bin files (this is included to facilitate recovering the individual detections associated with the composite cluster).

thisType = A structure containing features and metadata describing the cluster. Fields include:

tIntMat = Time of each bin in cluster.

clickTimes = Times of all individual detections contributing to the cluster.

fileNumExpand = Time of each bin in cluster.

Tfinal{1} = Mean spectrum of cluster.

Tfinal{2} = Mean ICI distribution of cluster.

Tfinal{3} = 1st diff mean spectrum.

Tfinal{4} = Modal ICI values of all bins in cluster.

Tfinal{5} = Mean spectra for each bin in cluster.

Tfinal{6} = Index of bin file that each bin came from (relative to inFileList).

Tfinal{7} = Time of each bin in cluster.

Tfinal{8} = Primary index of bins in this cluster relative to the input set.

Tfinal{9} = Subindex of bins in this cluster (useful for cases where multiple mean spectra per bin were found.

Command Line

The bin clustering step can also be run in the Matlab command line as

[exitCode,ccOutput] = ct_composite_clusters(<settings_file_path>)

where

exitCode = 1 (Success) or 0 (Failure) of the routine.

ccOutput = A structure with the variables listed above as well as

outputDataFile = Name and path of saved data file.

partitions = A cell array that contains bin indices for each partition iteration, if multiple iterations were run.

Step 2: Post-Clustering Options

Various options are possible after composite clusters has run.

Save labeled bin times to CSV

Exports a text file with bin times and associated cluster numbers.

Save labels for TPWS

Save *_ID1.mat files to go with TPWS files. This is useful for sanity checking the identified clusters in context. Remember however that only a subset of detections will be labeled because this is not a complete classification algorithm.

Save individual cluster files for classifier training

This tool will create one subfolder per cluster within the output folder and fill each with representative examples of the associated cluster. You can rearrange files, merge folders, rename folders, or add examples from other clustering iterations and datasets to the folders prior to training a classifier using the NeuralNet Remora.

Remove types and re-cluster

Allows you to check boxes of clusters that you want to remove from the set, and re-run composite clusters without those bins. This is useful for removing noise or highly unique signals that might negatively affect the distinctions between more similar signal types.