Module structure and data storage retrieval - CellProfiler/CellProfiler GitHub Wiki

See also our Orientation to CellProfiler code page

Introduction

We have provided module templates that illustrate the basic structure that is described below: imagetemplate.py and measurementtemplate.py.

CellProfiler splits internal data into two broad categories:

Pipeline data: Data that does not change from run to run, i.e., the modules in your pipeline and their settings.
Workspace data: Workspace data is data that is saved during the course of your run, e.g., images, objects and measurements.

CellProfiler's modules store their configuration in the form of settings, which are attributes of the module. You can access them from your code using self.[my_setting]. CellProfiler's modules never store workspace data in module attributes - they use the workspace or image set list that is passed to their methods to store workspace data.

pipeline

Pipeline utility functions
Pipeline modification functions
Running a pipeline

Module

Module settings
- upgrade_settings and variable revision number
Modules and measurements
Modules and pipeline execution
- Display during execution

settings

Setting Types

workspace
objects

Inside cellprofiler.objects

cpimage

image
ImageSet
- ImageSetProvider
ImageSetList

measurements

Hierarchy of feature names
HDF5 Measurement and Workspace Format

pipeline

cellprofiler.pipeline is the Python class CellProfiler uses to save the list of modules to run. You might need the pipeline to look at modules that run before your module, for instance, to see what measurements those modules produce. Very few modules need to look into the pipeline; none modify the pipeline. The pipeline is also used to load and save pipeline files to and from disk and to run your pipeline file.

Pipeline utility functions

The following is a list of pipeline functions that can be called from test code or from your module:

File handling
- load: Loads a pipeline file into the pipeline
- save: Saves the pipeline to a file
- save_measurements: Saves measurements to a .mat file (and embeds the pipeline in the measurement file)
Data access
- modules: Returns a list of the modules in your pipeline
- get_measurement_columns: Returns a list of the measurements your pipeline makes
- test_valid: Throws an exception if any pipeline modules are in invalid states (for instance, if the user entered an invalid file name in the LoadData module).

Pipeline modification functions

The pipeline is the center of CellProfiler's UI. Some parts of the UI plug into the pipeline to find out about changes and events. Others make changes to the pipeline. The pipeline itself keeps track of changes in a way that lets you undo the changes.

add_module: Adds a module to the pipeline. The module has a one-based index: module_num. The pipeline uses this index to determine where to insert a module, so the caller should set the module's module_num to the insert point before calling
remove_module: Removes a module according to its module_num. For instance, pipeline.remove_module(1) removes the first module in your pipeline.
clear: Removes all modules.
edit_module: Registers a change in the module's settings (the actual edit is done by the caller and edit_module is called afterwards).
move_module: Moves a module up or down one position.

You can undo edits to the pipeline. You can also group a number of edits together to form a single undo action that will undo the whole group of edits (for instance, a user might delete several modules with one keystroke - the user interface groups all of the remove_module edits into one undo action).

undo: Undoes the last undo action
start_undoable_action: Marks the start of a group of edits to be bundled into a single action
stop_undoable_action: Marks the end of the action
undo_action: Returns an informative string that describes the current action that can be undone
has_undo: True if the pipeline has something to undo

You can receive notifications when something happens to your pipeline. To do this, write a callback function like this:

def callback(pipeline, event):
    ...

pipeline.add_listener(callback)

At this point, the callback function will receive notifications when the pipeline is edited and when exceptions happen when the pipeline is run.

Running a pipeline

CellProfiler runs the pipeline by calling methods on the pipeline. A run is a nested hierarchy.

On the outside is the run itself.
The run is composed of groups, and each group is composed of image sets.
Each image set is a collection of images that is passed from module to module as the pipeline is run.
Groups are collections of the image sets whose metadata matches the grouping criteria (for instance, the image sets with the same plate metadata value).

Below are the methods called while running a pipeline; the calling order corresponds to the hierarchy order:

run: Runs the pipeline from start to finish or over a subset of image sets or groups
run_with_yield: A version of run that yields after running each module (to be UI-friendly)
- prepare_run: Build the run's image set list
- get_groupings: Find the groups and grouping criteria from within the image set list
  - prepare_group: Build the image sets from the grouping criteria. Modules that aggregate over groups (for instance, CorrectIlluminationCalculate) can initialize a group during this stage. The pipeline calls each module's run and display methods, passing a workspace that holds the data for an image set.
  - post_group: Perform any necessary post-group processing. Modules that aggregate can write their results during this stage. For instance, SaveImages writes out aggregate images during post_group.
- post_run: Tell each module to perform post-run processing (for instance, ExportToSpreadsheet writes its spreadsheets in post_run)

Module

cellprofiler_core.module.Module is the Python class that represents a CellProfiler module. Modules have settings. These settings hold the parameters that control the module's operation. They are loaded and saved in the pipeline file and are displayed in the UI.

A module has methods that do things as the pipeline is run. The module also has methods that tell pipeline users about what the module does. If you write a module, you'll have to fill in some or all of these methods in order to get your module to communicate with other modules and the user interface.

Module settings

A module tells CellProfiler about its settings using the following methods:

create_settings: CellProfiler calls create_settings when it makes a new instance of a module. This is the place to create the settings that you want a user to see in the new module. Look at cellprofiler.modules.colortogray for a simple example of how to create settings.
settings: CellProfiler calls settings whenever it wants to find out which settings should be saved to the pipeline. You should return a list of the settings that you want saved.
visible_settings: CellProfiler calls visible_settings whenever it wants to find out which settings to display in its user interface. You can omit visible_settings - if you do, CellProfiler will use settings instead. You can implement visible_settings if you want to show different settings in different circumstances. For instance, cellprofiler.modules.colortogray can either combine the color channels to form one grayscale image or it can split each channel into a separate image. ColorToGray uses visible_settings to either show the split options or the combine options; users only see the settings they have to set.
validate_module: CellProfiler calls validate_module whenever it wants to determine whether the module's settings are valid. You might have some combinations of settings that won't work in your pipeline. You can implement validate_module and raise the ValidationError exception if the settings are wrong. See the LoadData module for an example.

Modules can handle some complex setting situations as a pipeline is loaded. A module can make itself compatible with older versions of itself by implementing upgrade_settings (see the EnhanceEdges module for an example). A module can adjust its settings in order to handle different numbers of inputs and outputs using prepare_settings (see the Morph module for an example). A module can call help_settings when the help text for the module needs to be ordered for display differently than that given by settings.

Being a good citizen - upgrade_settings and variable revision number

Modules change over time. Developers add settings and this changes how the modules are loaded and saved in a pipeline. You can be a good citizen and make a module capable of loading settings saved with older versions using variable_revision_number and upgrade_settings. Each module has a variable_revision_number and this number is saved in the pipeline. The number correlates with how the settings were saved. If you change settings, you should increase the variable_revision_number in your module and you should change or override upgrade_settings to account for the changes. For instance, here is settings for a hypothetical module:

variable_revision_number = 1
def create_settings(self):
    self.wants_something = cps.Binary('Do you want something?')
def settings(self):
    return [ self.wants_something ]

And perhaps we decide that, instead, we want This, That or The other, instead of a simple yes/no choice in a later version. The code might look like this:

variable_revision_number = 2
def create_settings(self):
    self.what_do_i_want = cps.Choice('What do you want?', ['This', 'That', 'The other'])
def settings(self):
    return [self.what_do_i_want]

Now what happens when we load the old version? Binary gets saved as either "Yes" or "No", but, when we load, Choice is none of those. Your module will show an error and your pipeline is broken. But say that the owner of the module knows that version 1's "No" is version 2's "This" and version 1 "Yes" is "That" in version 2. We can use upgrade_settings to fix it.

def upgrade_settings(self, setting_values, variable_revision_number, module_name, from_matlab):
    if variable_revision_number === 1:
         # The first slot in the setting values was "wants_something"
         # and has changed, so we have to rewrite the settings.
         #
         if setting_values[0] === cps.NO:
             setting_values = [ "This" ]
         else:
             setting_values = [ "That" ]
         variable_revision_number = 2
     return setting_values, variable_revision_number, from_matlab

The module will now load the old version, read the old setting value and translate to the new format.

Modules and measurements

Modules can make measurements when they process image sets and groups. Other modules want to know about these measurements; for instance, ExportToDatabase wants to make tables based on these measurements before the run starts. Every module that makes measurements implements the following methods:

get_measurement_columns: This returns a list that has the measurement's name, the object being measured and the data type of the measurement.
get_category: Given an object, this returns the measurement categories that the module makes on that object. For instance, IdentifyPrimaryObjects makes Location_CenterX measurements and these are in the "Location" category.
get_measurement: Given an object and a category, this method returns the feature name. For example, the feature name of the category Location_CenterX is 'CenterX'.

A module can break its measurement names into category and measurement. It can specialize them further by implementing get_measurement_objects, get_measurement_images and/or get_measurement_scales. For example, the MeasureTexture module allows you can measure an object's texture at different scales and can measure the texture of an object in different images. MeasureTexture's measurements show up in an easy-to-navigate hierarchy in the ClassifyObjects module because it implements get_measurement_images and get_measurement_scales.

Modules and pipeline execution

You write a module in order to have it do something when it is executed during the course of a pipeline run. Each module implements some of the pipeline execution methods; CellProfiler calls these as it executes. CellProfiler calls pipeline execution methods with an image set list (prepare_run) or with a workspace (all other methods). Your module can get the results of prior modules from the image set list or workspace and can add its results to the image set list or workspace.

The following methods are called during the course of a run:

prepare_run: Called at the start of a run to find out the image sets in the run
get_groupings: Called after prepare_run to find out how image sets are grouped
prepare_group: Called before running each group
run: Called once per image set. This is where most modules do their work.
display: Called if your module produces a display
post_group: Called just after running each group
post_run: Called at the conclusion of a run

Display during execution

CellProfiler uses the matplotlib library for display. Matplotlib has plotting graphics patterned after those in the MATLAB language. You have three display choices in your module:

No display: Your module doesn't have a display window and produces no feedback.
Interactive display: Your module has an interactive user interface which lets the user provide input during the course of a run. IdentifyObjectsManually is an example of this sort of module.
Informational display: Your module displays information or images after analysis.

Informational displays have some advantages. The user interface is live during calculation because the "run" method is executed in a worker thread while the user interface runs in the main thread. Informational displays are slightly harder to code, but can be accomplished with a little planning. First, you have to implement the is_interactive method as follows:

class MyModule(cellprofiler_core.module.Module):
    ...
    def is_interactive(self):
        return False

This tells CellProfiler that your module's run method can be executed in the worker thread. Next, you can optionally store intermediate results in the workspace during run. Many modules (for example, MeasureImageQuality) collect a table of results for display. These are stored in the workspace's display_data. Here's an example:

def run(self, workspace):
    # Initialize the statistics with a header
    workspace.display_data.statistics = [("Measurement", "Value")]
    ...
    workspace.display_data.labels = labels
    for measurement_name, value in ...:
        workspace.display_data.statistics.append((measurement_name, value))

You implement the display separately. CellProfiler makes it easy to create a montage of images, tables and other displays in a window that's reserved for your module (see the documentation for Figure for a full list). CellProfiler lays out your display in a grid where each grid cell is a subplot. You choose which grid cell displays which plot using x and y coordinates. Here's an example with an image and a table laid out side-by-side:

def display(self, workspace):
    f = workspace.create_or_find_figure(subplots = (1,2))
    f.subplot_imshow_labels(0, 0, workspace.display_data.labels)
    f.subplot_table(0, 1, workspace.display_data.statistics)

settings

Settings store your module's parameters in the pipeline file. They tell the user interface how the parameters should be displayed and edited. They give names to your pipeline's data and link one module's inputs to another module's outputs. They complain when a user enters invalid data.

You can see a full list of settings in settings.py. All settings have text that is displayed to the left of the setting and HTML documentation that appears in a window when the user presses the help button to the right of the setting. Almost all settings have a value which holds the parameter that's stored in the pipeline and made available inside your module.

Settings are initialized inside your module's create_settings method. Here's a simple example:

import cellprofiler_core.setting as cps
...
class MyModule(Module):
    def create_settings(self):
        self.my_parameter = cps.Text(
            "Enter a value:", "Default",
            doc = """This setting controls the ....""")

    def run(self, workspace):
        my_value = self.my_parameter.value

The first parameter in cps.Text is the prompt text. The second is the initial value for the setting. "doc" is the documentation for the setting.

Setting Types

CellProfiler has settings that handle a number of different kinds of inputs such as numeric values, ranges, choices and yes/no questions. It also has some settings that hook into the CellProfiler data: the measurements, images and objects. Finally, it has settings whose only purpose is display or UI interaction. Here's a list:

Simple Settings

Simple settings are designed to capture values similar to Python's built-in Boolean, string and numeric types. You can use these in comparisons in your code as if they were built-ins, but you have to use the .value attribute during numeric operations. Here's an example:

...
def create_settings(self):
    self.wants_automatic_threshold = cps.Binary("Calculate threshold automatically?", True)
    self.manual_threshold = cps.Float("Enter threshold:", .5)
...
def run(self, workspace):
    if self.wants_automatic_threshold:
        t = self.calculate_threshold(workspace)
        adjustment = self.calculate_adjustment(workspace)
    else:
        t = self.manual_threshold
        adjustment = self.manual_threshold.value * 1.5
    mask = image_pixels > t
    adjusted_image = image_pixels * adjustment

The different types of simple settings are:

Text: Value is a string, displayed in an edit box.
Integer: The Integer's value is an integer, displayed in an edit box. You can set the minimum and maximum acceptable values using the minval and maxval keywords.
Float: Value is a decimal or floating point number. Otherwise, it's similar to the Integer setting.
IntegerRange, FloatRange: These settings hold a pair of integers or floats that represent the lower and upper bounds of a range.
Binary: Represents a yes/no or True / False value, displayed as a checkbox.
Choice: Represents one of a set of possible choices, displayed as a drop-down choice box.

Buttons and Dividers

You might want to do something if the user presses a button, e.g., add another image or read a file.

''DoSomething: '' Displays a button that does something when the user presses it. Below is example code that displays a message box when the button is pressed:

...
def create_settings(self):
    self.my_value = cps.Text("Enter something", "Default")
    self.hello_world = cps.DoSomething("Press me, please", self.hello_world_pressed)
    ...
def settings(self):
    # Leave out hello_world... there's no value to save or load
    return [self.my_value]
... 
def visible_settings(self):
    # Include it here though because we want it to be displayed
    return [self.my_value, self.hello_world]
...
def hello_world_pressed(self):
    import wx
    wx.MessageBox("You entered " + self.my_value.value, "Hello, world")

The second argument to DoSomething is the function to run when the button is pressed.

Divider: Displays a vertical line between the previous and following setting.

Additional Settings

CellProfiler has settings that give users a sophisticated and useful interface intended for specific situations. These include file choosers, color map choosers and regular expression editors:

FilenameText: Displays the file name dialog when the user presses the browse button next to the edit box. At its simplest, the button just stores the file name in the edit box, but you can get it to save the path and do other, more complicated things. Look at LoadData's create_settings method for a pretty complex example of how you can get the FilenameText setting to interact with other settings in your module.

The FilenameText constructor has some optional parameters that can help you:

get_directory_fn: Supply a function here that returns the initial directory for the file dialog
set_directory_fn: FilenameText calls this function after running the file dialog. It passes the function the directory that contains the filename that the user picked.
browse_msg: A message that's displayed in the caption of the file dialog.
exts: A list of acceptable extensions, for example, [("Text files (*.txt)","*.txt"),("All files (*.*)","*.*")]
RegexpText: Displays a regular expression editor when the user presses the browse button. The editor tells the user if the regular expression is valid and it displays the fields that would be captured by groupings if the regexp was applied to the example text.

You can supply the example text if you pass RegexpText an optional get_example_fn parameter. This should be a function that returns the example text when called. LoadImages' create_settings method has an example of this.

Colormap: Lets the user pick one of Matplotlib's color maps. The ConvertObjectsToImage module is an example of how the Colormap setting might be used.

Settings that interact with your pipeline

Modules get their data from images, objects and/or measurements. CellProfiler has settings that supply names for images and objects and that let users choose from the names supplied by prior modules.

*Provider settings:*These supply a name for an image or object that the module creates. The most widely used providers are the ImageNameProvider and ObjectNameProvider which provide names for the images and objects created by a module. These names will appear in the choice boxes of the appropriate subscribers. Here's an example of a module that supplies an image of all zeros to subsequent modules:

import numpy as np
import cellprofiler_core.image as cpi
import cellprofiler_core.setting as cps

class ZeroImage(Module):
    ...
    def create_settings(self):
        self.image_name = cps.ImageNameProvider("Image name:", "Zeros")
    ...
    def run(self, workspace):
        workspace.image_set.add(self.image_name.value, np.zeros((100,100)))

Subscriber settings: These look for matching provider settings in prior modules. The most widely used subscribers are the ImageNameSubscriber and ObjectNameSubscriber which subscribe to names of images and objects created by a prior module. Here's an example of how to retrieve an image inside your module:

import cellprofiler_core.setting as cps
class FindImageMaximum(Module):
    ...
    def create_settings(self):
        self.image_name = cps.ImageNameSubscriber("Image name:")
    ...
    def run(self, workspace):
        my_image = workspace.image_set.get_image(self.image_name.value)
        my_pixels = my_image.pixel_data
        maximum = my_pixels.max()
        ...

Measurement setting: Supplies a measurement name to your module. You might want to feed the measurement results of one module into another; CalculateMath is an obvious example of a module that does this, but you might have more subtle reasons for using measurements such as adjusting a calculation based on some arbitrary measurement made on an object.

Measurements are made on images and objects. You have to supply an object name or the keyword, "Image", to the Measurement setting to tell it where to look for measurements. This is done using the object_fn argument to Measurements. Here's a comprehensive example:

import cellprofiler_core.setting as cps
import cellprofiler_core.measurement as cpmeas
...
class UseMeasurement(cellprofiler_core.Module):
    ...
    def create_settings(self):
        self.wants_image_measurement = cps.Binary("Use an image measurement?", True)
        # We need an object if the user doesn't want an image measurement
        self.object_name = cps.ObjectNameSubscriber("Object name:")
        #
        # Python lets you define functions wherever you want
        # and they get to use the stuff that's lying around
        # inside your function. This lets you write your code
        # right next to where it's used. That makes reading the
        # code simpler.
        #
        def object_fn():
            if self.wants_image_measurement:
                return cpmeas.IMAGE
            else:
                return self.object_name.value

        self.measurement = cps.Measurement("Measurement:", object_fn)
   ...
   def run(self, workspace):
       object_name = (cpmeas.IMAGE if self.wants_image_measurement else self.object_name.value)
       value = workspace.measurements.get_current_measurement(object_name, self.measurement.value)
       ...

workspace

The workspace is a central location for the data needed to process an image set.

A module's run method gets called with the workspace for the current image set.
A module's post_group method gets called with the workspace for the last image set in the group.
A module's post_run method gets called with the workspace for the last image set in the run.

The workspace has properties that provide access to the image set and the image set's object set. It also has properties that provide access to the run's measurements, the image set list and the pipeline:

workspace.image_set: Holds the images for the image set. Use workspace.image_set.get_image(...) to get an image by name. Use workspace.image_set.add(...) to add a new image to the image set.
workspace.object_set: Holds the objects for the image set. Use workspace.object_set.get_objects(...) to get objects by name. Use workspace.object_set.add(...) to add new objects to the image set's object set.
workspace.measurements: Holds the run's measurements
workspace.image_set_list: Holds the list of image sets
workspace.pipeline: Holds the pipeline being run

Finally, the workspace controls a module's figure window.

objects

Segmentation (and other operations) group collections of pixel positions into objects. CellProfiler represents this data as a two-dimensional array of integer values. The special value, "0", marks a position as being outside of any object; all pixels with the same value, other than zero, are part of the same object. This value is used in CellProfiler to represent the object; for instance, the value is used as the one-based index into the object's measurements and the value is used to identify the object as being the parent of some other object. We refer to the array as the "labels matrix" in the code.

Each image set's workspace has an object set. This object set is a dictionary that links the name for an object to its representation. You can find the classes for ObjectSet and Objects here. You can get objects by name like this:

my_objects = workspace.object_set.get_objects("my_objects")

or you can put new ones that you make into the workspace like this:

workspace.object_set.add_objects(my_objects, "my_objects")

You might want to look at how objects are added in a fairly simple module like FilterObjects.

Inside cellprofiler.objects

You get an instance of cellprofiler.objects.Objects when you call get_objects, not the labels matrix. Most often, you'll only want the labels matrix which is accessible through the "segmented" property. For instance, you might set all pixels in an image that are outside of a labeled object like this:

my_objects = workspace.object_set.get_objects("my_objects")
pixels[my_objects.segmented === 0] = 0

"segmented" represents the final segmentation of your image; typically, there are parts of your image that should be ignored based on the segmentation, for instance, objects partially outside of the field of view. Your segmented objects and the objects to be ignored are accessible through the "unedited_segmented" property which is a labels matrix with both unfiltered and filtered objects labeled. CellProfiler uses this to allow the unfiltered objects to compete for pixels in a secondary segmentation with the filtered objects; otherwise, the unfiltered objects would extend to cover the space taken up by the filtered objects.

There is a third labels matrix, "small_removed_segmented". This is a labels matrix that has both the unfiltered objects and all filtered objects except for those that were filtered because they were too small. This labels matrix is useful for analyzing secondary images if the small objects are unfortunate artifacts of segmentation.

You can access the image that was used during segmentation through the "parent_image" property. This image may have secondary properties that are useful during downstream analysis. The image mask and cropping can be used to exclude areas in other channels from consideration during measurement.

cpimage

cpimage holds the three classes that define how images are handled in CellProfiler: Image, ImageSet and ImageSetList. These correspond to three levels of image hierarchy: a single image, a set of images that are processed together by one iteration of the pipeline and the list of all image sets to be run by the pipeline. A module builds the image set list during prepare_run, inserts images into image sets during prepare_run and run and processes the images during run.

Image

The main purpose of Image is to hold onto one image array. This array can either be a two-dimensional matrix representing a single channel of detection data or a three-dimensional multichannel or color image where the third dimension indexes the channel or color. The array can be composed of boolean, integer or floating point values. Images loaded from disk are normalized so that the detector's minimum value is a floating point zero and the maximum value is a floating point one. You can retrieve the array through the image's pixel_data attribute. You should treat the pixel_data as read-only; copy the pixel data before modifying it.

Image holds additional attributes for the image:

mask: the mask is a boolean two-dimensional array that defines a region of interest for the image. CellProfiler imaging algorithms use the mask to determine which values (mask === True) are to be process and which values (mask === False) are to be ignored. The has_mask property will be true if the image has a mask; if not, the region of interest is the whole image and the mask will be entirely True.
crop_mask: an image might have been cropped by having its edges trimmed away. CellProfiler saves a crop_mask that describes how this was done: the mask is True in the areas that were not cropped and False in the areas that were cropped. An algorithm might take two images and one of these images might be cropped and another might be the full, uncropped size. The algorithm can crop the full size image using the cropped image's cropping mask; you do this by calling crop_image_similarly to trim the full size image.
masking_objects: CellProfiler can define an image's region of interest as the part of the image within some set of objects. CellProfiler will save the objects with the image when it masks - masking_objects is an instance of cellprofiler.objects.Objects in this case and the image's has_masking_objects property is True. You can get the labels for the objects directly through the image's labels property; labels will be '1' within the image's region of interest and '0' outside of it if the image was not masked with objects.
parent_image: derived images (for example, a smoothed, cropped or masked image) will have a parent_image. This is the original image that CellProfiler processed to come up with the derived image. Images loaded from disk will not have a parent image (parent_image is None, has_parent_image === False). The image's file_name and path_name are the file and path of the image's primordial ancestor: the original parent loaded from disk.

Here's a typical example of how an image might be used to generate a derived image:

import numpy as np
import cellprofiler_core.image as cpi
pixel_data = img.pixel_data                           # The image intensity data (assumed to be 2d)
mask = img.mask                                       # The image's ROI, if any
std_img = np.zeros(pixel_data.shape)                  # Set don't-care pixels to 0
if np.any(mask):                                      # Make sure there is some ROI
    mean = np.mean(pixel_data[mask])                  # Figure out how many STD each pixel is
    std = np.std(pixel_data[mask])                    # from the mean. [mask] only considers
    std_img[mask] = (pixel_data[mask] - mean) / std   # pixels in the ROI
std_image = cpi.Image(pixel_data, parent_image = img) # The parent's mask becomes the new image's mask

ImageSet

CellProfiler keeps your images in an ImageSet as it executes your pipeline. You can fetch an image from the image set like this:

image_name = self.orig_image_name.value       # my module's image name subscriber
image = image_set.get_image(image_name)       # I get the image from the image set

Some algorithms need grayscale, binary or color images; for instance, a morphological skeletonization algorithm operates on binary images. You can get a grayscale image by supplying the must_be_grayscale keyword - the image set will combine channels for a color image and it will convert true and false to 1 and 0 for a binary image:

grayscale_image = image_set.get_image(image_name, must_be_grayscale = True)

Similarly, you can supply the must_be_color or must_be_binary keywords. These guarantee that an image has 3 color channels or that it is a two-dimensional boolean array; ImageSet will raise an exception if the image you retrieve is not the correct type.

You can put an image into the image set like this:

image_name = self.output_image_name.value     # my module's image name provider
image_set.add(image_name, image)              # I add an image to the image set

Each image set has a dictionary of metadata keys and values: ImageSet.keys. You can use this dictionary to find the image sets that match your metadata values. For instance, one module might load images that have "Plate" and "Well" metadata and another module might load per-plate illumination correction images that only have "Plate" metadata. You can match the illumination correction images with the corresponding regular images by matching the "Plate" metadata values (you may want to use the LoadData module instead, which lets you explicitly specify which images should be loaded for each image set).

ImageSetProvider

Behind the scenes, ImageSet uses ImageProviders. These are promises of images - for images that are loaded from disk or are calculated from aggregates, the ImageProvider doesn't actually have an image until the first time that someone asks for the image. You can use an ImageProvider to efficiently supply the same image to every image set in a group:

def prepare_group(self, pipeline, image_set_list, grouping, image_numbers):
    ...
    for image_number in image_numbers:
        image_set_index = image_number - 1 # a legacy of CellProfiler's Matlab roots is 1-based indexed image_numbers
        image_set = image_set_list.get_image_set(image_set_index)
        image_set.add_provider(my_provider)

Typically, you'll use the VanillaImageProvider if you use one at all. This is just a plain-vanilla image provider that holds an already-created image. You can look at loadimages.py and makeprojection.py for more complex examples.

ImageSetList

The ImageSetList holds the ImageSets for a run. The ImageSetList is saved when you create a batch file and is restored when the batch file is run. CellProfiler your module's dictionary in the image set list; this dictionary is saved and restored as well.

The image set list operates in one of two modes: by image set index or by key. The first module in your pipeline controls the image set list's mode: if it asks for an image set by image set index, then subsequent image sets are matched by index, if it asks for an image set by key, then subsequent image sets are matched by key. You can call ImageSetList.get_image_set to either create a new image set or retrieve an existing image set by either key or image set index. If you use a key, you pass in a key/value dictionary and ImageSetList will find the image set that has the same values for your keys.

Most modules make little or no use of image set lists; typically the image set list is used behind the scenes by CellProfiler and modules use the image set that's in the workspace.

measurements

CellProfiler saves measurements made on images and objects in the Measurements structure which is accessible through the workspace. Measurements organizes measurements by object name, feature and image set. Image measurements are stored using the special object name, "Image" (cellprofiler.measurements.IMAGE is the symbolic name) and there are methods for saving and retrieving image measurements and object measurements. Each measurement is named by a measurement name; the measurement name should describe the measurement, for instance, "Intensity_MeanIntensity_GFP" is the feature name for the mean intensity measurement taken on the GFP channel.

Most modules only save measurements. You can use the add_image_measurement method to record an image-wide measurement made on the current image set. For instance:

import numpy as np
image_name = self.image_name.value
image = workspace.image_set.get_image(image_name)
standard_deviation = np.std(image.pixel_data)
measurement_name = "Statistics_StandardDeviation_" + image_name
workspace.measurements.add_image_measurement(measurement_name, standard_deviation)

You can use the add_measurement method to record object measurements for the current image set. You should store object measurements in a one-dimensional numpy array with one element per object. For instance:

from cellprofiler.cpmath.cpmorphology import centers_of_labels
objects_name = self.objects_name.value
objects = workspace.object_set.get_objects(objects_name)
labels = objects.segmented
i,j = np.mgrid[labels.shape[0], labels.shape[1]]
i_center, j_center = centers_of_labels(labels)
assert isinstance(i_center, np.array)              # make sure that we're storing an array
assert tuple(i_center.shape) === (np.max(labels),)  # make sure it's 1-d and has measurement per label
measurement_name = "Location_Center_Y"
workspace.measurements.add_measurement(objects_name, measurement_name, i_center)

Some modules use measurements made by prior modules. You can get the image and object measurements for the current image set using get_current_image_measurement and get_current_measurement. For example:

objects_name = self.objects_name.value
speckles_name = self.speckles_name.value
# Get the total # of speckles in the image (a single number)
total_speckles = workspace.measurements.get_current_image_measurement("Count_"+speckles_name)
scount = "Children_" + speckles_name + "_Count"
# Get the numpy array of speckle counts for each object
speckles_per_object = workspace.measurements.get_current_measurement(objects_name, scount)
fraction_of_speckles = speckles_per_object.astype(float) / float(total_speckles)

Hierarchy of feature names

CellProfiler has a standardized nomenclature for features; this helps out in the GUI and helps organize measurements during analysis. A measurement name has several parts, some of which may not be present. By convention, the parts are separated by the underbar ("_") character.

Category: This is the general category of the measurement; it might indicate the module or general use of the measurement. Examples are "Metadata", "Intensity" and "Location"
Feature: This names the algorithm or method that was used to make the measurement. Examples are "MeanIntensity" for the "Intensity" category, "Well" for the "Metadata" category and "Center_X" for the "Location" category.
Image: easurements are often made on the intensity information from a particular image. Examples are "GFP" for the "MeanIntensity" measurement (the mean intensity of the GFP channel) or "Actin" for the "Texture_Gabor_Actin_5" measurement.
Object: Object measurements can be made relative to other objects. For instance, the IdentifySecondaryObjects module might save the object number of a secondary object's parent (say it's nucleus) in the "Parent_Nucleus" measurement. "Nucleus" is the object part of the measurement name.
Scale: The "Scale" part of the measurement can be used to capture the scale in pixels that was used when making a measurement. It can also be used as a catch-all for parameters used when making the measurement in order to distinguish two measurements made using the same algorithm, image and/or object. For instance, texture measurements are made by correlating pixels that are a certain distance away from each other; "3" and "5" in "Texture_Gabor_Actin_3" and "Texture_Gabor_Actin_5" designate two measurements of the Gabor transform, made on the Actin image at scales of 3 and 5.

Modules define the categories, feature names, objects, images and scales using the methods, "get_categories", "get_measurements", "get_measurement_images", "get_measurement_objects" and "get_measurement_scales". Modules also define the measurements they output using "get_measurement_columns". A module must report each measurement it makes using "get_measurement_columns". It should report each measurement it makes using the other methods in order to let the user select those measurements in the user interface. The measuretexture.py module is a good example of a module that defines these methods.

HDF5 Measurement and Workspace Format

Measurements are stored in an HDF5 file. The workspace is also stored in an HDF5 file using an augmented measurements file format. HDF5 files are composed of three types of data:

Groups: conceptually similar to folders, groups are containers that can hold other groups or datasets.
Datasets: datasets are N-dimensional arrays that can hold string and numeric data.
Attributes: attributes are small taglike items of data that can be used to decorate groups or datasets.

Each group, dataset and attribute has a name that can be used to fetch it from its container. The names can be strung together to describe the path to an item, for instance, "/Measurements/2013-04-10-15-16-00/Images/Metadata_Plate" refers to the Metadata_Plate group within the Images group, etc.

All Measurements files and workspaces have a dataset named "/Version" which contains a single integer version number. The current version number is "1".

Measurements group

Measurements are stored inside the "Measurements" group. We have added a two-level hierarchy to the group to allow a single file to store results from any number of CellProfiler runs, but at present, each measurements file contains only one set of measurements. Each set of measurements is designated with a timestamp whose format is YYYY-MM-DD-HH-mm-SS where YYYY is year, MM is month, DD is day, HH is a 0-23 hour, mm is minute and SS is second. The latest measurement set is the last alphabetically so the convention is to sort them all and take the last if there's any ambiguity about which should be used.

Each measurement object has a corresponding group within the timestamp group. There are measurement objects for each of the "objects" created by identify and similar modules. The measurements also contain three special "objects", Experiment, Images and Relationships:

Images: one data element is recorded per feature per image. Crucial image measurements are:
- ImageNumber: the image number for each image set. Image numbers uniquely identify an image set and are used elsewhere to refer to the image set.
- GroupNumber: for grouped image sets, all image sets in the same group have the same group number
- GroupIndex: for grouped image sets, image sets in the same group are numbered by group index
- **URL_**image: Each input image in the image set has a URL or URL-encoded pathname. For instance, if a channel is named, "DNA", the path to its file is stored in the URL_DNA measurement.
- **Series_**image and **Frame_**image: for .flex files, .tif stacks and movies, the Frame_ measurement specifies the zero-based index of the image plane to be used in the image set. .flex files can contain multiple stacks. Each stack has a series index which is recorded by the Series_ measurement.
Experiment: one data element is recorded per experiment. Experiment contains several important data items, recorded as string measurements:
- Pipeline_Pipeline: This measurement contains the pipeline text for the pipeline that was used to produce the analysis.
- ChannelType_: The image set for an analysis is stored in the images table - one URL per channel is stored per image set. The experiment table has one ChannelType_image-name measurement per channel with an indicator of the type of data for the channel, e.g. "Grayscale", "Color" or "Object".
- Metadata_Tags: The Metadata_Tags measurement contains a JSON-encoded list of the metadata tags that uniquely identify an image set for image sets ordered by metadata.
Relationships: The Relationships group and its subgroups are used to store object-object relationships. Examples of relationships are frame-to-frame linking of objects in a time series or neighbor relationships recorded by the RelateObjects or MeasureObjectNeighbors modules.

Measurement HDF5 structure

CellProfiler uses two datasets to encode measurements, index and data. A measurement's datasets are stored in the group, "/Measurements/timestamp/object-name/feature-name".

The index dataset is an Nx3 chunked array that is used to address the range of measurement values for each image set. Each row of the array is a 3-tuple of image number, start and end of the range of values within the data dataset. For object measurements, the first element of the range corresponds to the value recorded for object # 1 and subsequent objects' values appear afterwards. A single measurement is recorded in the data dataset for image measurements and a single measurement overall is recorded for experiment measurements.

The data dataset can have a numeric or variable-length string type. Strings are stored using the utf-8 encoding.

Example: image set 1 has three objects and image set 2 has two. The AreaShape_Area measurement for the Nucleus object is recorded: image # 1's objects have areas of 25, 32, and 41 and image # 2's objects have areas of 27 and 35. The timestamp is "2013-04-11-09-40-00". The datasets for these are

Image number	Start	Stop
1	0	3
2	3	5

Index	0	1	2	3	4
data	25	32	41	27	35

Relationships HDF5 structure

A relationship is recorded by a module and the relationship has a name and relates one object to another. Consequently, a relationship measurement is a conceptual tuple of module, relationship name, first object name and second object name. Each measurement instance records the image number and object number of the two related objects. CellProfiler creates a nested group structure to encode the parts of the tuple. The path to the relationship dataset is "/Measurements/timestamp/Relationships/module-number/relationship-name/object1-name/object2-name". module-number is the module number of the module that created the relationship - the actual module can be found by searching the pipeline for the module with the corresponding module number. relationship-name is the name assigned to the relationship by the module, for instance, "Parent". object1-name and object2-name are the names of the objects being related.

The following datasets are stored per-relationship:

ImageNumber_First - the image number of the image set containing the first object
ObjectNumber_First - the object number of the first object
ImageNumber_Second - the image number of the image set containing the second object
ObjectNumber_Second - the object number of the second object

The datasets are one-dimensional arrays of integer values. There is no guaranteed ordering of the relationship records.

FileList

The FileList is the list of active image files that are the input to the Images module. Both the file list and the collected OME-XML metadata area stored in the HDF5 file. The file list's group structure mirrors the directory structure of the collected files - in the HDF5 file, the default file list's root is in the group, "/FileList/Default" (other file lists can be accessed using option arguments to HDF5FileList).

Groups and datasets in the FileList are tagged with metadata attributes that provide a context for their interpretation. Each group and dataset has a Class attribute which is one of:

FileListGroup - indicates that the group represents a folder in a file list hierarchy
VStringArrayIndex - indicates that a dataset contains indexes into a VStringArrayData dataset. The dataset is typically named, "index".
VStringArrayData - indicates that a dataset is a vector of string data. The dataset is typically named, "data".

Each path to an image is URL-encoded. This produces an ASCII 8-bit character string that is capable of encoding unicode file names. HDF5 group names are restricted to a subset of ASCII, so each of the directory parts undergoes a secondary encoding where characters other than "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789_-+.%=" are encoded as a two-digit hexadecimal number, preceded by backslash. For instance, the DOS path, "c:\Users\developer", is URL-encoded as "file:///C:/Users/developer". By convention, the URL schema is encoded as the first group in the path and the slashes that follow the schema are included in the second group of the path. Subsequent slashes are used as splitpoints for the grouping. The above DOS path is broken into the groups, "file", "///C:", "Users", "developer". "///C:" must be escape-encoded to "\2f\2f\2fC\3a", so the path to the file list in the HDF5 file is "/FileList/Default/file/\2f\2f\2fC\3a/Users/developer".

VStringArrays

A VStringArray is an encoding of an array of variable-length strings. Two datasets define a VStringArray - the index dataset and the data dataset. The index dataset is an Nx2 integer array that gives the begin and end indices into the data dataset of the nth string in the array. Strings are encoded using utf-8. The null string is represented as an index row whose end is zero and whose beginning is after the end (e.g. 2147483647, 0)

Directories

If a FileListGroup contains an VStringArrayIndex and VStringArrayData, then the corresponding VStringArray is interpreted as an alphabetically-ordered listing of the corresponding directory. Each directory also includes a metadata group. This group holds a second VStringArray that holds the OME-XML metadata for the correspondingly indexed file.

Schematic

The following is a sequence diagram of a headless run of CellProfiler. More stuff happens than what is depicted, for instance each module's particular .run functionality is not shown.

Broad Strokes

NamesAndTypes