Skip to content

Quality control of trained models

Romain F. Laine edited this page Oct 6, 2020 · 2 revisions

This page is under construction

Always evaluate your model

The reliable implementation of DL methods depends on a careful evaluation of the models’ output performance; we call this step quality control (QC). Quantitative QC is crucial to avoid using models that produce low-quality images and artifacts, especially when they may not be easily identifiable by simple visual assessment.

All ZeroCostDL4Mic notebook contains a QC section (Section 5 of each notebook) that allows the quantitative assessment of any trained models. This section typically has two parts:

  • Inspection of the loss (and validation) function over the number of epochs trained.
  • Evaluation of network output image by comparing the model predictions against a ground truth target. This evaluation is done using quantitative quality metrics.

The metrics calculated will help the user improve the models created by comparing models trained with different hyperparameters or explore the range of applicability to different data from which it was trained on (generalization).

Inspection of the loss function

The first performance metrics shown to users are loss curves for model training and validation. They allow users to determine if the training has converged and if the model is overfitting, identifiable by an increasing divergence between the validation and training loss. This divergence appears if the model learns features specific to the training dataset instead of general features applicable to all similar datasets, therefore preventing it from generalizing to unseen data, a common problem for deep neural networks.

The author of CycleGAN and pix2pix does not recommend the visual inspection of the loss function curves to evaluate the training quality achieved with these networks. Therefore, these training curves are not displayed in the pix2pix and CycleGAN notebooks. These informations are however saved in a log file (in the same folder as the trained model) and can easily be accessed after the training is completed.

Evaluation of image quality metrics

The best way to assess the quality of the network output generated by a model is to compare the predictions from unseen data to its corresponding ground-truth target image. This can also be performed in Section 5 of the ZeroCostDL4Mic notebooks.

The metrics used in individual notebooks vary depending on the tasks performed by each network (and therefore the type of output image). For networks producing a grey-scale image, e.g. CARE, Noise2Void, U-Net, and Label-free prediction (fnet), the metrics used are SSIM (Structural similarity) and RSE (Root Square Error).

For networks producing a binary or semantic image, e.g., StarDist, the metric used is IoU (intersection over union). The YOLOv2 notebook uses the mean average precision score (mAP) as in Everingham et al., reflecting the validity of the bounding box positions and the corresponding classification. More information on how these metrics are implemented can be found in the ZeroCostDL4Mic manuscript.

In the pix2pix and CycleGAN notebooks, a different model (called model checkpoint) is saved every five epochs. In these notebooks, the QC section evaluates the output from each checkpoint on the QC dataset in order to retrieve the optimal model. This is also performed based on quantitative image quality metrics.

Quality metrics used in the notebooks

The SSIM (structural similarity) map

The SSIM metric is used to evaluate whether two images contain the same structures. It is a normalized metric and an SSIM of 1 indicates a perfect similarity between two images. Therefore for SSIM, the closer to 1, the better. This can be evaluated locally in images, which therefore allows to build a SSIM map. These maps are constructed by calculating the SSIM metric in each pixel by considering the surrounding structural similarity in the neighbourhood of that pixel (typically in a 11x11 pixels window surrounding the pixel of interest).

The overall metric mSSIM calculated in the notebooks is the SSIM value calculated across the entire image.

The RSE (Root Squared Error) map

This is a display of the root of the squared difference between the normalized predicted and target or the source and the target (basically showing the absolute value of the pixel difference between the two images). In this case, the closer to 0 the RSE, the better. Therefore, a perfect agreement between target and prediction will lead to an RSE map showing zeros everywhere (dark).

NRMSE (normalized root mean squared error) gives the average difference between all pixels in the images compared to each other. Good agreement yields low NRMSE scores.

PSNR (Peak signal-to-noise ratio) is a metric that gives the difference between the ground truth and prediction (or source input) in decibels, using the peak pixel values of the prediction and the MSE between the images. The higher the score the better the agreement.

Intersection over Union (IoU)

The Intersection over Union metric is a method that can be used to quantify the percentage of overlap between the target mask and the network output. The closer to 1, the better the performance. This metric can be used to assess the quality of models to accurately predict segmentation.

mAP score, Precision and Recall

These scores are available in the YOLOv2 notebook. If you want to read in more detail about these scores, we recommend this brief explanation.

mAP score: This refers to the mean average precision of the model on the given dataset. This value gives an indication of how precise the predictions of the classes on this dataset are when compared to the ground-truth. Values closer to 1 indicate a good fit.

Precision: This is the proportion of the correct classifications (true positives) in all the predictions made by the model.

Recall: This is the proportion of the detected true positives in all the detectable data.