01.Association10.Observer agreement - sporedata/researchdesigneR GitHub Wiki

1. Use cases: in which situations should I use this method?

Evaluate the agreement between two or more observers regarding a given measurement - see Is pocket mobile echocardiography the next-generation stethoscope? A cross-sectional comparison of rapidly acquired images with standard transthoracic echocardiography
Conduct an educational experiment where the agreement is an outcome measure - National Health and Nutrition Examination Survey 1999-2000: effect of observer training and protocol standardization on reducing blood pressure measurement error

2. Input: what kind of data does the method require?

Two or more raters, using a categorical or continuous metric
No external gold standard

3. Algorithm: how does the method work?

Model mechanics

Studies of observer agreement are important whenever a classification is made by a person other than the patient. For example, a clinician might classify a medical image (MRI, CT, etc) in relation to disease stage, or classify a patient in relation to her impairment status. Of importance, observer agreement is different from patient self-report, where patients are rating their own status. The inherent issue of observer agreement is that different observers will have different opinions and ratings when looking at the same phenomenon. Also, there is no gold standard to which their ratings can be compared. In other words, the only rating is how much they might agree or disagree on a given topic, high rates of an agreement being a good thing. For many clinical areas, interobserver reliability tends to display high levels of heterogeneity [1].

Describing in words

Describing in images

Describing with code

Breaking down equations

Suggested companion methods

Observer agreement is often measured through surveys. For example, one could use one of the multiple open source PACs systems to measure agreement among radiologists.
Could be used in Image recognition

Learning materials

4. Output: how do I interpret this method's results?

Mock conclusions or most frequent format for conclusions reached at the end of a typical analysis.

The agreement rate was [PERCENTAGE], with a Cohen's kappa coefficient of {VALUE]
The intraclass correlation coefficient was [VALUE]

Typical tables and plots and corresponding text description

Cohen’s kappa coefficient estimates [2]

Metaphors

Observer agreement is related to how much two people with different backgrounds could have the same perspective towards one event/outcome. Professionals with diverse backgrounds tend to disagree, for example, while providing patients' assessments.

Reporting guidelines

Methods are often split into categorical and continuous scales. For example, two pathologists looking at a single slide and attempting to establish its grade on a scale of one through four will be assessed using a categorical method. The most common method is the kappa statistic, which measures agreement while "discounting" agreement by chance. For example, if a classification only has two categories, two observers will agree by pure chance in 50% of the time. Kappa accounts for that chance, being measured on a scale from 0 to 1, where 1 is perfect agreement. Kappa coefficients can also be weighted for three or more categories, accounting for the fact that a classification that is closer to another represents a smaller degree of disagreement. Weights are often assigned in a linear or quadratic format. Of importance, association tests such as McNemar's test do not measure agreement, kappa statistics should be preferred.

For continuous measurements, for example, two pathologists measuring the size of a tumor, the classical measurement is the intra-class correlation coefficient. As a caveat, simple correlation statistics are not used as a measure of agreement as two ratings might be highly associated while not agreeing at all. An intra-class correlation coefficient measures agreement.

When conducted within trials, it is preferable that labeling teams be centralized within a small team. This structure will keep the logistics of the labeling more agile and easier to control, and the interobserver reliability with better indices. Additional resources to increase interobserver reliability include:

Educational material that can be consulted every time reliability starts decreasing
Statistical quality control of reliability indices throughout the labeling period
Use a three-tier system whenever possible: start the labeling with automated segmentation algorithms as a first step in the workflow [3]. Second, have a group of researchers continuously annotate a random sample of automated labeling as a statistical quality control method, using an online annotation platform [4]. Last, have board-certified radiologists or pathologists label a random sample of the biomedical researchers, serving as the gold standard (ground truth) for the whole system. For fairly simple diagnoses, it has been suggested that step two could be substituted by untrained observers [5].

Reporting guidelines include:

Guidelines for reporting reliability and agreement studies (GRRAS) were proposed [6].
Validating Whole Slide Imaging for Diagnostic Purposes in Pathology: Guideline from the College of American Pathologists Pathology and Laboratory Quality Center [7].

Reporting guidelines for Results

Results are often presented as kappa coefficient values for categorical variables and intra-class correlation coefficients for numeric ones. Numeric variables can be displayed as scatterplots with splines representing the agreement between any two raters. Bland-Altman plots are also frequently used for continuous variables.

plots can be generated for agreement on continuous variables (Bland Altman, but perhaps even a scatterplot) as well as for nominal ones Agreement Plot

Bland-Altman plot [9].

The Bland-Altman plot is a graphical method to compare two measurement techniques, the plot describes the agreement between two quantitative measurements. [10]

The resulting graph is an XY scatter plot, with the Y-axis representing the difference between the two paired measurements (A-B) and the X-axis representing the average of these measurements ((A+B)/2). The differences can alternatively be shown as percentages or ratios, and the first or second method can be used instead of the mean of both ways. [10]

The lines parallel to the X-axis are indicative of agreement. The solid line corresponds to zero difference between the methods. The dotted lines represent the limits of agreement. It wasn't exactly part of the question but I thought it was important to note that B&A plot does not specify whether the agreement is sufficient or appropriate to use either method indifferently. It simply quantifies the bias and provides a range of agreement that includes 95 percent of the differences between one measurement and the other. Therefore, the best way to use this approach would be to define a priori the limits of maximum acceptable differences based on biologically and analytically relevant criteria. [10]

In this graph, between weeks 30 and 45 (approximately) most of the results are within the established agreement limit. However, outside this time (i.e., less than 30 or more than 45 weeks) the variation (or amount of differences) is much larger.

5. SporeData-specific

Templates

Data science functions

Data science packages

irr package has functions for both continuous and categorical observer agreement [8].

General description

Clinical areas of interest

Variable categories

Linkage to other datasets

Limitations

Related publications

SporeData data dictionaries

References

[1] Siddiqui MR, Gormly KL, Bhoday J, Balyansikova S, Battersby NJ, Chand M, Rao S, Tekkis P, Abulafi AM, Brown G. Interobserver agreement of radiologists assessing the response of rectal cancers to preoperative chemoradiation using the MRI tumour regression grading (mrTRG). Clinical radiology. 2016 Sep 1;71(9):854-62.
[2] Baccini A, Barabesi L, De Nicolao G. On the agreement between bibliometrics and peer review: Evidence from the Italian research assessment exercises. PloS one. 2020 Nov;15(11):e0242520.
[3] Harris RJ, Teng P, Nagarajan M, Shrestha L, Lu X, Ramakrishna B, Lu P, Sanford T, Clem H, McRoberts M, Goldin J. High-throughput image labeling and quality control for clinical trials using machine learning. Int. J. Clin. Trials. 2018 Oct;5(4):161.
[4] Rubin DL, Akdogan MU, Altindag C, Alkim E. ePAD: An image annotation and analysis platform for quantitative imaging. Tomography. 2019 Mar;5(1):170.
[5] Nguyen TB, Wang S, Anugu V, Rose N, McKenna M, Petrick N, Burns JE, Summers RM. Distributed human intelligence for colonic polyp classification in computer-aided detection for CT colonography. Radiology. 2012 Mar;262(3):824-33.
[6] Kottner J, Audigé L, Brorson S, Donner A, Gajewski BJ, Hróbjartsson A, Roberts C, Shoukri M, Streiner DL. Guidelines for reporting reliability and agreement studies (GRRAS) were proposed. International journal of nursing studies. 2011 Jun 1;48(6):661-71.
[7] Pantanowitz L, Sinard JH, Henricks WH, Fatheree LA, Carter AB, Contis L, Beckwith BA, Evans AJ, Lal A, Parwani AV. Validating whole slide imaging for diagnostic purposes in pathology: guideline from the College of American Pathologists Pathology and Laboratory Quality Center. Archives of Pathology and Laboratory Medicine. 2013 Dec;137(12):1710-22.
[8] Gamer M. irr: Various coefficients of interrater reliability and agreement. 2010.
[9] Rada S, Gamper J, González R, Mombo-Ngoma G, Ouédraogo S, Kakolwa MA, ... & Ramharter M. Concordance of three alternative gestational age assessments for pregnant women from four African countries: A secondary analysis of the MIPPAD trial. PloS one. August 6, 2018;13(8):e0199243. [10] Giavarina D. Understanding bland altman analysis. Biochemia medica. 2015 Jun; 25(2): 141–151.