08.Latent variable modeling04.Item response theory - sporedata/researchdesigneR GitHub Wiki

1. Use cases: in which situations should I use this method?

When you want to use Computerized Adaptive Testing or customized scales rather than standard scales.
When differential item functioning might be a concern

2. Input: what kind of data does the method require?

A sample of approximately 400 patients with complete measurements, although this number can be reduced if a Bayesian approach is to be used.
If data from more than one scale is to be used, then the datasets should have a group of anchor items. Anchor items are questions that are common in the different datasets, these items being free from DIF.

3. Algorithm: how does the method work?

Model mechanics

Compared to traditional methods to measure self-report constructs (Quality of Life, depression, anxiety, ability to perform activities of daily living, etc), Item Response Theory (IRT) provides the following advantages:
1. Each item (question) is evaluated in isolation, meaning that the scale doesn't have to be a monolith. For example, a researcher can build assessment tools that are personalized to individual patients or groups
2. When combined with Computerized Adaptive Testing (CAT), the measurements can be faster (fewer questions, decreasing patients' burden of response) and more precise (smaller measurement error)
3. Since the analysis is conducted at the patient rather than at the scale level, different scales measuring the same construct can be combined while creating a cross-walk scoring algorithm.
4. It is possible to evaluate if questions contain differential item functioning (DIF). DIF is a situation where individuals with the same underlying level of the construct (latent variable) have different scores as a function of measurement bias. A classical example is a shoulder function scale asking whether the person can reach his/her back. Since women might do that motion more often than men (reaching out for their bras), women would be more likely to have low shoulder function scores than men, even if they have the same function.
The downside is that a larger sample is required to estimate the properties of each question, although this requirement can be somewhat mitigated by the of use Bayesian Item Response Theory [1].
1-PL (1-parameter logistic) models are used for dichotomous items, and so using them for ordinal ones would force us to re-categorize them, leading to a lose of information. So, an ordinal model (aka, Samejima model) would be a better option. 1-PL models only provide information about the level of the construct (difficulty parameter) for each item, whereas 2-PL also provides information about the slope (discrimination parameter), and 3-PL about the intercept (guessing parameter). 4-PL is the "carelessness" parameter, or when respondents just don't care about what they answer.
Read the differences between 1-PL and Rasch here, but the main thing is the 1.7 in the formula of 1-PL (not present in Rasch), which indicates that 1-PL models for the average of the sample, whereas Rasch models for each individual in the sample.
Items can be evaluated graphically by creating information curves. They are related to information characteristic curves and can be used to assemble a custom scale:
1. Information curves are created by taking the item's characteristic curve's integral and displaying the area under that curve. This area is interpreted as the amount of information carried by an item.
2. To create a customized scale, you try to assemble a group of items that will cover the spectrum of the latent construct matching the population you are about to assess. Thus, if they are expected to have a low degree of the construct, you assemble a group of items that matches that region, and that match is assessed through information curves.
In ordinal and nominal IRT models, you end up with multiple logistic curves being represented in an Item Characteristic Curve (ICC). These curves may relate to the alternatives in those items, as in an ordinal IRT model (aka, Samejima's model), you treat each alternative response as an item. This means that the ICC for alternative response 1 equals the probability of answering while having a latent trait at least at that level. The same is repeated for all the other alternatives. Since polytomous items carry more information than dichotomous ones, the information curve tends to have a greater area under the curve. Here is a good tutorial.
If an item has polytomous (i.e.more than two) response options, e.g., Likert scales, the interpretation of ICCs is slightly different in that the ICC plots the expected item score over the range of the trait. Therefore, to depict the probability of endorsing each response category for a polytomous item, categorical response curves (CRCs) can be plotted, one curve for each response category [6].
When modeling a scale with some Likert items with four alternative responses, CRCs will show two sigmoid curves on each side, with two other curves that resemble a standard curve representing the two categories in the middle. Supposing that your IRT model is focused on measuring knee pain levels and that one particular question of a questionnaire is "How often do you fill pain when you need to stand up?". For this example, the Likert scale items are "1- never/ 2-rarely/ 3-some times/ 4- Always/". Thus, the reason why the two alternatives at the two extremes have an S-shaped curve is as follows: For "never," there is no other alternative to its left, and so if your latent variable is at that level, your probability of using that category as your response is 100%, i.e., the curve starts at 1.0 (which is 100%). The opposite is true for "always." Now, for the categories in the middle, for example, "rarely," your probability starts low on the left because there is another alternative preceding it (for "never"), then the probability of saying yes to "rarely" peaks, but at some point, the probability of saying yes starts decreasing because people start getting more likely to say "sometimes." The probabilities of first going up and then going down are why the curves in the middle end up looking like a bell-shaped curve.

Reporting guidelines for Methods

Applying item response theory (IRT) modeling to questionnaire development, evaluation, and refinement

Applying item response theory and computer adaptive testing: the challenges for health outcomes assessment

Data science packages

Suggested companion methods

Computerized Adaptive Testing (CAT)
Categorical response curves (CRC)
Chatbots
Ecological momentary assessment (EMA)
Individualized cost-effectiveness analysis for the collection of QALYs
Randomized trials where self-report constructs are either primary or secondary endpoints

Learning materials

Books
- Bayesian Item Response Modeling: Theory and Applications [1].
- Item Response Theory: Item Response Theory for Psychologists [2].
- The Basics of Item Response Theory Using R [3].
Articles combining theory and scripts
- mirt: Multidimensional Item Response Theory [4] and mirt lectures and educational material [5].
- psych
- Common references for latent variable

References

[1] Fox JP. Bayesian Item Response Modeling: Theory and Applications. Springer Science & Business Media; 2010 May 19.

[2] Embretson SE, Reise SP. Item Response Theory: Item Response Theory for Psychologists. Psychology Press; 2013 Sep 5.

[3] Baker FB, Kim SH. The Basics of Item Response Theory Using R. New York: Springer; 2017 Apr 25.

[4] Chalmers RP. mirt: Multidimensional Item Response Theory. Journal of Statistical Software; 2012; 48(6), 1–29.

[5] Chalmers P. mirt lectures and educational material. 2017.

[6] Nguyen, T. H., Han, H. R., Kim, M. T., & Chan, K. S. (2014). An introduction to item response theory for patient-reported outcome measurement. The patient, 7(1), 23–35.