Psychometric Methods

Traditional approaches to psychometric analyses of inventories, such as the BDI and CES-D, employ omnibus statistics, such as item-total correlations or reliability coefficients that average across levels of individual variation. Typically, scores on items from the BDI and the CES-D are summed and compared across groups. How item performance may vary as a function of depressive severity within groups is ignored. Summing item scores in this way assumes that items are equally informative about how depressed an individual is, and averaging across a continuum of variation assumes that items are uniformly effective at all levels of depressive severity. These assumptions are unlikely to be valid. Indeed, how individuals endorse items on a depression inventory may vary across items on a single measure of depression, across measures of depression, as well as across levels of depressive severity and different groups of individuals.


Despite the considerable progress in the assessment of depression, there are a number of enduring questions concerning most, if not all, measures of depression. These include how relatively effective response options are within different levels of depressive severity (option effectiveness), how effective scales are at detecting differences in depressive severity (scale discriminability), and whether certain groups of individuals endorse items differently (differential item functioning). Techniques based on item response models promise to resolve many of these issues because they model how individuals endorse items and options as a function of some ability, trait or condition. Consequently, analytic techniques based on item response models, which evaluate item performance as a function of depressive severity, are not only helpful in addressing problems facing many areas of research-they are essential to resolving these issues adequately. In the quantitative papers I have written, I have tried to address both substantive issues about how specific scales perform, including the Hamilton Rating Scale for Depression (HRSD), the Beck Depression Inventory (BDI), and the Center for Epidemiologic Studies Depression Scale (CES-D) as well as theoretical issues, such as how to equate two different scales as a function of depression (Santor et al., 1995) or evaluate whether a priori weights for individual items are supported by the data (Santor et al., 1994). I have also used similar techniques to examine the relation between item performance and scale length (Santor, Zuroff, & Fielding, 1998), as well as the relation between rates of symptom reduction achieved during treatment and the severity of post-treatment relapse (Santor & Segal, 1999). More recently, I have published a more conceptual article on the measurement of depression (Santor, Gregus & Welch, 2006), for which five commentaries were invited, to which I provided a rejoinder (Santor, 2006). These techniques were recently to evaluate the performance of scales assessing depression (HRSD; Santor et al. 2008) and symptoms of schizophrenia (PNSS; Santor et al., 2007) using large datasets from Eli Lilly, as well as in a very large series of health articles assessing heath care measures, with Jeanne Haggerty, which after some five years of work are now in press at Healthcare Policy (2011).


The studies I have published to date are among the few studies to adequately distinguish group mean differences from item bias, and they are also among the first to employ non-parametric item response models to analysing test data. First, finding an overall mean difference between two groups does not demonstrate bias nor does failing to find a difference preclude the possibility of bias. Detecting item bias depends on evaluating differential item functioning between groups that have been equated along some underlying continuum such as depressive severity. Second, the majority of studies using item response models have employed parametric models based on a family of logistic functions. Despite the widespread use of parametric models, parametric approaches to item response analyses are not without limitations. Item analyses based on parametric models are normative tests of item performance. The effectiveness of any item will be a question not only of (a) the utility of the item, itself, as an indicator of depression but also of (b) the model chosen, and (c) the sample selected. Items and scales that correspond to a logistic model may be useful, but unless items are constructed with a logistic model in mind, items not meeting the criteria of the model chosen may still be useful items, either within a restricted range of depression or within particular samples. Moreover, parametric models based on logistic functions run the risk of failing to account for features such as curve asymmetry, as well as radical departures from monotonicity or unity. Nonparametric models offer the advantage of locally estimating the relation between item and some underlying trait, ability, or condition at various levels of ability, trait, or severity.


Analytic techniques originally used in these papers (Santor, Ramsay, & Zuroff, 1995; Santor, Zuroff, Ramsay, Cervantes & Palacios, 1997) have since been adopted by other researchers and have been used in developing the major revision of the Beck Depression Inventory (Beck, Steer, & Brown, 1996). This work has also been cited in two Psychological Bulletin articles (Flett, Vredenberg, & Krames, 1997; Hartung & Widiger, 1998).