Item response modeling of inter- and intra-rater variations in severity and local dependence

Wen Chung WANG

Research output: Contribution to conferencePapers


Many assessments require raters to make subjective scorings, such as essay items, job interviews, and performance appraisals in beauty, dance, singing, or sport contests. In order to reduce rater error and increase reliability, an item response or a performance is often graded by multiple raters, which are referred to as repeated ratings. Within the framework of item response theory (IRT), the facets model (Linacre, 1989) is commonly fit to rater data. The facets model considers only inter-rater variation in severity, because a rater is assigned a fixed-effect parameter (Dk) to describe his/her severity. A rater might hold different degrees of severity during the rating process. In order to consider both inter-rater and intra-rater variation in severity, Wang and Wilson (2005) added a set of random-effect parameters to the facets model to describe the interaction between person n and rater k, and is assumed to follow a normal distribution with mean zero and variance. The variance term describes intra-rater variation in severity of rater k; the larger the variance the greater the intra-rater variation. Although the above formulations consider both inter- and intra-rater variation in severity, they fail to consider local dependence among repeated ratings, which is likely to occur when raters interact with each other, for example, group discussion before giving a score individually. Even when there is no formal discussion, a rater might be aware of other raters’ views (e.g., through interaction or body language). Such a discussion or awareness may make “independent” judgment become “dependent.” That is, ratings given to a performance by different raters may have a higher correlation than ratings given to different performances by different raters, even when conditional on person ability, item difficulty, and rater severity. To resolve this problem, we developed the generalized rater model (GRM), with the interaction effect between person n and rater k within item i; the others are defined as above. The variance term describes the magnitude of local dependence among repeated ratings given to item i, and a larger variance indicates a larger local dependence. The parameters of the GRM can be estimated using the freeware WinBUGS. A simulation study was conducted and the results demonstrated a good parameter recovery. A reading data set was analyzed, where a total of 46 senior high schools in Taiwan submitted a proposal to bid for grants of creativity education provided by Taiwan government. Each proposal consisted of four parts and each part was evaluated by a group of 12 experts based on four or five criteria using a 5-point rating scale. Local dependence was likely to occur because experts were allowed and encouraged to share their views with each other before giving a rating. We were especially interested in local dependence among repeated ratings within each of the four parts, and inter- and intra-rater variations in severity. Equations 4-5 were fit and found to have a good fit.


Conference2012 Annual Meeting of American Educational Research Association: “Non Satis Scire: To Know Is Not Enough”
Abbreviated titleAERA 2012
Internet address


Wang, W.-C. (2012, April). Item response modeling of inter- and intra-rater variations in severity and local dependence. Paper presented at the American Educational Research Association AERA 2012 Annual Meeting: Non Satis Scire: To Know is not enough, Vancouver, British Columbia, Canada.


Dive into the research topics of 'Item response modeling of inter- and intra-rater variations in severity and local dependence'. Together they form a unique fingerprint.