Do Interrater Correlations Estimate the Reliability of Job Performance Ratings?

assumptions of classical true score model

arnold on Oct 09, 2008

It depends on who you ask. Some Researchers would say yes. They argue that correlations between ratings provided by multiple raters provide the “correct” estimates of the reliability of job performance ratings (Schmidt & Hunter, 1996; Viswesvaran et al., 1996). However, Murphy and De Shon (2000) provide an interesting insight into why this may not be the case. As they succinctly conclude in their paper, “interrater correlations are not reliability coefficients, and should not be interpreted as such” (p. 898). The rest of this post will summarize how they arrive at that conclusion. (Please read their article for the specifics of the argument). It is cited below this post.

When correcting for measurement error in ratings of job performance the choice of methods for estimating reliability is especially important. There are two methods that are widely used to estimate the reliability of performance ratings. The first is the measure of internal consistency which can be used to estimate intrarater reliability. This type of reliability assessment is where the same assessment is completed by the same rater on two or more occasions. The second measure of agreement between raters can be used to estimate interrater reliability which is the disagreement between similarly situated raters about individuals’ level of job performance (Murphy and De Shon, 2000, p. 874).

Researchers have often taken the position that correlations of ratings between raters (interrater) provide the correct estimate of the reliability of job performance ratings (Schmidt and Hunter, 1996). Murphy and De Shon (2000) argue that the correlations between raters can be the results of other factors besides measurement error or true performance. The reason that this is important is because “the choice of reliability estimates has a substantial impact on the conclusions one is likely to draw from research involving performance ratings” (Murphy and De Shon, 2000, p. 874). If Interrater reliability estimates are used, they can be five times higher than if internal consistency estimates of reliability are used.

There are three key assumptions to the argument that interrater correlations provide reasonable estimates:

1. Similarly situated raters in organizations can be considered to function as alternate forms of the same measurement instrument
2. Agreement between raters reflects true score variance in ratings.
3. Disagreement between raters is best conceptualized as measurement error (i.e., variance that is unrelated to true performance).

Murphy and De Shon’s (2000) arguments against these assumptions:

1. Judges are not equally knowledgeable as they observe different behaviors, different judges are not parallel forms, research is a complexly-motivated activity driven by variables that have little to do with ratee’s performance
2. The assumption that agreement between raters is necessarily do to ratees’ true performance levels of ratings is not well supported.
3. Generalizability theory suggests that variance due to raters is probably not “true score”, but that it is also not necessarily a source of random measurement error.

They also compare rater’s affects to the assumptions of classical reliability theory.

1. Measurement errors have a mean a zero (Unlikely for two raters)
2. Measurement error is uncorrelated with true scores
3. Measurement errors on two tests or measures are uncorrelated (For 2 and 3, Both characteristics of the rater and characteristics of the rating context are likely to substantially affect interrater correlations).

The classical true score model of reliability suggests that the true score of the subject being rated will influence the correlation between two raters and that any observed differences among observers will be influenced by random measurement error. A simplification of that could be, if the correlation is high it is because the rater’s captured the true score, if there is error, it is caused by random measurement error. The authors do not agree with that and suggest a more complex set of factors why raters may agree or disagree.

Why raters might agree in their evaluations; Four categories:

1. Shared goals and biases: (Both raters trying to motivate, or maintain positive relationships)
2. Shared perceptions of the organization and of appraisal systems (Raters have same view of corporation)
3. Shared frames of reference (similar standards or expectations)
4. Shared relationships with ratees. (raters like the same employee)

Why raters might disagree in their evaluations; Four categories:

1. Systematic differences in what is observed (different facets and amounts of behavior)
2. Systematic differences in access to information other than observations of performance (indirect reports from other employees and customers)
3. Systematic differences in expertise in interpreting what is observed (Different observers may different knowledge about the results of an evaluation).
4. Systematic differences in evaluating what is observed (Especially knowledgeable observers, different standards

As the preceding section shows, there are numerous sources of variance in ratings that are not directly a function of true performance and that are not so idiosyncratic to individual raters that they can be treated as random measurement errors. However, with Generalizability theory, “the rater variance and the error variance can be separated in this design and that leads to the question of whether the rater variance should be lumped together with the error variance when estimating the generalizability coefficient (which is equivalent to the reliability coefficient in this design), or whether rater effects should be considered as variance that is neither true score nor error” (Murphy and De Shon, 2000, p. 885). “The assumption that ratings reflect true performance and random measurement error is just that—that is, an assumption that has been widely accepted, but that has little empirical or theoretical support” (Murphy and De Shon, 2000, p. 890). It is because of this that the only appropriate way you can interpret a corrected correlation is if you start with the assumption that performance ratings are a true measure of performance and the only error is random error of measurement. Interrater correlations cannot be interpreted as reliability coefficients. What the coefficient can tell you from the domain sampling point of view, is the extent to which any inferences based on the sample can be generalized to the population of potential raters.

Since the average correlation among raters has been found to be .52 (Viswesvaran et al, 1996), the good news is that you can be confident that the reliability rating is at least .52. Sampling error is not going to account for the other 48% of the variance. Interrater correlations can substantially underestimate reliability and corrections based on interrater correlations may spuriously inflate our estimates of the correlation between ratings and other measures.

Reference:

KEVIN R, MURPHY and RICHARD DESHON, (2000) “INTERRATER CORRELATIONS DO NOT ESTIMATE THE RELIABILITY OF JOB PERFORMANCE RATINGS,” PERSONNEL PSYCHOLOGY, 53 p. 873 – 900.

The Ultimate

DIGITAL MARKETING Email List

SPEND ONE HOUR A WEEK ON YOUR DIGITAL MARKETING WITH THE HELP OF OUR WEEKLY GUIDES EMAILED TO YOU!

SIGN UP NOW

    IS THIS CONTENT

    UP-TO-DATE?

    IF NOT, PLEASE LET US KNOW
    SO WE CAN UPDATE

    POPULAR POSTS