5. Which
measure(s) of intercoder reliability should researchers use? [TOP]
There are literally dozens of different measures, or indices,
of intercoder reliability. Popping (1988) identified
39 different "agreement indices" for coding nominal categories, which
excludes several techniques for interval and ratio level data. But only a handful
of techniques are widely used.
In communication the most widely used indices are:
Percent agreement
Holsti's method
Scott's pi (p)
Cohen's kappa (k)
Krippendorff's alpha (a)
Just some of the indices proposed, and in some cases widely
used, in other fields are Perreault and Leigh's (1989) Ir
measure; Tinsley and Weiss's (1975) T index; Bennett,
Alpert, and Goldstein's (1954) S index; Lin's (1989) concordance
coefficient; Hughes and Garrett’s (1990) approach based
on Generalizability Theory, and Rust and Cooil's (1994) approach
based on "Proportional Reduction in Loss" (PRL).
It would be nice if there were one universally accepted index
of intercoder reliability. But despite all the effort that scholars, methodologists
and statisticians have devoted to developing and testing indices, there is
no consensus on a single, "best" one.
While there are several recommendations for Cohen's kappa (e.g., Dewey
(1983) argued that despite its drawbacks, kappa should still be "the
measure of choice") and this index appears to be commonly used in research
that involves the coding of behavior (Bakeman, 2000),
others (notably Krippendorff, 1978, 1987)
have argued that its characteristics make it inappropriate as a measure of
intercoder agreement.