|
|
||||||||
Statistical Concepts Series |
1 From the Department of Radiology (H.L.K.) and MCP Hahnemann School of Public Health (M.P.), University of Pennsylvania Medical Center, 3600 Market St, Suite 370, Philadelphia, PA 19104. Received November 21, 2001; revision requested January 29, 2002; revision received March 4; accepted March 11. Supported by grant P01 CA 53141 from the National Cancer Institute, National Institutes of Health, U.S. Public Health Service, Department of Health and Human Services. Address correspondence to H.L.K. (e-mail: kundel@rad.upenn.edu).
| ABSTRACT |
|---|
|
|
|---|
and weighted
. Examples from the imaging literature illustrate the method of calculation and the effects of both disease prevalence and the number of rating categories. Other measures of agreement that are used less frequently, including multiple-rater
, are referenced and described briefly. © RSNA, 2003
Index terms: Diagnostic radiology, observer performance Statistical analysis
| INTRODUCTION |
|---|
|
|
|---|
We illustrate the standard methods for description of agreement in regard to categorical data and point out the advantages and disadvantages of the use of these methods. We refer to some of the less common, although not less important, methods but do not describe them. Then we describe some current developments in methods for use of agreement to estimate accuracy. The discussion is limited to data that can be assigned to categories, such as positive or negative; high, medium, or low; class IV. Data, such as lesion volume or heart size, that are collected on a continuous scale are more appropriately analyzed with methods of correlation.
| MEASUREMENT OF AGREEMENT OF TWO READERS |
|---|
|
|
|---|
|
|
|
|
|
The negative agreement, which we will call pneg, can be calculated in a similar way as follows:
|
|
COHEN
|
|---|
|
|
|---|
has been developed as a measure of agreement that is corrected for chance. The
is calculated by subtracting the proportion of the readings that are expected to agree by chance, which we will call pe, from the overall agreement, po, and dividing the remainder by the number of cases on which agreement is not expected to occur by chance. This is demonstrated in Equation (1) as follows:
|
|
Another way to view
is that if the readers read different images and the readings were paired, some agreement, namely po, would be observed. The observed agreement would occur purely by chance. The agreement that is expected to occur by chance, which we shall designate pe, can be calculated. When the readings of different images are compared, the observed value, namely the po, should equal the expected value, pe, because there is no agreement beyond chance and
is zero.
The joint agreement that is expected because of chance is calculated for each combination with multiplication of the total responses of each reader contained in the marginal totals of the data table. From Table 1, the agreement expected by chance for the joint positive and joint negative responses is calculated with the following equation:
|
|
The value for
is 0.31, as is calculated with this equation:
|
|
The standard error, which we will call SE, of
for a 2 x 2 table can be estimated with the following equation:
|
A more accurate and more complicated equation for the standard error of
can be found in most books about statistics (6,7).
The 95% CIs of
can be calculated as follows:
|
|
For example, the 95% CIs are 0.31 - 1.96 x 0.14 = 0.04 and 0.31 + 1.96 x 0.14 = 0.58.
Thus, what is the meaning of a
of 0.31, together with an overall agreement of 0.85? The calculated value of
can range from -1.00 to +1.00, but for practical purposes the range from zero to +1.00 is of interest. A
of zero means that there is no agreement beyond chance, and a
of 1.00 means that there is perfect agreement. Interpretations of intermediate values are subjective. Table 2 shows the strength of agreement beyond chance for various ranges of
that were suggested by Landis and Koch (8). The choice of intervals is entirely arbitrary but has become ingrained with frequent usage. The values calculated from Table 1 show that there is good overall agreement (po = 0.85) but only fair chance-corrected agreement (
= 0.31). This paradoxical result is caused by the high prevalence of negative cases. Prevalence effects can lead to situations in which the values of
do not correspond with intuition (9,10). This is illustrated with the data in Tables 3 and 4 that were extrapolated, with a bit of adjustment to make the numbers come out even, from a data set collected during a study of readings in regard to portable chest images obtained in a medical intensive care unit (11). Table 3 shows the agreement of the reports of two of the readers concerning the position of tubes and catheters. An incorrectly positioned tube or catheter was defined as a positive reading. Table 4 shows the agreement in regard to the reports of the same two readers about the presence of radiographic signs of congestive heart failure. The example was chosen because the actual values of
for the two diagnoses were very close.
|
|
|
of 0.52, which happens to be the same
as that for congestive heart failure. The result is not intuitively appealing, because a relatively simple decision such as that about the location of a catheter tip should have a higher index of agreement than a more difficult decision such as that concerning a diagnosis of congestive heart failure. Feinstein and Cicchetti (9) have pointed out the paradox of high overall agreement and low
, and Cicchetti and Feinstein (10) suggest that when investigators report the results of studies of agreement they should include the three indices of
, positive agreement, and negative agreement. We agree that this is a useful way of showing agreement data, because it provides more details about where disagreements occur and alerts the reader to the possibility of effects caused by prevalence or prior knowledge.
|
WEIGHTED FOR MULTIPLE CATEGORIES
|
|---|
|
|
|---|
can be calculated for two readers who report results with multiple categories. As the number of categories increases, the value of
decreases because there is more room for disagreement with more categories. However, when findings are reported by using a ranked variable, the relative importance of disagreement between categories may not be the same for adjacent categories as it is for distant categories. Two readers who consistently disagree about minimal and moderate categories would have the same value for
calculated in the usual way as would two readers who consistently disagree about minimal and severe categories. A method for calculation of
has been developed that allows for differences in the importance of disagreements. The usual approach is to assign weights between 1.00 and zero to each agreement pair, where 1.00 represents perfect agreement and zero represents no agreement. Assignment of weights can be very subjective and can confuse comparison of
values between studies in which different weights were used. For theoretical reasons, Fleiss (7) suggests assignment of weights as follows:
|
|
by using four categories is presented in the Appendix. In the example in the Appendix, the categories of absent, minimal, moderate, and severe are used. The weighted and unweighted values for po and
are included in Table 6. The calculations were repeated by collapsing the data for four categories first into three and then into two categories: First, minimal and moderate categories were combined, and then minimal, moderate, and severe categories were combined, and these two combinations would be equivalent to normal and abnormal categories, respectively. Table 6 shows that the value of
increases as the number of categories is decreased, thus indicating better agreement when the fine distinctions are eliminated. The weighted
is greater than the unweighted
when multiple categories are used and is the same as the unweighted
when only two categories are used. Some investigators prefer to use multiple categories because they are a better reflection of actual clinical decisions, and if sensible weighting can be achieved, the weighted
may reflect the actual agreement better than does the unweighted
.
|
ESTIMATION OF FOR MULTIPLE READERS
|
|---|
|
|
|---|
for pairs of readers and then compute an average
for all possible pairs (1214). Fleiss (7) describes a method for calculation of a
index for multiple readers. It has not been used very much in diagnostic imaging, although it has been reported in some studies along with values for weighted
(15).
ADVANTAGES AND DISADVANTAGES OF THE INDEX
|
|---|
|
|
|---|
has the advantage that it is corrected for agreement with statistical chance, and there is an accepted method for computing confidence limits and for statistical testing. The main disadvantage of
is that the scale is not free of dependence on disease prevalence or the number of rating categories. As a consequence, it is difficult to interpret the meaning of any absolute value of
, although it is still useful in experiments in which a control for prevalence and for the number of categories is used. The prevalence bias makes it difficult to compare the results of clinical studies where disease prevalence may vary; for example, this may occur in studies about the screening and diagnosis of breast cancer. The disease prevalence should always be reported when
is used to prevent misunderstanding when one is trying to make generalizations. | RELATIONSHIP BETWEEN AGREEMENT AND ACCURACY |
|---|
|
|
|---|
has been shown to be inconsistent with accuracy as measured by the area under the receiver operating characteristic curve (16) and should not be used as a surrogate for accuracy. Different areas under the receiver operating characteristic curve can have the same
, and the same areas under the receiver operating characteristic curve can have different
values. For example, Taplin et al (14) studied the accuracy and agreement of single- and double-reading screening mammograms by using the area under the receiver operating characteristic curve and
. The study included 31 radiologists who read 120 mammograms. The mean area under the receiver operating characteristic curve for single-reading mammograms was 0.85, and that for double-reading mammograms was 0.87. However, the average unweighted
for patients with cancer was 0.41 for single-reading mammograms and 0.71 for double-reading mammograms. The average unweighted
for patients without cancer was 0.26 for single-reading mammograms and 0.34 for double-reading mammograms. Double reading of mammograms resulted in better agreement but not in better accuracy. If we assume that agreement implies accuracy, then we can use measurements of observed agreement to set a lower limit for accuracy. Suppose two readers agree with respect to interpretation in 50% of the cases; then, by implication, they are both correct with respect to interpretation in 50% of the cases about which they agree and one of them is correct with respect to interpretation in half (25% of the total) of the cases about which they disagree. Therefore, the overall accuracy of the readings is 75%. Typically, in radiology, observed between-reader agreement is 70%80%, implying an accuracy that is 85%90% (ie, 70% + 30%/2 to 80% + 20%/2).
Some new approaches to estimation of accuracy from agreement have been proposed. These approaches are based on the assumption that when a majority of readers agree about a diagnosis they are likely to be right (4,17). We have proposed the use of a technique called mixture distribution analysis (4,18). At least five readers report the cases by using either a yes-no response or a rating scale. The agreement of the group of readers about each case is fit to a mathematic model, with the assumption that the sample was drawn from a population that consists of easy normal, easy abnormal, and hard cases. With the computer program, the population that best fits the sample is located, and an overall measure of performance that we call the relative percentage agreement is calculated. We have found that the relative percentage agreement has values similar to those obtained by using receiver operating characteristic curve analysis with proved cases (18,19).
| CONCLUSION |
|---|
|
|
|---|
| APPENDIX |
|---|
|
|
|---|
|
|
|
|
|
The results for observed weighted proportions are presented in Table A4. The expected agreement is calculated by multiplying the row and column total for each cell of the 4 x 4 table by the corresponding weighting factor. The calculations for the first row are as follows: (0.42 x 0.38) x 1.00 = 0.16, (0.42 x 0.22) x 0.89 = 0.08, (0.42 x 0.15) x 0.56 = 0.03, and (0.42 x 0.25) x 0 = 0.
|
to the weighted values, we get a weighted
index of 0.76, which is calculated with the following equation:
|
|
can be calculated by using the sum of the diagonal cells in Table A2, or 0.31 + 0.07 + 0.04 + 0.13 = 0.55, to calculate the observed agreement and the sum of the diagonal cells in Table A4 with regard to expected weighted proportions, or 0.16 + 0.05 + 0.03 + 0.04 = 0.28, to calculate the expected agreement. The unweighted
is 0.37.
The calculation of the appropriate standard error and the use of the standard error for testing either the hypothesis that
is different from zero or that
is different from a value other than zero is beyond the scope of this article but is in most basic statistical texts (6,7).
GLOSSARY
Below is a list of common terms and definitions related to the measurement of observer agreement.
Accuracy.This value is the likelihood of the interpretation being correct when compared with an independent standard.
Agreement.This term represents the likelihood that one reader will indicate the same responses as another reader.
Attributes.An attribute is a categorical variable that represents a property of the object being imaged (eg, tumor descriptors such as mass, calcification, and architectural distortion).
Categorical variables.Categorical variables are variables that can be assigned to specific categories. Categorical variables can be either ranked variables or attributes.
.The
value is an overall measure of agreement that is corrected for agreement by chance. It is sensitive to disease prevalence.
Marginal sums.A marginal sum is the sum of the responses in a single row or column of the data table, and it represents the total response of one of the readers.
Measurement variable.Measurement variables are variables that can be measured or counted. They are generally divided into continuous variables (eg, lesion diameter or volume) and discrete variables (eg, number of lesions, expressed as whole numbers but never as decimal fractions).
Prevalence.Prevalence is the proportion of a particular class of cases in the population being studied.
Ranked variables.Ranked variables are categorical variables that have a natural order, such as stage of a disease, histologic grade, or discrete severity index (ie, mild, moderate, or severe).
Reliability.Reliability is the likelihood that one reader will provide the same responses as those provided by a large consensus group.
Weighted
.The weighted
is an overall measure of agreement that is corrected for agreement by chance; a weighting factor is applied to each pair of disagreements to account for the importance of the disagreement.
| REFERENCES |
|---|
|
|
|---|
This article has been cited by other articles:
![]() |
H. Akiba, M. Tamakawa, H. Hyodoh, K. Hyodoh, N. Yama, T. Nonaka, Y. Minamida, M. Hashimoto, and M. Hareyama Assessment of Dural Arteriovenous Fistulas of the Cavernous Sinuses on 3D Dynamic MR Angiography AJNR Am. J. Neuroradiol., October 1, 2008; 29(9): 1652 - 1657. [Abstract] [Full Text] [PDF] |
||||
![]() |
K. Tanitame, K. Sasaki, T. Sone, S. Uyama, M. Sumida, T. Ichiki, and K. Ito Anterior Chamber Configuration in Patients with Glaucoma: MR Gonioscopy Evaluation with Half-Fourier Single-Shot RARE Sequence and Microscopy Coil Radiology, October 1, 2008; 249(1): 294 - 300. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Y. Kim, S. H. Park, E. K. Choi, S. S. Lee, K. H. Lee, J. C. Kim, C. S. Yu, H. C. Kim, A. Y. Kim, and H. K. Ha Automated Carbon Dioxide Insufflation for CT Colonography: Effectiveness of Colonic Distention in Cancer Patients with Severe Luminal Narrowing Am. J. Roentgenol., March 1, 2008; 190(3): 698 - 706. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. Taylor, E. V. Garcia, J. N. G. Binongo, A. Manatunga, R. Halkar, R. D. Folks, and E. Dubovsky Diagnostic Performance of an Expert System for Interpretation of 99mTc MAG3 Scans in Suspected Renal Obstruction J. Nucl. Med., February 1, 2008; 49(2): 216 - 224. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. S. Gierada, T. K. Pilgram, M. Ford, R. M. Fagerstrom, T. R. Church, H. Nath, K. Garg, and D. C. Strollo Lung Cancer: Interobserver Agreement on Interpretation of Pulmonary Findings at Low-Dose CT Screening Radiology, December 1, 2007; 246(1): 265 - 272. [Abstract] [Full Text] [PDF] |
||||
![]() |
E. Just da Costa e Silva and G. Alves Pontes da Silva Eliminating Unenhanced CT When Evaluating Abdominal Neoplasms in Children Am. J. Roentgenol., November 1, 2007; 189(5): 1211 - 1214. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. K. Hofkes, B. J. Iskandar, P. A. Turski, L. R. Gentry, J. B. McCue, and V. M. Haughton Differentiation between Symptomatic Chiari I Malformation and Asymptomatic Tonsilar Ectopia by Using Cerebrospinal Fluid Flow Imaging: Initial Estimate of Imaging Accuracy Radiology, November 1, 2007; 245(2): 532 - 540. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. Suzuki, Y. Nakamoto, T. Terauchi, M. Kawamoto, Y. Okumura, Y. Suzuki, T. Sato, N. Takahashi, J. Lee, M. Senda, et al. Inter-observer Variations in FDG-PET Interpretation for Cancer Screening Jpn. J. Clin. Oncol., August 18, 2007; (2007) hym064v1. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. L. Alcazar, M. Garcia-Manero, and R. Galvan Three-Dimensional Sonographic Morphologic Assessment of Adnexal Masses: A Reproducibility Study J. Ultrasound Med., August 1, 2007; 26(8): 1007 - 1011. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Suzuki, S. Furui, K. Okinaga, T. Sakamoto, J. Murata, A. Furukawa, and Y. Ohnaka Differentiation of Femoral Versus Inguinal Hernia: CT Findings Am. J. Roentgenol., August 1, 2007; 189(2): W78 - W83. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. KUMAR, R. A. BUMB, N. A. ANSARI, R. D. MEHTA, and P. SALOTRA CUTANEOUS LEISHMANIASIS CAUSED BY LEISHMANIA TROPICA IN BIKANER, INDIA: PARASITE IDENTIFICATION AND CHARACTERIZATION USING MOLECULAR AND IMMUNOLOGIC TOOLS Am J Trop Med Hyg, May 1, 2007; 76(5): 896 - 901. [Abstract] [Full Text] [PDF] |
||||
![]() |
H.J. Cloft, T. Kaufmann, and D.F. Kallmes Observer Agreement in the Assessment of Endovascular Aneurysm Therapy and Aneurysm Recurrence AJNR Am. J. Neuroradiol., March 1, 2007; 28(3): 497 - 500. [Abstract] [Full Text] [PDF] |
||||
![]() |
H. D. Stupak, T. H. M. Moulthrop, P. Wheatley, A. V. Tauman, and C. M. Johnson Jr Calcium Hydroxylapatite Gel (Radiesse) Injection for the Correction of Postrhinoplasty Contour Deficiencies and Asymmetries Arch Facial Plast Surg, March 1, 2007; 9(2): 130 - 136. [Abstract] [Full Text] [PDF] |
||||
![]() |
Y. J. Kim, S. S. Raman, N. C. Yu, K. J. To'o, R. Jutabha, and D. S. K. Lu Esophageal Varices in Cirrhotic Patients: Evaluation with Liver CT Am. J. Roentgenol., January 1, 2007; 188(1): 139 - 144. [Abstract] [Full Text] [PDF] |
||||
![]() |
T. A. Jaffe, L. C. Martin, C. M. Miller, K. M. Franklin, E. M. Merkle, W. M. Thompson, R. C. Nelson, D. M. DeLong, and E. K. Paulson Abdominal Pain: Coronal Reformations from Isotropic Voxels with 16-Section CT--Reader Lesion Detection and Interpretation Time Radiology, January 1, 2007; 242(1): 175 - 181. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Kim, Y.-M. Huh, H.-T. Song, S.-A. Lee, J.-W. Lee, J. E. Lee, I. H. Chung, and J.-S. Suh Chronic Tibiofibular Syndesmosis Injury of Ankle: Evaluation with Contrast-enhanced Fat-suppressed 3D Fast Spoiled Gradient-recalled Acquisition in the Steady State MR Imaging Radiology, January 1, 2007; 242(1): 225 - 235. [Abstract] [Full Text] [PDF] |
||||
![]() |
G. Andreisek, T. Pfammatter, K. Goepfert, D. Nanz, P. Hervo, R. Koppensteiner, and D. Weishaupt Peripheral Arteries in Diabetic Patients: Standard Bolus-Chase and Time-resolved MR Angiography Radiology, December 19, 2006; (2006) 2422051111. [Abstract] [Full Text] |
||||
![]() |
R. de Silva, L. F. Gutierrez, A. N. Raval, E. R. McVeigh, C. Ozturk, and R. J. Lederman X-Ray Fused With Magnetic Resonance Imaging (XFM) to Target Endomyocardial Injections: Validation in a Swine Model of Myocardial Infarction Circulation, November 28, 2006; 114(22): 2342 - 2350. [Abstract] [Full Text] [PDF] |