Radiology
HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH TABLE OF CONTENTS
 QUICK SEARCH:   [advanced]


     


Published online before print June 20, 2003, 10.1148/radiol.2282011860
This Article
Right arrow Abstract Freely available
Right arrow Full Text (PDF)
Right arrow All Versions of this Article:
2282011860v1
228/2/303    most recent
Right arrow Submit a response
Right arrow View responses
Right arrow Alert me when this article is cited
Right arrow Alert me when eLetters are posted
Right arrow Alert me if a correction is posted
Right arrow Citation Map
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Right arrow reprints & permissions
Citing Articles
Right arrow Citing Articles via HighWire
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Kundel, H. L.
Right arrow Articles by Polansky, M.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Kundel, H. L.
Right arrow Articles by Polansky, M.
(Radiology 2003;228:303-308.)
© RSNA, 2003


Statistical Concepts Series

Measurement of Observer Agreement1

Harold L. Kundel, MD and Marcia Polansky, ScD

1 From the Department of Radiology (H.L.K.) and MCP Hahnemann School of Public Health (M.P.), University of Pennsylvania Medical Center, 3600 Market St, Suite 370, Philadelphia, PA 19104. Received November 21, 2001; revision requested January 29, 2002; revision received March 4; accepted March 11. Supported by grant P01 CA 53141 from the National Cancer Institute, National Institutes of Health, U.S. Public Health Service, Department of Health and Human Services. Address correspondence to H.L.K. (e-mail: kundel@rad.upenn.edu).


    ABSTRACT
 TOP
 ABSTRACT
 INTRODUCTION
 MEASUREMENT OF AGREEMENT OF...
 COHEN {kappa}
 WEIGHTED {kappa} FOR MULTIPLE...
 ESTIMATION OF {kappa} FOR...
 ADVANTAGES AND DISADVANTAGES OF...
 RELATIONSHIP BETWEEN AGREEMENT...
 CONCLUSION
 APPENDIX
 REFERENCES
 
Statistical measures are described that are used in diagnostic imaging for expressing observer agreement in regard to categorical data. The measures are used to characterize the reliability of imaging methods and the reproducibility of disease classifications and, occasionally with great care, as the surrogate for accuracy. The review concentrates on the chance-corrected indices, {kappa} and weighted {kappa}. Examples from the imaging literature illustrate the method of calculation and the effects of both disease prevalence and the number of rating categories. Other measures of agreement that are used less frequently, including multiple-rater {kappa}, are referenced and described briefly.

© RSNA, 2003

Index terms: Diagnostic radiology, observer performance • Statistical analysis


    INTRODUCTION
 TOP
 ABSTRACT
 INTRODUCTION
 MEASUREMENT OF AGREEMENT OF...
 COHEN {kappa}
 WEIGHTED {kappa} FOR MULTIPLE...
 ESTIMATION OF {kappa} FOR...
 ADVANTAGES AND DISADVANTAGES OF...
 RELATIONSHIP BETWEEN AGREEMENT...
 CONCLUSION
 APPENDIX
 REFERENCES
 
The statistical analysis of observer agreement in imaging is generally performed for three reasons. First, observer agreement provides information about the reliability of imaging diagnosis. A reliable method should produce good agreement when used by knowledgeable observers. Second, observer agreement can be used to check the consistency of a method for classification of an abnormality that indicates the extent or severity of disease (1) and to determine the reliability of various signs of disease (2). It can also be used to compare the performance of humans and computers (3). Third, observer agreement can provide a general estimate of the value of an imaging technique when an independent method of proving the diagnosis precludes the measurement of sensitivity and specificity or the more general receiver operating characteristic curve. In many clinical situations, imaging provides the best evidence of abnormality. Furthermore, even if an independent method for obtaining proof exists, it may be difficult to use. For every suspected lesion, a biopsy cannot be performed to obtain a specific tissue diagnosis. As we will demonstrate, currently popular measures of agreement do not necessarily reflect accuracy. However, there are statistical techniques for use of the agreement of multiple expert readers (4) or the agreement of multiple tests (5) to estimate the underlying accuracy of the test.

We illustrate the standard methods for description of agreement in regard to categorical data and point out the advantages and disadvantages of the use of these methods. We refer to some of the less common, although not less important, methods but do not describe them. Then we describe some current developments in methods for use of agreement to estimate accuracy. The discussion is limited to data that can be assigned to categories, such as positive or negative; high, medium, or low; class I–V. Data, such as lesion volume or heart size, that are collected on a continuous scale are more appropriately analyzed with methods of correlation.


    MEASUREMENT OF AGREEMENT OF TWO READERS
 TOP
 ABSTRACT
 INTRODUCTION
 MEASUREMENT OF AGREEMENT OF...
 COHEN {kappa}
 WEIGHTED {kappa} FOR MULTIPLE...
 ESTIMATION OF {kappa} FOR...
 ADVANTAGES AND DISADVANTAGES OF...
 RELATIONSHIP BETWEEN AGREEMENT...
 CONCLUSION
 APPENDIX
 REFERENCES
 
Consider readings of the same 150 images that are reported as either positive or negative by two readers. The results are shown in Table 1 as joint agreement in a 2 x 2 format, with the responses of each reader as marginal totals. Three general indices of agreement can be derived from Table 1. The overall proportion of agreement, which we will call po, is calculated as follows:


View this table:
[in this window]
[in a new window]

 
TABLE 1. Joint Judgment of Two Readers about Same 150 Images

 
The proportion is useful for calculations, but the result is usually expressed as a percentage. A po of 0.85 indicates that the two readers agree in regard to 85% of their interpretations. If the number of negative readings is large relative to the number of positive readings, the agreement in regard to negative readings will dominate the value of po and may give a false impression of performance. For example, suppose that 90% of the cases are actually negative, and two readers agree about all of the negative interpretations but disagree about the positive interpretations. The overall agreement will be at least 90% and may be greater depending on the number of positive interpretations on which they agree. As an alternative to the overall agreement, the positive and negative agreement can be estimated separately. This will give an indication of the type of decision on which readers disagree. The positive agreement, which we will call ppos, is the number of positive readings that both readers agree on divided by all of the positive readings for both readers. For the data in Table 1, the positive agreement is calculated with the following equation:

The negative agreement, which we will call pneg, can be calculated in a similar way as follows:

In the example given in Table 1, although the two readers agree 85% of the time overall, they only agree on positive interpretations 39% of the time, whereas they agree on negative interpretations 92% of the time. The advantage of calculation of ppos and pneg is that any imbalance in the proportion of positive and negative responses becomes apparent, as in the example. The disadvantage is that CIs cannot be calculated.


    COHEN {kappa}
 TOP
 ABSTRACT
 INTRODUCTION
 MEASUREMENT OF AGREEMENT OF...
 COHEN {kappa}
 WEIGHTED {kappa} FOR MULTIPLE...
 ESTIMATION OF {kappa} FOR...
 ADVANTAGES AND DISADVANTAGES OF...
 RELATIONSHIP BETWEEN AGREEMENT...
 CONCLUSION
 APPENDIX
 REFERENCES
 
Some of the observer agreement concerning findings of imaging tests can be caused by chance. For example, chance agreement occurs when the readers know in advance that most of the cases are negative and they adopt a reading strategy of reporting a case as negative whenever they are in doubt. Both will have a large percentage of negative agreements because of prior knowledge of the prevalence of negative cases, not because of information obtained from viewing of the images. An index called {kappa} has been developed as a measure of agreement that is corrected for chance. The {kappa} is calculated by subtracting the proportion of the readings that are expected to agree by chance, which we will call pe, from the overall agreement, po, and dividing the remainder by the number of cases on which agreement is not expected to occur by chance. This is demonstrated in Equation (1) as follows:

Another way to view {kappa} is that if the readers read different images and the readings were paired, some agreement, namely po, would be observed. The observed agreement would occur purely by chance. The agreement that is expected to occur by chance, which we shall designate pe, can be calculated. When the readings of different images are compared, the observed value, namely the po, should equal the expected value, pe, because there is no agreement beyond chance and {kappa} is zero.

The joint agreement that is expected because of chance is calculated for each combination with multiplication of the total responses of each reader contained in the marginal totals of the data table. From Table 1, the agreement expected by chance for the joint positive and joint negative responses is calculated with the following equation:

The value for {kappa} is 0.31, as is calculated with this equation:

The standard error, which we will call SE, of {kappa} for a 2 x 2 table can be estimated with the following equation:

A more accurate and more complicated equation for the standard error of {kappa} can be found in most books about statistics (6,7).

The 95% CIs of {kappa} can be calculated as follows:

For example, the 95% CIs are 0.31 - 1.96 x 0.14 = 0.04 and 0.31 + 1.96 x 0.14 = 0.58.

Thus, what is the meaning of a {kappa} of 0.31, together with an overall agreement of 0.85? The calculated value of {kappa} can range from -1.00 to +1.00, but for practical purposes the range from zero to +1.00 is of interest. A {kappa} of zero means that there is no agreement beyond chance, and a {kappa} of 1.00 means that there is perfect agreement. Interpretations of intermediate values are subjective. Table 2 shows the strength of agreement beyond chance for various ranges of {kappa} that were suggested by Landis and Koch (8). The choice of intervals is entirely arbitrary but has become ingrained with frequent usage. The values calculated from Table 1 show that there is good overall agreement (po = 0.85) but only fair chance-corrected agreement ({kappa} = 0.31). This paradoxical result is caused by the high prevalence of negative cases. Prevalence effects can lead to situations in which the values of {kappa} do not correspond with intuition (9,10). This is illustrated with the data in Tables 3 and 4 that were extrapolated, with a bit of adjustment to make the numbers come out even, from a data set collected during a study of readings in regard to portable chest images obtained in a medical intensive care unit (11). Table 3 shows the agreement of the reports of two of the readers concerning the position of tubes and catheters. An incorrectly positioned tube or catheter was defined as a positive reading. Table 4 shows the agreement in regard to the reports of the same two readers about the presence of radiographic signs of congestive heart failure. The example was chosen because the actual values of {kappa} for the two diagnoses were very close.


View this table:
[in this window]
[in a new window]

 
TABLE 2. Guidelines for Strength of Agreement Indicated with {kappa} Values

 

View this table:
[in this window]
[in a new window]

 
TABLE 3. Joint Judgment of Two Readers about Position of Tubes and Catheters on 100 Portable Chest Images

 

View this table:
[in this window]
[in a new window]

 
TABLE 4. Joint Judgment of Two Readers about Presence of Signs of Congestive Heart Failure on 100 Portable Chest Images

 
The agreement indices for the two types of readings are shown in Table 5. The overall agreement (95%) for the position of tubes and catheters is very high, but so is the agreement according to chance (90%) calculated from the marginal values in Table 3. This results in a low {kappa} of 0.52, which happens to be the same {kappa} as that for congestive heart failure. The result is not intuitively appealing, because a relatively simple decision such as that about the location of a catheter tip should have a higher index of agreement than a more difficult decision such as that concerning a diagnosis of congestive heart failure. Feinstein and Cicchetti (9) have pointed out the paradox of high overall agreement and low {kappa}, and Cicchetti and Feinstein (10) suggest that when investigators report the results of studies of agreement they should include the three indices of {kappa}, positive agreement, and negative agreement. We agree that this is a useful way of showing agreement data, because it provides more details about where disagreements occur and alerts the reader to the possibility of effects caused by prevalence or prior knowledge.


View this table:
[in this window]
[in a new window]

 
TABLE 5. Indices of Agreement for Readings of Two Radiologists Regarding Portable Chest Images for Position of Tubes and Catheters and Signs of Congestive Heart Failure

 

    WEIGHTED {kappa} FOR MULTIPLE CATEGORIES
 TOP
 ABSTRACT
 INTRODUCTION
 MEASUREMENT OF AGREEMENT OF...
 COHEN {kappa}
 WEIGHTED {kappa} FOR MULTIPLE...
 ESTIMATION OF {kappa} FOR...
 ADVANTAGES AND DISADVANTAGES OF...
 RELATIONSHIP BETWEEN AGREEMENT...
 CONCLUSION
 APPENDIX
 REFERENCES
 
The {kappa} can be calculated for two readers who report results with multiple categories. As the number of categories increases, the value of {kappa} decreases because there is more room for disagreement with more categories. However, when findings are reported by using a ranked variable, the relative importance of disagreement between categories may not be the same for adjacent categories as it is for distant categories. Two readers who consistently disagree about minimal and moderate categories would have the same value for {kappa} calculated in the usual way as would two readers who consistently disagree about minimal and severe categories. A method for calculation of {kappa} has been developed that allows for differences in the importance of disagreements. The usual approach is to assign weights between 1.00 and zero to each agreement pair, where 1.00 represents perfect agreement and zero represents no agreement. Assignment of weights can be very subjective and can confuse comparison of {kappa} values between studies in which different weights were used. For theoretical reasons, Fleiss (7) suggests assignment of weights as follows:

where w represents weight, i is the number of the row, j is the number of the column, and k is the total number of categories. The weighting is called quadratic because of the squared terms. An example of the method for calculation of weighted {kappa} by using four categories is presented in the Appendix. In the example in the Appendix, the categories of absent, minimal, moderate, and severe are used. The weighted and unweighted values for po and {kappa} are included in Table 6. The calculations were repeated by collapsing the data for four categories first into three and then into two categories: First, minimal and moderate categories were combined, and then minimal, moderate, and severe categories were combined, and these two combinations would be equivalent to normal and abnormal categories, respectively. Table 6 shows that the value of {kappa} increases as the number of categories is decreased, thus indicating better agreement when the fine distinctions are eliminated. The weighted {kappa} is greater than the unweighted {kappa} when multiple categories are used and is the same as the unweighted {kappa} when only two categories are used. Some investigators prefer to use multiple categories because they are a better reflection of actual clinical decisions, and if sensible weighting can be achieved, the weighted {kappa} may reflect the actual agreement better than does the unweighted {kappa}.


View this table:
[in this window]
[in a new window]

 
TABLE 6. Comparison of Unweighted and Weighted po and {kappa} Calculated by Using Four-, Three-, and Two-Response Categories

 

    ESTIMATION OF {kappa} FOR MULTIPLE READERS
 TOP
 ABSTRACT
 INTRODUCTION
 MEASUREMENT OF AGREEMENT OF...
 COHEN {kappa}
 WEIGHTED {kappa} FOR MULTIPLE...
 ESTIMATION OF {kappa} FOR...
 ADVANTAGES AND DISADVANTAGES OF...
 RELATIONSHIP BETWEEN AGREEMENT...
 CONCLUSION
 APPENDIX
 REFERENCES
 
When multiple readers are used, some authors calculate the values of {kappa} for pairs of readers and then compute an average {kappa} for all possible pairs (1214). Fleiss (7) describes a method for calculation of a {kappa} index for multiple readers. It has not been used very much in diagnostic imaging, although it has been reported in some studies along with values for weighted {kappa} (15).


    ADVANTAGES AND DISADVANTAGES OF THE {kappa} INDEX
 TOP
 ABSTRACT
 INTRODUCTION
 MEASUREMENT OF AGREEMENT OF...
 COHEN {kappa}
 WEIGHTED {kappa} FOR MULTIPLE...
 ESTIMATION OF {kappa} FOR...
 ADVANTAGES AND DISADVANTAGES OF...
 RELATIONSHIP BETWEEN AGREEMENT...
 CONCLUSION
 APPENDIX
 REFERENCES
 
{kappa} has the advantage that it is corrected for agreement with statistical chance, and there is an accepted method for computing confidence limits and for statistical testing. The main disadvantage of {kappa} is that the scale is not free of dependence on disease prevalence or the number of rating categories. As a consequence, it is difficult to interpret the meaning of any absolute value of {kappa}, although it is still useful in experiments in which a control for prevalence and for the number of categories is used. The prevalence bias makes it difficult to compare the results of clinical studies where disease prevalence may vary; for example, this may occur in studies about the screening and diagnosis of breast cancer. The disease prevalence should always be reported when {kappa} is used to prevent misunderstanding when one is trying to make generalizations.


    RELATIONSHIP BETWEEN AGREEMENT AND ACCURACY
 TOP
 ABSTRACT
 INTRODUCTION
 MEASUREMENT OF AGREEMENT OF...
 COHEN {kappa}
 WEIGHTED {kappa} FOR MULTIPLE...
 ESTIMATION OF {kappa} FOR...
 ADVANTAGES AND DISADVANTAGES OF...
 RELATIONSHIP BETWEEN AGREEMENT...
 CONCLUSION
 APPENDIX
 REFERENCES
 
High accuracy implies high agreement, but high agreement does not necessarily imply high accuracy. There is no direct way to infer the accuracy in regard to an image-reading task from reader agreement. Accuracy can only be implied from agreement, with the assumption that when readers agree they must be correct. We frequently make this assumption by seeking a consensus diagnosis or by obtaining a second opinion, but it is not always correct. The {kappa} has been shown to be inconsistent with accuracy as measured by the area under the receiver operating characteristic curve (16) and should not be used as a surrogate for accuracy. Different areas under the receiver operating characteristic curve can have the same {kappa}, and the same areas under the receiver operating characteristic curve can have different {kappa} values. For example, Taplin et al (14) studied the accuracy and agreement of single- and double-reading screening mammograms by using the area under the receiver operating characteristic curve and {kappa}. The study included 31 radiologists who read 120 mammograms. The mean area under the receiver operating characteristic curve for single-reading mammograms was 0.85, and that for double-reading mammograms was 0.87. However, the average unweighted {kappa} for patients with cancer was 0.41 for single-reading mammograms and 0.71 for double-reading mammograms. The average unweighted {kappa} for patients without cancer was 0.26 for single-reading mammograms and 0.34 for double-reading mammograms. Double reading of mammograms resulted in better agreement but not in better accuracy.

If we assume that agreement implies accuracy, then we can use measurements of observed agreement to set a lower limit for accuracy. Suppose two readers agree with respect to interpretation in 50% of the cases; then, by implication, they are both correct with respect to interpretation in 50% of the cases about which they agree and one of them is correct with respect to interpretation in half (25% of the total) of the cases about which they disagree. Therefore, the overall accuracy of the readings is 75%. Typically, in radiology, observed between-reader agreement is 70%–80%, implying an accuracy that is 85%–90% (ie, 70% + 30%/2 to 80% + 20%/2).

Some new approaches to estimation of accuracy from agreement have been proposed. These approaches are based on the assumption that when a majority of readers agree about a diagnosis they are likely to be right (4,17). We have proposed the use of a technique called mixture distribution analysis (4,18). At least five readers report the cases by using either a yes-no response or a rating scale. The agreement of the group of readers about each case is fit to a mathematic model, with the assumption that the sample was drawn from a population that consists of easy normal, easy abnormal, and hard cases. With the computer program, the population that best fits the sample is located, and an overall measure of performance that we call the relative percentage agreement is calculated. We have found that the relative percentage agreement has values similar to those obtained by using receiver operating characteristic curve analysis with proved cases (18,19).


    CONCLUSION
 TOP
 ABSTRACT
 INTRODUCTION
 MEASUREMENT OF AGREEMENT OF...
 COHEN {kappa}
 WEIGHTED {kappa} FOR MULTIPLE...
 ESTIMATION OF {kappa} FOR...
 ADVANTAGES AND DISADVANTAGES OF...
 RELATIONSHIP BETWEEN AGREEMENT...
 CONCLUSION
 APPENDIX
 REFERENCES
 
Formal evaluations of imaging technology by using reader agreement started in 1947 with the publication of an article about tuberculosis case finding by using four different chest imaging systems (20). The author of an editorial that accompanied the article expressed surprise that there was so much disagreement (21). History repeated itself when an article about agreement in screening mammography that showed considerable reader variability (22) was published; this article was accompanied by an editorial in which the author expressed surprise in regard to the extent of disagreement (23). The consensus of a group of physicians is frequently the only basis for determination of a difficult diagnostic decision. Studies of pathologists who classify cancer have shown levels of disagreement are similar to those associated with hard decisions in radiology (24). Agreement usually results from informal discussion; however, the method used to obtain agreement can have a large influence on the decision outcome (25). Formal procedures that are used to achieve agreement have been proposed (26); although they can minimize individual bias in achieving a consensus, they are rarely used. We hope that this brief review will stimulate greater use of existing statistics for characterization of agreement and further exploration of new methods.


    APPENDIX
 TOP
 ABSTRACT
 INTRODUCTION
 MEASUREMENT OF AGREEMENT OF...
 COHEN {kappa}
 WEIGHTED {kappa} FOR MULTIPLE...
 ESTIMATION OF {kappa} FOR...
 ADVANTAGES AND DISADVANTAGES OF...
 RELATIONSHIP BETWEEN AGREEMENT...
 CONCLUSION
 APPENDIX
 REFERENCES
 
Consider a data set in Table A1 that consists of four categories. The frequencies in Table A1 are converted into proportions, which are included in Table A2, by dividing the data by the total number of cases.


View this table:
[in this window]
[in a new window]

 
TABLE A1. Frequency of Responses of Two Readers Who Rated a Disease as Absent, Minimal, Moderate, or Severe

 

View this table:
[in this window]
[in a new window]

 
TABLE A2. Proportion of Responses of Two Readers Who Rated a Disease as Absent, Minimal, Moderate, or Severe

 
Table A3 shows the quadratic weights calculated by using Equation (4), as presented earlier:

where w represents weight, i is the number of the row, j is the number of the column, and k is the total number of categories. It is assumed that disagreement between adjacent categories (ie, disagreement for absent to minimal is 0.89) is not as important as that between distant categories (ie, disagreement for absent to severe is zero).


View this table:
[in this window]
[in a new window]

 
TABLE A3. Quadratic Weights for 4 x 4 Table

 
The weighted observed agreement is calculated by multiplying the proportion of responses in each cell of the 4 x 4 table by the corresponding weighting factor. The calculations for the first row are as follows: 0.31 x 1.00 = 0.31, 0.09 x 0.89 = 0.08, 0.02 x 0.56 = 0.01, and 0 x 0 = 0.

The results for observed weighted proportions are presented in Table A4. The expected agreement is calculated by multiplying the row and column total for each cell of the 4 x 4 table by the corresponding weighting factor. The calculations for the first row are as follows: (0.42 x 0.38) x 1.00 = 0.16, (0.42 x 0.22) x 0.89 = 0.08, (0.42 x 0.15) x 0.56 = 0.03, and (0.42 x 0.25) x 0 = 0.


View this table:
[in this window]
[in a new window]

 
TABLE A4. Weighted Proportion of Observed and Expected Responses

 
The results for expected weighted proportions are presented in Table A4. The sum of all of the cells in regard to observed weighted proportions (sum, 0.93) in Table A4 is the weighted observed agreement, which we call po(w), and the sum of all of the cells in regard to expected weighted proportions (sum, 0.70) in Table A4 is the weighted expected agreement, which we call pe(w). When we apply the equation for {kappa} to the weighted values, we get a weighted {kappa} index of 0.76, which is calculated with the following equation:

An unweighted {kappa} can be calculated by using the sum of the diagonal cells in Table A2, or 0.31 + 0.07 + 0.04 + 0.13 = 0.55, to calculate the observed agreement and the sum of the diagonal cells in Table A4 with regard to expected weighted proportions, or 0.16 + 0.05 + 0.03 + 0.04 = 0.28, to calculate the expected agreement. The unweighted {kappa} is 0.37.

The calculation of the appropriate standard error and the use of the standard error for testing either the hypothesis that {kappa} is different from zero or that {kappa} is different from a value other than zero is beyond the scope of this article but is in most basic statistical texts (6,7).

GLOSSARY
Below is a list of common terms and definitions related to the measurement of observer agreement.

Accuracy.—This value is the likelihood of the interpretation being correct when compared with an independent standard.

Agreement.—This term represents the likelihood that one reader will indicate the same responses as another reader.

Attributes.—An attribute is a categorical variable that represents a property of the object being imaged (eg, tumor descriptors such as mass, calcification, and architectural distortion).

Categorical variables.—Categorical variables are variables that can be assigned to specific categories. Categorical variables can be either ranked variables or attributes.

{kappa}.—The {kappa} value is an overall measure of agreement that is corrected for agreement by chance. It is sensitive to disease prevalence.

Marginal sums.—A marginal sum is the sum of the responses in a single row or column of the data table, and it represents the total response of one of the readers.

Measurement variable.—Measurement variables are variables that can be measured or counted. They are generally divided into continuous variables (eg, lesion diameter or volume) and discrete variables (eg, number of lesions, expressed as whole numbers but never as decimal fractions).

Prevalence.—Prevalence is the proportion of a particular class of cases in the population being studied.

Ranked variables.—Ranked variables are categorical variables that have a natural order, such as stage of a disease, histologic grade, or discrete severity index (ie, mild, moderate, or severe).

Reliability.—Reliability is the likelihood that one reader will provide the same responses as those provided by a large consensus group.

Weighted {kappa}.—The weighted {kappa} is an overall measure of agreement that is corrected for agreement by chance; a weighting factor is applied to each pair of disagreements to account for the importance of the disagreement.


    REFERENCES
 TOP
 ABSTRACT
 INTRODUCTION
 MEASUREMENT OF AGREEMENT OF...
 COHEN {kappa}
 WEIGHTED {kappa} FOR MULTIPLE...
 ESTIMATION OF {kappa} FOR...
 ADVANTAGES AND DISADVANTAGES OF...
 RELATIONSHIP BETWEEN AGREEMENT...
 CONCLUSION
 APPENDIX
 REFERENCES
 

  1. Baker JA, Kornguth PJ, Floyd CE. Breast imaging reporting and data system standardized mammography lexicon: observer variability in lesion description. AJR Am J Roentgenol 1996; 166:773-778.[Abstract/Free Full Text]
  2. Markus JB, Somers S, Franic SE, et al. Interobserver variation in the interpretation of abdominal radiographs. Radiology 1989; 171:69-71.[Abstract/Free Full Text]
  3. Tiitola M, Kivisaari L, Tervahartiala P, et al. Estimation or quantification of tumour volume? CT study on irregular phantoms. Acta Radiol 2001; 42:101-105.[Medline]
  4. Polansky M. Agreement and accuracy: mixture distribution analysis. In: Beutel J, VanMeter R, Kundel H, eds. Handbook of imaging physics and perception. Bellingham, Wash: Society of Professional Imaging Engineers, 2000; 797-835.
  5. Henkelman RM, Kay I, Bronskill MJ. Receiver operating characteristic (ROC) analysis without truth. Med Decis Making 1990; 10:24-29.
  6. Agresti A. Categorical data analysis New York, NY: Wiley, 1990; 366-370.
  7. Fleiss JL. Statistical methods for rates and proportions 2nd ed. New York, NY: Wiley, 1981; 212-236.
  8. Landis JR, Koch GG. The measurement of observer agreement for categorical data. Biometrics 1977; 33:159-174.[CrossRef][Medline]
  9. Feinstein A, Cicchetti D. High agreement but low kappa. I. The problem of two paradoxes. J Clin Epidemiol 1990; 43:543-549.
  10. Cicchetti D, Feinstein A. High agreement but low kappa. II. Resolving the paradoxes. J Clin Epidemiol 1990; 43:551-558.
  11. Kundel HL, Gefter W, Aronchick J, et al. Relative accuracy of screen-film and computed radiography using hard and soft copy readings: a receiver operating characteristic analysis using bedside chest radiographs in a medical intensive care unit. Radiology 1997; 205:859-863.[Abstract/Free Full Text]
  12. Epstein DM, Dalinka MK, Kaplan FS, et al. Observer variation in the detection of osteopenia. Skeletal Radiol 1986; 15:347-349.[Medline]
  13. Herman PG, Khan A, Kallman CE, et al. Limited correlation of left ventricular end-diastolic pressure with radiographic assessment of pulmonary hemodynamics. Radiology 1990; 174:721-724.[Abstract/Free Full Text]
  14. Taplin SH, Rutter CM, Elmore JG, et al. Accuracy of screening mammography using single versus independent double interpretation. AJR Am J Roentgenol 2000; 174:1257-1262.[Abstract/Free Full Text]
  15. Robinson PJ, Wilson D, Coral A, et al. Variation between experienced observers in the interpretation of accident and emergency radiographs. Br J Radiol 1999; 72:323-330.[Abstract]
  16. Swets JA. Indices of discrimination or diagnostic accuracy: their ROCs and implied models. Psychol Bull 1986; 99:100-117.[CrossRef][Medline]
  17. Uebersax JS. Modeling approaches for the analysis of observer agreement. Invest Radiol 1992; 27:738-743.[CrossRef][Medline]
  18. Kundel HL, Polansky M. Mixture distribution and receiver operating characteristic analysis of bedside chest imaging using screen-film and computed radiography. Acad Radiol 1997; 4:1-7.[CrossRef][Medline]
  19. Kundel HL, Polansky M. Comparing observer performance with mixture distribution analysis when there is no external gold standard. In: Kundel HL, eds. Medical imaging 1998: image perception. Bellingham, Wash: Society of Professional Imaging Engineers, 1998; 78-84.
  20. Birkelo CC, Chamberlain WE, Phelps PS, et al. Tuberculosis case finding: a comparison of the effectiveness of various roentgenographic and photofluorographic methods. JAMA 1947; 133:359-366.
  21. The "personal equation" in the interpretation of a chest roentgenogram (editorial). JAMA 1947; 133:399-400.
  22. Elmore JG, Wells CK, Lee CH, et al. Variability in radiologists’ interpretation of mammograms. N Engl J Med 1994; 331:1493-1499.[Abstract/Free Full Text]
  23. Kopans DB. Accuracy of mammographic interpretation (editorial). N Engl J Med 1994; 331:1521-1522.[Free Full Text]
  24. Landis JR, Koch GG. An application of hierarchical kappa-type statistics in the assessment of majority agreement among multiple observers. Biometrics 1977; 33:363-374.[CrossRef][Medline]
  25. Revesz G, Kundel HL, Bonitatibus M. The effect of verification on the assessment of imaging techniques. Invest Radiol 1983; 18:194-198.[CrossRef][Medline]
  26. Hillman BJ, Hessel SJ, Swensson RG, Herman PG. Improving diagnostic accuracy: a comparison of interactive and Delphi consultations. Invest Radiol 1977; 12:112-115.[CrossRef][Medline]



This article has been cited by other articles:


Home page
Am. J. Neuroradiol.Home page
H. Akiba, M. Tamakawa, H. Hyodoh, K. Hyodoh, N. Yama, T. Nonaka, Y. Minamida, M. Hashimoto, and M. Hareyama
Assessment of Dural Arteriovenous Fistulas of the Cavernous Sinuses on 3D Dynamic MR Angiography
AJNR Am. J. Neuroradiol., October 1, 2008; 29(9): 1652 - 1657.
[Abstract] [Full Text] [PDF]


Home page
RadiologyHome page
K. Tanitame, K. Sasaki, T. Sone, S. Uyama, M. Sumida, T. Ichiki, and K. Ito
Anterior Chamber Configuration in Patients with Glaucoma: MR Gonioscopy Evaluation with Half-Fourier Single-Shot RARE Sequence and Microscopy Coil
Radiology, October 1, 2008; 249(1): 294 - 300.
[Abstract] [Full Text] [PDF]


Home page
Am. J. Roentgenol.Home page
S. Y. Kim, S. H. Park, E. K. Choi, S. S. Lee, K. H. Lee, J. C. Kim, C. S. Yu, H. C. Kim, A. Y. Kim, and H. K. Ha
Automated Carbon Dioxide Insufflation for CT Colonography: Effectiveness of Colonic Distention in Cancer Patients with Severe Luminal Narrowing
Am. J. Roentgenol., March 1, 2008; 190(3): 698 - 706.
[Abstract] [Full Text] [PDF]


Home page
JNMHome page
A. Taylor, E. V. Garcia, J. N. G. Binongo, A. Manatunga, R. Halkar, R. D. Folks, and E. Dubovsky
Diagnostic Performance of an Expert System for Interpretation of 99mTc MAG3 Scans in Suspected Renal Obstruction
J. Nucl. Med., February 1, 2008; 49(2): 216 - 224.
[Abstract] [Full Text] [PDF]


Home page
RadiologyHome page
D. S. Gierada, T. K. Pilgram, M. Ford, R. M. Fagerstrom, T. R. Church, H. Nath, K. Garg, and D. C. Strollo
Lung Cancer: Interobserver Agreement on Interpretation of Pulmonary Findings at Low-Dose CT Screening
Radiology, December 1, 2007; 246(1): 265 - 272.
[Abstract] [Full Text] [PDF]


Home page
Am. J. Roentgenol.Home page
E. Just da Costa e Silva and G. Alves Pontes da Silva
Eliminating Unenhanced CT When Evaluating Abdominal Neoplasms in Children
Am. J. Roentgenol., November 1, 2007; 189(5): 1211 - 1214.
[Abstract] [Full Text] [PDF]


Home page
RadiologyHome page
S. K. Hofkes, B. J. Iskandar, P. A. Turski, L. R. Gentry, J. B. McCue, and V. M. Haughton
Differentiation between Symptomatic Chiari I Malformation and Asymptomatic Tonsilar Ectopia by Using Cerebrospinal Fluid Flow Imaging: Initial Estimate of Imaging Accuracy
Radiology, November 1, 2007; 245(2): 532 - 540.
[Abstract] [Full Text] [PDF]


Home page
Jpn J Clin OncolHome page
A. Suzuki, Y. Nakamoto, T. Terauchi, M. Kawamoto, Y. Okumura, Y. Suzuki, T. Sato, N. Takahashi, J. Lee, M. Senda, et al.
Inter-observer Variations in FDG-PET Interpretation for Cancer Screening
Jpn. J. Clin. Oncol., August 18, 2007; (2007) hym064v1.
[Abstract] [Full Text] [PDF]


Home page
J Ultrasound MedHome page
J. L. Alcazar, M. Garcia-Manero, and R. Galvan
Three-Dimensional Sonographic Morphologic Assessment of Adnexal Masses: A Reproducibility Study
J. Ultrasound Med., August 1, 2007; 26(8): 1007 - 1011.
[Abstract] [Full Text] [PDF]


Home page
Am. J. Roentgenol.Home page
S. Suzuki, S. Furui, K. Okinaga, T. Sakamoto, J. Murata, A. Furukawa, and Y. Ohnaka
Differentiation of Femoral Versus Inguinal Hernia: CT Findings
Am. J. Roentgenol., August 1, 2007; 189(2): W78 - W83.
[Abstract] [Full Text] [PDF]


Home page
Am J Trop Med HygHome page
R. KUMAR, R. A. BUMB, N. A. ANSARI, R. D. MEHTA, and P. SALOTRA
CUTANEOUS LEISHMANIASIS CAUSED BY LEISHMANIA TROPICA IN BIKANER, INDIA: PARASITE IDENTIFICATION AND CHARACTERIZATION USING MOLECULAR AND IMMUNOLOGIC TOOLS
Am J Trop Med Hyg, May 1, 2007; 76(5): 896 - 901.
[Abstract] [Full Text] [PDF]


Home page
Am. J. Neuroradiol.Home page
H.J. Cloft, T. Kaufmann, and D.F. Kallmes
Observer Agreement in the Assessment of Endovascular Aneurysm Therapy and Aneurysm Recurrence
AJNR Am. J. Neuroradiol., March 1, 2007; 28(3): 497 - 500.
[Abstract] [Full Text] [PDF]


Home page
Arch Facial Plast SurgHome page
H. D. Stupak, T. H. M. Moulthrop, P. Wheatley, A. V. Tauman, and C. M. Johnson Jr
Calcium Hydroxylapatite Gel (Radiesse) Injection for the Correction of Postrhinoplasty Contour Deficiencies and Asymmetries
Arch Facial Plast Surg, March 1, 2007; 9(2): 130 - 136.
[Abstract] [Full Text] [PDF]


Home page
Am. J. Roentgenol.Home page
Y. J. Kim, S. S. Raman, N. C. Yu, K. J. To'o, R. Jutabha, and D. S. K. Lu
Esophageal Varices in Cirrhotic Patients: Evaluation with Liver CT
Am. J. Roentgenol., January 1, 2007; 188(1): 139 - 144.
[Abstract] [Full Text] [PDF]


Home page
RadiologyHome page
T. A. Jaffe, L. C. Martin, C. M. Miller, K. M. Franklin, E. M. Merkle, W. M. Thompson, R. C. Nelson, D. M. DeLong, and E. K. Paulson
Abdominal Pain: Coronal Reformations from Isotropic Voxels with 16-Section CT--Reader Lesion Detection and Interpretation Time
Radiology, January 1, 2007; 242(1): 175 - 181.
[Abstract] [Full Text] [PDF]


Home page
RadiologyHome page
S. Kim, Y.-M. Huh, H.-T. Song, S.-A. Lee, J.-W. Lee, J. E. Lee, I. H. Chung, and J.-S. Suh
Chronic Tibiofibular Syndesmosis Injury of Ankle: Evaluation with Contrast-enhanced Fat-suppressed 3D Fast Spoiled Gradient-recalled Acquisition in the Steady State MR Imaging
Radiology, January 1, 2007; 242(1): 225 - 235.
[Abstract] [Full Text] [PDF]


Home page
RadiologyHome page
G. Andreisek, T. Pfammatter, K. Goepfert, D. Nanz, P. Hervo, R. Koppensteiner, and D. Weishaupt
Peripheral Arteries in Diabetic Patients: Standard Bolus-Chase and Time-resolved MR Angiography
Radiology, December 19, 2006; (2006) 2422051111.
[Abstract] [Full Text]


Home page
CirculationHome page
R. de Silva, L. F. Gutierrez, A. N. Raval, E. R. McVeigh, C. Ozturk, and R. J. Lederman
X-Ray Fused With Magnetic Resonance Imaging (XFM) to Target Endomyocardial Injections: Validation in a Swine Model of Myocardial Infarction
Circulation, November 28, 2006; 114(22): 2342 - 2350.
[Abstract] [Full Text] [PDF]