Radiology
HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH TABLE OF CONTENTS
 QUICK SEARCH:   [advanced]


     


This Article
Right arrow Abstract Freely available
Right arrow Figures Only
Right arrow Full Text (PDF)
Right arrow Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when eLetters are posted
Right arrow Alert me if a correction is posted
Right arrow Citation Map
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Right arrow reprints & permissions
Citing Articles
Right arrow Citing Articles via HighWire
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Fultz, P. J.
Right arrow Articles by Shapiro, D. E.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Fultz, P. J.
Right arrow Articles by Shapiro, D. E.
(Radiology. 1999;212:401-410.)
© RSNA, 1999


Special Report

Ovarian Cancer: Comparison of Observer Performance for Four Methods of Interpreting CT Scans1

Patrick J. Fultz, MD, Christina V. Jacobs, MD 2, W. Jackson Hall, PhD, Ronald Gottlieb, MD, Deborah Rubens, MD, Saara M. S. Totterman, MD, PhD, Steven Meyers, MD, PhD, Cynthia Angel, MD, Giuseppe Del Priore, MD, MPH 3, David P. Warshal, MD 4, Kelly H. Zou, PhD 5 and David E. Shapiro, PhD 6

1 From the Depts of Radiology (P.J.F., C.V.J., R.G., D.R., S.M.S.T., S.M.), Obstetrics and Gynecology (C.A., G.D.P., D.P.W.), and Biostatistics (W.J.H., K.H.Z., D.E.S.), University of Rochester Medical Center, 601 Elmwood Ave, Rochester, NY 14642. Received May 7, 1997; revision requested May 27; final revision received Sep 2, 1998; accepted Feb 12, 1999. Supported in part by a grant from Innovations in Patient Care Program, Univ of Rochester Medical Center. Address reprint requests to P.J.F.


    Abstract
 TOP
 Abstract
 Introduction
 MATERIALS AND METHODS
 RESULTS
 DISCUSSION
 Appendix 1
 Appendix 2
 References
 
PURPOSE: To assess the effects of four interpretative methods on observers' mean sensitivity and specificity by using computed tomography (CT) of ovarian carcinoma as a model.

MATERIALS AND METHODS: CT scans in 98 patients with ovarian carcinoma and 49 women who were disease free were retrospectively reviewed by four experienced blinded radiologists to compare single-observer reading, single-observer reading with an anatomic checklist, paired-observer reading (simultaneous double reading), and replicated reading (combination of two independent readings). Confidence level scoring was used to identify three possible disease forms in each patient: extranodal tumor, lymphadenopathy, and ascites. Patient conditions were then categorized as abnormal or normal.

RESULTS: There were no significant improvements in sensitivity or specificity for classification of patient conditions as abnormal or normal when comparing single-observer interpretation with single-observer interpretation with a checklist or paired-observer interpretation. Although there was no significant improvement in the mean sensitivity (93% vs 94%) by using the replicated reading method, there was a statistically significant improvement in mean specificity (85% vs 79%) for the replicated readings compared with single-observer interpretations (P < .05).

CONCLUSION: Diagnostic aids such as checklists and paired simultaneous readings did not lead to an improved mean observer performance for experienced readers. However, an increase in the mean specificity occurred with replicated readings.

Index terms: Diagnostic radiology, observer performance, 852.12112 • Images, interpretation, 852.12112 • Ovary, CT, 852.12112 • Ovary, neoplasms, 852.32


    Introduction
 TOP
 Abstract
 Introduction
 MATERIALS AND METHODS
 RESULTS
 DISCUSSION
 Appendix 1
 Appendix 2
 References
 
More than 50 years have passed since Birkelo et al (1) illuminated inconsistencies among radiologists in interpretations of radiographs of the chest for the assessment of tuberculosis. In that study, 1,256 radiographs were obtained for screening for tuberculosis, and five readers independently interpreted the radiographs. The number of positive diagnoses for the five readers ranged from 56 to 100. Since the initial work by Birkelo et al (1), a number of studies (26) have raised concerns regarding the variability and/or accuracy of radiologic interpretations for examinations such as chest radiography, mammography, computed tomography (CT), and magnetic resonance (MR) imaging.

Ovarian carcinoma is estimated to be the fifth leading cancer cause of death in women in the United States (7). While CT has some inherent limitations in the detection of ovarian carcinoma and its metastases, it is often used preoperatively and/or in postoperative follow-up in these patients (8,9). Findings of a previous study (5) of CT in patients with ovarian carcinoma showed that in paired comparisons of three independent reviews of the images in 100 cases, there were up to 26 cases with discrepancy between two reviewers for the presence of a mass. We therefore chose to use CT in the setting of ovarian carcinoma as a model to investigate the effects of four methods of image interpretation on observer performance.

This project was designed to assess the effects of a variety of methods of image interpretation on interobserver agreement, mean sensitivity, and mean specificity. Methods to potentially improve the efficacy of standard single-observer image interpretation included an observer checklist, paired simultaneous reading by two observers, and combining two independent readings in the form of a replicated reading. Comparison of these various methods was performed by using a group of experienced radiologists (R.G., D.R., S.M.S.T., S.M.) in an effort to identify the most efficacious approach to CT interpretations in patients with ovarian carcinoma.


    MATERIALS AND METHODS
 TOP
 Abstract
 Introduction
 MATERIALS AND METHODS
 RESULTS
 DISCUSSION
 Appendix 1
 Appendix 2
 References
 
Patient Population
Our focus was on the effects of four interpretative methods on mean observer performance with use of CT of ovarian cancer as a model and not on the accuracy of CT as a modality in this setting. Therefore, we included only patients with disease judged as recognizable at CT in the study group and only patients whose CT scans showed no recognizable disease in the control group.

CT studies from January 1990 to February 1995 in 98 patients with a history of active ovarian cancer and in 19 patients with disease in remission were selected from our gynecologic oncology clinic and surgery records. All patients who presented to the clinic and those who underwent surgery performed by our gynecologic oncologists (including C.A., G.D.P., D.P.W.) were identified from a computer printout of patients examined in this time frame. Thirty women without a history of gynecologic malignancy who had undergone body CT examinations during this same period were selected from the general adult patient population. Therefore, the study comprised 147 patients (98 with abnormal and 49 with normal conditions) who had undergone abdominal and/or pelvic CT.

The 98 patients with abnormal CT findings were the 17 patients with ovarian cancer at initial presentation and the 81 patients with disease recurrence recognizable at CT examination. Normal CT examinations were those in the 19 patients with ovarian cancer in remission and the 30 women with no history of gynecologic malignancy who underwent abdominal or pelvic CT examinations at our hospital for various indications.

The mean age of patients with ovarian cancer was 60 years (SD, 12 years; age range, 30–82 years); the mean age of patients free of disease was 62 years (SD, 14 years; age range, 30–86 years).

The patients with ovarian cancer and those free of ovarian cancer after treatment were identified from consecutive CT examinations that fulfilled specific diagnostic criteria for confirmation of active disease or remission at the time of the CT examination. The inclusion criteria in 58 patients included timely (within less than 1 month) surgical and/or percutaneous biopsy confirmation of CT findings. The inclusion criteria for the remaining 59 patients included findings of serial clinical examinations and serum tumor marker analysis and/or follow-up CT examinations over intervals of at least 6 months that corroborated the findings of the specific CT examinations included in this study.

A total of 261 patients with a history of ovarian cancer were excluded because of lack of complete CT examinations (some patients underwent no CT or only limited CT for percutaneous biopsy guidance), the presence of concomitant disease (eg, other malignancy, abscess, or lymphocele) at CT, insufficient follow-up to document the true status of the disease, and/or surgically documented disease with no recognizable disease at CT.

By using the aforementioned diagnostic information, the confirmation of CT findings for each of the 147 patients was established by means of a consensus review of all data by investigators (P.J.F., C.V.J.) who were not involved in the subsequent blinded comparative CT examination interpretations. By using all available clinical and radiologic data, the CT scans in each of the 147 patients were then reviewed to create the standard interpretations to which the blinded interpretations were compared.

Three forms of possible disease were defined: extranodal tumor mass, nodal disease, and ascites. When an abnormality was present, it was localized to the upper or lower part of the abdomen or to the upper or lower part of the pelvis so that four sites of abnormality were possible for each form of disease. The total numbers of normal and abnormal areas for each of the three forms of disease are presented in Table 1.


View this table:
[in this window]
[in a new window]
 
TABLE 1. Patients with Abnormalities for Each of Three Forms of Disease
 
The majority of patients with a history of ovarian carcinoma had typical epithelial cell cancer; the remaining patients had tumors of low malignant potential or nonepithelial malignancies.

All 147 CT examinations (CT HiSpeed Advantage, HiLight Advantage, model 9800, and model 8800 scanners; GE Medical Systems, Milwaukee, Wis) were performed at our hospital between January 1990 and February 1995. More than half the patients (n = 78) received contrast material intravenously, orally, and rectally, with most of the remaining patients receiving contrast material intravenously and orally (n = 46). The remaining examinations (n = 23) were performed with at least one of these three routes of contrast material administration. Most (n = 117) examinations were performed with 10-mm contiguous images, with the remaining examinations performed with a variety of section thicknesses (5 or 7 mm) and scanning intervals (7 or 10 mm).

CT Scan Review and Replicated Reading Method
Images from the 147 CT examinations were intermixed randomly, and the examinations were subdivided into three groups as follows: group 1, 49 examinations; group 2, 49 examinations; and group 3, 49 examinations. Images from these 147 examinations were then reviewed without clinical information in a blinded retrospective manner by four attending cross-sectional imaging radiologists (including R.G., D.R., S.M.S.T., S.M.) in three sessions at three different times. Subsequently, in a fourth reading session, each of four observers reviewed images from 30 of the CT examinations (10 from each of the three original groups of examinations) to allow evaluation of calculations of intraobserver agreement. Each reading session was separated by at least 2 months. This CT scan review method is summarized in Figure 1.



View larger version (29K):
[in this window]
[in a new window]
 
Figure 1. Diagram shows the format for the review of images from CT examinations and the reader (observer) assignments. For reading sessions 1-3, the 147 examinations were divided into three groups of examinations: group 1 (G1), 49 examinations; group 2 (G2), 49 examinations; and group 3 (G3), 49 examinations.

 
With this format, a different interpretative method was used by each reader for each of the three groups of CT examinations within each session. The randomized order within each reading session and the 2-month interval between the different review methods applied to a given group of examinations diminished the potential for learning biases (reading-order effects) that could be introduced by using only one specific order of application of the various methods.

Three direct methods of interpretation were employed in the first three reading sessions, and the four radiologists actively participated in each of these three image reviews by using confidence level scoring. The three direct methods were (a) the standard single-observer approach, (b) the single-observer interpretation with a predesigned CT checklist (Fig 2), and (c) simultaneous consensus interpretation by two observers (paired reading). By using this review format, images from each of the 147 examinations were reviewed three times by each observer (alone, with a checklist, and as pairs of readers) in the first three reading sessions, with each session separated by at least 2 months.



View larger version (32K):
[in this window]
[in a new window]
 
Figure 2. Abdominal-pelvic CT checklist. Modified from Kinard et al (10). GI = gastrointestinal, IVC = inferior vena cava.

 
As noted earlier, the fourth reading session was an independent review of images from 30 of the original 147 CT examinations. This final interpretative method was a form of replicated reading (11). The replicated reading used in this study was a mathematic retrospective pairing of earlier single-observer interpretations in all possible combinations. For the purposes of this analysis, the original confidence level scores were dichotomized into a "yes or no" response (discussed further in Statistical Methods, Sensitivity and Specificity). Because these artificial reader pairings will at times result in disagreement, a pseudoarbitration method (12) was used when the pair disagreed about a classification as normal or abnormal. Therefore, each replicated reading pair was evaluated by alternatively combining readings of this pair with one of the two remaining readers, who then served as an arbitrator only when a disagreement occurred between the original pair of readers. The arbitrator's patient assessment thus served as a tiebreaker when needed. This design yielded a total of 12 possible reader combinations when dispute arbitrators were included in the replicated reading calculations. A mean sensitivity and specificity were then calculated from these 12 reader combinations.

The four individual readers were randomly designated as reader 1, 2, 3, or 4. The pairs of observers for the double-reading sessions were randomly assigned and were constant throughout the study. The four readers varied in duration of specialized body imaging experience from 2 to 10 years when this study was initiated. The random reader assignments coincidentally resulted in pairing the two radiologists least experienced in body imaging (2 and 5 years experience) and the two radiologists most experienced (9 and 10 years).

Prior to the reading sessions, the observers received an overview of the use of a data collection form and confidence level scoring method: 0 indicated definitely absent, 1 indicated probably absent, 2 indicated possibly present, 3 indicated probably present, and 4 indicated definitely present. The observers assessed each patient CT study for the presence of three categories of possible disease by using confidence level scoring for each disease category, regardless of the number of sites involved. These three categories of disease included extranodal tumor mass, nodal disease, and ascites (5). At subsequent questioning, there was also agreement among all the readers that in a "forced choice" situation (normal vs abnormal) a confidence level score of 0 or 1 generally corresponded to a diagnosis of normal and scores of 2–4 corresponded to a diagnosis of abnormal. The division between abdomen and pelvis was defined as the iliac crest, and the division between the upper and lower parts of the abdomen or pelvis was the midway point in each area. If a mass crossed a boundary, it was scored in each area. Actual observer time per examination interpretation was also recorded.

Only those observers who used the CT checklist (Fig 2) in a given session received instruction regarding its use. Observers could use the checklist as a simple reminder to review all listed anatomic areas, or they could formally complete the form and then transfer information to the data collection form. Other detailed information given to all observers included instructions to omit chest cavity findings from their analysis, a review of CT nodal size criteria for this study, and instructions that cysts contiguous with the liver margin and all mesenteric nodules should be considered extranodal tumor masses and that centrally located simple liver cysts, along with hepatic hemangiomas and renal or adrenal lesions, should be considered incidental findings. In addition, a review of a pseudoarbitration method (12) was provided to resolve potential disputes during the paired-observer simultaneous reading sessions.

Statistical Methods
Modeling the confidence level scores.—To assess the magnitude and statistical significance of the effects of various factors on the ordinal scores assigned by the readers for each of the three forms of disease, we fit a linear model to the scores (SAS Institute, "SAS/STAT Software: Changes and Enhancements through Release 6.11," SAS Institute, Cary, NC, 1996)(Appendix 1). We thus treated the scores as if continuous just for this purpose; regrettably, software modeling of ordinal scores in a complex design (with pairing) is not to our knowledge currently available. Results from ordinal regression modeling of subsets of the data agreed well with results from continuous modeling.

The results of these analyses helped in guiding the agreement and observer performance analyses. Moreover, other results of tests of significance about various measures of observer performance were in agreement with results of tests associated with this modeling.

Intraobserver and interobserver agreement.—Intraobserver agreement information for the single-observer review method for each of the three forms of disease was obtained by comparing earlier single-reader observations (from reading sessions 1–3) with individual reinterpretations of examinations of 30 patients in reading session 4, the final session. By using the various combinations of paired comparisons, the mean interobserver agreement was calculated for each of the three direct interpretative methods (single observer, single observer with anatomic checklist, and paired observers). Intraobserver agreement and interobserver agreement were initially calibrated by using the weighted {kappa} statistic (13), in which we assigned weight w to a disagreement for which the two readings were w confidence levels apart. Also, a simple {kappa} analysis for the three direct interpretative methods was performed for dichotomized patient condition categorizations of normal versus abnormal.

Sensitivity and specificity.—Analysis of sensitivity and specificity was carried out to estimate observer performance in the context of the more commonly used measures of a method's utility. In this analysis, sensitivity and specificity were defined with respect to the true status of the patient's condition (diseased vs normal).

To achieve a dichotomy of yes or no responses (ie, disease present or absent) for this analysis, the observers' confidence level scores for a patient's three possible forms of disease were redefined as absent (score of 0 or 1) or present (score of 2–4). Our readers were in agreement that this was the most appropriate format for dichotomizing their confidence level responses.

The localization data for each of the three forms of disease in each of the four body areas were then tabulated by categorizing the observer responses for detection of disease as true-positive, true-negative, false-positive, or false-negative findings in a comparison with our previously defined standards of reference (see Materials and Methods, Patient Population). Subsequent data analysis for the 98 patients with abnormal and the 49 patients with normal CT findings (in contrast with normal and abnormal areas on CT scans) was accomplished by using a 0 or 1 confidence level score for all of the three forms of disease as a diagnosis of normal and scores of 2–4 for any of the three forms of disease as a diagnosis of abnormal.

The mean sensitivities and specificities for a given reading method were obtained by averaging the various observer's respective results within that reading method.

When comparing two or more proportions (eg, sensitivities across observers or across interpretative methods), McNemar {chi}2 tests were used. When comparing mean proportions—that is, proportions averaged across observers—mean proportions and differences in mean proportions were evaluated relative to an estimated standard error (Appendix 2). All tests accounted for inherent dependence because several methods were evaluated in the same patients. Effects were considered statistically significant if P values (two-sided) were less than .05 (with no adjustments for multiple comparisons).

Receiver operating characteristic analysis.—In addition, Receiver operating characteristic (ROC) analyses were performed on the data from the three direct interpretative methods to analyze the confidence level scores without dichotomizing the observer responses. Separate analyses were carried out for each observer and each method, but only pooled (across observers) data curves are presented, as conclusions from the others were similar to those from the traditional sensitivity and specificity analyses. A ROC analysis could not be performed for the replicated reading methods, as the data for this interpretative form were obtained by dichotomizing the original confidence level scores.

ROC curves were constructed by using ROCFIT (Metz CE, Shen JH, Wang PL, Kronman HB, Fortran program ROCFIT, Department of Radiology, University of Chicago, Ill, 1994) and CORROC2 (Metz CE, Shen JH, Kronman HB, Wang PL, Fortran program CORROC2, Department of Radiology, University of Chicago, Ill, 1994) software (14,15) for fitting binormal models to rating data. ROC curves were constructed for each reader or reader pair and for each of the first three direct interpretative methods. When comparing two interpretative methods, CORROC2 software was used, with recognition that data from the same 147 patients were being studied. To construct a summary curve, by summarizing over readers, we used the ad hoc method of simply pooling all readings together as if they were single readings in different patients rather than multiple readings in the same patients. No significance testing was done with these summary curves.

ROC analyses were performed by using the overall classification of patient conditions as abnormal or normal (rather than abnormalities in separate forms of disease). The overall confidence level score (rating) of a patient condition along the confidence level spectrum from abnormal to normal was defined as the highest of the scores for the three forms of disease.

To compare curves, three summary characteristics were considered: area under the curve, true-positive rate at a false-positive rate of 0.1, and false-positive rate at a true-positive rate of 0.9. When comparing characteristics of two curves, significance testing was done by comparing the difference between characteristics with an estimated standard error. When comparing characteristics of three (or four) curves simultaneously, we treated the sum of squared deviations from the mean characteristic divided by a pooled variance estimate as a {chi}2 with 2 (or 3) degrees of freedom. Similar significance testing methods were used for comparing true-positive rates and false-positive rates across readers or across methods. Effects were considered statistically significant if P values were less than .05.

Analysis of the time to read images.—Each observer's mean time to read images from an examination for each of the four methods of interpretation was compared to those of the other observers. The mean times to read images from examinations for all observers combined within each of the various reading methods were also compared. The replicated reading session times are the sum of the mean times to read for each single observer that were artificially paired to create the various replicated readings. Some of the time data are not available because the data collection form was simplified during the first reading session; therefore, 60% (294 of 490) of the image interpretation times from reading session 1 were disregarded. Furthermore, the time to read was not recorded by the observers on some occasions.


    RESULTS
 TOP
 Abstract
 Introduction
 MATERIALS AND METHODS
 RESULTS
 DISCUSSION
 Appendix 1
 Appendix 2
 References
 
Modeling the Confidence Level Scores
The modeling of the confidence level scores (Appendix 1) led to the following general conclusions, which are consistent among the three forms of disease (details omitted). There was a slight increase in confidence level scores from reading session 1 to reading session 2 and a further increase at reading session 3 (ie, there was a trend of observers perceiving more abnormal findings in the later reading sessions), but these effects were not statistically significant. Consequently, no attempts at adjustments for reading sessions were made in further analyses. There were scoring differences among readers; some tended to give higher scores than others, but generally in a consistent way across methods; this is addressed further below.

When comparing confidence level scores among the three direct interpretive methods for each reader, no statistically significant differences were found. Other comparisons of the methods are considered below.

Intraobserver and Interobserver Agreement
Table 2 summarizes the two {kappa} analyses of intraobserver and interobserver agreement for recognition of each of the three forms of disease and for categorization of patient conditions as normal or abnormal for the three direct interpretative methods. There was very good intraobserver agreement for categorization of patient conditions with respect to extranodal tumor masses (mean weighted {kappa} statistic of 0.79) and ascites (mean weighted {kappa} statistic of 0.88). There was less intraobserver agreement among observers for recognition of nodal disease; the mean weighted {kappa} statistic of 0.49 reflects a moderate or fair level of mean intraobserver agreement. Generally, the intraobserver agreement was greater than interobserver agreement for the three possible forms of disease.


View this table:
[in this window]
[in a new window]
 
TABLE 2. {kappa} Analysis of Intraobserver and Interobserver Agreement for the Three Forms of Disease and for Categorization of Normal and Abnormal CT Studies
 
The {kappa} analysis for intraobserver and interobserver agreement for the dichotomized confidence level scores (disease present or absent) yielded a mean intraobserver {kappa} of 0.69. This was similar to the mean interobserver agreements for each of the three direct interpretative methods, which ranged from 0.64 to 0.75.

Sensitivity and Specificity
The means and ranges across observers for sensitivity and specificity for the various interpretative methods are shown in Table 3. There were statistically significant differences among observers for sensitivity (range, 87%–97%, P < .01) and specificity (range, 65%–92%, P < .01), both for single reading and reading with the checklist, but there were no statistically significant differences among the first three methods for any particular observer. The largest differences in sensitivity and specificity between single readers for classification of patient conditions as abnormal or normal were for observers 1 and 3: Sensitivities for these observers were 97% and 87%, respectively, and specificities were 67% and 92%, respectively.


View this table:
[in this window]
[in a new window]
 
TABLE 3. Sensitivity and Specificity for Each Interpretative Technique
 
When averaging sensitivity or specificity over observers, there was no statistically significant improvement for the increment over single-observer interpretations for single observers reading with a checklist or for paired observers. There was a statistically significant improvement in mean specificity (84.7% vs 79.1%) for the replicated reading method when compared with the single-observer reading. However, there was no statistically significant improvement in the mean sensitivity (93.9% vs 93.1%).

We estimate (by using information in Table 3 [details omitted]) that the power to detect a 5% improvement in sensitivity in any of the alternative reading methods compared with the single observer method was 90% or more; the power to detect a 3% improvement was around 50%. We estimate that the power to detect a 17% improvement in specificity in any of the alternative reading methods compared with the single-observer method was 90% or more; the power to detect a 10% improvement was around 50%.

ROC Analysis
Figure 3 presents the ROC analyses of abnormal versus normal patient condition assessment for each observer by using each of the three direct interpretative methods; ROC curves of reading method 1 (single observer) were compared with those of reading method 2 (checklist) and with reading method 3 (paired reading). There were no statistically significant differences between these curves or in the areas under the curves for any reader when comparing method 1 with method 2 or method 1 with method 3.



View larger version (24K):
[in this window]
[in a new window]
 
Figure 3a. ROC curves and areas under the curves (AUC) for overall patient CT study evaluation (abnormal vs normal) for each observer in comparisons of single observer with single observer plus checklist and of single observer with paired observers. (a, b) Observer 1. (c, d) Observer 2. (e, f) Observer 3. (g, h) Observer 4. FPR = false-positive rate, TPR = true-positive rate.

 


View larger version (22K):
[in this window]
[in a new window]
 
Figure 3b. ROC curves and areas under the curves (AUC) for overall patient CT study evaluation (abnormal vs normal) for each observer in comparisons of single observer with single observer plus checklist and of single observer with paired observers. (a, b) Observer 1. (c, d) Observer 2. (e, f) Observer 3. (g, h) Observer 4. FPR = false-positive rate, TPR = true-positive rate.

 


View larger version (24K):
[in this window]
[in a new window]
 
Figure 3c. ROC curves and areas under the curves (AUC) for overall patient CT study evaluation (abnormal vs normal) for each observer in comparisons of single observer with single observer plus checklist and of single observer with paired observers. (a, b) Observer 1. (c, d) Observer 2. (e, f) Observer 3. (g, h) Observer 4. FPR = false-positive rate, TPR = true-positive rate.

 


View larger version (22K):
[in this window]
[in a new window]
 
Figure 3d. ROC curves and areas under the curves (AUC) for overall patient CT study evaluation (abnormal vs normal) for each observer in comparisons of single observer with single observer plus checklist and of single observer with paired observers. (a, b) Observer 1. (c, d) Observer 2. (e, f) Observer 3. (g, h) Observer 4. FPR = false-positive rate, TPR = true-positive rate.

 


View larger version (25K):
[in this window]
[in a new window]
 
Figure 3e. ROC curves and areas under the curves (AUC) for overall patient CT study evaluation (abnormal vs normal) for each observer in comparisons of single observer with single observer plus checklist and of single observer with paired observers. (a, b) Observer 1. (c, d) Observer 2. (e, f) Observer 3. (g, h) Observer 4. FPR = false-positive rate, TPR = true-positive rate.

 


View larger version (23K):
[in this window]
[in a new window]
 
Figure 3f. ROC curves and areas under the curves (AUC) for overall patient CT study evaluation (abnormal vs normal) for each observer in comparisons of single observer with single observer plus checklist and of single observer with paired observers. (a, b) Observer 1. (c, d) Observer 2. (e, f) Observer 3. (g, h) Observer 4. FPR = false-positive rate, TPR = true-positive rate.

 


View larger version (24K):
[in this window]
[in a new window]
 
Figure 3g. ROC curves and areas under the curves (AUC) for overall patient CT study evaluation (abnormal vs normal) for each observer in comparisons of single observer with single observer plus checklist and of single observer with paired observers. (a, b) Observer 1. (c, d) Observer 2. (e, f) Observer 3. (g, h) Observer 4. FPR = false-positive rate, TPR = true-positive rate.

 


View larger version (24K):
[in this window]
[in a new window]
 
Figure 3h. ROC curves and areas under the curves (AUC) for overall patient CT study evaluation (abnormal vs normal) for each observer in comparisons of single observer with single observer plus checklist and of single observer with paired observers. (a, b) Observer 1. (c, d) Observer 2. (e, f) Observer 3. (g, h) Observer 4. FPR = false-positive rate, TPR = true-positive rate.

 
Table 4 presents the true-positive rate at a false-positive rate of 0.1, and Table 5 presents the false-positive rate at a true-positive rate of 0.9 for each observer by using each reading method. These were determined from fitted binormal ROC curves. By using these measures, the checklist was generally unhelpful or detrimental. Pairing observers 1 and 2 had almost no effect, but pairing observers 3 and 4 resulted in false-positive rates and true-positive rates that were, approximately, means of the rates for each observer. This improved the performance of observer 3 at the expense of that of observer 4.


View this table:
[in this window]
[in a new window]
 
TABLE 4. ROC Analysis: True-Positive Rates for Each Observer and Reading Method at a False-Positive Rate of 0.1
 

View this table:
[in this window]
[in a new window]
 
TABLE 5. ROC Analysis: False-Positive Rates for Each Observer and Reading Method at a True-Positive Rate of 0.9
 
Figure 4 is the overall ROC analysis of abnormal versus normal patient condition assessments for the three direct interpretative methods constructed by pooling all readings by using each interpretative method. The curves are very similar; if anything, the single-observer curve appears best. No significance testing was feasible here, due to the pooling across readers. However, no statistically significant differences were found when analyzed reader by reader (Fig 3).



View larger version (18K):
[in this window]
[in a new window]
 
Figure 4. ROC curves and areas under the curve (AUC) for categorizing abnormal and normal patient CT studies for the three interpretative methods. These curves were constructed by pooling all readings for each method. FPR = false-positive rate, TPR = true-positive rate.

 
Analysis of the Time to Read Images
The mean time to read images from an examination for each of the interpretative methods was as follows: single observer, 3.44 minutes (range, 2.35–4.89 minutes); checklist, 3.99 minutes (range, 3.54–4.61 minutes); paired observers, 5.07 minutes (range, 3.12–6.99 minutes); and replicated readings, 6.53 minutes (range, 5.04–8.02 minutes). This last time estimate does not include the additional time commitment of the third independent reader when the pseudoarbitration method was applied. There were statistically significant differences in these times between all observers within and among reading methods. As expected, on average it was more time-consuming to read with a checklist or as pairs of readers and, because of the design, for replicated readings.


    DISCUSSION
 TOP
 Abstract
 Introduction
 MATERIALS AND METHODS
 RESULTS
 DISCUSSION
 Appendix 1
 Appendix 2
 References
 
Other investigators have also addressed various aspects of radiologic diagnostic efficacy as they relate to interobserver agreement, sensitivity, and specificity, and some (5,16) have evaluated methods to improve on these measures of observer performance. In a previous study (5), investigators assessed interobserver reliability for the interpretation of CT scans of advanced ovarian cancer. These authors did not assess the effect of improving CT techniques or other approaches for resolving discrepancies. We found somewhat lower, but nevertheless striking levels of disagreement between various pairs of independent observers (up to 17% [25 of 147]) for the presence of a mass. It is of interest that interobserver agreement has also been investigated in a wide variety of clinical situations (17,18), with similar striking variability among individual observers.

Other investigators (1,3,4) have shown a wide spectrum of observer sensitivities and specificities for other radiologic examinations. To our knowledge, much of the previous literature on this topic has been in the area of mammography. In one review (3) of 10 radiologists' independent mammographic interpretations in 150 patients (for a mixture of screening and diagnostic studies), there was a median sensitivity for a diagnosis of "suggestive of cancer" of 70% with a range of 37%–85% and a median specificity of 93% with a range of 85%–99%. Furthermore, in 33% of patients recommended for additional work-up in this study (3), at least one of the 10 radiologists differed with regard to the side of the possible abnormality. In a different project (4) in which the variability in breast cancer detection was assessed for a group of 108 radiologists reviewing images from 79 screening mammographic examinations, a range of sensitivity of 47%–100% and a range of specificity of 36%–99% were found. In our study, the results were less variable than the aforementioned mammographic assessments; the individual reader sensitivities ranged from 87% to 97% and the individual reader specificities ranged from 67% to 92% for classification of patient conditions as abnormal or normal.

Reproducibility of interpretation results is an important feature of diagnostic efficacy. We used the weighted {kappa} statistic for analysis of intraobserver and interobserver agreements or reliabilities for our confidence level scoring system. The critical issue in the comparison of these weighted {kappa} values relates to determining which methods of interpretation will yield the highest level of interobserver agreement. However, neither using the checklist when interpreting studies nor the paired readings yielded an increase in mean interobserver agreement over that for the standard independent interpretations for our experienced readers. Given the lack of improvement in mean sensitivity and specificity for the three direct interpretative methods in our study, a more detailed discussion of the differences in interobserver agreement among those methods becomes moot. We did not assess interobserver agreement for the replicated reading method because by design this method forces interpretations to be in greater agreement.

Several previous investigators (1012,1931) have addressed methods ranging from simple to more complex to improve individual observer performances. These methods have included use of various checklists; simultaneous pairing of observers; sequential pairing of observers, with the second independent observer either aware or unaware of the first interpretation; replicated readings; group consensus; group consultation followed by independent diagnoses; the Delphi technique; computerized noninteractive consultation; mathematic combinations of different interpretations; computerized sequential decision making; and computer-assisted interpretations (1012,1931). We chose to emphasize three of these methods to potentially improve the single-observer performance of CT interpretation in light of the improved observer performance previously reported with these methods (6,10,12,21,22).

A checklist worksheet has been suggested as a method to improve diagnostic accuracy for body CT examinations (10). It is of interest that, on the basis of our ROC analysis, three of our four readers had poorer results when using a checklist than they did with their independent readings.

Furthermore, on the basis of the ROC analysis, there were variable effects recognized when pairing our observers. There was no benefit to pairing observers 1 and 2. Pairing observers 3 and 4 yielded mean false-positive rates and true-positive rates, with the performance of observer 3 improving at the expense of that of observer 4. The net result was no significant improvement in overall mean sensitivity or specificity. Representative examples of false-negative and false-positive diagnoses in our study are illustrated in Figures 5 and 6.



View larger version (176K):
[in this window]
[in a new window]
 
Figure 5. False-negative diagnosis. CT image shows a tumor implant (arrow) at the psoas margin that was unrecognized at six of 10 reviews.

 


View larger version (171K):
[in this window]
[in a new window]
 
Figure 6. False-positive diagnosis. Pelvic CT image shows a mildly asymmetric vaginal cuff (arrow), which was mistaken for a tumor implant at three of 10 reviews. A vaginal tampon (arrowhead) is present on the right.

 
The only interpretative method in our study that yielded an improvement in mean observer performance (via an improved mean specificity) was the replicated reading method. In a prior study (12) of 100 randomly selected chest radiographs independently interpreted by eight radiologists, various pairings of the independent reviews were reviewed to produce independent replicated "double readings." The pseudoarbitration method was used as a third independent opinion to settle disputes that arose with the reader pairings; this yielded a 37% improvement in accuracy (12).

An earlier investigation compared single readings of chest photofluorograms obtained for tuberculosis screening with a combination of two separate interpretations (replicated readings) to improve accuracy (20). In those cases where the combined single readings differed, a third opinion was used, with an additional 10% improvement in sensitivity when compared with the sensitivity of individual interpretations and without a change in specificity (20).

Sequential double reading, where the second observer often had knowledge of the original interpretation (21), and independent sequential double readings (22) of screening mammograms have yielded increases in cancer detection sensitivity of 10%–15%, with variable effects on specificity. However, using data from their previous mammographic study (4), Beam et al (23) examined the effects of a form of independent double reading and concluded that radiologists may form complementary or noncomplementary pairs. The average radiologist in their study (23) had an increase in the true-positive rate of 0.11 accompanied by an increase in the false-positive rate of 0.07. As in our study, some observer pairings in their study resulted in no change or in small changes in sensitivities and specificities.

The accuracy of radiologic examinations for recognizing abnormalities requires the identification and proper interpretation of various examination findings. Some studies of observer performance have purposely focused on interpretation by identifying the radiologic abnormality for the observers (24,28). In addition, the cognitive methods used to influence observer performance have had varied levels of complexity. We focused on three of the simpler interpretative methods (checklists, paired observers, and replicated readings) supplementary to the standard single-observer approach and used experienced observers who were required to both identify and interpret abnormal findings. Contrary to what might be inferred from most of the published literature, these approaches are not always successful in improving the performance of experienced observers.

There is also a potential cost in time that may occur with any supplemental interpretative methods. It should be noted that in the context of a study such as this, observers' times to read do not incorporate the time involved in developing a protocol or monitoring CT examinations, patient preparation, review of clinical data or prior studies, report dictation, and review or consultation with clinicians. In our investigation, the greatest cost in terms of physician time occurred with paired and replicated readings. In the absence of documented improvements in sensitivity and/or specificity, the cost of paired readings is not justified in this patient population for our readers. Replicated readings, or some modification thereof, may be more justifiable.

The limitations of studies such as ours are well outlined in the project of Elmore et al (3), who assessed variability of mammographic interpretations. These limitations include the effects of a study situation, where radiologists may be more (or less) diligent in their examination reviews than they are in daily practice and where the images from previous examinations are unavailable for comparison (3).

Another limitation of our study was that each disease site (upper vs lower abdomen or pelvis) was not documented with surgery in every case. As our focus was on observers categorizing patients with any evidence of ovarian cancer on CT scans versus patients with no disease on CT scans, the effect of lack of surgical confirmation at all anatomic sites was minimized. We believe our combined verification criteria listed in Materials and Methods were sufficiently accurate standards of reference for the CT findings to enable us to address the objectives of this study.

It is also important to note that these results may apply only to our four readers for this particular indication for CT examination. In addition, the results might have differed if only the more difficult cases had been addressed. Furthermore, a variety of CT scanning protocols were used in our patient population, which may have influenced our results. Continued refinements and standardization of CT protocols should serve to minimize observer variability and improve observer performance.

Most research in radiology focuses on applications of technologic improvements in complex imaging modalities including CT, MR imaging, and ultrasonography. Less effort has been devoted to improving the methods of image interpretation on the part of the radiologist, the human element in the process. Methods to improve interobserver reliability and observer performance for complex examinations, such as CT, may improve on delays in diagnosis, which contribute to delays in hospital discharge and lead to additional costly diagnostic testing. Independent replicated readings slightly increased the mean specificity of the CT interpretations, but this was the most time-consuming interpretative method.

Diagnostic aids such as anatomic checklists and paired simultaneous readings did not lead to improved mean interobserver agreement, sensitivity, or specificity for experienced readers in our study. Potential benefits of various forms of paired reading and other diagnostic aids must be weighed carefully against the increased physician time commitment. While some methods may assist less experienced readers or even experienced individuals for certain possible diseases or radiologic examinations, their universal application may not be efficacious. Certain interpretative methods, such as the integration of independent readings to form a noninteractive replicated reading, may be helpful for improving observer performance for experienced readers.


    Appendix 1
 TOP
 Abstract
 Introduction
 MATERIALS AND METHODS
 RESULTS
 DISCUSSION
 Appendix 1
 Appendix 2
 References
 
For each of the three forms of disease (with each form analyzed separately), there were 1,470 confidence level scores (10 observer-group combinations in each of three reading sessions, with 49 patients per group [Fig 1]). Each score was represented as a base mean score for single-observer readings of normal CT studies in the first reading session, plus linear and quadratic effects for the true number of areas with abnormalities in patients with abnormalities, plus an effect of deviation from the single-observer method for checklist or for paired-reading methods, plus linear and quadratic effects for second and third reading sessions, plus a random patient effect for each of the 147 patients, plus a random observer effect for each of the 10 observer-method combinations, plus a random error.

Differing error variances for differing methods and normal-versus-diseased conditions were allowed, as were correlations between random effects involving the same observers. The 27 parameters (seven fixed effects, 11 variances of random effects, three correlation coefficients, and six error variances) were fitted by using mixed-effect linear-model methods, and then submodels were fitted after the finding of many nonsignificant parameters.

The true number of areas with abnormalities and the random patient effects removed a major part of the variability of the scores, which enabled inference about the effects of interest—in particular, about the reading methods.


    Appendix 2
 TOP
 Abstract
 Introduction
 MATERIALS AND METHODS
 RESULTS
 DISCUSSION
 Appendix 1
 Appendix 2
 References
 
  There were four reading methods evaluated (Table 3), with four versions (readers) for the single-observer method, four versions for the checklist method, two versions for the paired-observer method, and 12 versions for the replicated reading method (six different pairings, each with two possible tiebreakers). We refer to these 22 versions as "systems." Of special interest are the mean sensitivity and mean specificity of each of the four methods (with averaging over the versions) and the differences between two such mean sensitivities or mean specificities. Other comparisons may be treated similarly.

To assess precision, a standard error is needed. Since all systems were evaluated in the same 98 patients with disease or the same 49 healthy persons, dependencies need to be accounted for. We illustrate with consideration of the difference between mean paired-observer sensitivity and mean single-observer sensitivity in six of the 22 systems.

We labeled the two versions of paired observers as systems 1 and 2 and the four versions of single observers as systems 3–6. We let pi represent the proportion of the 98 patients with disease classified correctly with system i, and, for later use, we let pij represent the proportion classified correctly with both system i and system j. (Note that pi = pii.) Then, interest centered on the difference d in mean sensitivities: d = [(p1 + p2)/2] - [(p3 + p4 + p5 + p6)/4]. Writing this as the inner product of two vectors, we had d = c'p, with c having elements 1/2, 1/2, -1/4, -1/4, -1/4, and -1/4 and with p having elements p1 to p6.

We let V represent the 6 x 6 matrix of variances and covariances, with (i,j)-element vij = cov(pi,pj). Then, the variance of the difference d was c'Vc (the double summation of cicjvij). Moreover, the variance vii of pi was estimated by using the familiar pi(1 - pi)/n, while the covariance vij was (pij - pipj)/n; here n is 98. This enabled estimation of the variance of d, and the desired standard error was the square root thereof. Finally, d was evaluated relative to its standard error by reference to a standard normal distribution. (Notice that this method uses two 22 x 22 matrices of proportions pij, one for sensitivities and one for specificities.)

To evaluate differences among several means simultaneously, an appropriate quadratic form is constructed and evaluated against a {chi}2 distribution.


    Acknowledgments
 
We thank Margaret Kowaluk, BS, for secretarial assistance in the preparation of this manuscript and Roberta Montanaro for clerical assistance.


    Footnotes
 
2 Current addresses: Elmendorf Hospital, Alaska Back

3 Dept of Obstetrics and Gynecology, New York University Medical School, NY. Back

4 Dept of Obstetrics and Gynecology, Cooper Health System, Camden, NJ Back

5 Dept of Health Care Policy, Harvard Medical School and Dept of Radiology, Brigham and Women's Hospital, Boston, Mass Back

6 Center for Biostatistics in AIDS Research, Harvard School of Public Health, Boston, Mass Back

Author contributions: Guarantor of integrity of entire study, P.J.F.; study concepts, P.J.F., C.V.J.; study design, P.J.F., D.E.S., W.J.H.; definition of intellectual content, P.J.F., W.J.H.; literature research, P.J.F., W.J.H., K.H.Z.; clinical studies, P.J.F., C.V.J., R.G., D.R., S.M.S.T., S.M., C.A., G.D.P., D.P.W.; data acquisition, P.J.F., C.V.J., R.G., D.R., S.M.S.T., S.M., C.A., G.D.P., D.P.W.; data analysis, P.J.F., W.J.H., K.H.Z., C.V.J.; statistical analysis, W.J.H., K.H.Z.; manuscript preparation, P.J.F., W.J.H., K.H.Z., D.E.S.; manuscript editing, P.J.F., C.V.J., W.J.H., R.G., S.M.S.T., G.D.P., D.P.W., K.H.Z., D.E.S.; manuscript review, all authors


    References
 TOP
 Abstract
 Introduction
 MATERIALS AND METHODS
 RESULTS
 DISCUSSION
 Appendix 1
 Appendix 2
 References
 

  1. Birkelo CC, Chamberlain WE, Phelps PS, Schools PE, Zacks D, Yerushalmy J. Tuberculosis case findings: a comparison of the effectiveness of various roentgenographic and photofluorographic methods. JAMA 1947; 133:359-366.
  2. Muhm JR, Miller WE, Fontana RS, et al. Lung cancer detected during a screening program using four-month chest radiographs. Radiology 1983; 148:609-615.[Abstract/Free Full Text]
  3. Elmore JG, Wells CK, Lee CH, Howard DH, Feinstein AR. Variability in radiologists' interpretations of mammograms. N Engl J Med 1994; 331:1493-1499.[Abstract/Free Full Text]
  4. Beam CA, Layde PM, Sullivan DC. Variability in the interpretation of screening mammograms by U.S. radiologists. Arch Intern Med 1996; 156:209-213.
  5. Warde P, Rideout DF, Herman S, et al. Computed tomography in advanced ovarian cancer: inter- and intraobserver reliability. Invest Radiol 1986; 21:31-33.[Medline]
  6. Wakeley CJ, Jones AM, Kabala JE, Prince D, Goddard PR. Audit of the value of double reading magnetic resonance imaging films. Br J Radiol 1995; 68:358-360.[Abstract]
  7. Landis SH, Murray T, Bolden S, Wingo PA. Cancer statistics, 1998. Cancer 1998; 47:6-29.
  8. Forstner R, Hricak H, White S. CT and MRI of ovarian cancer. Abdom Imaging 1995; 20:2-8.[Medline]
  9. Nelson RC, Frederick MG. Peritoneal carcinomatosis. Abdom Imaging 1995; 20:56-57.
  10. Kinard RE, Orrison WW, Brogdon BG. The value of a worksheet in reporting body-CT examinations. AJR 1986; 147:848-849.[Free Full Text]
  11. Metz CE, Shen JH. Gains in accuracy from replicated readings of diagnostic images: prediction and assessment in terms of ROC analysis. Med Decis Making 1992; 12:60-75.
  12. Hessel SJ, Herman PG, Swensson RG. Improving performance by multiple interpretations of chest radiographs: effectiveness and cost. Radiology 1978; 127:589-594.[Abstract]
  13. Fleiss JL. Statistical methods for rates and proportions 2nd ed. New York, NY: Wiley, 1981.
  14. Dorfman DD, Alf E. Maximum-likelihood estimation of parameters of method data. J Math Psych 1969; 6:487-496.
  15. Metz CE, Wang PL, Kronman HB. A new approach for testing the significance of differences between ROC curves measured from correlated data In: Proceedings of the Eighth Information Processing in Medical Imaging Conference. The Hague, the Netherlands: Nijhoff, 1984; 432-454.
  16. Webb RW, Sarin M, Zerhouni EA, Heelan RT, Glazer GM, Gatsonis C. Interobserver variability in CT and MR staging of lung cancer. J Comput Assist Tomogr 1993; 17:841-846.[Medline]
  17. Koran LM. The reliability of clinical methods, data and judgments. I. N Engl J Med 1975; 293:642-648.[Medline]
  18. Koran LM. The reliability of clinical methods, data and judgments. II. N Engl J Med 1975; 293:695-701.[Medline]
  19. Markus JB, Somers S, O'Mally BP, Stevenson GW. Double-contrast barium enema studies: effect of multiple reading on perception error. Radiology 1990; 175:155-156.[Abstract/Free Full Text]
  20. Yerushalmy J, Harkness JT, Cope JH, Kennedy BR. The role of dual reading in mass radiography. Am Rev Tuberc 1950; 61:443-463.
  21. Anderson ED, Muir BB, Walsh JS, Kirkpatrick AE. The efficacy of double reading mammograms in breast screening. Clin Radiol 1994; 49:248-251.[Medline]
  22. Thurfjell EL, Lernevall KA, Taube AA. Benefit of independent double reading in a population-based mammography screening program. Radiology 1994; 191:241-244.[Abstract/Free Full Text]
  23. Beam CA, Sullivan DC, Layde PM. Effect of human variability on independent double reading in screening mammography. Acad Radiol 1996; 3:891-897.[Medline]
  24. Hillman BJ, Swensson RG, Hessel SJ, Gerson DE, Herman PG. The value of consultation among radiologists. AJR 1976; 127:807-809.[Abstract]
  25. Hillman BJ, Hessel SJ, Swensson RG, Herman PG. Improving diagnostic accuracy: a comparison of interactive and Delphi consultations. Invest Radiol 1977; 12:112-115.[Medline]
  26. Curtis PB, Ferrell WR, Hillman BJ. Improved imaging diagnosis by sequentially combined confidence judgments. Invest Radiol 1988; 23:342-347.[Medline]
  27. Ferrell WR, Hillman BJ, Brewer ML, Amendola MA, Thornbury JR. Interactive, mathematical, and sequential consultative methods in diagnosing renal masses on excretory urograms. Invest Radiol 1989; 24:456-462.[Medline]
  28. Getty DJ, Pickett RM, D'Orsi CJ, Swets JA. Enhanced interpretation of diagnostic images. Invest Radiol 1988; 23:240-252.[Medline]
  29. Swets JA, Getty DJ, Pickett RM, D'Orsi CJ, Seltzer SE, McNeil BJ. Enhancing and evaluating diagnostic accuracy. Med Decis Making 1991; 11:9-18.
  30. Baker JA, Kornguth PJ, Lo JY, Floyd CE, Jr. Artificial neural network: improving the quality of breast biopsy recommendations. Radiology 1996; 198:131-135.[Abstract/Free Full Text]
  31. Seltzer SE, Getty DJ, Tempany CM, et al. Staging prostate cancer with MR imaging: a combined radiologist-computer system. Radiology 1997; 202:219-226.[Abstract/Free Full Text]



This article has been cited by other articles:


Home page
Br. J. Radiol.Home page
P Goddard, A Leslie, A Jones, C Wakeley, and J Kabala
Error in radiology
Br. J. Radiol., October 1, 2001; 74(886): 949 - 951.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow Figures Only
Right arrow Full Text (PDF)
Right arrow Submit a response
Right arrow Alert me when this article is cited
Right arrow