|
|
||||||||
Special Reports |
1 From Department of Radiology, Imaging Research, Suite 4200, University of Pittsburgh, 300 Halket St, Pittsburgh, PA 15213-3180. Supported by grant CA84507 from the National Cancer Institute, National Institutes of Health. Received June 14, 2002; revision requested August 8; revision received September 6; accepted October 21. Address correspondence to (e-mail: gurd@msx.upmc.edu).
| ABSTRACT |
|---|
|
|
|---|
MATERIALS AND METHODS: A multiobserver multiabnormality receiver operating characteristic (ROC) study to assess the effect of prevalence on observer performance was conducted. Fourteen observers, including eight faculty members, two fellows, and four residents, interpreted 1,632 posteroanterior chest images with five prevalence levels by using a nested study design. Performance comparisons were accomplished by using a multireader multicase approach to assess the effect of prevalence from 28% (69 of 249) to 2% (31 of 1,577) on diagnostic accuracy. The mean times required to review and report a case were analyzed and compared for different levels of prevalence and readers experience.
RESULTS: Area under the ROC curve demonstrated that, with the study experimental conditions, no significant effect could be measured as a function of prevalence (P > .05) for any abnormality, group of cases, or readers. There were no significant differences (P > .05) in the mean times required to review and report cases at different prevalence levels and with different groups of readers.
CONCLUSION: The consistency in the results and the size of this study suggest that with laboratory conditions, if a prevalence effect exists, it is quite small in magnitude; hence, it will not likely alter conclusions derived from such studies.
© RSNA, 2003
Index terms: Diagnostic radiology, observer performance Receiver operating characteristic (ROC) curve Statistical analysis
| INTRODUCTION |
|---|
|
|
|---|
The actual prevalence of a given abnormality in the clinical setting may vary considerably depending on the particular practice, the demographics of the population served, and the type of procedures being reviewed, such as screening versus diagnostic examinations. For practical reasons of study efficiency, in most laboratory experiments highly selected and enriched sets of difficult positive and negative cases are used; for example, the fraction of actually positive subtle cases and difficult negative cases is substantially higher than that seen in the clinical environment.
The effect of the prevalence on observer performance (generally referred to as the "prevalence effect") is not very well studied. Although it is often cited as a potential bias and limitation to generalizability in many studies, there is little evidence as to its existence or magnitude with respect to either detection or classification tasks (2,810). In theory, barring observer behavioral effects, the results of ROC analysis should be independent of disease prevalence. However, both limited experimental data from studies in which this issue is explored, as well as those from retrospective reviews of detection rates in screening environments, suggest that there may be a measurable and potentially substantial effect that must be taken into account (1115).
Despite its fundamental nature and the possible existence of a prevalence effect, experimental data regarding the same group of experienced readers who interpret large sets of cases with a wide range of levels of prevalence do not exist. Hence, the purpose of our study was to measure observer performance at various levels of prevalence.
| MATERIALS AND METHODS |
|---|
|
|
|---|
|
The total numbers of nodules, cases of pneumothorax, cases of interstitial disease, cases of alveolar disease, and rib fractures depicted on the core (or nested) group of images were 38, 37, 41, 32, and 31, respectively (Table 2). In 28 cases, more than one abnormality was depicted; in 21 cases, two were depicted; and in seven cases, three were depicted. Fifty of the images in the core group were negative for all five abnormalities. The inclusion of this case mix was designed to increase the type of possible reported abnormalities and to increase the difficulty in estimation of the frequency distribution of each abnormality for the observers.
|
Selection of Observers and Prestudy Training
Fourteen observers were selected for the study. Four were 3rd-year radiology residents at the beginning of the reading sessions, two were radiology fellows, and eight were board-certified faculty radiologists with varying experience that ranged from 2 to 25 years in reading posteroanterior chest radiographs. All continue to read chest radiographs during periodic rotation in the emergency department in addition to their regular duties. We selected this group of observers to assess the effect, if any, that may be associated with observers training level and daily experience with reading posteroanterior chest radiographs. Observers were not made aware of the aims of the study or the prevalence levels to expect in any reading session. All observers received a detailed "Instruction to Observers" document to review. The document included a clear definition of the abnormalities in question, and a set of subtle and typical cases was used to demonstrate the types of cases to be included and to familiarize observers with the use of the computerized scoring form. The document also described in detail a step-by-step process for reviewing and rating cases during a session.
Performance of Study
The study was a five-mode comparison with varying levels of prevalence for each of the abnormalities. Fourteen readers viewed and rated each case five times. The reading sessions lasted for 18 months. Each reading session included approximately 50 randomized cases from only one mode. The study design allowed for case randomization within a mode and a session for each observer. Mode counterbalancing was implemented to decrease any reading-order effects. Observers completed readings of all cases in one mode (ie, one level of prevalence) before they were permitted to continue the study after a predetermined minimum period of 2 weeks between modes. Given the large number of cases and the complexity of the reading tasks, our experience indicated that this was sufficient time to ensure that individual cases were generally not remembered. Readers were allowed to spend as much time as desired on each image. During a reading session, observers were presented with a stack of envelopes, each containing one original conventional chest radiograph. These envelopes were arranged in the order of the designated interpretations for that session. Observers reported the results for each case on our computerized scoring form by using a computer mouse (7). Five sliding scales, one for each abnormality, were presented. The radiologists slid an indicator along the scale from 0 to 100 to indicate the likelihood (ie, probability) of the presence or absence of the abnormality in question. The study management software recorded the time required to review and report each case.
Of the 38 nodules depicted on the core images (Table 2), 26 were malignant and 12 were benign. In all modes but the enriched mode (ie, mode 1), actually negative cases were added to the core group to decrease (ie, "dilute") prevalence. The nested design was implemented for modes 25 as well; namely, all 244 cases in mode 2 were included in mode 3, all 394 cases in mode 3 were included in mode 4, and all 744 cases in mode 4 were included in mode 5.
Data Analyses
In our primary analyses, the Az values for the five modalities and 14 readers for the detection of each of the five disease categories were compared. We performed this analysis by using the method of Dorfman et al (17), which is a multireader multicase method that takes into consideration the correlation between readers who read the same set of cases. The first analysis was performed only with the 194 core cases that were read in all modes. Negative cases were defined as all cases in which the specific abnormality in question was not depicted, even if other abnormalities were present. However, the analysis was repeated for each subgroup that was common to more than one mode (ie, 244 cases for modes 25, 394 cases for modes 35, and 744 cases for modes 45). In addition, the analysis was repeated by using the analysis-of-variance method described by Obuchowski (18) and Obuchowski and Rockette (19). We also included a test for linear trend by incorporating the method proposed by Abelson and Tukey (20) into the procedure described by Obuchowski (18) and Obuchowski and Rockette (19). The data were also analyzed with only "pure" negative cases (namely, only those with negative findings for all five abnormalities in question). In an attempt to identify potential biases, we tested the data for reading-order effect and mean time required to read cases. The data were tested for trends in the Az when the cases were segmented according to the order in which each case was read (eg, first time, second time, etc), regardless of the specific mode. This was performed to determine possible case retention (ie, memorization or learning effects). The time (ie, seconds) to review and rate cases was averaged for each reader and each mode after exclusion of all measurements of 300 seconds or greater (ie, less than 2% [236 of 13,580] of all cases). The exclusion was instituted on the basis of the assumption, which was verified experimentally as well, that these excessively long times were the result of interruptions such as phone calls during the session. The test of Page (21) was used to determine whether there was a trend in the mean time used to read cases for the different reading modes. Both the mean time to review cases and the Az for faculty versus fellows and residents were compared by using the method of generalized estimating equations (22). The statistical power of selected alternative hypotheses was estimated by using the method described by Obuchowski (18).
| RESULTS |
|---|
|
|
|---|
level, there were no statistically significant differences in performance among the five prevalence levels (P = .94, .20, .06, .88, and .15 for nodules, pneumothorax, interstitial disease, alveolar disease, and rib fractures, respectively).
|
|
|
|
|
As seen in Tables 37, the mean observer performance levels of faculty tended to be higher than those of fellows and residents. Although the number of readers was small and interreader variability was quite large, the results were statistically significant for nodules (P = .03) and showed borderline significance for cases of alveolar disease (P = .06) and rib fractures (P = .07). No effect was observed (P > .17) when the cases were analyzed on the basis of the order of reading rather than on the basis of the prevalence level. Hence, no reading-order effect or learning effect could be identified. None of the analyses of the other nested groups yielded significant differences between modes (P > .12 for all 15 comparisons of three groups and five abnormalities).
When the data were analyzed by using the approach of Obuchowski (18) and of Obuchowski and Rockette (19) and with only the 50 cases in which findings were negative for all five abnormalities, the results were not substantially affected. The probability of detecting a difference varies for the different hypotheses and is a function of the alternative hypotheses, the number of cases with disease and the number of cases without disease, the covariance structure, and the type I error. Only the type I error remains the same for differing hypotheses. However, we conservatively estimated that with only the core cases, the probability of detection of a consistent linear trend of a 0.02 change in the Az between consecutive prevalence levels, or a total of 0.08 change between the lowest and highest Az levels, ranges from 83% to 95% for the five disease categories.
The mean time spent in viewing and rating the images of a case varied substantially for different readers and ranged from 27 seconds ± 4 to 96 seconds ± 23. This large variability was observed for both positive and negative cases. However, the mean time over all readers and cases was not significantly affected (P > .05) by prevalence. Mean times in seconds for all readers for the core cases were 55 ± 17, 57 ± 21, 55 ± 23, 56 ± 27, and 55 ± 16 for modes 15, respectively. When we assigned the readers to two groups, that is, faculty and all others (ie, fellows and residents), the mean reading time for all readers and modes was 51 seconds ± 17 for the faculty and 62 seconds ± 25 for the fellows and residents. The difference was not significant (P = .27) by using a generalized estimating equation approach, where the mean time over all cases was compared for the group of faculty and all others.
| DISCUSSION |
|---|
|
|
|---|
The possible effect of prevalence on observer performance studies has been mentioned in many studies as a potential bias that may impose a serious impediment to generalizability of laboratory results to the clinical environment (810). Laboratory observer performance studies generally require some form of a checklist-type scoring (ie, rating) to enable an ROC-type analysis. Hence, the prevalence effect or lack thereof in this setting may be different from that which may be exhibited in the clinical environment. The effect has been assumed to be a possibility in both types of environments but was never carefully studied. Some indirect supporting information was used in the past to infer the potential effect (15). However, because of the cost and complexity associated with the performance of this type of study, available experimental data are sparse and limited (12).
In this study, we attempted to determine the magnitude of the prevalence effect, if any, for a specific set of experimental conditions. By default, our results may not be generalizable to the general clinical environment or to any reading conditions that do not require a formatted checklist-type response. The case mix used in this study is different (ie, generally more subtle) than that which is typically seen in the clinical environment. Nonetheless, our study findings clearly demonstrated that, with a laboratory condition and a wide range of cases, abnormalities, and observer experiences, the prevalence effect could not be identified. This finding is in full concordance with theoretical underpinning of the ROC approach to performance assessment. The consistency in our results and in the relatively large number of readers and cases indicates that, if a prevalence effect exists, it is likely to be small in magnitude; hence, it will not likely alter conclusions derived from such studies. To the extent measured here, our study findings demonstrated that the observers level of training and experience affects detection performance, albeit not in regard to any measurable prevalence effect in the laboratory experiment. The range of Az values for individual readers and abnormalities clearly indicates that the cases were not very easy to diagnose. Validation of these results with a different set of cases, abnormalities, and observers may be important if we are to largely ignore this effect in future studies.
Despite its shortcomings in that this study was not performed in a double-blind manner and it required a checklist-type response, results of this experiment provide a data point regarding a fundamentally needed assumption that in some manner validates results of numerous laboratory experiments performed over many decades.
| ACKNOWLEDGMENTS |
|---|
| FOOTNOTES |
|---|
Author contributions: Guarantor of integrity of entire study, D.G.; study concepts, D.G., H.E.R.; study design, D.G., H.E.R., C.R.F.; literature research, D.G., D.R.A; clinical studies, D.R.A., A.B., J.K.B., G.B, C.A.B., M.L.B., P.L.D., J.V.F., S.K.G., S.K., J.M.L., B.M.M., F.L.T., T.E.W.; data acquisition and analysis/interpretation, D.G., H.E.R., C.R.F.; statistical analysis, D.G., H.E.R.; manuscript preparation and definition of intellectual content, D.G., H.E.R.; manuscript editing, revision/review, and final version approval, all authors
| REFERENCES |
|---|
|
|
|---|
This article has been cited by other articles:
![]() |
D. Gur, A. I. Bandos, C. S. Cohen, C. M. Hakim, L. A. Hardesty, M. A. Ganott, R. L. Perrin, W. R. Poller, R. Shah, J. H. Sumkin, et al. The "Laboratory" Effect: Comparing Radiologists' Performance and Variability during Prospective Clinical and Laboratory Mammography Interpretations Radiology, October 1, 2008; 249(1): 47 - 53. [Abstract] [Full Text] [PDF] |
||||
![]() |
P.M. Phal, L.P. Riccelli, P. Wang, G.M. Nesbit, and J.C. Anderson Fracture Detection in the Cervical Spine with Multidetector CT: 1-mm versus 3-mm Axial Images AJNR Am. J. Neuroradiol., September 1, 2008; 29(8): 1446 - 1449. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. Gur, A. I. Bandos, and H. E. Rockette Comparing Areas under Receiver Operating Characteristic Curves: Potential Impact of the "Last" Experimentally Measured Operating Point Radiology, April 1, 2008; 247(1): 12 - 15. [Abstract] [Full Text] [PDF] |
||||
![]() |
B. Sahiner, H.-P. Chan, M. A. Roubidoux, L. M. Hadjiiski, M. A. Helvie, C. Paramagul, J. Bailey, A. V. Nees, and C. Blane Malignant and Benign Breast Masses on 3D US Volumetric Images: Effect of Computer-aided Diagnosis on Radiologist Accuracy Radiology, March 1, 2007; 242(3): 716 - 724. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. H. Taplin, C. M. Rutter, and C. D. Lehman Testing the effect of computer-assisted detection on interpretive performance in screening mammography. Am. J. Roentgenol., December 1, 2006; 187(6): 1475 - 1482. [Abstract] [Full Text] [PDF] |
||||
![]() |
G. T. Sica Bias in Research Studies Radiology, March 1, 2006; 238(3): 780 - 789. [Abstract] [Full Text] [PDF] |
||||
![]() |
L. Monnier-Cholley, F. Carrat, B. P. Cholley, J.-M. Tubiana, and L. Arrive Detection of Lung Cancer on Radiographs: Receiver Operating Characteristic Analyses of Radiologists', Pulmonologists', and Anesthesiologists' Performance Radiology, December 1, 2004; 233(3): 799 - 805. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. Gur Imaging Technology and Practice Assessments: Diagnostic Performance, Clinical Relevance, and Generalizability in a Changing Environment Radiology, November 1, 2004; 233(2): 309 - 312. [Full Text] [PDF] |
||||
![]() |
N. A. Obuchowski One Less Bias to Worry About [letter] Radiology, July 1, 2004; 232(1): 302 - 302. [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
| RADIOLOGY | RADIOGRAPHICS | RSNA JOURNALS ONLINE |