|
|
||||||||
Editorials |
1 From the Department of Radiology, University of Pittsburgh School of Medicine, Imaging Research, Suite 4200, 300 Halket St, Pittsburgh, PA 15213. Received September 1, 2004; accepted September 27. Address correspondence to the author (e-mail: gurd@upmc.edu).
I recently attended a clinical conference where a well-known investigator reviewed some of the scientific presentations given at the recent American Society of Clinical Oncology meeting in New Orleans. One of the highlights in the review was a randomized study of a cancer vaccine that essentially showed no effect in the treatment group. The shown survival curves (at least at our review conference) of the two arms overlapped in such a manner that one could basically state that there was no difference between the two survival curves. The excitement the study generated (per the reviewer) was caused by the fact that in a poststudy analysis, a "clear" difference was shown in one subset of the studied population. "This is very enlightening," I said, "but does this mean that the vaccine is actually harmful to the rest of the population who are not a part of this subset?" The short pause (silence) in the audience was quite noticeable. I write this editorial to illustrate a point. One has to be careful when using subsets of cases in poststudy analyses to demonstrate, after the fact, what one would like to state without regard to what happens to the study as a whole as a result of these statements and, in particular, to the population not included in the subset in question.
We recently completed a study about the effect of computer-aided detection (CAD) in our clinical practice. We were truly disappointed to see a lower than expected effect in the detection of additional cancers in general and in the detection of masses in particular (1). The editorial describing our article claimed that, had we corrected for the changing fraction of repeat versus initial screening examinations, we would have obtained an increase in recall rates (2). I can only guess that the authors of the editorial wanted to highlight the possible decrease in performance (due to decrease in specificity) with the use of CAD. Our response (3) was that, in this observational study, we chose not to adjust for confounding variables, since we did not have complete information on the history of all women in regard to prior breast examinations that may have been performed elsewhere. In addition, had we done what was suggested (namely, corrected for this particular confounding variable), we would have shown that cancer detection rates were also affected in a similar manner (namely, sensitivity would have been similarly increased). Unfortunately, this was not the end of the story. The data in our study were reanalyzed for the subset of "low-volume" readers by those who would like to demonstrate the positive effect of CAD on cancer detection (4). They appropriately suggested that, if one does correct for this confounder, there is an increase in cancer detection for the subset of low-volume readers (albeit not statistically significant, because of a small sample size). The authors (of this letter to the editor) chose not to mention that, indeed, the recall rates in this subset actually increased significantly (P < .001), as well. The net result, for the most part, was that the effect of CAD in the subset of low-volume readers could be quite similar to simply telling these readers to be more conservative or to change the threshold for recall of patients for additional diagnostic procedures (4).
I convey the above point to illustrate that a journey in search of a statement one wishes to make is always interesting, but often is not a good practice. There are issues that need to be carefully addressed when this is done. If one looks hard enough, there are often subsets that would support the desired statement. First, statements claiming that "there is an observed effect in a small subset, but in order to demonstrate statistical significance, one would need a much larger study" (ie, a population five or six times that of our study sample) are inappropriate. The nature of these analyses is such that it is not at all clear whether the observation made regarding the particular subset would hold true when a larger study of this type is performed. Second, what could one say about the remainder of the population that is not included in the subset? Namely, in regard to our study, if there is no effect in the population as a whole but there is an effect in the subset, does this mean that CAD is actually bad for high-volume readers? After all, in our study, the performance of high-volume readers did decline with the use of CAD in regard to both detection rates and recall rates. Third, is this subset affected in a manner that is truly different from the effect on the rest of the population, or do they just operate at a different point on the same (or a similar) receiver operating characteristic (ROC) curve? Isolating one issue to make a statement one wishes to make (for whatever reasons) is not very helpful in the long run, unless all the issues that may be related to such a statement are carefully considered.
A different example that should not be ignored is the current effort to demonstrate the advantages of full-field digital mammography (FFDM) over screen-film mammography (58). In a series of studies in this area, it has been demonstrated repeatedly that recall rates are largely associated with cancer detection rates (68). In several of these studies, when the former (recall rate) is higher, the latter (detection rate) follows (5,6,8). However, since these studies are designed to assess the possible benefits of FFDM, the investigators often focus on the detection rate because this is the more obviously important aspect of a breast cancer screening program (8). It is imperative, when the results of these studies are presented, that the authors consider whether the measured effect is likely the result of a truly better technology that leads to better observer performance or simply the result of readers operating at a different level (point) on the ROC curve. Unfortunately, even if the results of the latest study in question (8) did reach statistical significance (which they did not), the question of actual possible improvement remains because of the significant increase in recall rates in the population subset described (patients aged 5069 years) (8). A similar effect, albeit in the opposite directionnamely, a lower recall rate and a lower detection ratewas associated with the use of FFDM as described by another group (6). The fundamental question is whether one can see more with one method of detection or display than with the other, and not whether a change in the operating point along the same ROC curve results in more cancers being detected. If the results are largely the latter, it could be a lot cheaper and simpler to train observers to be more conservative (namely, to recall more patients with less suspicious findings) than to implement a very expensive technology that, in addition, requires significant incremental effort during the transition period. One has to ascertain appropriate data regarding the type of cases (cancers) that are detected (visible) with one type of detection system, compared with the other. If these differences are not well characterized, our ability to define true progress, when incremental improvements with new technologies and practices in this field are often relatively small, diminishes substantially. There may be a number of reasons that are not related to observer performance yet are compelling reasons why technologies or practices such as CAD or FFDM should be implemented clinically, but these are rarely discussed in many of our studies.
There is an appropriate alternative practice with which performance should be compared in many technology and practice assessments. In the case of CAD, the change in detection, when CAD is introduced after an initial review without CAD, is often not the only change one sees. Often the new practice leads to higher detection rates, but frequently this happens at the cost of a corresponding increase in recall rates for false-positive findings. The question is whether and by how much this new practice is better than simply lowering the threshold of what one chooses as a reason to recall; hence, did one actually move the operating point to a higher ROC curve? Similar appropriate alternatives (and hypotheses) should be used when comparing FFDM with screen-film mammography. When the alternative technologies or practices that are compared result in different false-positive (eg, recall) rates and one does not measure directly the full performance curves in question (curves for ROC or derivatives thereof), a direct comparison of measures of sensitivity levels alone (eg, detection rates) is not optimal and frequently is not valid. The more appropriate comparison would be with the expected level of performance (sensitivity) of one practice (eg, without CAD or when using screen-film mammography) with adjustment for the same rate of false-positive findings as that measured for the new technology or practice (eg, with CAD or when using FFDM).
It is as important, perhaps, that we not forget the statistical nature of many of these analyses. If we perform a large enough number of studies to evaluate the hypothesis in question (eg, "FFDM is better than screen-film mammography"), we can be sure that at least some of these studies will demonstrate statistically significant differences and enable us to find what we wish to find. This is a natural consequence of the hypothesis testing format in which the type I error is set at a level (typically,
= .05) for the individual study without consideration of the possibility that several hypotheses may be tested in other studies. However, as the number of studies increases, it is the consistency in results among studies that has to be investigated (eg, through meta-analysis or a derivative thereof) to ensure that the results are valid and do not represent but a single study (which may actually be an "outlier"). Unfortunately, we all tend to use the one study that demonstrated the desirable effect as a guide toward the future (or as implicit truth), rather than an ensemble of studies with various results that address the issue at hand.
This author has spent most of his time and effort in recent years in attempting to develop new technologies and establish better practices for earlier detection of breast cancer, including a significant effort in the development and assessment of both CAD and digital acquisition technologies. Therefore, it is with great pain that I write these lines. I am a true believer in, and a supporter of, new technologies. There is no doubt in my mind that both CAD and FFDM are the future. However, we all have to be critical of our own science, or eventually we and society will pay the price for statements we make that are not fully supported or consistently sustained by the data we ascertain.
FOOTNOTES
Author stated no financial relationship to disclose.
REFERENCES
This article has been cited by other articles:
![]() |
J. H. Sumkin, D. Gur, R. L. Birdwell, and D. M. Ikeda Computer-aided Detection with Screening Mammography: Improving Performance or Simply Shifting the Operating Point? Radiology, June 1, 2006; 239(3): 916 - 918. [Full Text] [PDF] |
||||
![]() |
P. Skaane, L. Niklason, and N. A. Obuchowski Receiver Operating Characteristic Analysis: A Proper Measurement for Performance in Breast Cancer Screening? Am. J. Roentgenol., February 1, 2006; 186(2): 579 - 580. [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
| RADIOLOGY | RADIOGRAPHICS | RSNA JOURNALS ONLINE |