|
|
||||||||
Breast Imaging |
1 From the H. Lee Moffitt Cancer Center & Research Institute, 12902 Magnolia Dr, Tampa, FL 33612-9497. From the 2002 RSNA scientific assembly. Received December 4, 2002; revision requested February 6, 2003; final revision received May 13; accepted May 19. Supported by National Cancer Institute grant CA-74110. Address correspondence to C.A.B. (e-mail: beamca@moffitt.usf.edu).
| ABSTRACT |
|---|
|
|
|---|
MATERIALS AND METHODS: Percentiles of accuracy based on a random sample of 110 U.S. radiologists were used to examine the number of radiologists who would need to be restricted from providing mammographic interpretation to increase median accuracy from 66% to 67%, 71%, and 76%. In addition, reading volume data recorded for the sampled readers were used to project the percentage reduction in service volume (mammograms per year) that would result from restriction. Characteristics of participating radiologists were compared with those of nonparticipating radiologists by using
2 testing and analysis of variance to assess the external validity of the results.
RESULTS: To increase median accuracy by 1% (from 66% to 67%) would require prohibiting about 2,200 U.S. radiologists (ie, the 11% in the lowest quantile for accuracy) from performing mammographic interpretation and would result in a reduction of yearly service volume of approximately 10%. An increase in median accuracy of 5% (to 71%) would require prohibiting about 6,000 U.S. radiologists (ie, 30%) from performing this service, with an accompanying volume reduction of 25%. An increase in median accuracy of 10% (to 77%) would require prohibiting about 11,400 practicing U.S. radiologists (ie, 57%) from performing this service and would diminish the national service capacity by 50%.
CONCLUSION: These data show that implementation of proscriptive health care policies based on accuracy would diminish the service capacity of screening mammography in the United States.
© RSNA, 2003
Index terms: Breast radiography, utilization, 00.11 Diagnostic radiology, observer performance Radiology and radiologists Radiology and radiologists, departmental management
| INTRODUCTION |
|---|
|
|
|---|
Most recently, the factor of the minimum requirement of cases has been discussed. As compared with that in countries with mass screening programs, the caseload requirement for the American radiologist is minimal. In Sweden, mass screening is performed at select sites, with only
expert
radiologists (those specializing in breast imaging) interpreting the images (1). In the United Kingdom, where high-volume screening programs exist, the radiologist must interpret a minimum of 5,000 mammograms per year (6). In the Canadian province of British Columbia, the recommended minimum number of cases interpreted per year is 2,500 (3). Although this is half the number required in the United Kingdom, it is more than five times the recommended number in the United States. In the United States, the Food and Drug Administration (FDA), in enforcing the Mammography Quality Standards Act, requires every radiologist to read a minimum of 960 mammograms during a 24-month period (7).
Although volume has been cited as an important factor in improving the sensitivity and specificity of mammography (1,2), it is important to note that radiologists in the United States work in different financial and legal environments than do radiologists in countries with socialized medicine (8). Therefore, a simple comparison of specificity and sensitivity between radiologists in different countries that is based on volume of cases may not be valid. Additionally, patients in the United States demand convenient access to medical care (8) and may not be willing to travel to specialized mammography centers where high volumes of mammograms are
batched
for interpretation.
Given the uncertain connection between volume and skill, we might wonder whether qualifying radiologists on the basis of volume is a reliable foundation for an effective health care policy. One might ask instead,
Why not qualify radiologists directly on the basis of ability?
The idea of qualifying examinations is not new to the field of medicine. However, such examinations rarely serve a proscriptive function. For example, the subspecialty board examination in radiology tests for a basic fund of knowledge in the field, but not passing the examination does not necessarily prohibit the physician from practicing radiology in the United States. The PERFORMS 2 test, administered in the United Kingdom by the National Health Service Breast Screening Program, is taken electively by radiologists and serves as a teaching tool as well as a skill-assessment tool. Recently, the American College of Radiology Committee on Mammography Interpretive Skills Assessment introduced a similar voluntary self-evaluation test with feedback and a scoring system that enables a radiologist to compare his or her scores with those of other radiologists across the country. However, this test was not designed to be used as a proscriptive health care policy tool.
Currently in the United States, proscriptive policies limiting the practice of medicine to only those who pass an examination do not exist. The implementation of proscriptive policies for American radiologists could conceivably improve the accuracy of mammographic interpretations while decreasing both patient recall and false-positive biopsy rates. Yet the very nature of proscription is to restrict access, and we must simultaneously be concerned with the potential for reduction in volume and services that such a policy might bring to mammographic screening in the United States. In sum, restricting the number of radiologists who can interpret mammograms may also restrict the access of American women to mammography. An important question is, therefore, what cost will there be in terms of access to qualify radiologists on the basis of their skill?
The purpose of our study was to evaluate the potential effect of proscriptive health care policies directed toward improving screening mammogram interpretation in the United States.
| MATERIALS AND METHODS |
|---|
|
|
|---|
Radiologists
Radiologists were recruited to participate in the Variability in Diagnostic Interpretation, or VIDI, screening mammography study (9). VIDI is a research program devoted to the population-based assessment of interpretation variability in diagnostic medicine. Participants for the screening mammography study were from randomly sampled mammography facilities accredited by the FDA as of January 1, 1998. Stratified random sampling of the 9,916 geographically contiguous accredited facilities ensured approximately equal representation across four geographic regions defined by the U.S. Census Bureau. Facilities in each of the four geographic regions were additionally stratified according to the minority composition of local screening populations to yield a total of eight
strata.
Minority composition was categorized on the basis of the percentage of minorities in the population in the zip code area of the facility (obtained from U.S. Census Bureau reports) as either
less than 50% nonwhite
or
more than 50% nonwhite.
Thus, stratified sampling was performed and yielded approximately equal numbers of facilities within each of the eight strata. On average, each facility we sampled reported having two radiologists involved in mammography, and, thus, we estimated there to be approximately 20,000 radiologists in the U.S. population.
All radiologists at each randomly sampled facility were invited to participate. The procedure for recruitment began with a letter to the lead interpreting physician at a sampled facility that asked him or her to distribute our recruitment material to all radiologists who interpret mammograms for their facility. In this way, we sampled not only permanent faculty members but also locum tenens radiologists. The recruitment material explained the study and the requirements for and benefits of participation in the study and asked the radiologists whether they would be willing to participate if randomly sampled. In all, 412 radiologists were contacted, and 292 (71%) expressed willingness to participate in the study if sampled. These 292 radiologists, grouped by facility, provided our frame for random sampling. Again, we sampled facilities (and, hence, willing radiologists within facilities) within the strata formed by geographic region and minority composition to arrive at approximately equal numbers of radiologists per stratum.
Cases
One hundred forty-eight index mammography cases were randomly selected from the records of a large screening program affiliated with the University of Pennsylvania; these 148 cases represented results of examinations performed between 1993 and 1997. All mammograms selected for this study were reviewed for quality (positioning, compression, exposure level, contrast, and artifacts) by E.F.C., who is director of the Breast Imaging Program at the University of Pennsylvania. No cases were rejected because of poor technical quality.
Sample cases were stratified on the basis of disease status (ie, with cancer or cancer free), which was determined at biopsy or after a minimum follow-up period of 2 years, as well as on the basis of patient age. Stratification was performed by using the electronic patient information and biopsy databases maintained by the Breast Imaging Program. Once the cases were stratified, sampling was performed at random within strata. Differences in case availability prevented us from meeting our initial goal of having equal numbers of cases within strata for each disease status.
Original film mammograms were used in the reading study. To parallel usual clinical practice, comparison original film mammograms were also provided when available. Comparison mammograms were available for 67 cases (45%). Each set of mammograms had been obtained at low-dose screen-film mammography performed with dedicated mammography units and single-emulsion film. Each set consisted of mediolateral oblique and craniocaudal views of each breast. The index examination of a woman was defined as the one whose results led to the first biopsy in those women who underwent biopsy or as the next-to-last examination for those women who were followed up for at least 2 years and did not undergo biopsy. A comparison examination was defined as the screening examination immediately prior to the index examination.
Reading Study
All radiologists interpreted the mammograms in a controlled reading environment during two 3-hour periods. The reading was performed entirely in a room dedicated solely to the study that permitted the investigators to control ambient light. Readers traveled to a central site at which the controlled reading room was located. Eight readers participated at a time.
Case images were mounted in random sequence on dedicated mammography alternators (RADX, Houston, Tex). The only information presented to the reader was the age of the patient. Before reading, radiologists were instructed that the case set did not have the mix expected in a typical screening population (ie, about two to six cancers per 1,000 individuals). Results of pilot studies performed by the authors have established that this instruction adequately controls context bias (10) (details are available upon request to C.A.B.). Readers were oriented by means of supervised hands-on experience; they reviewed a set of practice cases before beginning the review of the study cases. The practice case set did not include any cases used for the reading study.
The reading data were immediately input to a database with laptop computers. A custom computer program operating in real time during the reading session captured the reading data described below and ensured data reliability by way of several programmed checks for completeness and inconsistency.
Readers were asked to (a) identify findings, (b) make a recommendation for further work-up, (c) report what they believed would be the result of additional work-up, and (d) give a subjective assessment of the presence of breast cancer for each case. Responses to item d were reported by using an 11-point scale (in which a score of 0 represented
definitely normal
and a score of 11 represented
definitely cancer
). Responses to item c involved use of the BI-RADS scale (11) and were used in the receiver operating characteristic (ROC) curve analysis in this study. This analysis is described later.
Reader Factors
Two surveys were used to collect data about the readers in our study. One survey was used to collect data about each individual reader and another to collect data about the facility with which the radiologist indicated affiliation. Among other things, radiologists were asked to report their recent reading volume, which is the total number of mammograms (both screening and diagnostic) read in the year prior to their participation in the study. All survey items were self reported and not independently verified.
Statistical Analysis
The qualitative characteristics (eg, sex, race) of the radiologists who participated in our study were summarized with percentages and comparedby means of the
2 testwith those of radiologists who were contacted but who did not participate in the study. Quantitative characteristics (eg, age, recent reading volume) of participants were summarized with means, SDs, and ranges. The mean values for the participating and the nonparticipating radiologists were statistically compared with analysis of variance.
We also characterized our reader sample group by computing the performance characteristics (ie, sensitivity and specificity) for each reader who interpreted our case set and then summarizing the reader sample group data with the mean, median, and range of these characteristics. In our study, radiologist sensitivity was computed as the proportion of women with breast cancer who were recommended for recall by the radiologist. Radiologist specificity was computed as the proportion of women without breast cancer who were not recommended for further work-up by the radiologist.
The distribution of ages in the study cases was summarized with percentages in each disease group (with cancer or cancer free) to reflect the sampling plan used in selecting the cases.
Radiologist accuracy was measured by using the partial area under the binormal ROC curve (12,13). This measurement can be interpreted as the average sensitivity of the radiologist when he or she is reading with at least 90% specificity (1416). Technical details about the method of computation are presented in the Appendix.
Results in our sample group of radiologists yielded an estimate of
quantiles of accuracy
in the U.S. population of radiologists. A quantile of accuracy represents the accuracy value associated with a cumulative proportion of the population. For example, the median of a population is the value such that 0.50 of the population is less than or equal to that value. The median is, therefore, the 0.50 quantile of the population distribution.
The quantiles of accuracy estimated from our data were then used to estimate the number of radiologists who would need to be restricted from providing mammographic interpretation to increase median accuracy by 1%, 5%, and 10%. In the Appendix, we show that, to increase median accuracy by a certain amount, denoted by p%, the lower 2p% of the population has to be restricted. A detailed example of this computation is given in the next section.
Reading volume data recorded from the sampled readers were used to project the percentage reduction in service volume (ie, mammograms read per year) that would result from restriction. This was accomplished by computing the proportion of total reading volume attributed to each reader in the sample and then summing these proportions for the readers in the percentage of radiologists who would be restricted from reading mammograms in the proposed proscriptive policy.
| RESULTS |
|---|
|
|
|---|
|
|
|
2 test). This situation reflects differences in the availability of original mammograms after the already age-stratified cases were randomly selected. The mammograms of younger women with breast cancer tended more often to be in clinical use than the mammograms of older women with breast cancer.
|
|
|
middle
of the distribution upward to where the vertical arrow intersects the plot. By following the left-facing arrow, it can be seen that this action equates to shifting the middle of the distribution to the value that is currently approximately the 75th percentile in the population (ie, 0.75 on the vertical axis). In other words, this health care policy goal requires shifting the median up past 25% of the data. As shown in the Appendix, for each 1% increase in median accuracy desired, 2% of the population must be eliminated from service. Therefore, our data suggest that to increase the median accuracy in the United States by 10% by using proscription would require the restriction of 50% (two times the desired 25%) of presently active interpreting radiologists. Figure 2 summarizes the implications of the previous analysis for proscriptive health care policies designed to achieve various improvements in median accuracy: To increase median accuracy by 1% (from 66% to 67%) would require restricting 11% of U.S. radiologists in the lowest quantile for accuracy (about 2,200 in an approximate U.S. population of 20,000 radiologists who interpret mammograms) and would result in a reduction in yearly service volume of approximately 10%. An increase in median accuracy of 5% (to 71%) would require restricting 30% of radiologists (about 6,000 physicians), with an accompanying volume reduction of 25%. An increase in median accuracy of 10% (to 76%) would require the restriction of 57% of practicing U.S. radiologists (about 11,400 physicians) and would diminish the national service capacity by 50%.
|
| DISCUSSION |
|---|
|
|
|---|
pick up the slack.
Given the many disincentives to read mammograms already, the requirements of having to qualify and then take on a much increased caseload would likely lead to increased radiologist attrition. If great enough, such attrition could create a cascade of increased reading demand for those radiologists wishing to continue interpreting mammograms and, hence, even greater disincentive. Efforts should focus on improving the tools and skills of currently practicing radiologists. In addition, more advanced practice management paradigms ought to be considered. For example, one management paradigm might be to achieve the goal of maintaining access to quality mammography through the planned redistribution of caseload within the facility. With this strategy, mammographic interpretation would be shifted away from those radiologists who did not meet the proscriptive threshold for performance to those who did. The former individuals would then pick up the resulting slack in other modalities and disease areas. The mammography service volume provided by the facility would be maintained. Another way to implement this management goal would be to require double reading of cases initially read by radiologists who did not meet the proscriptive threshold. The radiologist performing the second reading would be provided with appropriate acknowledgment of status and remuneration for their senior role. Some of the data required for this management paradigm should already be available, since current Mammography Quality Standards Act rules require each mammography facility to track outcomes of abnormal mammograms separately for each radiologist. This approach to auditing (ie, at the level of the radiologist) should be extended beyond collection of biopsy outcomes to include additional clinically useful performance parameters (eg, cancer detection rate, recall rate, etc).
Our study had several limitations that must be kept in mind when interpreting our findings. We did not attempt to measure or account for statistical error or uncertainty in our estimates and, hence, in our findings. Although this is a limitation of our study, we believe this was appropriate because the goal of this analysis was to investigate the potential implications of proscriptive health care policies. We believe that the point we make with this analysis is sufficient to dissuade serious consideration of proscriptive policies. However, should it be decided that such policies deserve further consideration, we point out that the next step must then be to specify the desired increase in median accuracy. This specification is, to say the least, enormously difficult because it involves specifying universally acceptable societal-level valuations of health care outcome. It is beyond the intended scope of this article to offer such a valuation.
One might also be concerned about the representativeness of our findings. However, we have provided data confirming that the physician sample was representative of the entire U.S. population of interpreting radiologists. Our case set was randomly sampled and hence should be representative of cases in typical screening populations, having approximately equal representation of women from each of the three age groups. Of course, the case mix was enriched with cancers, but this should not bias our assessment of ROC curves because such measurements are conditional on disease status and, therefore, yield estimates of sensitivity and specificity in an independent and unbiased fashion. One might be concerned with how well our measurement of accuracy reflects actual screening performance in the field. This is indeed a concern and a limitation of any study in which performance measured in the
laboratory
is used to estimate performance achieved in real practice. Finally, recent reading volume was used in our projections of future effect on service volume. Therefore, any changes in reading volume that occurred immediately after our study are not incorporated into our estimate of the effect of the proscriptive health care policy on total service capacity. For example, three randomly selected radiologists reported zero reading volume because one had been a resident and two had just returned to performing mammography. Our projection of the loss to total service volume from the restriction of any of these three radiologists would therefore be zero as well.
On the other hand, the experimental methods we used are well established and yield what might be considered to be the most optimistic appraisal because extraneous sources of variation were minimized. This is because our experimental conditions optimized reader performancethe radiologists read in optimal lighting conditions with state-of-the-art mammography alternators. They could focus entirely on the task and were not interrupted with the usual things that interfere with a radiologists concentrated reading of mammograms. Furthermore, our use of the ROC curve controlled for any
overreading
or
underreading
that the radiologists might have performed in response to being subjects in an experiment. Thus, we believe our data capture the
essential state of practice.
And, despite these conditions, the spread in average sensitivity among radiologists was great enough to provide compelling evidence against proscriptive health care policy approaches for improving mammographic interpretation in the United States.
It is also important to point out that
average sensitivity
in our study referred to average sensitivity in the context of screening. In screening, the central decision is whether or not to conduct additional work-up (ie, the
callback decision
). It is not the goal of screening mammographic interpretation to provide a definitive diagnosis or to recommend biopsy without further consideration. Thus, a true-positive result in screening occurs whenever a woman with breast cancer is given a recommendation for additional work-up. However, in our study, determination of true-positive results was performed without reference to correct localization of the cancer by the radiologist, and, therefore, our estimates of radiologist screening sensitivity might be positively biased.
Authors of articles published within the past 5 years have reported that there is appreciable variability in the interpretative skills of the radiologist reading mammograms (2,3,9). This finding has even reached the lay press. In a recent front-page New York Times article (17), radiologists were cited as
the weak link
in mammographic screening programs. The result of these developments is, naturally, a desire to implement far-reaching interventions to improve the interpretation of screening mammograms for all American women. Logically, two options are available: One option is to restrict radiologists from interpreting screening mammograms on the basis of their skill (proscription), and the other option is to develop interventions targeted at improving the skills of those practicing radiologists most in need of training (prescription). On the basis of our study data, our conclusion is that prescription is by far preferable to proscription because the cost of the former, in terms of reduced access to mammography for American women, is, we believe, unacceptable. In sum, we conclude that efforts should focus on improving the tools and skills of practicing radiologists while maintaining the access of American women to screening mammography.
| APPENDIX |
|---|
|
|
|---|
Determining Population Percentage That Would Need to be Restricted to Achieve Health Care Policy Goals
We first establish some terminology (Figure A1). Problem: We wish to move the median accuracy of the parent population upward by means of the restriction of a lower percentage of the population. What percentage should be restricted?
|
Therefore, [N - 2N(1 - p)]/N = 1 - 2(1 - p) is the proportion of the parent population that must be restricted to
move
the median from Q.50 to Qp.
Now, let d = p - .50, where d represents the amount by which we wish to shift the population median, in terms of proportions in the parent population. And, therefore, p = .50 + d.
Then, the quantity 1 - 2(1 - p) can be reexpressed as
|
In other words, to shift the median upward by the amount d(100%), the lower 2d(100%) of the parent population must be restricted.
| ACKNOWLEDGMENTS |
|---|
| FOOTNOTES |
|---|
Author contributions: Guarantor of integrity of entire study, C.A.B.; study concepts, C.A.B., E.F.C., E.A.S.; study design, C.A.B.; literature research, C.A.B., S.P.W.; data acquisition, C.A.B.; data analysis/interpretation, C.A.B., E.F.C., E.A.S.; statistical analysis, C.A.B.; manuscript preparation and definition of intellectual content, C.A.B.; manuscript editing, revision/review, and final version approval, all authors
| REFERENCES |
|---|
|
|
|---|
This article has been cited by other articles:
![]() |
E. S. Burnside, J. M. Park, J. P. Fine, and G. A. Sisney The Use of Batch Reading to Improve the Performance of Screening Mammography Am. J. Roentgenol., September 1, 2005; 185(3): 790 - 796. [Abstract] [Full Text] [PDF] |
||||
![]() |
W. E. Barlow, C. Chi, P. A. Carney, S. H. Taplin, C. D'Orsi, G. Cutter, R. E. Hendrick, and J. G. Elmore Accuracy of Screening Mammography Interpretation by Characteristics of Radiologists J Natl Cancer Inst, December 15, 2004; 96(24): 1840 - 1850. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
| RADIOLOGY | RADIOGRAPHICS | RSNA JOURNALS ONLINE |