|
|
||||||||
Breast Imaging |
1 From the Departments of Radiology (N.K.) and Epidemiology and Biostatistics (J.D.M.O., A.L.M.V.) and the National Expert and Training Center for Breast Cancer Screening (J.H.C.L.H., R.H.), University Medical Center Nijmegen, Geert Grooteplein 18, 6525 GA Nijmegen, the Netherlands; and the Department of Public Health, National Evaluation Team for Breast Cancer Screening (J.H.G., H.J.d.K.). From the 2001 RSNA scientific assembly. Received November 30, 2001; revision requested January 10, 2002; final revision received August 5; accepted August 22. Supported by the National Evaluation Team for Breast Cancer Screening. Address correspondence to N.K. (e-mail: n.karssemeijer@rad.umcn.nl).
| ABSTRACT |
|---|
|
|
|---|
MATERIALS AND METHODS: Screening mammograms from 500 cases were collected; 125 of these cases were screening-detected cancers, and 125 were interval cancers. Previously obtained screening mammograms (ie, prior mammograms) were available in all cases. All mammograms were analyzed by a CAD system, which detected mass regions and assigned a level of (cancer) suspicion to each mass. Ten experienced screening radiologists read the prior mammograms. For independent interpretation with CAD, the suspicion rating assigned to each finding by the radiologist was weighted with the CAD output at the area of the finding. CAD markers on areas that were not reported by the radiologist were not used. Independent double reading was implemented by using a rule to combine the levels of suspicion assigned to findings by two radiologists. Results were evaluated by using localized-response receiver operating characteristic analysis.
RESULTS: In a total of 141 cases, there was a visible abnormality at the location of the cancer on the prior mammogram, and 115 of these were classified as mass cases. For prior mammograms that depicted masses, the mean sensitivity of the radiologists, as averaged among the false-positive rates lower than 10%, was 39.4%; this increased by 7.0% with CAD and by 10.5% with double reading. Differences among single, double, and CAD readings were statistically significant (P < .001).
CONCLUSION: Although independent double reading yields the best detection performance, the presence and probability of CAD mass markers can improve mammogram interpretation.
© RSNA, 2003
Index terms: Breast neoplasms, diagnosis, 00.31, 00.32 Cancer screening, 00.11 Computers, diagnostic aid
| INTRODUCTION |
|---|
|
|
|---|
The causes of these false-negative screening examinations are not clear. When previously obtained screening mammograms (hereafter referred to as prior mammograms) from cancer cases are retrospectively reviewed and show substantial abnormality, it is often suggested that these abnormalities were overlooked, whereas the more subtle findings are classified as misinterpretations (5). Such classifications, however, might be misleading, because a retrospective review of known abnormalities is very different from a screening of unknown cases with a cancer prevalence of about 0.5%.
To reduce the number of false-negative interpretations at screening mammography, computer-aided detection (CAD) methods have been developed. These systems are primarily designed to generate prompts at suspicious areas on mammograms and have been shown to be effective in detecting cancers at a stage earlier than that at which radiologists detect malignancies (68). Typically, these prompting systems operate with high sensitivity, but their specificity is moderate. The idea behind these CAD systems is that when a radiologist carefully inspects mammographic areas that are prompted with CAD, the risk of overlooking a substantial abnormality is minimized.
It has been suggested that the potential contribution of CAD can be estimated from the sensitivity of CAD in identifying lesions missed at screening (6), with the assumption that these missed lesions were overlooked by the radiologist. However, the extent to which false-negative interpretations are really caused by radiologist oversight is not known. In a large prospective study (9), the sensitivity of screening increased by 19.5% with use of CAD. This result showed that errors due to oversight are frequent, but it also appeared to indicate that the increase in sensitivity was due mainly to improved detection of microcalcifications. The benefit of CAD in the detection of masses, architectural distortions, and focal asymmetries of the breast (all of which are categorized into one group and referred to as masses in this article) still remains an open issue. In general, radiologists find it more difficult to use the CAD prompts for mass detection than the CAD prompts for microcalcification detection. The reason for this may be that mass detection problems are related more to the interpretation of complex mammographic regions than to the search and detection of small signal intensities that are hampered by low contrast and noise.
Studies of perception in radiology have been conducted by using eye trackers, with which investigators have classified observer errors into three categories: search errors, detection errors, and interpretation errors (10,11). Search errors are those in which foveal sight never reached the lesion. Detection errors are those involving missed lesions that were only briefly focused on by the eye but for which the visual dwell time was shorter than a given threshold time. A threshold time of 1 second is typically chosen but thought to be too short to enable one to fully perceive the typical signs that would make the lesion suspicious for cancer. Lesions that are inspected longer than 1 second and either not reported or incorrectly classified are considered to be so owing to interpretation errors. These lesions are consciously evaluated but acted on inappropriately. Most current CAD systems are designed to prevent search and detection errors; however, some study results suggest that radiologists can also increase their performance when they use computers to help them interpret detected lesions (1214).
Without considering recorded eye movements, one may define search and detection errors as those that occur when a radiologist does not report the presence of a visible lesion and interpretation errors as those that occur when the lesion is reported but not considered actionable. It should be noted, however, that such definitions make sense only when radiologists are asked to report any mammographic region that they closely inspect.
The purpose of our study was to evaluate the use of a CAD system (designed for mammographic mass detection) to help improve mass interpretation and to compare CAD results with independent double-reading results.
| MATERIALS AND METHODS |
|---|
|
|
|---|
Currently, about 800,000 screening mammograms are obtained annually in this program nationwide. About 8,800 (1.1%) of these mammograms are referred to a general hospital for further investigationthat is, additional imaging and/or biopsy. About half of these cases turn out to be breast cancer. The average positive predictive value of biopsy for screening-detected breast cancer cases in the Netherlands is around 70%. To get a clear understanding of the data set that we collected, it is important to note that two mammographic viewsmediolateral oblique and craniocaudalare obtained at the initial screening in this program, whereas only mediolateral views are obtained at subsequent screenings, unless there is an indication that additional craniocaudal views would be beneficial.
We collected the screening mammograms from women screened between 1997 and 1999 in five of nine regions in the Netherlands. We chose only five regions to reduce our administrative workload. The district of Nijmegen was chosen because mammograms were easily accessible there. The other four districts were chosen at random. From all of the women in these five regions in whom breast cancer was detected, 125 positive screening-detected cases and 125 interval cancers were randomly chosen; we selected only those cases in which the mammograms acquired from at least two prior screening periods were available. Cases with insufficient image quality or poor positioning technique also were excluded. All women whose mammograms were included in the study gave their consent by completing a questionnaire at screening, in which they granted us permission to use their mammograms for scientific and educational purposes. Institutional review board approval was not required.
In the 250 positive cases, mammograms were obtained at three times: The diagnostic mammogram was that obtained at the time of cancer detection, and the mammograms obtained during the two screening sessions before cancer was detected are referred to as the prior and reference (obtained at screening before prior mammograms) mammograms. The diagnostic mammograms were either clinical mammograms from the interval cancer cases or screening mammograms of the screening-detected cancers. For each case, the average time between subsequent screening mammogram acquisitions was 2 years. For the interval cancers, the time between the diagnostic and prior mammogram acquisitions ranged from 3 to 26 months (average, 14 months).
The 250 normal (ie, negative) cases were randomly selected among women from the same five regions and during the same period as the cancer cases. With use of this selection process, biases due to variation in image quality, variation in equipment and film manufacturers, and the use of patient labels and identification markers were avoided. For each normal case, three screening mammograms were obtained: those that corresponded to the two prior screening rounds of the positive cases and the screening mammogram of the next round. The criterion for inclusion was that all three screening mammograms were reported as normal. The cases that were referred but involved benign lesions or that were reported as normal after additional imaging or biopsy were not included. This does not mean that there were no benign lesions in the studied data set. To maintain a low referral rate, screening radiologists in the Netherlands do not refer patients who have lesions that are judged to be benign by both readers.
The total number of mammograms collected for the study was 1,500. From the 500 prior mammograms, an additional craniocaudal view was available in 62 positive cases and 35 negative (ie, normal) cases. The main reason that more positive cases were found among the craniocaudal views is that positive cases more often show a dense breast tissue pattern, which is an indication for radiographers to obtain a craniocaudal view. The unequal number of craniocaudal views among the positive and negative cases did not bias our study. This ratio reflects screening practice, in which readers are aware that four-view cases are associated with a higher probability of cancer. Among the 500 reference mammograms, craniocaudal views were obtained in 235 cases. Many of these cases were those of women undergoing their first screening. Among the 250 diagnostic examinations of the positive cases, there were also 235 cases in which a craniocaudal view was obtained.
A radiologist (J.H.C.L.H.), whose experience includes the reading of more than 10,000 screening mammograms annually since 1975, reviewed all the positive cases. We will refer to this radiologist as the study radiologist. The diagnostic mammograms and pathology reports were used to draw the locations of the cancers on paper printouts of the diagnostic and prior mammograms. When there was no visible sign of cancer on the prior mammogram, we annotated the cancer location by visually matching the lesion depicted on the diagnostic mammogram with the corresponding region of the lesion on the prior mammogram. No annotation was made when the cancer was hidden on the diagnostic mammogram as well. In some cases, a lesion was visible on the reference mammogram, but the location of these lesions was not annotated. The study radiologist classified each case as obvious, minimal sign, or not visible on the basis of the visibility of the cancer at the prior screening (ie, on the prior mammogram).
We digitized all of the images used in the study (a total of 3,732 screen films) by using a digitizing system (ImageChecker M1000, version 2.0; R2 Technology, Sunnyvale, Calif) equipped with a film scanner (Lumisys LS85; Lumisys, Sunnyvale, Calif). We archived the digitized images after averaging the spatial resolution down to 100 µm per pixel. The annotations made by the study radiologist were converted from paper printouts to a digital format by a research assistant.
Observer Study
A panel of 10 radiologists, not including the study radiologist, was invited to perform a blinded review of the collected mammograms. These radiologists had at least 5 years of screening experience and read 3,00010,000 mammographic studies per year. Each reader independently read the original prior mammograms from all 500 cases during 10 sessions spread out over 2 days. The cases were presented on dedicated mammography film alternators in a random order. Mounting both current and prior mammograms is standard in the Dutch screening program. Therefore, to mimic daily screening practice as close as possible, for each case, the mammogram obtained during the screening session previous to the prior screening session (ie, the reference mammogram) was also mounted on the alternator so that the radiologist could assess mammographic changes over time.
The 10 radiologists were instructed to use a very low threshold when deciding which findings to report so that they could analyze afterward all regions that they interpreted as possible locations of cancer. They were provided with a scoring form for each case; this form included a printed copy of the prior mammogram. They were asked to draw the contours of each finding on the mediolateral oblique and craniocaudal views, classify the visible signs of the abnormality according to one or more predefined categories, and assess the likelihood (ie, suspicion) of malignancy.
In this study, we used five major categories of reported findings: mass, microcalcifications, architectural distortion, asymmetry, or "other." To assess the likelihood of malignancy of each finding, we used a scale with 14 levels that ranged from 0% to 100%. The scale had nine categories with a linear increase from 10% to 90% and four categories to subdivide the low and high ends of the scale: 1%, 2%, 5%, and 95%. We included the latter categories to avoid getting many responses in the highest and lowest categories. This choice was motivated by experience in previous experiments, in which some radiologists strongly preferred to use categories that were close to both ends of the scale.
In the analysis of data, a suspicion level score of 0 was assigned when a region that was identified by the other radiologists was not scored by one radiologist. It should be noted that indicated levels of the likelihood of malignancy were used only as reference points to help radiologists use the scale in a consequent manner (ie, not varying over time). Before the image interpretation sessions started, the readers were assured that their interpretation of the absolute levels of the scale would not influence the analysis of their results, as long as they did not change their use of the scale during the experiment and spread their scores over the full width.
The time to read images had to be limited for practical purposes. The radiologists could spend a maximum of 1.5 hours to review each batch of 50 cases; this is considerably longer than the time they are allotted in routine screening, at which reading of more than 100 cases per hour is common. More time was needed because many cases were abnormal and because our scoring form to report findings was more extensive than that used in regular screening.
Independent Double Reading
Independent double-reading results were derived by using a rule to combine reader (cancer) suspicion scores (15,16)that is, determine results without consensus or arbitration, as is more common in practice. We computed the independent double-reading results by using the combination of results of two radiologists and averaging the levels of suspicion of their findings. Before combining the reader scores, we correlated the findings on the basis of their locations. Findings were combined only if they were sufficiently close to each other. We used the distance between the centers of the annotations. To combine findings, this distance had to be less than 2.5 cm. Otherwise, the findings were considered to be unrelated.
We computed the average double-reading result for each finding by considering each possible pair of radiologists. Thus, we paired each of the 10 readers with the other nine readers to obtain nine possible double-reading results. The arithmetic mean of these nine results was used to compute each readers overall double-reading result. The simple arithmetic mean was not used to compute the average of the suspicion ratings. When only one of the radiologists in a pair marked a particular finding, two methods of computing the average were investigated: With the first method, the level of suspicion assigned by the radiologist who did not annotate the finding was considered to be zero. This way, the findings seen by only one of the two radiologists were strongly downgraded. With the second method, the rating of the radiologist who did mark the lesion was considered to be the unmodified combined rating.
When computing the average score of the radiologist pairs, we also had to take into account that radiologists used the scale that we provided in different ways. Those who marked a large number of findings liberally used the lowest categories of the scale, whereas others who marked fewer findings used the 1%, 2%, and 5% categories of the 14-level scale less often. Furthermore, some radiologists distributed their findings more evenly across the upper end of the scale than others. It was important to address this latter issue without introducing noise as a result of the variability in use of the lower end of the scale. To facilitate evenly weighted average ratings, we computed an adjustment factor for each radiologist. We calculated the adjusted score by dividing the suspicion rating assigned to each of a readers findings by the arithmetic mean of the 25 highest suspicion ratings of the noncancerous findings.
Computer-aided Detection
We processed all cases by using the R2 ImageChecker, version 2.0 system with a special installation of the software provided by R2 Technology so that we could archive the detected regions with a level of importance assigned to each region and its coordinates. It should be noted that in the near future, archiving such data will be common because this process is part of the DICOM (digital imaging and communications in medicine) standard that has been developed for CAD in mammography. The level of importance of regions is determined by using a detection algorithm and is normally used only internally to select the markers to be displayed for the radiologists by using a threshold. Detection algorithms involve the classification of cancers into two categories: masses and microcalcifications.
The mass detection algorithm encompasses a wide range of mammographic abnormalities, from spiculated masses and architectural distortions to ill-defined masses and focal asymmetries. This algorithm is not geared for the classification of detected lesions into benign and malignant types. Nevertheless, the level of importance of a marked region roughly corresponds to the probability that the region is cancerous. It should be noted that the CAD algorithm in this study did not involve the use of temporal information. Thus, it did not make use of the reference mammograms like the radiologists did. Only the CAD results obtained from the prior mammograms were used for simulation of reading with CAD.
In this study, we used only the output of the mass detection algorithm. We converted the level of importance of the regions detected by the CAD system into a standardized level of normality by using all images from the normal cases in this study. For a given CAD marker, the standardized level of normality was defined as the average number of noncancerous regions per image that were marked by the CAD system as having at least the level of importance of the region at hand. Note that this standardization was independent of a particular database of abnormalities with subjective annotations. Only a large representative data set of normal mammograms, which is much easier to obtain, is required for this procedure.
Simulation of Reading with CAD
By independently combining the findings of the radiologists with the detection results of the CAD program, one can simulate mammogram readings with CAD and investigate the use of CAD markers as a possible way of improving mammogram interpretations. There are many ways this can be done. In this study, we restricted ourselves by considering only those areas that the observer detected and annotated. As a consequence, we ignored the possible true-positive findings of the CAD system that the radiologist overlooked. Furthermore, we looked at only the CAD mass detection results and therefore included only the findings of the radiologists that were assigned to at least one of the three categories: mass, architectural distortion, or asymmetry. Thus, we excluded all microcalcification findings.
By using the standardized level of normality of the regions inspected by the CAD mass detection algorithm, we implemented two methods of simulating radiologists assisted with CAD. The idea behind the first method is that the presence of a mass detection prompt in a region that a radiologist inspects would make that region more suspicious, whereas the absence of a prompt would suggest a lower likelihood of the region being cancerous. The second method refines this idea by giving less weight to mass markers with a higher standardized level of normality. Mathematically, the combined level of suspicion (SR + CAD) is computed by using the equation SR + CAD = SR + f(L), where L is the level of normality of the region, SR is the level of suspicion assigned by the reader, and f(L) is the linear weight function. In cases in which both craniocaudal and mediolateral oblique views were available and the CAD system hit (ie, identified) the region on both views, the level of the mass marker with the lowest level of normality was assigned to the finding. When the CAD system did not hit the region, the finding was determined to have the highest level of normality and thus maximally degraded the readers score.
To determine if a mass marker corresponded to a finding of a reader, we used the distance between the center of a mass in the area drawn by the reader and the location of the mass marker. If this distance was smaller than 1.5 cm, a hit was counted; otherwise, the marker was considered to be unrelated to the finding.
The two methods of simulating radiologists assisted by CAD were implemented as two different CAD weight functions: a step function and a function of the linear decrease with log(L). As the cutoff point for the step function, we used 0.4 marked normal region per image, which corresponds to the default setting of the R2 Imagechecker system for mass markers. This leaves one free parameter for each of the functions: the height and the slope of the step and linear functions, respectively. These parameters were determined by maximizing the average performance of the radiologists combined with the performance of the CAD system. Examples of the CAD weight functions are shown in Figure 1. It should be noted that adding a constant to the weight functions did not influence the results, because we determined performance as a function of a decision threshold (described in Data Analysis section). The weight functions for obtaining the CAD results were the same for all radiologists. However, as explained earlier, we took into account differences in the way radiologists used the scale to report suspicious findings. This way, the relative weight of CAD was similar for each observer.
|
We used the outlines drawn by the study radiologist who reviewed the cases to determine the cancer locations. By using the contours of the findings that the observers annotated on small printouts of the mammograms, we defined the true-positive findings as those that were within a certain distance from the true cancer location. A hit was counted when the center of a mass finding was closer than 2.5 cm to the center of a mass of true cancer. It should be noted that on the basis of this definition, our results did not depend on the size of the annotated findings or on the size of truth annotations made by the study radiologist. We chose the distance criterion of 2.5 cm as a compromise to balance the risk of counting accidental hits with the risk of counting true annotated findings of cancer.
For microcalcifications and architectural distortions in particular, there was considerable variation in the size of the annotated findings. Therefore, we could not set a very strict criterion. The risk of accidental hits with our criterion was mostly related to very small cancers. In this regard, it should be noted that the fraction of cancers with an annotation size of less than 1 cm in diameter was only 6.4%. The average size of the finding annotations made by the study radiologist was 1.9 cm. In our analysis, we did not include cases of cancers that were retrospectively classified as not visible on the prior mammograms. This way, the risk of counting accidental hits was reduced.
In a similar way, we defined a true-positive finding of the CAD system as that when a marker was less than 1.5 cm from the center of a mass of true cancer. Because the CAD system had more false-positive findings than the radiologists, we used a stricter criterion to reduce the chance of counting an accidental hit. We computed case sensitivity; this means that a hit was counted when a lesion was seen on either the mediolateral or the craniocaudal view. To judge the stand-alone performance of the CAD system, free-response receiver operating characteristic curves were constructed and yielded the sensitivity as a function of the number of false-positive markers per image. This was done for both the diagnostic and the prior mammograms.
To obtain an estimate of the false-positive fraction, we used only the normal cases. This enabled us to prevent ambiguities in the annotations of abnormalities from influencing our study results. For instance, a radiologist might have annotated a multifocal lesion by outlining more than one finding, which easily could have led to findings being counted as false-positive if they did not correspond exactly to the truth annotations made by the study radiologist.
During screening, the false-positive rate generally is lower than 10%. It ranges from 1% to 4% in European screening programs to around 8% on average in the United States (6). Therefore, we are particularly interested in the part of localized-response receiver operating characteristic curves that represents sensitivity at a recall rate of less than 10%. The mean sensitivity for this range of recall rates is used as a measure of performance. For statistical analysis of differences among single, double, and single CAD readings, we used analysis of variance with a linear mixed model. To determine which reading conditions differed, we used the Tukey-Kramer contrast test.
In addition to performing localized-response receiver operating characteristic analysis, we computed a summary score for each of the abnormalities that were visible on the prior mammograms by taking the average suspicion rating of the 10 readers. This enabled us to rank the cancers according to their level of suspiciousness and perform a separate analysis of a subset of the cancers. To get an indication of how often the cancers that were the most visible on prior mammograms were missed by the readers owing to oversight, we calculated how often the 50 least subtle cancers were seen by the readers. With the assumption that these more obvious cancers would most likely be annotated by any reader once he or she had seen them, the number of times that these cancers were not reported could be regarded as an indication of the number of oversight errors.
| RESULTS |
|---|
|
|
|---|
|
Figure 2 shows localized-response receiver operating characteristic curves representing the mass detection performance of the radiologists. The vertical axis shows the detection sensitivity, and the horizontal axis shows the fraction of normal cases that would have been referred owing to a false-positive finding. The results indicate a considerable variation in skill among the radiologists. At a false-positive referral rate of 3%, which is common in European screening programs, the detection sensitivity of the radiologists ranged from 27% to 54%. For all radiologists, a strong increase in sensitivity was observed as the referral rate increased. At a 20% referral rate, the average sensitivity was 68 of 115 cases (59%). This indicates that much of the reader variability could be attributed to interpretation skills rather than to search or detection problems.
|
|
|
|
|
| DISCUSSION |
|---|
|
|
|---|
We investigated the effectiveness of independent double reading by combining the suspicion scores of findings of the different radiologists. Double reading significantly improved their detection performances, even though the way in which reader scores were combined was simple. According to the data shown in Figure 3, there were some abnormalities that were obvious to most of the radiologists but not marked by some of them. These cases might have been classified as "overlooked" by those who did not mark them. With our score averaging scheme, assigning a level of suspicion of zero for the radiologist who did not mark the lesion greatly reduced the levels of suspicion of such findings. For such cases, a real double reading with a consensus meeting or arbitration (ie, with inclusion of a third radiologist who makes the decision) would probably lead to better results. It appeared, however, that the averaging procedure that we implemented yielded significantly better results than the alternative method of using the level of the reader who did see the finding as the combined reader result. These results indicate that the positive effect of down weighting the false-positive lesions marked by only one radiologist was much stronger than the negative effect of down weighting the true-positive lesions seen by only one radiologist. The results also demonstrate that the effect of misclassifying cases was stronger than the effect of missing cases owing to oversight.
The fact that we observed a benefit from independent double reading contradicts the results of a study performed by Taplin et al (15), in which the independent combining of the assessments of two readers did not lead to an increase in observer performance. Also, Jiang et al (16), in a study on the characterization of microcalcifications in which various ways of combining reader scores were used, did not observe a positive effect from independent double reading. Our results are in accordance, however, with other reports of double reading in breast cancer screening (4,1820). The fact that Taplin et al (15) did not have improved performance with independent double reading might be explained by the fact that they did not use localization of the findings to combine reader assessments, but rather they combined assessments of mammograms in a regular receiver operating characteristic analysis. It also may be possible that their negative results were obtained because they did not average the ratings of the radiologists, but rather they used the most abnormal result of the two assessments for each case.
We investigated the potential benefit of using a CAD system to interpret mammographic masses by combining the level of normality of regions inspected by the CAD system with the radiologists judgments. The CAD systems placements of markers in locations where the radiologist did not report a finding were ignored. A weight function was used to weight the outcomes of the CAD system with the normalized levels of suspicion that were assigned to findings by the radiologists. The results shown in Figure 4 show that there was a large benefit in using the CAD system in this way. For recall rates larger than 5%, the single-reading results improved to a level that was comparable to that of the independent double-reading results.
One might argue that the double-reading method that we implemented is not optimal and that in real practice the improvement achieved with double reading might be better with use of consensus or arbitration instead of the independent combination of the single-reading results. However, the CAD reading that we implemented was not optimal either, because there were unused true cancerous regions that were detected by the CAD system but overlooked by the radiologist (eg, not annotated). In other words, there was no way with this simulation that CAD could help us avoid oversight errors, even though this is generally considered the most important role of current CAD systems. The results of this study indicate that the use of a CAD system may have as much potential to help improve breast mass interpretation as it does to help avoid search errors.
The schemes that we used to combine the reader scores and the CAD markers were complex owing to the fact that correlation of the finding locations was required. In addition, the suspicion level scale that we used had many more levels than the ones used in actual screening practice. This makes the clinical use of the procedure that we designed impractical. However, our intention was not to design practical schemes for combining the results of independent reader results or for independent reading with CAD; our goal was to systematically study the potential benefit of using such strategies. We believe that the consensus double reading currently practiced in many breast cancer screening programs probably is somewhat better than the independent-reading combination approach that we studied. Also, with regard to CAD mass interpretation, one might argue that the independent combination of suspicion levels is not ideal. A radiologist who knows the strengths and weaknesses of a particular CAD system might be able to use CAD more optimally. In particular, the fact that the readers interpreted mammograms from two screening sessions, whereas the CAD system interpreted mammograms from only one screening was a weakness of our study. The growth of masses, which is generally recognized as an important indicator of malignancy, was not recognized as a measure of suspicion by the CAD system; this limitation reduced the weight of this feature in the combined scores.
We estimated the parameters of the CAD weight functions by maximizing the mean benefit obtained from using CAD for the whole group of readers. In practice, the step function can be implemented by displaying only the CAD marks on regions below a threshold level; current prompting systems are designed this way. Radiologists may use the presence or absence of a marker on regions that they inspect to help them make a decision when they are in doubt. However, in our study, the use of a continuous function to weight the CAD levels of normality with the radiologists ratings led to better results. These results suggest that displaying the importance of mass marks should be investigated.
The series of prior mammograms that we collected can be regarded as a representative sample from the Dutch screening program. It should be noted, however, that there were more interval cancers in our series than would have been obtained by means of random sampling. In actual practice, 36% of the cancers in regular participants in the Dutch screening program are interval cases, whereas 64% are screening-detected malignancies (21). We found that the rates of detection of interval and screening-detected cancers on prior mammograms were very similar. Therefore, we do not believe that our results would have been different if the selected cases had more accurately reflected the case distribution in real practice. However, one should be aware that the organization of a screening program affects the types of cases that one collects.
When the threshold for referral is relatively high and the screening interval is longer, visible abnormalities on prior mammograms will be less subtle on average; however, one might expect double reading to enable the detection of these less obvious abnormalities on prior mammograms. The subtlety of cancers in a given study set will have a major influence on radiologist performance, so one should be cautious when applying our results outside of the contexts described herein. However, we believe that the relative differences among single reading, double reading, and CAD reading that we observed are valid.
It should be noted that the CAD system that we used was not trained to classify or separate benign from malignant lesions. This system often marks benign abnormalities such as lymph nodes and cysts as highly abnormal. A radiologist can easily identify such marks as irrelevant and ignore them, but with our simple scheme, the CAD outputs were weighted independently of the radiologists interpretations. The effectiveness of CAD in helping to improve breast mass detection is expected to increase when CAD systems are further developed and trained to distinguish benign abnormalities from malignancies.
| ACKNOWLEDGMENTS |
|---|
| FOOTNOTES |
|---|
Abbreviation: CAD = computer-aided detection
Author contributions: Guarantor of integrity of entire study, N.K.; study concepts, N.K.; study design, all authors; literature research, N.K., J.H.C.L.H.; experimental studies, J.D.M.O., J.H.C.L.H., N.K., J.H.G., R.H.; data acquisition, J.D.M.O., N.K.; data analysis/interpretation, N.K.; statistical analysis, N.K.; manuscript preparation, definition of intellectual content, and editing, N.K.; manuscript revision/review, N.K., R.H., J.H.C.L.H., J.D.M.O., A.L.M.V.; final version approval, all authors.
| REFERENCES |
|---|
|
|
|---|
This article has been cited by other articles:
![]() |
F. J. Gilbert, S. M. Astley, M. G.C. Gillan, O. F. Agbaje, M. G. Wallis, J. James, C. R.M. Boggis, S. W. Duffy, and the CADET II Group Single Reading with Computer-Aided Detection for Screening Mammography N. Engl. J. Med., October 16, 2008; 359(16): 1675 - 1684. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. Georgian-Smith, R. H. Moore, E. Halpern, E. D. Yeh, E. A. Rafferty, H. A. D'Alessandro, M. Staffa, D. A. Hall, K. A. McCarthy, and D. B. Kopans Blinded Comparison of Computer-Aided Detection with Human Second Reading in Screening Mammography Am. J. Roentgenol., November 1, 2007; 189(5): 1135 - 1141. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. A. J. Roelofs, N. Karssemeijer, N. Wedekind, C. Beck, S. van Woudenberg, P. R. Snoeren, J. H. C. L. Hendriks, M. Rosselli del Turco, N. Bjurstam, H. Junkermann, et al. Importance of Comparison of Current and Prior Mammograms in Breast Cancer Screening Radiology, January 1, 2007; 242(1): 70 - 77. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. M. Ko, M. J. Nicholas, J. B. Mendel, and P. J. Slanetz Prospective assessment of computer-aided detection in interpretation of screening mammography. Am. J. Roentgenol., December 1, 2006; 187(6): 1483 - 1491. [Abstract] [Full Text] [PDF] |
||||
![]() |
V. R. Pai, N. E. Gregory, A. E. Swinford, and M. Rebner Ductal Carcinoma in Situ: Computer-aided Detection in Screening Mammography Radiology, December 1, 2006; 241(3): 689 - 694. [Abstract] [Full Text] [PDF] |
||||
![]() |
N Karssemeijer, J D M Otten, H Rijken, and R Holland Computer aided detection of masses in mammograms as decision support Br. J. Radiol., December 1, 2006; 79(Special_Issue_2): S123 - S126. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Das, G. Muhlenbruch, A. H. Mahnken, T. G. Flohr, L. Gundel, S. Stanzel, T. Kraus, R. W. Gunther, and J. E. Wildberger Small Pulmonary Nodules: Effect of Two Computer-aided Detection Systems on Radiologist Performance. Radiology, November 1, 2006; 241(2): 564 - 571. [Abstract] [Full Text] [PDF] |
||||
![]() |
F. J. Gilbert, S. M. Astley, M. A. McGee, M. G. C. Gillan, C. R. M. Boggis, P. M. Griffiths, and S. W. Duffy Single Reading with Computer-aided Detection and Double Reading of Screening Mammograms in the United Kingdom National Breast Screening Program Radiology, October 1, 2006; 241(1): 47 - 53. [Abstract] [Full Text] [PDF] |
||||
![]() |
K. Drukker, M. L. Giger, and C. E. Metz Robustness of Computerized Lesion Detection and Classification Scheme across Different Breast US Platforms Radiology, December 1, 2005; 237(3): 834 - 840. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. D. M. Otten, N. Karssemeijer, J. H. C. L. Hendriks, J. H. Groenewoud, J. Fracheboud, A. L. M. Verbeek, H. J. de Koning, and R. Holland Effect of Recall Rate on Earlier Screen Detection of Breast Cancers Based on the Dutch Performance Indicators J Natl Cancer Inst, May 18, 2005; 97(10): 748 - 754. [Abstract] [Full Text] [PDF] |
||||
![]() |
K Doi Current status and future potential of computer-aided diagnosis in medical imaging Br. J. Radiol., January 1, 2005; 78(suppl_1): S3 - s19. [Abstract] [Full Text] [PDF] |
||||
![]() |
J Roehrig The manufacturer's perspective Br. J. Radiol., January 1, 2005; 78(suppl_1): S41 - S45. [Abstract] [Full Text] [PDF] |
||||
![]() |
S M Astley Computer-based detection and prompting of mammographic abnormalities Br. J. Radiol., December 1, 2004; 77(suppl_2): S194 - S200. [Abstract] [Full Text] [PDF] |
||||