Radiology
HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH TABLE OF CONTENTS
 QUICK SEARCH:   [advanced]


     


Published online before print February 28, 2003, 10.1148/radiol.2271011962
This Article
Right arrow Abstract Freely available
Right arrow Figures Only
Right arrow Full Text (PDF)
Right arrow All Versions of this Article:
2271011962v1
227/1/192    most recent
Right arrow Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when eLetters are posted
Right arrow Alert me if a correction is posted
Right arrow Citation Map
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Right arrow reprints & permissions
Citing Articles
Right arrow Citing Articles via HighWire
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Karssemeijer, N.
Right arrow Articles by Holland, R.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Karssemeijer, N.
Right arrow Articles by Holland, R.
(Radiology 2003;227:192-200.)
© RSNA, 2003


Breast Imaging

Computer-aided Detection versus Independent Double Reading of Masses on Mammograms1

Nico Karssemeijer, PhD, Johannes D. M. Otten, MS, Andre L. M. Verbeek, MD, PhD, Johanna H. Groenewoud, MD, PhD, Harry J. de Koning, MD, PhD, Jan H. C. L. Hendriks, MD, PhD and Roland Holland, MD, PhD

1 From the Departments of Radiology (N.K.) and Epidemiology and Biostatistics (J.D.M.O., A.L.M.V.) and the National Expert and Training Center for Breast Cancer Screening (J.H.C.L.H., R.H.), University Medical Center Nijmegen, Geert Grooteplein 18, 6525 GA Nijmegen, the Netherlands; and the Department of Public Health, National Evaluation Team for Breast Cancer Screening (J.H.G., H.J.d.K.). From the 2001 RSNA scientific assembly. Received November 30, 2001; revision requested January 10, 2002; final revision received August 5; accepted August 22. Supported by the National Evaluation Team for Breast Cancer Screening. Address correspondence to N.K. (e-mail: n.karssemeijer@rad.umcn.nl).


    ABSTRACT
 TOP
 ABSTRACT
 INTRODUCTION
 MATERIALS AND METHODS
 RESULTS
 DISCUSSION
 REFERENCES
 
PURPOSE: To evaluate the use of a computer-aided detection (CAD) system (designed for mammographic mass detection) to help improve mass interpretation and to compare CAD results with independent double-reading results.

MATERIALS AND METHODS: Screening mammograms from 500 cases were collected; 125 of these cases were screening-detected cancers, and 125 were interval cancers. Previously obtained screening mammograms (ie, prior mammograms) were available in all cases. All mammograms were analyzed by a CAD system, which detected mass regions and assigned a level of (cancer) suspicion to each mass. Ten experienced screening radiologists read the prior mammograms. For independent interpretation with CAD, the suspicion rating assigned to each finding by the radiologist was weighted with the CAD output at the area of the finding. CAD markers on areas that were not reported by the radiologist were not used. Independent double reading was implemented by using a rule to combine the levels of suspicion assigned to findings by two radiologists. Results were evaluated by using localized-response receiver operating characteristic analysis.

RESULTS: In a total of 141 cases, there was a visible abnormality at the location of the cancer on the prior mammogram, and 115 of these were classified as mass cases. For prior mammograms that depicted masses, the mean sensitivity of the radiologists, as averaged among the false-positive rates lower than 10%, was 39.4%; this increased by 7.0% with CAD and by 10.5% with double reading. Differences among single, double, and CAD readings were statistically significant (P < .001).

CONCLUSION: Although independent double reading yields the best detection performance, the presence and probability of CAD mass markers can improve mammogram interpretation.

© RSNA, 2003

Index terms: Breast neoplasms, diagnosis, 00.31, 00.32 • Cancer screening, 00.11 • Computers, diagnostic aid


    INTRODUCTION
 TOP
 ABSTRACT
 INTRODUCTION
 MATERIALS AND METHODS
 RESULTS
 DISCUSSION
 REFERENCES
 
Breast cancer screening programs have been established as an effective way to reduce mortality from breast cancer. It is well known that the effectiveness of these programs strongly depends on high acceptance by the target population and on high quality of the screening procedure. Although a lot of attention has been focused on technical quality assurance to guarantee optimal mammographic image quality, the quality of mammographic interpretation now seems to be the weakest link in the process. Review studies (16) have revealed that observer errors are frequent in breast cancer screening. On the basis of the results of these studies, it is estimated that 20%–30% of cancers could be detected in an earlier screening without an increase in the recall rate to an unacceptable level.

The causes of these false-negative screening examinations are not clear. When previously obtained screening mammograms (hereafter referred to as prior mammograms) from cancer cases are retrospectively reviewed and show substantial abnormality, it is often suggested that these abnormalities were overlooked, whereas the more subtle findings are classified as misinterpretations (5). Such classifications, however, might be misleading, because a retrospective review of known abnormalities is very different from a screening of unknown cases with a cancer prevalence of about 0.5%.

To reduce the number of false-negative interpretations at screening mammography, computer-aided detection (CAD) methods have been developed. These systems are primarily designed to generate prompts at suspicious areas on mammograms and have been shown to be effective in detecting cancers at a stage earlier than that at which radiologists detect malignancies (68). Typically, these prompting systems operate with high sensitivity, but their specificity is moderate. The idea behind these CAD systems is that when a radiologist carefully inspects mammographic areas that are prompted with CAD, the risk of overlooking a substantial abnormality is minimized.

It has been suggested that the potential contribution of CAD can be estimated from the sensitivity of CAD in identifying lesions missed at screening (6), with the assumption that these missed lesions were overlooked by the radiologist. However, the extent to which false-negative interpretations are really caused by radiologist oversight is not known. In a large prospective study (9), the sensitivity of screening increased by 19.5% with use of CAD. This result showed that errors due to oversight are frequent, but it also appeared to indicate that the increase in sensitivity was due mainly to improved detection of microcalcifications. The benefit of CAD in the detection of masses, architectural distortions, and focal asymmetries of the breast (all of which are categorized into one group and referred to as masses in this article) still remains an open issue. In general, radiologists find it more difficult to use the CAD prompts for mass detection than the CAD prompts for microcalcification detection. The reason for this may be that mass detection problems are related more to the interpretation of complex mammographic regions than to the search and detection of small signal intensities that are hampered by low contrast and noise.

Studies of perception in radiology have been conducted by using eye trackers, with which investigators have classified observer errors into three categories: search errors, detection errors, and interpretation errors (10,11). Search errors are those in which foveal sight never reached the lesion. Detection errors are those involving missed lesions that were only briefly focused on by the eye but for which the visual dwell time was shorter than a given threshold time. A threshold time of 1 second is typically chosen but thought to be too short to enable one to fully perceive the typical signs that would make the lesion suspicious for cancer. Lesions that are inspected longer than 1 second and either not reported or incorrectly classified are considered to be so owing to interpretation errors. These lesions are consciously evaluated but acted on inappropriately. Most current CAD systems are designed to prevent search and detection errors; however, some study results suggest that radiologists can also increase their performance when they use computers to help them interpret detected lesions (1214).

Without considering recorded eye movements, one may define search and detection errors as those that occur when a radiologist does not report the presence of a visible lesion and interpretation errors as those that occur when the lesion is reported but not considered actionable. It should be noted, however, that such definitions make sense only when radiologists are asked to report any mammographic region that they closely inspect.

The purpose of our study was to evaluate the use of a CAD system (designed for mammographic mass detection) to help improve mass interpretation and to compare CAD results with independent double-reading results.


    MATERIALS AND METHODS
 TOP
 ABSTRACT
 INTRODUCTION
 MATERIALS AND METHODS
 RESULTS
 DISCUSSION
 REFERENCES
 
Case Collection and Annotation
All of the mammographic cases used in this study were from the Dutch Breast Cancer Screening Program. This is a population-based program that was gradually implemented in the Netherlands. Initiated in 1989, the program reached full coverage of the population in 1998. All women aged 50–75 years are invited biannually to participate in this program. The attendance rate is almost 80%. Mammographic examinations are performed in screening units operated by radiographers only; most of these units are mobile. The mammograms are transferred to reading centers, where they are double read by trained screening radiologists. It should be noted that these radiologists do not have the option to perform short-term follow-up or additional imaging.

Currently, about 800,000 screening mammograms are obtained annually in this program nationwide. About 8,800 (1.1%) of these mammograms are referred to a general hospital for further investigation—that is, additional imaging and/or biopsy. About half of these cases turn out to be breast cancer. The average positive predictive value of biopsy for screening-detected breast cancer cases in the Netherlands is around 70%. To get a clear understanding of the data set that we collected, it is important to note that two mammographic views—mediolateral oblique and craniocaudal—are obtained at the initial screening in this program, whereas only mediolateral views are obtained at subsequent screenings, unless there is an indication that additional craniocaudal views would be beneficial.

We collected the screening mammograms from women screened between 1997 and 1999 in five of nine regions in the Netherlands. We chose only five regions to reduce our administrative workload. The district of Nijmegen was chosen because mammograms were easily accessible there. The other four districts were chosen at random. From all of the women in these five regions in whom breast cancer was detected, 125 positive screening-detected cases and 125 interval cancers were randomly chosen; we selected only those cases in which the mammograms acquired from at least two prior screening periods were available. Cases with insufficient image quality or poor positioning technique also were excluded. All women whose mammograms were included in the study gave their consent by completing a questionnaire at screening, in which they granted us permission to use their mammograms for scientific and educational purposes. Institutional review board approval was not required.

In the 250 positive cases, mammograms were obtained at three times: The diagnostic mammogram was that obtained at the time of cancer detection, and the mammograms obtained during the two screening sessions before cancer was detected are referred to as the prior and reference (obtained at screening before prior mammograms) mammograms. The diagnostic mammograms were either clinical mammograms from the interval cancer cases or screening mammograms of the screening-detected cancers. For each case, the average time between subsequent screening mammogram acquisitions was 2 years. For the interval cancers, the time between the diagnostic and prior mammogram acquisitions ranged from 3 to 26 months (average, 14 months).

The 250 normal (ie, negative) cases were randomly selected among women from the same five regions and during the same period as the cancer cases. With use of this selection process, biases due to variation in image quality, variation in equipment and film manufacturers, and the use of patient labels and identification markers were avoided. For each normal case, three screening mammograms were obtained: those that corresponded to the two prior screening rounds of the positive cases and the screening mammogram of the next round. The criterion for inclusion was that all three screening mammograms were reported as normal. The cases that were referred but involved benign lesions or that were reported as normal after additional imaging or biopsy were not included. This does not mean that there were no benign lesions in the studied data set. To maintain a low referral rate, screening radiologists in the Netherlands do not refer patients who have lesions that are judged to be benign by both readers.

The total number of mammograms collected for the study was 1,500. From the 500 prior mammograms, an additional craniocaudal view was available in 62 positive cases and 35 negative (ie, normal) cases. The main reason that more positive cases were found among the craniocaudal views is that positive cases more often show a dense breast tissue pattern, which is an indication for radiographers to obtain a craniocaudal view. The unequal number of craniocaudal views among the positive and negative cases did not bias our study. This ratio reflects screening practice, in which readers are aware that four-view cases are associated with a higher probability of cancer. Among the 500 reference mammograms, craniocaudal views were obtained in 235 cases. Many of these cases were those of women undergoing their first screening. Among the 250 diagnostic examinations of the positive cases, there were also 235 cases in which a craniocaudal view was obtained.

A radiologist (J.H.C.L.H.), whose experience includes the reading of more than 10,000 screening mammograms annually since 1975, reviewed all the positive cases. We will refer to this radiologist as the study radiologist. The diagnostic mammograms and pathology reports were used to draw the locations of the cancers on paper printouts of the diagnostic and prior mammograms. When there was no visible sign of cancer on the prior mammogram, we annotated the cancer location by visually matching the lesion depicted on the diagnostic mammogram with the corresponding region of the lesion on the prior mammogram. No annotation was made when the cancer was hidden on the diagnostic mammogram as well. In some cases, a lesion was visible on the reference mammogram, but the location of these lesions was not annotated. The study radiologist classified each case as obvious, minimal sign, or not visible on the basis of the visibility of the cancer at the prior screening (ie, on the prior mammogram).

We digitized all of the images used in the study (a total of 3,732 screen films) by using a digitizing system (ImageChecker M1000, version 2.0; R2 Technology, Sunnyvale, Calif) equipped with a film scanner (Lumisys LS85; Lumisys, Sunnyvale, Calif). We archived the digitized images after averaging the spatial resolution down to 100 µm per pixel. The annotations made by the study radiologist were converted from paper printouts to a digital format by a research assistant.

Observer Study
A panel of 10 radiologists, not including the study radiologist, was invited to perform a blinded review of the collected mammograms. These radiologists had at least 5 years of screening experience and read 3,000–10,000 mammographic studies per year. Each reader independently read the original prior mammograms from all 500 cases during 10 sessions spread out over 2 days. The cases were presented on dedicated mammography film alternators in a random order. Mounting both current and prior mammograms is standard in the Dutch screening program. Therefore, to mimic daily screening practice as close as possible, for each case, the mammogram obtained during the screening session previous to the prior screening session (ie, the reference mammogram) was also mounted on the alternator so that the radiologist could assess mammographic changes over time.

The 10 radiologists were instructed to use a very low threshold when deciding which findings to report so that they could analyze afterward all regions that they interpreted as possible locations of cancer. They were provided with a scoring form for each case; this form included a printed copy of the prior mammogram. They were asked to draw the contours of each finding on the mediolateral oblique and craniocaudal views, classify the visible signs of the abnormality according to one or more predefined categories, and assess the likelihood (ie, suspicion) of malignancy.

In this study, we used five major categories of reported findings: mass, microcalcifications, architectural distortion, asymmetry, or "other." To assess the likelihood of malignancy of each finding, we used a scale with 14 levels that ranged from 0% to 100%. The scale had nine categories with a linear increase from 10% to 90% and four categories to subdivide the low and high ends of the scale: 1%, 2%, 5%, and 95%. We included the latter categories to avoid getting many responses in the highest and lowest categories. This choice was motivated by experience in previous experiments, in which some radiologists strongly preferred to use categories that were close to both ends of the scale.

In the analysis of data, a suspicion level score of 0 was assigned when a region that was identified by the other radiologists was not scored by one radiologist. It should be noted that indicated levels of the likelihood of malignancy were used only as reference points to help radiologists use the scale in a consequent manner (ie, not varying over time). Before the image interpretation sessions started, the readers were assured that their interpretation of the absolute levels of the scale would not influence the analysis of their results, as long as they did not change their use of the scale during the experiment and spread their scores over the full width.

The time to read images had to be limited for practical purposes. The radiologists could spend a maximum of 1.5 hours to review each batch of 50 cases; this is considerably longer than the time they are allotted in routine screening, at which reading of more than 100 cases per hour is common. More time was needed because many cases were abnormal and because our scoring form to report findings was more extensive than that used in regular screening.

Independent Double Reading
Independent double-reading results were derived by using a rule to combine reader (cancer) suspicion scores (15,16)—that is, determine results without consensus or arbitration, as is more common in practice. We computed the independent double-reading results by using the combination of results of two radiologists and averaging the levels of suspicion of their findings. Before combining the reader scores, we correlated the findings on the basis of their locations. Findings were combined only if they were sufficiently close to each other. We used the distance between the centers of the annotations. To combine findings, this distance had to be less than 2.5 cm. Otherwise, the findings were considered to be unrelated.

We computed the average double-reading result for each finding by considering each possible pair of radiologists. Thus, we paired each of the 10 readers with the other nine readers to obtain nine possible double-reading results. The arithmetic mean of these nine results was used to compute each reader’s overall double-reading result. The simple arithmetic mean was not used to compute the average of the suspicion ratings. When only one of the radiologists in a pair marked a particular finding, two methods of computing the average were investigated: With the first method, the level of suspicion assigned by the radiologist who did not annotate the finding was considered to be zero. This way, the findings seen by only one of the two radiologists were strongly downgraded. With the second method, the rating of the radiologist who did mark the lesion was considered to be the unmodified combined rating.

When computing the average score of the radiologist pairs, we also had to take into account that radiologists used the scale that we provided in different ways. Those who marked a large number of findings liberally used the lowest categories of the scale, whereas others who marked fewer findings used the 1%, 2%, and 5% categories of the 14-level scale less often. Furthermore, some radiologists distributed their findings more evenly across the upper end of the scale than others. It was important to address this latter issue without introducing noise as a result of the variability in use of the lower end of the scale. To facilitate evenly weighted average ratings, we computed an adjustment factor for each radiologist. We calculated the adjusted score by dividing the suspicion rating assigned to each of a reader’s findings by the arithmetic mean of the 25 highest suspicion ratings of the noncancerous findings.

Computer-aided Detection
We processed all cases by using the R2 ImageChecker, version 2.0 system with a special installation of the software provided by R2 Technology so that we could archive the detected regions with a level of importance assigned to each region and its coordinates. It should be noted that in the near future, archiving such data will be common because this process is part of the DICOM (digital imaging and communications in medicine) standard that has been developed for CAD in mammography. The level of importance of regions is determined by using a detection algorithm and is normally used only internally to select the markers to be displayed for the radiologists by using a threshold. Detection algorithms involve the classification of cancers into two categories: masses and microcalcifications.

The mass detection algorithm encompasses a wide range of mammographic abnormalities, from spiculated masses and architectural distortions to ill-defined masses and focal asymmetries. This algorithm is not geared for the classification of detected lesions into benign and malignant types. Nevertheless, the level of importance of a marked region roughly corresponds to the probability that the region is cancerous. It should be noted that the CAD algorithm in this study did not involve the use of temporal information. Thus, it did not make use of the reference mammograms like the radiologists did. Only the CAD results obtained from the prior mammograms were used for simulation of reading with CAD.

In this study, we used only the output of the mass detection algorithm. We converted the level of importance of the regions detected by the CAD system into a standardized level of normality by using all images from the normal cases in this study. For a given CAD marker, the standardized level of normality was defined as the average number of noncancerous regions per image that were marked by the CAD system as having at least the level of importance of the region at hand. Note that this standardization was independent of a particular database of abnormalities with subjective annotations. Only a large representative data set of normal mammograms, which is much easier to obtain, is required for this procedure.

Simulation of Reading with CAD
By independently combining the findings of the radiologists with the detection results of the CAD program, one can simulate mammogram readings with CAD and investigate the use of CAD markers as a possible way of improving mammogram interpretations. There are many ways this can be done. In this study, we restricted ourselves by considering only those areas that the observer detected and annotated. As a consequence, we ignored the possible true-positive findings of the CAD system that the radiologist overlooked. Furthermore, we looked at only the CAD mass detection results and therefore included only the findings of the radiologists that were assigned to at least one of the three categories: mass, architectural distortion, or asymmetry. Thus, we excluded all microcalcification findings.

By using the standardized level of normality of the regions inspected by the CAD mass detection algorithm, we implemented two methods of simulating radiologists assisted with CAD. The idea behind the first method is that the presence of a mass detection prompt in a region that a radiologist inspects would make that region more suspicious, whereas the absence of a prompt would suggest a lower likelihood of the region being cancerous. The second method refines this idea by giving less weight to mass markers with a higher standardized level of normality. Mathematically, the combined level of suspicion (SR + CAD) is computed by using the equation SR + CAD = SR + f(L), where L is the level of normality of the region, SR is the level of suspicion assigned by the reader, and f(L) is the linear weight function. In cases in which both craniocaudal and mediolateral oblique views were available and the CAD system hit (ie, identified) the region on both views, the level of the mass marker with the lowest level of normality was assigned to the finding. When the CAD system did not hit the region, the finding was determined to have the highest level of normality and thus maximally degraded the reader’s score.

To determine if a mass marker corresponded to a finding of a reader, we used the distance between the center of a mass in the area drawn by the reader and the location of the mass marker. If this distance was smaller than 1.5 cm, a hit was counted; otherwise, the marker was considered to be unrelated to the finding.

The two methods of simulating radiologists assisted by CAD were implemented as two different CAD weight functions: a step function and a function of the linear decrease with log(L). As the cutoff point for the step function, we used 0.4 marked normal region per image, which corresponds to the default setting of the R2 Imagechecker system for mass markers. This leaves one free parameter for each of the functions: the height and the slope of the step and linear functions, respectively. These parameters were determined by maximizing the average performance of the radiologists combined with the performance of the CAD system. Examples of the CAD weight functions are shown in Figure 1. It should be noted that adding a constant to the weight functions did not influence the results, because we determined performance as a function of a decision threshold (described in Data Analysis section). The weight functions for obtaining the CAD results were the same for all radiologists. However, as explained earlier, we took into account differences in the way radiologists used the scale to report suspicious findings. This way, the relative weight of CAD was similar for each observer.



View larger version (20K):
[in this window]
[in a new window]
[Download PPT slide]
 
Figure 1. Graph shows the step (solid line) and linear (dashed line) CAD weight functions used to combine the mass marker output with the observer data. The horizontal axis represents the level of normality of the CAD markers, expressed as the number of normal (ie, noncancerous) regions per image that would be hit by the CAD system on average when only the markers on regions with levels of suspicion smaller than that level of normality would be displayed.

 
Data Analysis
Reader detection sensitivity (with or without CAD) at a given decision threshold was computed as the fraction of cases that would be recalled by the observer because of an abnormality detected at the correct location. In other words, if a reader reported a finding hitting a cancer and this finding was assigned a level of suspicion that exceeded the decision threshold, the case in which the finding was reported was counted as true-positive. A case was counted as false-positive when it had at least one false-positive finding that exceeded the threshold. By varying the decision threshold, we constructed localized-response receiver operating characteristic curves that showed sensitivity as a function of the false-positive fraction (17). It should be noted that these curves are not the same as standard receiver operating characteristic curves, because the cancer location is verified for determination of the sensitivity: When the fraction of false-positive findings reaches 100%, the sensitivity does not approach 100%, as would be the case with a receiver operating characteristic curve.

We used the outlines drawn by the study radiologist who reviewed the cases to determine the cancer locations. By using the contours of the findings that the observers annotated on small printouts of the mammograms, we defined the true-positive findings as those that were within a certain distance from the true cancer location. A hit was counted when the center of a mass finding was closer than 2.5 cm to the center of a mass of true cancer. It should be noted that on the basis of this definition, our results did not depend on the size of the annotated findings or on the size of truth annotations made by the study radiologist. We chose the distance criterion of 2.5 cm as a compromise to balance the risk of counting accidental hits with the risk of counting true annotated findings of cancer.

For microcalcifications and architectural distortions in particular, there was considerable variation in the size of the annotated findings. Therefore, we could not set a very strict criterion. The risk of accidental hits with our criterion was mostly related to very small cancers. In this regard, it should be noted that the fraction of cancers with an annotation size of less than 1 cm in diameter was only 6.4%. The average size of the finding annotations made by the study radiologist was 1.9 cm. In our analysis, we did not include cases of cancers that were retrospectively classified as not visible on the prior mammograms. This way, the risk of counting accidental hits was reduced.

In a similar way, we defined a true-positive finding of the CAD system as that when a marker was less than 1.5 cm from the center of a mass of true cancer. Because the CAD system had more false-positive findings than the radiologists, we used a stricter criterion to reduce the chance of counting an accidental hit. We computed case sensitivity; this means that a hit was counted when a lesion was seen on either the mediolateral or the craniocaudal view. To judge the stand-alone performance of the CAD system, free-response receiver operating characteristic curves were constructed and yielded the sensitivity as a function of the number of false-positive markers per image. This was done for both the diagnostic and the prior mammograms.

To obtain an estimate of the false-positive fraction, we used only the normal cases. This enabled us to prevent ambiguities in the annotations of abnormalities from influencing our study results. For instance, a radiologist might have annotated a multifocal lesion by outlining more than one finding, which easily could have led to findings being counted as false-positive if they did not correspond exactly to the truth annotations made by the study radiologist.

During screening, the false-positive rate generally is lower than 10%. It ranges from 1% to 4% in European screening programs to around 8% on average in the United States (6). Therefore, we are particularly interested in the part of localized-response receiver operating characteristic curves that represents sensitivity at a recall rate of less than 10%. The mean sensitivity for this range of recall rates is used as a measure of performance. For statistical analysis of differences among single, double, and single CAD readings, we used analysis of variance with a linear mixed model. To determine which reading conditions differed, we used the Tukey-Kramer contrast test.

In addition to performing localized-response receiver operating characteristic analysis, we computed a summary score for each of the abnormalities that were visible on the prior mammograms by taking the average suspicion rating of the 10 readers. This enabled us to rank the cancers according to their level of suspiciousness and perform a separate analysis of a subset of the cancers. To get an indication of how often the cancers that were the most visible on prior mammograms were missed by the readers owing to oversight, we calculated how often the 50 least subtle cancers were seen by the readers. With the assumption that these more obvious cancers would most likely be annotated by any reader once he or she had seen them, the number of times that these cancers were not reported could be regarded as an indication of the number of oversight errors.


    RESULTS
 TOP
 ABSTRACT
 INTRODUCTION
 MATERIALS AND METHODS
 RESULTS
 DISCUSSION
 REFERENCES
 
Retrospective Review
At review, it appeared that in five cases that were positive for cancer, the abnormality could not be determined with certainty. These cases were not included in further analyses. All of the other positive cases were confirmed at biopsy. For 13 of the interval cancers, no annotations were made because the cancer was hidden on the diagnostic and prior mammograms. In retrospect, 141 cases showed a visible abnormality at the location of the cancer on the prior mammogram. For the analyses in this study, we excluded all cases in which microcalcifications were the predominant sign. This left 115 cases with visible masses. The malignancies and sizes of these cancers when they were detected are listed in Table 1.


View this table:
[in this window]
[in a new window]

 
TABLE 1. Malignancy Characteristics and Histologic Sizes of Visible Masses

 
Single Reading
All radiologists finished the reading in approximately 1 hour per batch of 50 cases. The 10 radiologists reported a total of 3,806 findings, or an average of 0.77 finding per case (range, 0.52–1.20). There were 1,089 true-positive findings, 125 of which were excluded because microcalcifications were the only sign.

Figure 2 shows localized-response receiver operating characteristic curves representing the mass detection performance of the radiologists. The vertical axis shows the detection sensitivity, and the horizontal axis shows the fraction of normal cases that would have been referred owing to a false-positive finding. The results indicate a considerable variation in skill among the radiologists. At a false-positive referral rate of 3%, which is common in European screening programs, the detection sensitivity of the radiologists ranged from 27% to 54%. For all radiologists, a strong increase in sensitivity was observed as the referral rate increased. At a 20% referral rate, the average sensitivity was 68 of 115 cases (59%). This indicates that much of the reader variability could be attributed to interpretation skills rather than to search or detection problems.



View larger version (25K):
[in this window]
[in a new window]
[Download PPT slide]
 
Figure 2. Graph illustrates the mean performance of 10 screening radiologists in reading 115 prior mammograms from cancer cases with a visible mass, architectural distortion, or asymmetry, and 250 mammograms from normal cases (solid line). The dashed lines show the range of performance (minimum and maximum).

 
To determine how often the most obvious cancers in our series were reported, we ranked the 115 visible masses according to the average level of suspicion that was assigned to them by the readers. Subsequently, we selected the 50 masses that received the highest average suspicion rating. Figure 3 shows how often these cases were reported. Eleven of these 50 cases were missed once, eight cases were missed twice, and one case was missed three times. The total number of times that one of these cancers was not reported at all by any of the radiologists was 30, or 6% of the number of times that the mammograms depicting these lesions were read.



View larger version (21K):
[in this window]
[in a new window]
[Download PPT slide]
 
Figure 3. Graph shows the numbers of times the radiologists reported cancers in a subset of 50 positive cases, which collectively were determined to be cases of the most obvious cancers on prior mammograms.

 
Independent Double Reading
In Figure 4, the mean results of independent double reading are compared with the mean results of single reading. Double-reading results were obtained by matching the findings of every reader with the findings of all other readers. The error bars represent the estimated SDs of the means. We investigated the differences between the independent double-reading results and the single-reading results by using the mean sensitivity for the range of a 0%–10% false-positive fraction as a measure of performance. Independent double reading led to an increased mean sensitivity during this interval, from 39.4% to 49.9%. Using the alternative rule of combination by applying the single-reading score when only one radiologist reported a lesion turned out to be a poorer strategy. Double reading implemented in this way was not more effective than single reading.



View larger version (33K):
[in this window]
[in a new window]
[Download PPT slide]
 
Figure 4. Graph illustrates mean sensitivities for the detection of visible masses on prior mammograms at single reading, independent double reading, and independent CAD reading, as functions of the false-positive fraction. Performance improved both with CAD and double reading. The improved performance with CAD was a result of improved lesion characterization and did not include the potential reduction in oversight errors.

 
Simulated Reading with CAD
The stand-alone performance of the CAD mass detection algorithm, as measured from the database of 115 visible mass cases and 250 normal cases, is shown in Figure 5. For comparison, the mass detection performance with the diagnostic mammograms from the same cases also is shown. It should be noted that the large difference in CAD performance between the prior and diagnostic mammograms was partly due to the fact that for most of the prior mammograms, only mediolateral oblique views were available, and this decreased the probability that the CAD system detected the lesions. At a rate of 0.5 false-positive case per mammogram, the sensitivity of the CAD system was 71 of 115 cases (62%) for the prior mammograms and 108 of 115 cases (94%) for the diagnostic mammograms.



View larger version (24K):
[in this window]
[in a new window]
[Download PPT slide]
 
Figure 5. Graph illustrates detection performance of the CAD system in 115 study cases of visible masses on prior mammograms, as a function of the false-positive mass marker rate. For comparison, the performance of the CAD system in the same cases but with use of the diagnostic mammograms, in which the cancers were detected later, also is shown. The difference in performance between the two sets of mammograms was due to the increased visibility of the cancers on the diagnostic images and the fact that most of the prior mammograms were only mediolateral-oblique views. In the observer study, we used only the CAD results obtained from the prior mammograms.

 
Figure 4 shows the mean results obtained by combining the observer ratings with the output of the CAD mass detection program from the prior mammograms. These results were the mean performance curves obtained by using the linear weight function. Results for each individual reader are listed in Table 2. It appears that with use of CAD, the performance of all the radiologists—even those who had the best single reading performance in our study—increased. The mean increase in detection performance as measured according to a mean increase in sensitivity below a false-positive rate of 10% was 7.0%. With use of the step function to weight the CAD markers instead of the linear weight function, the effect of CAD was smaller: The mean increase in sensitivity was 3.2% (Table 2).


View this table:
[in this window]
[in a new window]

 
TABLE 2. Mean Sensitivities in the Range of False-Positive Fractions Lower Than 10% for Single, Independent, and CAD Readings

 
Statistical Analysis
At analysis of variance, the three reading conditions—single reading, double reading, and CAD reading (ie, with linear weight function)—differed significantly (P < .001). By testing the statistical significance of differences among the mean sensitivities for the different conditions, we found that both double reading and CAD reading were better than single reading (Tukey-Kramer test, P < .001). According to the results of our analysis, independent double reading appeared to be significantly better than CAD reading (Tukey-Kramer test, P = .009). The data in Figure 4 indicate that this difference was due mostly to the difference in performance at a false-positive rate of less than 5%.


    DISCUSSION
 TOP
 ABSTRACT
 INTRODUCTION
 MATERIALS AND METHODS
 RESULTS
 DISCUSSION
 REFERENCES
 
In breast cancer screening programs, many cancers may be classified retrospectively as "late detected," despite double reading. The results of this study indicate that a large fraction of screening errors may be attributed to difficulties in interpretation. In other words, with use of a relatively high threshold for referral (about 1% in the described screening program), many cancers that are considered by the radiologists to be potential abnormalities at the time of screening are not referred for diagnostic workup. In our screening study, the major sign indicative of lesions on prior screening mammograms was a mass. Only a few cases in which microcalcifications were the only sign could be classified as misinterpreted.

We investigated the effectiveness of independent double reading by combining the suspicion scores of findings of the different radiologists. Double reading significantly improved their detection performances, even though the way in which reader scores were combined was simple. According to the data shown in Figure 3, there were some abnormalities that were obvious to most of the radiologists but not marked by some of them. These cases might have been classified as "overlooked" by those who did not mark them. With our score averaging scheme, assigning a level of suspicion of zero for the radiologist who did not mark the lesion greatly reduced the levels of suspicion of such findings. For such cases, a real double reading with a consensus meeting or arbitration (ie, with inclusion of a third radiologist who makes the decision) would probably lead to better results. It appeared, however, that the averaging procedure that we implemented yielded significantly better results than the alternative method of using the level of the reader who did see the finding as the combined reader result. These results indicate that the positive effect of down weighting the false-positive lesions marked by only one radiologist was much stronger than the negative effect of down weighting the true-positive lesions seen by only one radiologist. The results also demonstrate that the effect of misclassifying cases was stronger than the effect of missing cases owing to oversight.

The fact that we observed a benefit from independent double reading contradicts the results of a study performed by Taplin et al (15), in which the independent combining of the assessments of two readers did not lead to an increase in observer performance. Also, Jiang et al (16), in a study on the characterization of microcalcifications in which various ways of combining reader scores were used, did not observe a positive effect from independent double reading. Our results are in accordance, however, with other reports of double reading in breast cancer screening (4,1820). The fact that Taplin et al (15) did not have improved performance with independent double reading might be explained by the fact that they did not use localization of the findings to combine reader assessments, but rather they combined assessments of mammograms in a regular receiver operating characteristic analysis. It also may be possible that their negative results were obtained because they did not average the ratings of the radiologists, but rather they used the most abnormal result of the two assessments for each case.

We investigated the potential benefit of using a CAD system to interpret mammographic masses by combining the level of normality of regions inspected by the CAD system with the radiologists’ judgments. The CAD system’s placements of markers in locations where the radiologist did not report a finding were ignored. A weight function was used to weight the outcomes of the CAD system with the normalized levels of suspicion that were assigned to findings by the radiologists. The results shown in Figure 4 show that there was a large benefit in using the CAD system in this way. For recall rates larger than 5%, the single-reading results improved to a level that was comparable to that of the independent double-reading results.

One might argue that the double-reading method that we implemented is not optimal and that in real practice the improvement achieved with double reading might be better with use of consensus or arbitration instead of the independent combination of the single-reading results. However, the CAD reading that we implemented was not optimal either, because there were unused true cancerous regions that were detected by the CAD system but overlooked by the radiologist (eg, not annotated). In other words, there was no way with this simulation that CAD could help us avoid oversight errors, even though this is generally considered the most important role of current CAD systems. The results of this study indicate that the use of a CAD system may have as much potential to help improve breast mass interpretation as it does to help avoid search errors.

The schemes that we used to combine the reader scores and the CAD markers were complex owing to the fact that correlation of the finding locations was required. In addition, the suspicion level scale that we used had many more levels than the ones used in actual screening practice. This makes the clinical use of the procedure that we designed impractical. However, our intention was not to design practical schemes for combining the results of independent reader results or for independent reading with CAD; our goal was to systematically study the potential benefit of using such strategies. We believe that the consensus double reading currently practiced in many breast cancer screening programs probably is somewhat better than the independent-reading combination approach that we studied. Also, with regard to CAD mass interpretation, one might argue that the independent combination of suspicion levels is not ideal. A radiologist who knows the strengths and weaknesses of a particular CAD system might be able to use CAD more optimally. In particular, the fact that the readers interpreted mammograms from two screening sessions, whereas the CAD system interpreted mammograms from only one screening was a weakness of our study. The growth of masses, which is generally recognized as an important indicator of malignancy, was not recognized as a measure of suspicion by the CAD system; this limitation reduced the weight of this feature in the combined scores.

We estimated the parameters of the CAD weight functions by maximizing the mean benefit obtained from using CAD for the whole group of readers. In practice, the step function can be implemented by displaying only the CAD marks on regions below a threshold level; current prompting systems are designed this way. Radiologists may use the presence or absence of a marker on regions that they inspect to help them make a decision when they are in doubt. However, in our study, the use of a continuous function to weight the CAD levels of normality with the radiologists’ ratings led to better results. These results suggest that displaying the importance of mass marks should be investigated.

The series of prior mammograms that we collected can be regarded as a representative sample from the Dutch screening program. It should be noted, however, that there were more interval cancers in our series than would have been obtained by means of random sampling. In actual practice, 36% of the cancers in regular participants in the Dutch screening program are interval cases, whereas 64% are screening-detected malignancies (21). We found that the rates of detection of interval and screening-detected cancers on prior mammograms were very similar. Therefore, we do not believe that our results would have been different if the selected cases had more accurately reflected the case distribution in real practice. However, one should be aware that the organization of a screening program affects the types of cases that one collects.

When the threshold for referral is relatively high and the screening interval is longer, visible abnormalities on prior mammograms will be less subtle on average; however, one might expect double reading to enable the detection of these less obvious abnormalities on prior mammograms. The subtlety of cancers in a given study set will have a major influence on radiologist performance, so one should be cautious when applying our results outside of the contexts described herein. However, we believe that the relative differences among single reading, double reading, and CAD reading that we observed are valid.

It should be noted that the CAD system that we used was not trained to classify or separate benign from malignant lesions. This system often marks benign abnormalities such as lymph nodes and cysts as highly abnormal. A radiologist can easily identify such marks as irrelevant and ignore them, but with our simple scheme, the CAD outputs were weighted independently of the radiologists’ interpretations. The effectiveness of CAD in helping to improve breast mass detection is expected to increase when CAD systems are further developed and trained to distinguish benign abnormalities from malignancies.


    ACKNOWLEDGMENTS
 
The authors thank Henny Rijken of the National Expert and Training Center for Breast Cancer Screening, Nijmegen, the Netherlands, for her contribution in organizing the observer study and Kathy O’Shaughnessy of R2 Technology, for her valuable comments on the manuscript.


    FOOTNOTES
 
N.K. is a consultant and stockholder of R2 Technology.

Abbreviation: CAD = computer-aided detection

Author contributions: Guarantor of integrity of entire study, N.K.; study concepts, N.K.; study design, all authors; literature research, N.K., J.H.C.L.H.; experimental studies, J.D.M.O., J.H.C.L.H., N.K., J.H.G., R.H.; data acquisition, J.D.M.O., N.K.; data analysis/interpretation, N.K.; statistical analysis, N.K.; manuscript preparation, definition of intellectual content, and editing, N.K.; manuscript revision/review, N.K., R.H., J.H.C.L.H., J.D.M.O., A.L.M.V.; final version approval, all authors.


    REFERENCES
 TOP
 ABSTRACT
 INTRODUCTION
 MATERIALS AND METHODS
 RESULTS
 DISCUSSION
 REFERENCES
 

  1. van Dijck JAAM, Verbeek ALM, Hendriks JHCL, Holland R. The current detectability of breast cancer in a mammographic screening program. Cancer 1993; 72:1933-1938.[CrossRef][Medline]
  2. Harvey JE, Fajardo LL, Inis CA. Previous mammograms in patients with impalpable breast carcinoma: retrospective vs blinded interpretation. AJR Am J Roentgenol 1993; 161:1167-1172.[Abstract/Free Full Text]
  3. Jones R, McLean L, Young J. Proportion of cancers detected at first incident screen which were false negative at the prevalent screen. Breast 1996; 5:339-343.[CrossRef]
  4. Blanks RG, Wallis MG, Moss SM. A comparison of cancer detection rates achieved by breast cancer screening programmes by number of readers, for one and two view mammography: results from the UK National Health Service breast screening programme. J Med Screen 1998; 5:195-201.[Abstract/Free Full Text]
  5. Vitak B. Invasive interval cancers in the Ostergotland mammographic screening programme: radiological analysis. Eur Radiol 1998; 8:639-646.[CrossRef][Medline]
  6. Burhenne LJW, Wood SA, D’Orsi CJ, et al. The potential contribution of computer aided detection to the sensitivity of screening mammography. Radiology 2000; 215:554-562.[Abstract/Free Full Text]
  7. te Brake GM, Karssemeijer N. Automated detection of breast carcinomas that were not detected in a screening program. Radiology 1998; 207:465-471.[Abstract/Free Full Text]
  8. Birdwell RL, Ikeda DM, O’Shaughnessy KF, Sickles EA. Mammographic characteristics of 115 missed cancers later detected with screening mammography and the potential utility of computer-aided detection. Radiology 2001; 219:192-202.[Abstract/Free Full Text]
  9. Freer TW, Ulissey MJ. Screening mammography with computer-aided detection: prospective study of 12,860 patients in a community breast center. Radiology 2001; 220:781-786.[Abstract/Free Full Text]
  10. Kundel H, Nodine C, Carmody D. Visual scanning, pattern recognition and decision-making in pulmonary nodule detection. Invest Radiol 1978; 13:175-181.[CrossRef][Medline]
  11. Nodine CF, Mello-Thoms C, Weinstein SP, et al. Blinded review of retrospectively visible unreported breast cancers: an eye-position analysis. Radiology 2001; 221:122-129.[Abstract/Free Full Text]
  12. Jiang YL, Nishikawa RM, Schmidt RA, Metz CE, Giger ML, Doi K. Improving breast cancer diagnosis with computer-aided diagnosis. Acad Radiol 1999; 6:22-33.[CrossRef][Medline]
  13. Veldkamp WJH, Karssemeijer N, Otten JD, Hendriks JH. Automated classification of clustered microcalcifications into malignant and benign types. Med Phys 2000; 27:2600-2608.[CrossRef][Medline]
  14. Chan HP, Sahiner B, Helvie MA, et al. Improvement of radiologists’ characterization of mammographic masses by using computer-aided diagnosis: an ROC study. Radiology 1999; 212:817-827.[Abstract/Free Full Text]
  15. Taplin SH, Rutter CM, Elmore JG, Seger D, White D, Brenner RJ. Accuracy of screening mammography using single versus independent double interpretation. AJR Am J Roentgenol 2000; 174:1257-1262.[Abstract/Free Full Text]
  16. Jiang Y, Nishikawa RM, Schmidt RA, Metz CE, Doi K. Relative gains in diagnostic accuracy between computer-aided diagnosis and independent double reading. SPIE Med Imaging 2000; 3981:10-15.
  17. Swensson RG. Unified measurement of observer performance in detecting and localizing target objects on images. Med Phys 1996; 23:1709-1725.[CrossRef][Medline]
  18. Thurfjell EL, Lernevall KA, Taube AAS. Benefit of independent double reading in a population-based mammography screening program. Radiology 1994; 191:241-244.[Abstract/Free Full Text]
  19. Anderson ED, Muir BB, Walsh JS, Kirkpatrick AE. The efficacy of double reading mammograms in breast screening. Clin Radiol 1994; 49:248-251.[CrossRef][Medline]
  20. Brown J, Bryan S, Warren R. Mammography screening: an incremental cost effectiveness analysis of double versus single reading of mammograms. BMJ 1996; 312:809-812.[Abstract/Free Full Text]
  21. Fracheboud J, de Koning HJ, Beemsterboer PM, et al. Interval cancers in the Dutch breast cancer screening programme. Br J Cancer 1999; 81(5):912- 917.[CrossRef][Medline]



This article has been cited by other articles:


Home page
NEJMHome page
F. J. Gilbert, S. M. Astley, M. G.C. Gillan, O. F. Agbaje, M. G. Wallis, J. James, C. R.M. Boggis, S. W. Duffy, and the CADET II Group
Single Reading with Computer-Aided Detection for Screening Mammography
N. Engl. J. Med., October 16, 2008; 359(16): 1675 - 1684.
[Abstract] [Full Text] [PDF]


Home page
Am. J. Roentgenol.Home page
D. Georgian-Smith, R. H. Moore, E. Halpern, E. D. Yeh, E. A. Rafferty, H. A. D'Alessandro, M. Staffa, D. A. Hall, K. A. McCarthy, and D. B. Kopans
Blinded Comparison of Computer-Aided Detection with Human Second Reading in Screening Mammography
Am. J. Roentgenol., November 1, 2007; 189(5): 1135 - 1141.
[Abstract] [Full Text] [PDF]


Home page
RadiologyHome page
A. A. J. Roelofs, N. Karssemeijer, N. Wedekind, C. Beck, S. van Woudenberg, P. R. Snoeren, J. H. C. L. Hendriks, M. Rosselli del Turco, N. Bjurstam, H. Junkermann, et al.
Importance of Comparison of Current and Prior Mammograms in Breast Cancer Screening
Radiology, January 1, 2007; 242(1): 70 - 77.
[Abstract] [Full Text] [PDF]


Home page
Am. J. Roentgenol.Home page
J. M. Ko, M. J. Nicholas, J. B. Mendel, and P. J. Slanetz
Prospective assessment of computer-aided detection in interpretation of screening mammography.
Am. J. Roentgenol., December 1, 2006; 187(6): 1483 - 1491.
[Abstract] [Full Text] [PDF]


Home page
RadiologyHome page
V. R. Pai, N. E. Gregory, A. E. Swinford, and M. Rebner
Ductal Carcinoma in Situ: Computer-aided Detection in Screening Mammography
Radiology, December 1, 2006; 241(3): 689 - 694.
[Abstract] [Full Text] [PDF]


Home page
Br. J. Radiol.Home page
N Karssemeijer, J D M Otten, H Rijken, and R Holland
Computer aided detection of masses in mammograms as decision support
Br. J. Radiol., December 1, 2006; 79(Special_Issue_2): S123 - S126.
[Abstract] [Full Text] [PDF]


Home page
RadiologyHome page
M. Das, G. Muhlenbruch, A. H. Mahnken, T. G. Flohr, L. Gundel, S. Stanzel, T. Kraus, R. W. Gunther, and J. E. Wildberger
Small Pulmonary Nodules: Effect of Two Computer-aided Detection Systems on Radiologist Performance.
Radiology, November 1, 2006; 241(2): 564 - 571.
[Abstract] [Full Text] [PDF]


Home page
RadiologyHome page
F. J. Gilbert, S. M. Astley, M. A. McGee, M. G. C. Gillan, C. R. M. Boggis, P. M. Griffiths, and S. W. Duffy
Single Reading with Computer-aided Detection and Double Reading of Screening Mammograms in the United Kingdom National Breast Screening Program
Radiology, October 1, 2006; 241(1): 47 - 53.
[Abstract] [Full Text] [PDF]


Home page
RadiologyHome page
K. Drukker, M. L. Giger, and C. E. Metz
Robustness of Computerized Lesion Detection and Classification Scheme across Different Breast US Platforms
Radiology, December 1, 2005; 237(3): 834 - 840.
[Abstract] [Full Text] [PDF]


Home page
JNCI J Natl Cancer InstHome page
J. D. M. Otten, N. Karssemeijer, J. H. C. L. Hendriks, J. H. Groenewoud, J. Fracheboud, A. L. M. Verbeek, H. J. de Koning, and R. Holland
Effect of Recall Rate on Earlier Screen Detection of Breast Cancers Based on the Dutch Performance Indicators
J Natl Cancer Inst, May 18, 2005; 97(10): 748 - 754.
[Abstract] [Full Text] [PDF]


Home page
Br. J. Radiol.Home page
K Doi
Current status and future potential of computer-aided diagnosis in medical imaging
Br. J. Radiol., January 1, 2005; 78(suppl_1): S3 - s19.
[Abstract] [Full Text] [PDF]


Home page
Br. J. Radiol.Home page
J Roehrig
The manufacturer's perspective
Br. J. Radiol., January 1, 2005; 78(suppl_1): S41 - S45.
[Abstract] [Full Text] [PDF]


Home page
Br. J. Radiol.Home page
S M Astley
Computer-based detection and prompting of mammographic abnormalities
Br. J. Radiol., December 1, 2004; 77(suppl_2): S194 - S200.
[Abstract] [Full Text] [PDF]