|
|
||||||||
Breast Imaging |
1 From the Department of Radiology, University of Michigan Medical Center, CGC B2102, Box 0904, 1500 E Medical Center Dr, Ann Arbor, MI 48109-0904. From the 2001 RSNA scientific assembly. Received June 18, 2001; revision requested August 8; revision received November 7; accepted January 7, 2002. Supported by USPHS grant CA 48129 and research grant DAMD 17-96-1-6254 from the U.S. Army Medical Research and Materiel Command. N.P. supported by the Whitaker Foundation and USPHS grant CA 79943. B.S. supported by Career Development Award DAMD 17-96-1-6012. L.M.H. supported by USAMRMC grant DAMD 17-98-1-8211. Address correspondence to N.P. (e-mail: petrick@umich.edu).
| ABSTRACT |
|---|
|
|
|---|
MATERIALS AND METHODS: Digitized mammograms were processed with an adaptive enhancement filter followed by a local border refinement stage. Features were then extracted from each detected structure and used to identify potential masses. The performance of the algorithm was evaluated in independent cases obtained from 263 patients from two institutions. Each case contained one or more pathologically proved breast masses. Contralateral mammograms obtained in the same patients that did not contain a visible lesion were used to estimate the CAD marker rate for the algorithm. The tradeoff between detection sensitivity and the number of CAD marks was analyzed in this study.
RESULTS: Malignant masses were detected with the computer in 87% (135 of 156), 83% (130 of 156), and 77% (120 of 156) of the malignant cases at CAD marker rates of 1.5, 1.0, and 0.5 marks per mammogram, respectively. The difference between malignant mass-detection performance in subsets of cases collected at each institution was found to be less than 1%. The detection accuracy for benign masses was lower than that for malignant masses.
CONCLUSION: This mass-detection algorithm had a high sensitivity for detection of malignant masses. It may be useful as a second opinion in mammographic interpretation.
© RSNA, 2002
Index terms: Breast neoplasms, diagnosis, 00.31, 00.32 Breast neoplasms, radiography, 00.112 Computers, diagnostic aid
| INTRODUCTION |
|---|
|
|
|---|
Efforts to evaluate the usefulness of CAD in reducing the rate of missed cancers are ongoing. A prospective study of 12,860 patients in a community breast cancer center that used a commercial CAD system (ImageChecker V2.0; R2 Technologies, Los Altos, Calif) reported a cancer detection rate of 81.6% (40 of 49), with eight of the cancers initially detected only with the CAD system (6). This corresponds to a 20% (41 vs 49) increase in the number of cancers detected. These results demonstrate that use of a CAD system can reduce the rate of missed cancers when CAD results are used as a second opinion, even if not all cancers can be detected with the CAD system.
These results do not distinguish between cancers that appear on mammograms as masses alone, microcalcification clusters alone, or as a combination of mass and cluster. We define a "preoperative mass" as a palpable or nonpalpable mass that is identified during clinical or mammographic evaluation and either is selected for biopsy based on the results of the examination or is followed up and proves to be benign. Castellino et al (7) reported that the latest version of the R2 ImageChecker achieved a sensitivity for mass detection of 85.7% at a marker rate of 0.5 mark per image for 677 preoperative masses; this represents an improvement over the sensitivity of 74.7% at a marker rate of 1.0 mark per image achieved in a previous release (V2.0). Researchers who evaluated the Second Look system (CADx Medical Systems, Laval, Quebec, Canada) reported a mass-detection sensitivity of 84% at a marker rate of 1.1 marks per image with mammograms obtained from a database of 149 preoperative masses (8).
The purpose of this study was to evaluate the performance of our CAD mass-detection algorithm in marking preoperative masses.
| MATERIALS AND METHODS |
|---|
|
|
|---|
Training Cases
The clinical mammograms used for training the algorithm parameters, referred to as the training cases, were selected from the files of patients who had undergone mammographic evaluation and biopsy at our institution. A multiple-reading paradigm, in which a resident or fellow previewed each case and an official interpretation was then rendered by an attending radiologist, was typically used during the initial evaluation of each case.
The mammograms were acquired with MinR/MinR or MinR/MRE screen-film systems (Eastman Kodak, Rochester, NY) with dedicated processing. Series of consecutive malignant and consecutive benign masses from several years were collected with a computerized biopsy registry. The selection criterion used by the radiologists was that a biopsy-proved mass smaller than 2.5 cm appeared on the mammogram. Cases with microcalcifications or architectural distortions without a visible mass were excluded, as were cases with masses larger than 2.5 cm. The data set consisted of 253 mammograms in 102 patients who were examined between 1981 and 1989. The training set included 128 malignant and 125 benign masses. Sixty-three of the malignant masses and six of the benign masses were judged to be spiculated by a Mammography Quality Standards Act (MQSA)approved radiologist.
The mammograms were digitized with a DIS-1000 laser film scanner (Lumisys, Sunnyvale, Calif) with a pixel size of 100 µm and 12-bit gray-level resolution. The gray levels were linearly proportional to optical density (OD) in the 0.12.8 OD range and gradually fell off in the 2.83.5 OD range.
Independent Test Cases
We analyzed the performance of the trained mass-detection algorithm with independent mammographic cases. These cases were collected from two different institutions (the University of Michigan Medical Center and the University of South Florida) and were not used in the training process. Series of consecutive malignant and consecutive benign masses were collected with a biopsy registry from each institution, in a similar manner to the process used in the collection of the training cases. Refer to the previous discussion on training-case selection for more details.
The first group of mammograms of preoperative masses, referred to as group 1, was selected from the files of 127 patients who underwent mammographic evaluation and biopsy at our institution (institution 1) between 1990 and 1999. The group 1 cases came from the same institution as the training cases and had at least one proven breast mass visible at mammography. Again, during the initial evaluation of these cases, a resident or fellow typically previewed each case; an official interpretation was then rendered by an attending radiologist (prior to MQSA in 1994) or an MQSA-approved radiologist.
Each case consisted of a single craniocaudal view and either a mediolateral oblique view or a lateral view of the breast containing the mass. For simplicity, we will refer to all views other than the craniocaudal view as the mediolateral oblique view in the following discussions, with the understanding that this also includes some lateral views. If both breasts of a patient had a mass, each breast was considered to be a separate case for data analysis. With this breast-based definition, a total of 138 cases (276 mammograms) were available.
The mammograms were acquired with MinR/MRE screen-film systems with dedicated processing in the years before 1997 (154 mammograms) and a Kodak 2000 screen-film system (Eastman Kodak) during and after 1997 (122 mammograms). Each case contained one or more preoperative masses that were identified prospectively during initial clinical evaluation or mammographic interpretation. The independent group 1 mammograms were digitized with a LS 85 laser film scanner (Lumisys) at 50 µm and 12-bit gray-level resolution. The gray levels were calibrated to be linearly proportional to OD in the 0.14.0 OD range. The images were reduced to a 100-µm pixel size by averaging 2 x 2 pixel neighborhoods before mass detection was performed.
Clinical cases from a public database available from the University of South Florida (USF) (institution 2) were also analyzed (9). We evaluated 142 craniocaudal and mediolateral oblique mammogram pairs obtained at USF in 136 patients between 1992 and 1998. These 142 USF cases will be referred to as the group 2 cases in the following discussions. Each group 2 case contained at least one proven breast mass visible at mammography. Additional information on the USF database can be found in the literature (9). For compatibility with the group 1 database, we selected only those USF mammograms digitized with the Lumisys 200 laser film scanner. Again, this scanner digitized the images at 50 µm and 12-bit gray-level resolution, but the gray levels were calibrated to be linearly proportional to OD in the 0.13.6 OD range. The group 2 cases came from a different institution than the training cases.
We used lesion-free mammograms of the breast contralateral to those breasts that contained an abnormality to estimate the CAD marker rate for the algorithm. These mammograms are referred to as normal cases in this study. In our analysis, "normal" implies only that a mammogram did not contain a visible mass at the time of the mammographic examination and at the time of a second review by an MQSA-approved radiologist during data collection. A total of 251 mammograms from the 127 group 1 patients and 252 mammograms from the 136 group 2 patients were included as normal mammograms. There were fewer normal than abnormal mammograms because seven of the 263 combined group 1 and group 2 patients had visible lesions in both the right and left breasts and because not all contralateral mammograms were digitized.
Table 1 summarizes the group 1 and 2 test cases used to evaluate the mass-detection algorithm. It includes the numbers of malignant and benign masses separated by whether they were visible in both views or in only a single view. Figure 1 shows the distributions of lesion subtlety (1 = subtle, 5 = obvious) on the mammograms obtained from the group 1 and 2 databases, as ranked by a radiologist (M.A.H. for the group 1 mammograms) who evaluated each individual mass.
|
|
The subtlety ratings for all group 2 masses were retrieved from the USF database and were also based on a five-point rating system. However, the USF ratings for the group 2 cases did not use the same subtlety definitions as those described earlier in this paragraph for the group 1 cases. Instead, the ratings were defined as follows: 1, subtle; 2, twice as subtle as rating 1; 3, three times a subtle as rating 1; and so forth.
The mammographic lesion size of each group 1 mass was measured by the radiologist during initial case evaluation. The malignant group 1 masses had a mean size, SD, and median size of 15.4 mm, 12.0, and 12.0 mm, respectively. The benign group 1 masses had a mean size, SD, and median size of 13.4 mm, 11.8, and 10.0 mm, respectively. Radiologist-measured mass sizes were not used for the group 2 cases because we found that the boundaries of the masses, which were hand-drawn by the reviewing radiologists, were much larger than the actual mammographic lesion size. Therefore, mass size information is not reported for the group 2 cases.
The institutional review board of our institution did not require the collection of racial or ethnic information from these patients, so no statistics on racial or ethnic composition are available for the group 1 cases. However, because the cases were randomly sampled from the records of patients undergoing mammography at our hospital, the racial and ethnic composition of the group of patients in this study is expected to be similar to that of our patient population. The ethnicity statistics for our mammography screening patient population in 1998 and 1999 are given in Table 2. Table 2 also includes the patient ethnicity statistics for the group 2 cases, which were provided in the USF public database.
|
|
Algorithm training.The computer program was trained with the entire training data set of 253 mammograms. The training process included adjusting the filters, clustering, selected features, and classification thresholds. Once training was completed, the parameters and all thresholds were fixed for testing. The training data set was then resubstituted into the algorithm and was found to have a mammogram-based (ie, when each mass on each mammogram was considered as an independent sample) training sensitivity of 81% (205 of 253) overall and 85% (109 of 128) for malignant masses. The mass-detection algorithm produced 2.9 marks per mammogram on average at this sensitivity level in the training cases. It is important to note that the detection classifiers considered only classification between breast masses and normal tissue, not classification between malignant and benign masses. Therefore, no distinction was made between malignant and benign masses in the training process.
Definition of True-Positive and False-Positive Markers
For the group 1 cases, the smallest bounding box containing the entire mass identified by a radiologist was used as the truth. For the group 2 cases, we used a bounding box around the radiologist-outlined mass region provided with each image. Our definition of a true-positive finding was based on the percentage of overlap between the bounding box of an identified structure and the bounding box of the true mass. On the basis of findings in the training set, we chose an overlap threshold of 25%. This value corresponds to the minimum overlap between the bounding box of a detected object and the bounding box of a true mass for the object to be considered as a true-positive detection. The 25% threshold was selected because it was found to match well with true-positive visual identifications. The detected objects were first labeled automatically by the computer with this criterion. All of the true-positive masses were then visually reviewed to make sure that the program highlighted the true lesion and not a neighboring structure. Marks that were found to match neighboring structures were eliminated as true-positive marks.
The number of false-positive marks produced by the algorithm was determined by counting the markings produced in normal cases. We used a total of 251 normal mammograms from group 1 and 252 normal mammograms from group 2 to estimate the marker rate. The true-positive fraction, calculated from the abnormal cases, and the average number of marks per image, calculated from the normal cases, were determined for a fixed set of thresholds at the final texture-classification stage. The true-positive fraction and the average number of marks per mammogram as the decision threshold varied were then used to plot the free-response receiver operating characteristic (FROC) performance curves for malignant and benign masses with the different data sets.
| RESULTS |
|---|
|
|
|---|
Results are also presented for two different true-positive scoring methods. The individual scoring method considers each mass on a mammogram or in a case as a different true-positive finding. The grouped scoring method considers all malignant masses on a mammogram or in a case as a single true-positive finding (5). The rationale for group scoring is that a radiologist may not need to be alerted to all malignant lesions in a mammogram or case before taking action. Therefore, multiple detections in a mammogram or case may not substantially enhance the power of CAD.
FROC performance curves, which were calculated on the basis of individual mass scoring, are shown in Figure 3 for the group 1 cases. Similar data are presented for the group 2 cases in Figure 4. These figures include per case and per mammogram performance curves for the detection of both malignant and benign masses and are included to show the true-positive fraction achievable for a large range of marker rates.
|
|
The per case and per mammogram FROC performance curves in malignant masses, calculated on the basis of grouped mass scoring, are shown in Figure 5. These curves show how the true-positive fraction varies as a function of the marker rate for group scoring, which was expected to be our most clinically relevant measure of algorithm performance. It is evident that the algorithm provided consistent malignant massdetection performance for both independent test sets over a wide range of marker rates.
|
|
|
| DISCUSSION |
|---|
|
|
|---|
The estimated performance values for our detection algorithm compare well with published performance results for commercial CAD vendors (an 85.7% true-positive fraction at 0.5 mark per image with the R2 program and an 84% true-positive fraction at 1.1 marks per image with the CADx program [7,8]), as well as with results for other algorithms currently being developed in research laboratories. This indicates that our mass-detection algorithm could be beneficial to radiologists as a second opinion. It also indicates that, although different methods are used in different detection programs, they all may be able to result in effective CAD and may lead to improvements in mammographic screening if the algorithms are properly trained.
We first compared the performance of the algorithm in malignant lesions with its performance in benign lesions and found that the detection performance in malignant masses is better than that in benign masses. One possible cause for the performance differences between benign and malignant mass detection is a difference in lesion subtlety. We observed a small difference in the subtlety ratings between the benign and malignant masses in the group 1 database, with the malignant masses being slightly more obvious than the benign masses. This same trend holds for the group 2 masses as well. However, it should be noted that the subtlety distributions between group 1 and group 2 differ considerably, as will be discussed in the following paragraphs. The observed difference in subtlety ratings between benign and malignant masses for both groups 1 and 2 is not particularly large, so a subtlety difference does not seem to fully explain the large disparities observed in the FROC curves.
Another factor that probably contributed to the observed difference is that malignant masses are more likely to be spiculated than are benign masses; the performance of our algorithm in spiculated masses is superior to its performance in nonspiculated masses. It is evident that the detection algorithm is better suited for detecting spiculated masses, especially at the lower marker rates, although no special efforts were made to train the algorithm to detect spiculated masses. We surmise that the texture-analysis function of the algorithm acquired a higher sensitivity to spiculated masses during the training process because of the relatively large fraction of spiculated lesions in the training set. Even though the detection algorithm had a higher sensitivity in detecting spiculated masses, the large number of nonspiculated masses in the training set (182 of 253) still trained the algorithm to be sensitive to nonspiculated malignant masses. The sizable difference between the curves for detection of malignant and benign nonspiculated masses suggests that some additional, as yet undetermined, factors may also have contributed to the observed performance difference between the assessment of malignant masses and the assessment of benign masses.
We also observed differences in algorithm performance between masses in group 1 and masses in group 2. The performance rates in detecting malignant masses were quite similar between the two groups, but the detection of benign lesions differed considerably between the groups. One potential factor is that 94% (147 of 157) of the benign masses in the group 1 database were later selected for biopsy. This high rate of biopsy of benign lesions suggests that the group 1 masses were judged by the radiologist to be similar enough to malignant masses to warrant biopsy (ie, the vast majority of the lesions were American College of Radiology Breast Imaging Reporting and Data System [BI-RADS] categories 4 and 5). We therefore can expect the detection performance in these benign masses to be somewhat similar to that in the malignant masses for group 1. The number of biopsies of benign lesions was not available in the group 2 database, but it is likely that a smaller fraction of the benign lesions were selected for biopsy, resulting in the presence of a larger fraction of BI-RADS category 2 or 3 lesions in this group. If this is true, then the characteristics of the group 2 benign masses would not have matched the characteristics of the benign masses in our training set very well; the group 2 benign masses therefore may have been more difficult to detect.
Another factor that may have contributed to this performance difference is a difference in the OD ranges of the digitizers used to acquire the cases at each institution. The OD ranges were 03.5, 04.0, and 03.6 for the Lumisys digitizers used to digitize the training, group 1, and group 2 mammograms, respectively. The smaller OD range of the digitizer used to digitize the group 2 mammograms may have caused a decrease in the detection performance for subtle low-density lesions compared with the group 1 performance in similar cases. However, the group 2 digitizer had an advantage in many of the cases because it better matched the OD range of the digitizer used to acquire the training set. Because of the presence of other factors such as case variability, it is difficult to distinguish the relative importance of these competing effects on the performance of the algorithm.
When we compared the subtlety ratings between the group 1 and group 2 databases, we observed a large disparity in the radiologists ratings. One may conclude that the group 2 cases were easier than the group 1 cases in terms of both malignant and benign masses. However, this does not agree with our detection results. The detection performance in the group 1 benign cases was much better than that in the group 2 benign cases, even though the group 1 lesions were rated as more subtle. The more "obvious" malignant masses in the group 2 database resulted in only a small (1%2%) gain in the detection performance when compared with the group 1 malignant cases. Likewise, visual comparison of the cases did not reveal such a large difference between the databases. The group 2 subtlety distribution does not match well with what is expected in clinical practice because it is highly skewed toward obvious. One would expect that a randomly drawn sample from the patient population would follow a distribution more similar to the group 1 histogram. Therefore, the subtlety difference was most likely caused by a difference in the subjective criteria used to define lesion subtlety instead of a true difference in subtlety between the cases. It is likely that the individual radiologists at the different institutions used different scales. The radiologists reading cases from institution 1 appeared to have spread their subtlety ratings across the multiple categories, while the radiologists at institution 2 seemed to have basically used a binary decision of visible or not visible.
The results suggest that caution must be taken when comparing detection results obtained in cases from different databases. Even if subtlety ratings are available, the rating criteria may be subject to large inter- and intraobserver variations. This is especially true if the databases are collected by different institutions. Comparisons between lesions rated at a single institution with a consistent rating criterion (eg, comparing malignant and benign lesions from the same data set) are less problematic.
The preoperative masses evaluated in this preliminary study were all characterized during mammographic evaluation on the basis of multiple reading of the case by a resident or fellow and an attending MQSA-approved radiologist. Clearly, CAD was not used as an aid to the radiologist during initial case interpretation. The collected data were simply used to characterize the expected performance of the algorithm and to provide a benchmark for comparison with other CAD algorithms. Evaluation studies are now underway to estimate how well our mass-detection algorithm performs with mammograms in which the lesions are not initially deemed actionable. Good CAD performance in these cases may lead to earlier cancer detection.
Simply evaluating CAD performance in preoperative and early cases will not directly measure the effectiveness of our algorithm as an aid to the radiologist. The true clinical performance of a CAD scheme must be established through a properly designed prospective clinical study such as the one reported in reference 6. This type of prospective study will be undertaken in the future to determine if our CAD algorithm aids radiologists in detecting breast cancer earlier and if it affects their recall rate. We are also developing new techniques to both improve the detection performance and reduce the marker rate of the algorithm by fusing single-view information and information from different mammographic views of the same breast (18,19).
| ACKNOWLEDGMENTS |
|---|
| FOOTNOTES |
|---|
Abbreviations: BI-RADS = Breast Imaging Reporting and Data System, CAD = computer-aided diagnosis, DWCE = density-weighted contrast enhancement, FROC = free-response receiver operating characteristic, MQSA = Mammography Quality Standards Act, OD = optical density, USF = University of South Florida
Author contributions: Guarantor of integrity of entire study, N.P.; study concepts and design, N.P., H.P.C., B.S., M.A.H.; literature research, N.P., H.P.C.; clinical studies, H.P.C., N.P., M.A.H.; data acquisition and analysis/interpretation, all authors; statistical analysis, N.P.; manuscript preparation, N.P.; manuscript definition of intellectual content, N.P., H.P.C., B.S., M.A.H.; manuscript editing, revision/review, and final version approval, all authors.
| REFERENCES |
|---|
|
|
|---|
This article has been cited by other articles:
![]() |
M. E. Baker, L. Bogoni, N. A. Obuchowski, C. Dass, R. M. Kendzierski, E. M. Remer, D. M. Einstein, P. Cathier, A. Jerebko, S. Lakare, et al. Computer-aided Detection of Colorectal Polyps: Can It Improve Sensitivity of Less-Experienced Readers? Preliminary Findings Radiology, October 1, 2007; 245(1): 140 - 149. [Abstract] [Full Text] [PDF] |
||||
![]() |
V. R. Pai, N. E. Gregory, A. E. Swinford, and M. Rebner Ductal Carcinoma in Situ: Computer-aided Detection in Screening Mammography Radiology, December 1, 2006; 241(3): 689 - 694. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. J. Morton, D. H. Whaley, K. R. Brandt, and K. K. Amrami Screening Mammograms: Interpretation with Computer-aided Detection--Prospective Evaluation Radiology, May 1, 2006; 239(2): 375 - 383. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. A. Helvie, L. Hadjiiski, E. Makariou, H.-P. Chan, N. Petrick, B. Sahiner, S.-C. B. Lo, M. Freedman, D. Adler, J. Bailey, et al. Sensitivity of Noncommercial Computer-aided Detection System for Mammographic Breast Cancer Detection: Pilot Clinical Trial Radiology, April 1, 2004; 231(1): 208 - 214. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. A. Baker, E. L. Rosen, J. Y. Lo, E. I. Gimenez, R. Walsh, and M. S. Soo Computer-Aided Detection (CAD) in Screening Mammography: Sensitivity of Commercial CAD Systems for Detecting Architectural Distortion Am. J. Roentgenol., October 1, 2003; 181(4): 1083 - 1088. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. W. Bates and A. A. Gawande Improving Safety with Information Technology N. Engl. J. Med., June 19, 2003; 348(25): 2526 - 2534. [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
| RADIOLOGY | RADIOGRAPHICS | RSNA JOURNALS ONLINE |