|
|
||||||||
Opinion |
1 From the Department of Radiology, Harvard Medical School, and Department of Radiology, Massachusetts General Hospital, Avon Foundation Comprehensive Breast Evaluation Center, Wang Ambulatory Care Center, Suite 240, 15 Parkman Street, Boston, MA 02114 (D.B.K.); Department of Radiology, Washington University Medical Center, St Louis, Mo (B.M.); and Department of Radiology, Mount Sinai School of Medicine, New York, NY (S.A.F.). Received October 1, 2002; revision requested December 12; revision received March 19, 2003; accepted April 28. Address correspondence to D.B.K. (e-mail: kopans.daniel@mgh.harvard.edu).
| ABSTRACT |
|---|
|
|
|---|
© RSNA, 2003
Index terms: Breast radiography, utilization, 00.11 Cancer screening Opinions
| INTRODUCTION |
|---|
|
|
|---|
In the past, much of health care was based on anecdotal experience. In the absence of science this is not unreasonable. However, evidence-based guidelines have replaced anecdotes in modern medicine. The requirements for evidence of benefit from an intervention are most critical for screening tests that affect otherwise healthy individuals.
Most imaging tests have been developed and are used to diagnose diseases among individuals who are ill. With the development of faster computed tomography (CT) scanners, magnetic resonance imaging systems, and positron emission tomography scanners, a great deal of interest in using these technologies to screen for various diseases has developed. Screening tests differ from diagnostic studies in that they are usually applied in the evaluation of healthy individuals. More and more healthy individuals are being subjected to these tests to try to find disease before the individuals become clinically ill. Furthermore, the vast majority of those who are screened do not have the disease being sought. This introduces a new way of looking at diseases and requires a different level of evidence for determining the efficacy of the screening test.
In his 2001 American Roentgen Ray Society presidential address (3), Robert Stanley, MD, discussed some of the issues associated with new screening tests for lung cancer, colon cancer, and coronary artery disease. Obuchowski et al (4) reviewed what they believed to be the "Ten Criteria for Effective Screening." It is somewhat surprising that the one imaging test that has undergone rigorous analysis as a screening techniquemammographywas barely mentioned in these reviews. Obuchowski and colleagues stated that "mammography is not an ideal screening model" (4).
In fact, although breast cancer screening may not be the ideal test, the issues associated with mammographic screening make it an excellent model for understanding the requirements for screening. Those involved in breast cancer screening have had firsthand experience for several decades with the pitfalls that lie in the way of demonstrating the efficacy of a screening test. X-ray mammography for detecting breast cancer has been studied in greater detail than any other test, and the problems that have been encountered in demonstrating its screening efficacy are basic in the demonstration of efficacy for any newer screening tests.
The complexities involved in demonstrating the efficacy of a screening test were highlighted by a controversy that recently arose almost 40 years after the first randomized controlled trial (RCT) of breast cancer screening. Two analysts argued that five of the seven RCTs of screening were not properly performed, and, therefore, their data and conclusions were not valid (5,6). Since the two remaining trials, they claimed, did not show a benefit, they concluded that mammographic screening was not justified. Although their concerns ultimately proved to be either unfounded or inconsequential (7,8), their review caused a great deal of confusion and consternation (9,10). The controversy underscored the importance of carefully designed RCTs and the proper execution and strict monitoring of these trials. It is hoped that the evaluation of new technologies that are being proposed for screening can benefit from the experience gained in validating mammographic screening for breast cancer.
Screening is not confined to the detection of cancer. The concepts involved in the search for occult breast cancer can be applied to the search for other processes such as tuberculosis or hypertension. There are general concepts that apply to all screening tests and some that are more specifically applicable to screening for cancer. Since many of the recent efforts have been to develop imaging tests to screen for colon cancer and lung cancer, we will confine our comments to the issues involved with cancer screening. However, most, if not all, of these concepts also apply to screening that is performed to assess for other life-threatening processessuch as imaging the coronary arteries to detect coronary artery diseaseand even to nonimaging tests, such as prostate specific antigen (PSA) testing for detecting prostate cancer.
| SOME BASIC CONCEPTS |
|---|
|
|
|---|
A screening test for cancer is similar to a mechanical screen over a window. The principle is to allow what is desirable (air and light) through the screen while filtering out what is not desirable (insects). The degree of filtration (ie, detection) depends on the size of the openings in the screen. To prevent very small insects from passing through, the openings must be very small, but this will also reduce the flow of air and the amount of light that enters the room. Larger openings will let more air and light through, but they will also let some of the insects through. This concept of screen size is analogous to the threshold for intervention in a cancer screening test. If the threshold for intervention is low (ie, small abnormalities are aggressively evaluated), then many noncancerous lesions will be caught in the screen. If the threshold is elevated to reduce the false-positive results, then small cancers will pass through and be missed. Efficacious screening has to balance the benefits and the risks.
Among the many requirements that have been suggested for ensuring that a cancer screening test is efficacious (11) are the following:
1. The cancer must be fairly common, so that the benefit to those with cancer will offset the inconvenience, cost, and harms that will be incurred by the many individuals who do not have the cancer.
2. The new test must be able to reveal the disease earlier in its growth than does the customary way in which the disease is found.
3. The cancer must have an effective treatment (either it is curable when found earlier, or there is treatment that can result in delayed death for a reasonable number of individuals).
4. The value of detecting the cancer at this earlier time outweighs the risks and costs generated by screening.
| THE IDEAL SCREENING TEST FOR CANCER |
|---|
|
|
|---|
| DOES EARLIER DETECTION MEAN THAT THE COURSE OF THE DISEASE IS ALTERED? |
|---|
|
|
|---|
Some cancers are indolent and may not affect an individual during his or her lifetime. Tumors have been found at autopsy in individuals who died from some other cause. Since these never affected the individual during life, then finding them would actually have been detrimental to the individual (resulting in unnecessary treatment, anxiety, etc). Furthermore, "competing" causes of death must also be taken into account. Screening someone for breast cancer who has chronic congestive heart failure and a life expectancy of less than 5 years is unlikely to result in a prolongation of the individuals life. Thus, for a screening test to be efficacious, it must be shown that its use actually alters the natural history of the cancer in a way that is beneficial to the patient.
Furthermore, because most individuals who are screened will not actually derive any benefit from the test (most will not develop the cancer), the benefit for the few whose disease course is influenced by earlier detection must outweigh the harms that the test might cause to the others.
| INCREASED SURVIVAL DOES NOT PROVE THE EFFICACY OF SCREENING |
|---|
|
|
|---|
Assume there are two individuals who are identical twins (A and B). They are both going to develop a cancer at the same time in the same organ on the same date, and these cancers are going to grow at the same rate. If left untreated, they would kill the individuals 15 years after the first cell began to proliferate. Five years after the cancer begins to grow, twin A undergoes a screening test. The cancer is detected before the twin has any symptoms, and the twin undergoes treatment, lives another 10 years, and then succumbs to the cancer. Since we usually only know the time from detection to the time of death (the survival time), the data would show that twin A "survived" 10 years.
Twin B does not want to undergo the cancer test, waits until symptoms develop, and then is diagnosed with cancer 8 years after the first cell began to proliferate. The cancer is treated, but the twin still dies 7 years later. Since twin Bs survival from diagnosis to death was 7 years, while twin As was 10 years, survival results suggest that twin As survival was 3 years longer than twin Bs, and twin A would appear to have derived a benefit from the screening test. However, adding up all the years, it is clear that there was no benefit from the test. Both actually died at the same time, 15 years after the cancer began. The screening test only made twin A aware of the cancer 3 years earlier, and finding the cancer earlier had no effect on the twins mortality. This has been termed lead time bias and is one of the reasons why comparing survival datathe time from diagnosis to the time of deathcan be misleading. Lead time bias is the main reason why survival data are usually insufficient for proving a benefit from a screening test. It is not enough to show that cancers can be detected earlier. It must be shown that detecting them earlier prolongs life.
| OVERDIAGNOSIS (OR PSEUDODISEASE) BIAS |
|---|
|
|
|---|
Mammographic screening, for example, reveals cancer while it is still in the milk ducts. This is known as ductal carcinoma in situ (DCIS). Because the majority of breast cancers almost certainly begin in the epithelial cells of the ducts (12), most breast cancers begin as intraductal lesions. However, it is unclear which intraductal cancers will progress to become invasive lesions. Because breast cancer can only become lethal by infiltrating into the tissue around the ducts so that it can gain access to the lymphatic vessels and vascular supply that will allow it to spread to other organs, DCIS is a nonlethal lesion. Because the cells of these in situ lesions look like the cells of invasive cancer, DCIS, until recently, has been treated like invasive cancer (13). For many years mastectomy was used to treat these very early lesions. On the basis of the uncertainty as to what percentage of DCIS lesions will progress to invasive cancer and the fact that it is likely that not all DCIS lesions will progress, many have criticized mammographic screening because it led to overdiagnosis and "overtreatment" (14).
There is still great uncertainty over the importance of these lesions. Given enough time, a large percentage may progress to invasive cancers, but it is likely that many will not (15). If DCIS is considered a "real" cancer, but few cases actually lead to lethal lesions, then finding DCIS with the screening test will bias the results. The test will appear to result in saving more lives by revealing "cancers" that would never have taken a life even if they were not discovered by the test. This has been called overdiagnosis bias.
It can confuse the interpretation of data in the following way: Suppose that there are 100 cancers diagnosed each year before the introduction of a new screening test, and 50% of the individuals with this diagnosis die from their cancer. Then the screening test is used. The number of cancers diagnosed each year doubles, and it is found that the death rate has been cut in half. On the surface, it would seem as if the screening test had provided a major benefita 50% reduction in deaths. Closer inspection could, however, reveal overdiagnosis bias. If none of the additionally diagnosed cancers had lethal potential, then they would contribute to the denominator of the ratio of deaths to the number of individuals with cancer, but they would not contribute to the numerator. Thus, the rate of death before screening would be 50/100, or 50%, while the rate of death after screening would be 50/200, or 25%. Superficially, it would seem that the use of screening had halved the number of deaths, when in fact screening had only revealed more cancers that had no lethal potential, and the absolute number of deaths remained unchanged.
| LENGTH BIAS SAMPLING |
|---|
|
|
|---|
| SELECTION BIAS |
|---|
|
|
|---|
| RCT IS THE ONLY ACCEPTED METHOD FOR VALIDATING THE EFFICACY OF A SCREENING TEST |
|---|
|
|
|---|
So that we can understand an RCT, let us return to the example of the twins with cancer. The only way we would have known that twin A actually benefited from screening was if this twin had actually outlived twin B. Since in that situation the only difference between the two individuals was that one had been screened and the other had not, then we could attribute the longevity of the first twin to screening.
Because twins like those described above do not exist in real life, the relationship must be simulated by using the laws of probability. If a large enough number of individuals are randomly separated into two groups, then every individual in one group will have a "twin" in the other. In other words, for every individual who is destined (assuming no intervention is performed) to die at a certain time from a certain type of cancer in one group, there will be a twin in the other group who is destined to die from the same type of cancer at the same time in the future. If a new intervention (like screening) works, then, as the two groups are followed up over time, there will be fewer deaths among the study (screened) group than among the control (unscreened) group. If the numbers are large enough, or if the reduction in death is large enough, that the difference cannot easily be attributed to chance, then the difference is said to be statistically significant, and the efficacy of the test has been demonstrated.
It can easily be appreciated that, even if the cancer is fairly common, the numbers of individuals that need to be included in a trial must be enormous. There have to be enough individuals so that, over the course of the trial, there will be enough participants who develop the cancer that is being studied and enough individuals who will die of the cancer such that the reduction in deaths from the intervention will be statistically significant. The greater the benefit from screening, the smaller the number of study participants needed to prove the benefit. If the actual benefit is small, then a much larger study is needed.
| BLINDED RANDOMIZATION IS CRITICAL |
|---|
|
|
|---|
Nonblinded randomization compromised the results of the Canadian National Breast Screening Study (8,17). Women with advanced breast cancers were placed in greater numbers into the screened group than into the control group. The trial was compromised because women who already had advanced breast cancers and thus could not benefit from screening were allowed to participate in a screening trial. Each of the women first underwent a clinical breast examination (so that those with lumps were identifiable), and then the allocation to one group or the other was performed with open lists, making it possible to just skip a line on the open lists to place a woman with a lump in the screened group. Obviously, this compromises the results of the trial.
One measure of how well the random allocation process went is to see if the demographic features (eg, age distribution, family income, number of children) of the two groups are the same. However, with a compromise such as that in the Canadian National Breast Screening Study, shifting a few women with advanced cancer from one group to the other will have a dramatic impact on the mortality rates of the groups but will have no influence on the overall demographics because the other thousands of women involved will dilute the effect. Randomization must be completely blinded.
| STATISTICAL POWER |
|---|
|
|
|---|
The statistical power of a study is critical. Failure to appreciate its importance has confused many analysts. If a study includes just enough individuals to show a benefit for the entire population that participated in the trial, then a retrospective analysis of data from a subgroup within the trial may not show benefit merely because there were not enough individuals in that group to permit a statistically significant result. This is precisely what happened in the breast cancer screening trials in which data in women aged 4049 years were analyzed, retrospectively, as a separate group (20). There were far too few women in that age group to demonstrate significant results in the early years of follow-up (20).
The fact that the benefit did not achieve statistical significance (because the trials did not include sufficient numbers of women to be able to show a significant benefit in this subgroup in the early years of follow-up) was misinterpreted by some analysts as meaning that there was no benefit. Following up the women for a longer period of time meant that there were more deaths from breast cancer and more patient years such that the benefit became significant (21).
Obviously, if numbers were not critical, then only very small trials would be needed. Retrospective subgroup analysis of data that lack the statistical power to permit accurate analysis should only be used to raise the next research question; it is scientifically unjustified to use these analyses to make medical recommendations. Investigators who wish to analyze data by subgroups (eg, age, sex) must plan at the outset to be sure that there will be sufficient numbers of participants in these groups to permit legitimate analysis.
| STATISTICAL SIGNIFICANCE |
|---|
|
|
|---|
There are mathematical formulas that are used to estimate the likelihood that a result is due to chance. These "tests of significance" are based on varying assumptions, but the basic principle behind them can be seen in the following example: If we are told that there were seven cancer deaths in the screened group and 10 cancer deaths in the unscreened control group, most of us would say that seven versus 10 was not a big difference and could easily be due to chance. However, if there were 70 deaths in the screened group and 100 deaths in the control group, even though the ratio is exactly the same, we would be more inclined to believe that there was a real benefit on the basis of the larger numbers. "Statistical significance" is merely a way of showing this mathematically.
| POINT ESTIMATES AND THE TIME TO SHOW A BENEFIT |
|---|
|
|
|---|
If we return to the basis of RCTs and the analogy of the twins, the problem is clear. For a benefit to be shown for a screened individual who would have died without screening, that individuals "twin" in the unscreened group must die from the cancer. If the screening test enables the interruption of only moderately growing and slow-growing but lethal cancers, it may take many years for the control twin to die and the benefit to be revealed. Thus, the follow-up time is very important. A benefit may be overlooked if the follow-up is too short. In fact, an "early benefit" would be difficult to explain. For an early benefit to appear in a cancer screening trial, the cancer in the screened "twin" would have to be detected and treated just before it metastasized, while the cancer in the unscreened "twin" would have to metastasize soon after and kill the unscreened twin fairly quickly. On the basis of the fact that length bias sampling means that periodic screening is more likely to interrupt moderately growing and slow-growing cancers, an early benefit soon after the start of screening would actually be unlikely.
| NONCOMPLIANCE AND CONTAMINATION |
|---|
|
|
|---|
Although counting noncompliers with the screened group and counting contaminators as if they were unscreened controls seems nonsensical, it is necessary because not doing so could introduce major biases. If, for example, individuals who were destined to have poor-prognosis cancers refused screening and were consequently not included in the study group data, this could make screening appear successful when it was merely the fact that these individuals with bad cancers had refused to participate.
Because women in the mammographic screening trials were not forced to undergo screening if they were allocated to the screened group and because those allocated to the control group were not prevented from being screened, the mammography trial results are reported as showing a benefit for women who were "invited" to be screened as opposed to those who actually were screened. Many women who died from breast cancer in the screened group had actually refused screening, while some in the unscreened control group may have been saved because of mammograms obtained outside the screening trial. Because proper analysis requires that such women still be counted with the group to which they had been allocated, it is fairly certain that the mammographic screening trial results represent an underestimation of the true benefit from screening.
| SCREENING INTERVAL |
|---|
|
|
|---|
In general, increasing the time between screening tests increases the number of cancers that become evident between screening tests (interval cancers). This means that the value of the screening test is diminished because an increasing number of individuals gain no benefit from the test. Because the cost of screening and the secondary costs that it creates (eg, for additional testing, biopsies) represent major challenges to its use, the goal is to try to determine the optimal screening interval (ie, lives saved vs time between screening examinations). At some point there is insufficient incremental value to justify decreasing the time between screening examinations (few if any additional lives are saved beyond those saved with the longer interval). It is often the cost of the test (including harms to the patients as well as economic cost) that dictates a compromise in which health planners support a longer interval than would be medically ideal (an ideal interval being one that results in the greatest number of lives saved).
When a new test is introduced that reveals cancers earlier than they would be revealed without the test, the first round of screening will not only reveal the cancers that have just reached the threshold where the test can demonstrate them but will also reveal cancers that have been accumulating in the population because the old method could not reveal the cancers until they reached an even larger size. Thus, in the first round of screening (prevalence screening), many more cancers will be detected than will be found at subsequent screening examinations if the time between screening examinations is not too long. If screening is repeated, and the time between screening examinations is appropriate, then most of the next group of cancers detected will have just entered the detectable phase. These new cancers are the incident cases that are newly discovered after each screening interval.
The period of time during which a cancer is detectable with a test before it is clinically evident is called the sojourn time. If, for example, a cancer becomes palpable at 1.5 cm, is detectable with mammography at 0.5 cm, and takes 2 years to grow from 0.5 to 1.5 cm, then the sojourn time is 2 years. If screening takes place every 3 years, then many cancers that were not detected at one screening examination will already have become palpable before the next screening examination, and the screening test will be less effective than it would have been if the screening interval were shorter. If there are too many cancers that become evident in the interval between screening examinations (ie, the screening interval is too long), then the screening test will have little value. Performing more frequent screening is generally better than having longer intervals between screening examinations, but the practical "best" time between screening examinations is a balance between the most effective period for reducing deaths through early detection and the cost of performing more frequent testing. To intercept cancers earlier, the screening interval should be less than half the sojourn time (26).
| THRESHOLDS FOR INTERVENTION |
|---|
|
|
|---|
We are unaware of a screening test that does not yield false-positive and false-negative results. Probably the best-known cancer screening test is the Papanicolaou (Pap) test for cancer of the uterine cervix. Depending on how the test is interpreted, as many as 10% of women will have falsely positive smears and will be recalled for additional testing that turns out to yield negative results. This is usually due to the fact that for all screening tests there is often an overlap in characteristics between benign and malignant lesions. Some lesions (eg, spiculated masses at mammography) have a very high probability of being cancer, while others (eg, well-circumscribed masses at mammography) have a low probability of being cancer, but a small risk nonetheless. If the goal is to find all cancers, then the threshold for intervention must be low and the false-positive call rate will be high. If the goal is to minimize the false-positive call rate, then the threshold for intervention is high and cancers will be allowed to slip through the screening test (27). Until a perfect screening test is developed, there is usually no way to avoid this relation