Radiology
HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH TABLE OF CONTENTS
 QUICK SEARCH:   [advanced]


     


DOI: 10.1148/radiol.2292021272
This Article
Right arrow Abstract Freely available
Right arrow Full Text (PDF)
Right arrow Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when eLetters are posted
Right arrow Alert me if a correction is posted
Right arrow Citation Map
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Right arrow reprints & permissions
Citing Articles
Right arrow Citing Articles via HighWire
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Kopans, D. B.
Right arrow Articles by Feig, S. A.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Kopans, D. B.
Right arrow Articles by Feig, S. A.
(Radiology 2003;229:319-327.)
© RSNA, 2003


Opinion

Screening for "Cancer": When is it Valid?—Lessons from the Mammography Experience1

Daniel B. Kopans, MD, FACR, Barbara Monsees, MD and Stephen A. Feig, MD

1 From the Department of Radiology, Harvard Medical School, and Department of Radiology, Massachusetts General Hospital, Avon Foundation Comprehensive Breast Evaluation Center, Wang Ambulatory Care Center, Suite 240, 15 Parkman Street, Boston, MA 02114 (D.B.K.); Department of Radiology, Washington University Medical Center, St Louis, Mo (B.M.); and Department of Radiology, Mount Sinai School of Medicine, New York, NY (S.A.F.). Received October 1, 2002; revision requested December 12; revision received March 19, 2003; accepted April 28. Address correspondence to D.B.K. (e-mail: kopans.daniel@mgh.harvard.edu).


    ABSTRACT
 TOP
 ABSTRACT
 INTRODUCTION
 SOME BASIC CONCEPTS
 THE IDEAL SCREENING TEST...
 DOES EARLIER DETECTION MEAN...
 INCREASED SURVIVAL DOES NOT...
 OVERDIAGNOSIS (OR PSEUDODISEASE)...
 LENGTH BIAS SAMPLING
 SELECTION BIAS
 RCT IS THE ONLY...
 BLINDED RANDOMIZATION IS...
 STATISTICAL POWER
 STATISTICAL SIGNIFICANCE
 POINT ESTIMATES AND THE...
 NONCOMPLIANCE AND CONTAMINATION
 SCREENING INTERVAL
 THRESHOLDS FOR INTERVENTION
 HOW TO MEASURE THE...
 UNEXPECTED CONSEQUENCES OF...
 GENERALIZABILITY
 BEWARE OF THE ARTIFICIAL...
 "AGE CREEP"
 SURROGATE END POINTS
 REFERENCES
 
There is increasing interest in the development of imaging tests to screen for diseases such as cancer. Mammographic screening for breast cancer has undergone greater scrutiny than any other test. Many important lessons have been learned from the issues that have been raised with regard to mammographic screening. Those interested in developing new screening tests can learn from the mammography experience.

© RSNA, 2003

Index terms: Breast radiography, utilization, 00.11 • Cancer screening • Opinions


    INTRODUCTION
 TOP
 ABSTRACT
 INTRODUCTION
 SOME BASIC CONCEPTS
 THE IDEAL SCREENING TEST...
 DOES EARLIER DETECTION MEAN...
 INCREASED SURVIVAL DOES NOT...
 OVERDIAGNOSIS (OR PSEUDODISEASE)...
 LENGTH BIAS SAMPLING
 SELECTION BIAS
 RCT IS THE ONLY...
 BLINDED RANDOMIZATION IS...
 STATISTICAL POWER
 STATISTICAL SIGNIFICANCE
 POINT ESTIMATES AND THE...
 NONCOMPLIANCE AND CONTAMINATION
 SCREENING INTERVAL
 THRESHOLDS FOR INTERVENTION
 HOW TO MEASURE THE...
 UNEXPECTED CONSEQUENCES OF...
 GENERALIZABILITY
 BEWARE OF THE ARTIFICIAL...
 "AGE CREEP"
 SURROGATE END POINTS
 REFERENCES
 
Santayana’s well-worn suggestion "Those who cannot remember the past are condemned to repeat it" (1) is often forgotten. There has been a great deal of recent interest in the use of imaging tests to screen for undetected disease. The rationale for this is the intuitively obvious belief that it must be better to detect cancer earlier. The most recent of several articles by Kolb et al (2) suggesting that ultrasonography (US) can reveal unsuspected cancers in women with radiographically dense breasts is an example. The findings of this study resulted in the impression that whole-breast US screening was beneficial when in fact the study design did not enable the researchers to prove that screening with US has any value.

In the past, much of health care was based on anecdotal experience. In the absence of science this is not unreasonable. However, evidence-based guidelines have replaced anecdotes in modern medicine. The requirements for evidence of benefit from an intervention are most critical for screening tests that affect otherwise healthy individuals.

Most imaging tests have been developed and are used to diagnose diseases among individuals who are ill. With the development of faster computed tomography (CT) scanners, magnetic resonance imaging systems, and positron emission tomography scanners, a great deal of interest in using these technologies to screen for various diseases has developed. Screening tests differ from diagnostic studies in that they are usually applied in the evaluation of healthy individuals. More and more healthy individuals are being subjected to these tests to try to find disease before the individuals become clinically ill. Furthermore, the vast majority of those who are screened do not have the disease being sought. This introduces a new way of looking at diseases and requires a different level of evidence for determining the efficacy of the screening test.

In his 2001 American Roentgen Ray Society presidential address (3), Robert Stanley, MD, discussed some of the issues associated with new screening tests for lung cancer, colon cancer, and coronary artery disease. Obuchowski et al (4) reviewed what they believed to be the "Ten Criteria for Effective Screening." It is somewhat surprising that the one imaging test that has undergone rigorous analysis as a screening technique—mammography—was barely mentioned in these reviews. Obuchowski and colleagues stated that "mammography is not an ideal screening model" (4).

In fact, although breast cancer screening may not be the ideal test, the issues associated with mammographic screening make it an excellent model for understanding the requirements for screening. Those involved in breast cancer screening have had firsthand experience for several decades with the pitfalls that lie in the way of demonstrating the efficacy of a screening test. X-ray mammography for detecting breast cancer has been studied in greater detail than any other test, and the problems that have been encountered in demonstrating its screening efficacy are basic in the demonstration of efficacy for any newer screening tests.

The complexities involved in demonstrating the efficacy of a screening test were highlighted by a controversy that recently arose almost 40 years after the first randomized controlled trial (RCT) of breast cancer screening. Two analysts argued that five of the seven RCTs of screening were not properly performed, and, therefore, their data and conclusions were not valid (5,6). Since the two remaining trials, they claimed, did not show a benefit, they concluded that mammographic screening was not justified. Although their concerns ultimately proved to be either unfounded or inconsequential (7,8), their review caused a great deal of confusion and consternation (9,10). The controversy underscored the importance of carefully designed RCTs and the proper execution and strict monitoring of these trials. It is hoped that the evaluation of new technologies that are being proposed for screening can benefit from the experience gained in validating mammographic screening for breast cancer.

Screening is not confined to the detection of cancer. The concepts involved in the search for occult breast cancer can be applied to the search for other processes such as tuberculosis or hypertension. There are general concepts that apply to all screening tests and some that are more specifically applicable to screening for cancer. Since many of the recent efforts have been to develop imaging tests to screen for colon cancer and lung cancer, we will confine our comments to the issues involved with cancer screening. However, most, if not all, of these concepts also apply to screening that is performed to assess for other life-threatening processes—such as imaging the coronary arteries to detect coronary artery disease—and even to nonimaging tests, such as prostate specific antigen (PSA) testing for detecting prostate cancer.


    SOME BASIC CONCEPTS
 TOP
 ABSTRACT
 INTRODUCTION
 SOME BASIC CONCEPTS
 THE IDEAL SCREENING TEST...
 DOES EARLIER DETECTION MEAN...
 INCREASED SURVIVAL DOES NOT...
 OVERDIAGNOSIS (OR PSEUDODISEASE)...
 LENGTH BIAS SAMPLING
 SELECTION BIAS
 RCT IS THE ONLY...
 BLINDED RANDOMIZATION IS...
 STATISTICAL POWER
 STATISTICAL SIGNIFICANCE
 POINT ESTIMATES AND THE...
 NONCOMPLIANCE AND CONTAMINATION
 SCREENING INTERVAL
 THRESHOLDS FOR INTERVENTION
 HOW TO MEASURE THE...
 UNEXPECTED CONSEQUENCES OF...
 GENERALIZABILITY
 BEWARE OF THE ARTIFICIAL...
 "AGE CREEP"
 SURROGATE END POINTS
 REFERENCES
 
The use of imaging to screen for cancer differs from most other uses of imaging in that ostensibly healthy people are evaluated in an effort to find a disease early in its course. The requirements for a screening test differ from those used to evaluate individuals who are already ill. In the latter situation, the intervention is limited to those with a definite problem in an effort to help them recover. Most individuals who undergo a screening test, however, are not sick. Therefore, a false-positive test (a test that suggests disease when the patient does not actually have the disease) can cause "harms" such as anxiety or even morbidity from a biopsy or diagnostic procedure needed to establish whether or not the disease is truly present. These harms never would have occurred if the screening test had not been administered. The degree of risk that is acceptable for patients who are already ill is usually greater than what would be acceptable for a healthy population that is being screened.

A screening test for cancer is similar to a mechanical screen over a window. The principle is to allow what is desirable (air and light) through the screen while filtering out what is not desirable (insects). The degree of filtration (ie, detection) depends on the size of the openings in the screen. To prevent very small insects from passing through, the openings must be very small, but this will also reduce the flow of air and the amount of light that enters the room. Larger openings will let more air and light through, but they will also let some of the insects through. This concept of screen size is analogous to the threshold for intervention in a cancer screening test. If the threshold for intervention is low (ie, small abnormalities are aggressively evaluated), then many noncancerous lesions will be caught in the screen. If the threshold is elevated to reduce the false-positive results, then small cancers will pass through and be missed. Efficacious screening has to balance the benefits and the risks.

Among the many requirements that have been suggested for ensuring that a cancer screening test is efficacious (11) are the following:

1. The cancer must be fairly common, so that the benefit to those with cancer will offset the inconvenience, cost, and harms that will be incurred by the many individuals who do not have the cancer.

2. The new test must be able to reveal the disease earlier in its growth than does the customary way in which the disease is found.

3. The cancer must have an effective treatment (either it is curable when found earlier, or there is treatment that can result in delayed death for a reasonable number of individuals).

4. The value of detecting the cancer at this earlier time outweighs the risks and costs generated by screening.


    THE IDEAL SCREENING TEST FOR CANCER
 TOP
 ABSTRACT
 INTRODUCTION
 SOME BASIC CONCEPTS
 THE IDEAL SCREENING TEST...
 DOES EARLIER DETECTION MEAN...
 INCREASED SURVIVAL DOES NOT...
 OVERDIAGNOSIS (OR PSEUDODISEASE)...
 LENGTH BIAS SAMPLING
 SELECTION BIAS
 RCT IS THE ONLY...
 BLINDED RANDOMIZATION IS...
 STATISTICAL POWER
 STATISTICAL SIGNIFICANCE
 POINT ESTIMATES AND THE...
 NONCOMPLIANCE AND CONTAMINATION
 SCREENING INTERVAL
 THRESHOLDS FOR INTERVENTION
 HOW TO MEASURE THE...
 UNEXPECTED CONSEQUENCES OF...
 GENERALIZABILITY
 BEWARE OF THE ARTIFICIAL...
 "AGE CREEP"
 SURROGATE END POINTS
 REFERENCES
 
Although there is no screening test that can achieve the following ideal criteria, these are, nevertheless, the goals that should be sought for any cancer screening test. An ideal screening test will (a) reveal all cancers at a time when they are curable, (b) yield no false-positive results, (c) yield no false-negative results, and (d) be harmless.


    DOES EARLIER DETECTION MEAN THAT THE COURSE OF THE DISEASE IS ALTERED?
 TOP
 ABSTRACT
 INTRODUCTION
 SOME BASIC CONCEPTS
 THE IDEAL SCREENING TEST...
 DOES EARLIER DETECTION MEAN...
 INCREASED SURVIVAL DOES NOT...
 OVERDIAGNOSIS (OR PSEUDODISEASE)...
 LENGTH BIAS SAMPLING
 SELECTION BIAS
 RCT IS THE ONLY...
 BLINDED RANDOMIZATION IS...
 STATISTICAL POWER
 STATISTICAL SIGNIFICANCE
 POINT ESTIMATES AND THE...
 NONCOMPLIANCE AND CONTAMINATION
 SCREENING INTERVAL
 THRESHOLDS FOR INTERVENTION
 HOW TO MEASURE THE...
 UNEXPECTED CONSEQUENCES OF...
 GENERALIZABILITY
 BEWARE OF THE ARTIFICIAL...
 "AGE CREEP"
 SURROGATE END POINTS
 REFERENCES
 
One of the most difficult concepts to understand and accept is that merely finding a cancer earlier does not mean that the patient will benefit. Finding it earlier may not be early enough. Most solid cancers are thought to arise from a single cell. It is likely that, through damage or mutation, the DNA of the cell is altered, allowing unrestrained proliferation. Most cancers do not kill by merely destroying the organ in which they arise; they kill through metastatic spread that destroys other organs. Simply finding cancers earlier or at a smaller size may not alter the course of the disease. If a cancer has metastasized even before the new screening test can reveal it, then detecting the cancer earlier will have no life-saving benefit unless treatment can destroy early metastases. Unless a screening test results in a beneficial alteration in the natural history of the disease, it may not only be of no benefit but may also (if it causes harm to some of the healthy individuals being tested) clearly cause more harm than good.

Some cancers are indolent and may not affect an individual during his or her lifetime. Tumors have been found at autopsy in individuals who died from some other cause. Since these never affected the individual during life, then finding them would actually have been detrimental to the individual (resulting in unnecessary treatment, anxiety, etc). Furthermore, "competing" causes of death must also be taken into account. Screening someone for breast cancer who has chronic congestive heart failure and a life expectancy of less than 5 years is unlikely to result in a prolongation of the individual’s life. Thus, for a screening test to be efficacious, it must be shown that its use actually alters the natural history of the cancer in a way that is beneficial to the patient.

Furthermore, because most individuals who are screened will not actually derive any benefit from the test (most will not develop the cancer), the benefit for the few whose disease course is influenced by earlier detection must outweigh the harms that the test might cause to the others.


    INCREASED SURVIVAL DOES NOT PROVE THE EFFICACY OF SCREENING
 TOP
 ABSTRACT
 INTRODUCTION
 SOME BASIC CONCEPTS
 THE IDEAL SCREENING TEST...
 DOES EARLIER DETECTION MEAN...
 INCREASED SURVIVAL DOES NOT...
 OVERDIAGNOSIS (OR PSEUDODISEASE)...
 LENGTH BIAS SAMPLING
 SELECTION BIAS
 RCT IS THE ONLY...
 BLINDED RANDOMIZATION IS...
 STATISTICAL POWER
 STATISTICAL SIGNIFICANCE
 POINT ESTIMATES AND THE...
 NONCOMPLIANCE AND CONTAMINATION
 SCREENING INTERVAL
 THRESHOLDS FOR INTERVENTION
 HOW TO MEASURE THE...
 UNEXPECTED CONSEQUENCES OF...
 GENERALIZABILITY
 BEWARE OF THE ARTIFICIAL...
 "AGE CREEP"
 SURROGATE END POINTS
 REFERENCES
 
Superficially, it would seem that there should be a benefit from a new test if it reveals cancers earlier than the usual means of detection and this results in longer survival. Survival is the time between the detection of a cancer and the individual’s death from the cancer. The time from detection and diagnosis to death is the survival period. One patient with a cancer who survives for a longer period of time than another might be thought to have had more successful care (eg, earlier detection or better treatment). The reasons why this may not be true have been described as possible biases that may confound the analysis of the results of survival studies. One of the major biases that can occur is known as lead time bias. The following example not only explains lead time bias but also is critical for understanding the use of the RCTs that are needed to overcome the various biases that confuse the interpretation of screening data.

Assume there are two individuals who are identical twins (A and B). They are both going to develop a cancer at the same time in the same organ on the same date, and these cancers are going to grow at the same rate. If left untreated, they would kill the individuals 15 years after the first cell began to proliferate. Five years after the cancer begins to grow, twin A undergoes a screening test. The cancer is detected before the twin has any symptoms, and the twin undergoes treatment, lives another 10 years, and then succumbs to the cancer. Since we usually only know the time from detection to the time of death (the survival time), the data would show that twin A "survived" 10 years.

Twin B does not want to undergo the cancer test, waits until symptoms develop, and then is diagnosed with cancer 8 years after the first cell began to proliferate. The cancer is treated, but the twin still dies 7 years later. Since twin B’s survival from diagnosis to death was 7 years, while twin A’s was 10 years, survival results suggest that twin A’s survival was 3 years longer than twin B’s, and twin A would appear to have derived a benefit from the screening test. However, adding up all the years, it is clear that there was no benefit from the test. Both actually died at the same time, 15 years after the cancer began. The screening test only made twin A aware of the cancer 3 years earlier, and finding the cancer earlier had no effect on the twin’s mortality. This has been termed lead time bias and is one of the reasons why comparing survival data—the time from diagnosis to the time of death—can be misleading. Lead time bias is the main reason why survival data are usually insufficient for proving a benefit from a screening test. It is not enough to show that cancers can be detected earlier. It must be shown that detecting them earlier prolongs life.


    OVERDIAGNOSIS (OR PSEUDODISEASE) BIAS
 TOP
 ABSTRACT
 INTRODUCTION
 SOME BASIC CONCEPTS
 THE IDEAL SCREENING TEST...
 DOES EARLIER DETECTION MEAN...
 INCREASED SURVIVAL DOES NOT...
 OVERDIAGNOSIS (OR PSEUDODISEASE)...
 LENGTH BIAS SAMPLING
 SELECTION BIAS
 RCT IS THE ONLY...
 BLINDED RANDOMIZATION IS...
 STATISTICAL POWER
 STATISTICAL SIGNIFICANCE
 POINT ESTIMATES AND THE...
 NONCOMPLIANCE AND CONTAMINATION
 SCREENING INTERVAL
 THRESHOLDS FOR INTERVENTION
 HOW TO MEASURE THE...
 UNEXPECTED CONSEQUENCES OF...
 GENERALIZABILITY
 BEWARE OF THE ARTIFICIAL...
 "AGE CREEP"
 SURROGATE END POINTS
 REFERENCES
 
There are other screening phenomena that can be misleading. For example, not all cancers are lethal. Some individuals who die from other causes are discovered, at autopsy, to have a cancer that never influenced them while they were alive. If a screening test had revealed these nonlethal cancers, it would have caused unnecessary anxiety. The individuals might have been subjected to treatments that they did not need, and the treatments might have been costly and harmful, without producing any benefit. The reason it is a bias is that if these nonlethal cancers would never have been detected without screening, counting them would make the test appear to result in saving lives when instead it was revealing cancers that would never have taken a life.

Mammographic screening, for example, reveals cancer while it is still in the milk ducts. This is known as ductal carcinoma in situ (DCIS). Because the majority of breast cancers almost certainly begin in the epithelial cells of the ducts (12), most breast cancers begin as intraductal lesions. However, it is unclear which intraductal cancers will progress to become invasive lesions. Because breast cancer can only become lethal by infiltrating into the tissue around the ducts so that it can gain access to the lymphatic vessels and vascular supply that will allow it to spread to other organs, DCIS is a nonlethal lesion. Because the cells of these in situ lesions look like the cells of invasive cancer, DCIS, until recently, has been treated like invasive cancer (13). For many years mastectomy was used to treat these very early lesions. On the basis of the uncertainty as to what percentage of DCIS lesions will progress to invasive cancer and the fact that it is likely that not all DCIS lesions will progress, many have criticized mammographic screening because it led to overdiagnosis and "overtreatment" (14).

There is still great uncertainty over the importance of these lesions. Given enough time, a large percentage may progress to invasive cancers, but it is likely that many will not (15). If DCIS is considered a "real" cancer, but few cases actually lead to lethal lesions, then finding DCIS with the screening test will bias the results. The test will appear to result in saving more lives by revealing "cancers" that would never have taken a life even if they were not discovered by the test. This has been called overdiagnosis bias.

It can confuse the interpretation of data in the following way: Suppose that there are 100 cancers diagnosed each year before the introduction of a new screening test, and 50% of the individuals with this diagnosis die from their cancer. Then the screening test is used. The number of cancers diagnosed each year doubles, and it is found that the death rate has been cut in half. On the surface, it would seem as if the screening test had provided a major benefit—a 50% reduction in deaths. Closer inspection could, however, reveal overdiagnosis bias. If none of the additionally diagnosed cancers had lethal potential, then they would contribute to the denominator of the ratio of deaths to the number of individuals with cancer, but they would not contribute to the numerator. Thus, the rate of death before screening would be 50/100, or 50%, while the rate of death after screening would be 50/200, or 25%. Superficially, it would seem that the use of screening had halved the number of deaths, when in fact screening had only revealed more cancers that had no lethal potential, and the absolute number of deaths remained unchanged.


    LENGTH BIAS SAMPLING
 TOP
 ABSTRACT
 INTRODUCTION
 SOME BASIC CONCEPTS
 THE IDEAL SCREENING TEST...
 DOES EARLIER DETECTION MEAN...
 INCREASED SURVIVAL DOES NOT...
 OVERDIAGNOSIS (OR PSEUDODISEASE)...
 LENGTH BIAS SAMPLING
 SELECTION BIAS
 RCT IS THE ONLY...
 BLINDED RANDOMIZATION IS...
 STATISTICAL POWER
 STATISTICAL SIGNIFICANCE
 POINT ESTIMATES AND THE...
 NONCOMPLIANCE AND CONTAMINATION
 SCREENING INTERVAL
 THRESHOLDS FOR INTERVENTION
 HOW TO MEASURE THE...
 UNEXPECTED CONSEQUENCES OF...
 GENERALIZABILITY
 BEWARE OF THE ARTIFICIAL...
 "AGE CREEP"
 SURROGATE END POINTS
 REFERENCES
 
Another problem with screening is that it is usually performed on a periodic basis. An understanding of how screening works makes it apparent that periodic testing has a greater chance of detecting slower-growing cancers than faster-growing cancers. Assume that a screening test is performed today and there are two individuals with cancers that are just below the threshold of detection so that neither cancer is detected. One is a fast-growing, more lethal cancer and the other is slower growing and indolent. The first cancer grows so quickly that it becomes clinically apparent in the interval before the next screening test (ie, the individual will develop a sign or symptom that indicates the presence of the cancer). The slower-growing cancer does not become clinically apparent before the next screening test, at which time it is detected with the test. Thus, when individuals whose cancers were detected with the screening test are compared with those whose cancers were not detected with the screening test, the individuals with screening-detected cancers have better survival simply because they have more indolent cancers. The fact that periodic screening is more likely to detect slower-growing, more indolent cancers is known as length bias sampling.


    SELECTION BIAS
 TOP
 ABSTRACT
 INTRODUCTION
 SOME BASIC CONCEPTS
 THE IDEAL SCREENING TEST...
 DOES EARLIER DETECTION MEAN...
 INCREASED SURVIVAL DOES NOT...
 OVERDIAGNOSIS (OR PSEUDODISEASE)...
 LENGTH BIAS SAMPLING
 SELECTION BIAS
 RCT IS THE ONLY...
 BLINDED RANDOMIZATION IS...
 STATISTICAL POWER
 STATISTICAL SIGNIFICANCE
 POINT ESTIMATES AND THE...
 NONCOMPLIANCE AND CONTAMINATION
 SCREENING INTERVAL
 THRESHOLDS FOR INTERVENTION
 HOW TO MEASURE THE...
 UNEXPECTED CONSEQUENCES OF...
 GENERALIZABILITY
 BEWARE OF THE ARTIFICIAL...
 "AGE CREEP"
 SURROGATE END POINTS
 REFERENCES
 
Another factor that can compromise the validity of a study of screening is selection bias. For example, volunteers can be asked to participate in a study. People who take special interest in their health (and volunteer for studies) tend to be healthier than the average individual. The results of the new test may appear to show a benefit, but the results may be due to the fact that the volunteers are healthier to begin with and might have better results for this reason. Conversely, a study may reveal no benefit if only individuals who already have a clinically evident problem are in the study. Women with advanced cancers, for example, will not benefit from screening. Allowing these women to participate in a screening study, as happened in the Canadian National Breast Screening Study, can bias the results (16).


    RCT IS THE ONLY ACCEPTED METHOD FOR VALIDATING THE EFFICACY OF A SCREENING TEST
 TOP
 ABSTRACT
 INTRODUCTION
 SOME BASIC CONCEPTS
 THE IDEAL SCREENING TEST...
 DOES EARLIER DETECTION MEAN...
 INCREASED SURVIVAL DOES NOT...
 OVERDIAGNOSIS (OR PSEUDODISEASE)...
 LENGTH BIAS SAMPLING
 SELECTION BIAS
 RCT IS THE ONLY...
 BLINDED RANDOMIZATION IS...
 STATISTICAL POWER
 STATISTICAL SIGNIFICANCE
 POINT ESTIMATES AND THE...
 NONCOMPLIANCE AND CONTAMINATION
 SCREENING INTERVAL
 THRESHOLDS FOR INTERVENTION
 HOW TO MEASURE THE...
 UNEXPECTED CONSEQUENCES OF...
 GENERALIZABILITY
 BEWARE OF THE ARTIFICIAL...
 "AGE CREEP"
 SURROGATE END POINTS
 REFERENCES
 
For all of the reasons described above, it is generally not possible to rely on survival data or even mortality data from screening trials that lack an unscreened control group. It may be impossible to eliminate all forms of bias, but the only way to eliminate those listed above is through the use of RCTs.

So that we can understand an RCT, let us return to the example of the twins with cancer. The only way we would have known that twin A actually benefited from screening was if this twin had actually outlived twin B. Since in that situation the only difference between the two individuals was that one had been screened and the other had not, then we could attribute the longevity of the first twin to screening.

Because twins like those described above do not exist in real life, the relationship must be simulated by using the laws of probability. If a large enough number of individuals are randomly separated into two groups, then every individual in one group will have a "twin" in the other. In other words, for every individual who is destined (assuming no intervention is performed) to die at a certain time from a certain type of cancer in one group, there will be a twin in the other group who is destined to die from the same type of cancer at the same time in the future. If a new intervention (like screening) works, then, as the two groups are followed up over time, there will be fewer deaths among the study (screened) group than among the control (unscreened) group. If the numbers are large enough, or if the reduction in death is large enough, that the difference cannot easily be attributed to chance, then the difference is said to be statistically significant, and the efficacy of the test has been demonstrated.

It can easily be appreciated that, even if the cancer is fairly common, the numbers of individuals that need to be included in a trial must be enormous. There have to be enough individuals so that, over the course of the trial, there will be enough participants who develop the cancer that is being studied and enough individuals who will die of the cancer such that the reduction in deaths from the intervention will be statistically significant. The greater the benefit from screening, the smaller the number of study participants needed to prove the benefit. If the actual benefit is small, then a much larger study is needed.


    BLINDED RANDOMIZATION IS CRITICAL
 TOP
 ABSTRACT
 INTRODUCTION
 SOME BASIC CONCEPTS
 THE IDEAL SCREENING TEST...
 DOES EARLIER DETECTION MEAN...
 INCREASED SURVIVAL DOES NOT...
 OVERDIAGNOSIS (OR PSEUDODISEASE)...
 LENGTH BIAS SAMPLING
 SELECTION BIAS
 RCT IS THE ONLY...
 BLINDED RANDOMIZATION IS...
 STATISTICAL POWER
 STATISTICAL SIGNIFICANCE
 POINT ESTIMATES AND THE...
 NONCOMPLIANCE AND CONTAMINATION
 SCREENING INTERVAL
 THRESHOLDS FOR INTERVENTION
 HOW TO MEASURE THE...
 UNEXPECTED CONSEQUENCES OF...
 GENERALIZABILITY
 BEWARE OF THE ARTIFICIAL...
 "AGE CREEP"
 SURROGATE END POINTS
 REFERENCES
 
Although RCTs are the best way to determine whether or not there is a benefit from an intervention such as a screening test, RCTs are not always perfect. There are very strict rules that need to be applied in their design, execution, and analysis, or they run the risk of being compromised. One of the critical elements of a trial is that the allocation of individuals to either the study group or the control group must be truly random. Otherwise, biases can enter into the trial that compromise its results. Random allocation is critical to ensure the equal distribution of "twins" between the two groups. To achieve a random distribution of participants, allocation must be completely blinded. Those performing the allocation can know nothing about the participants so that there is no way to consciously or otherwise compromise the random allocation. Obviously, assigning individuals with small cancers to the screened group and/or individuals with large, more advanced cancers to the control group will bias the results in favor of the screening test, while the reverse situation will show that the screening test has no benefit or even has a detrimental effect.

Nonblinded randomization compromised the results of the Canadian National Breast Screening Study (8,17). Women with advanced breast cancers were placed in greater numbers into the screened group than into the control group. The trial was compromised because women who already had advanced breast cancers and thus could not benefit from screening were allowed to participate in a screening trial. Each of the women first underwent a clinical breast examination (so that those with lumps were identifiable), and then the allocation to one group or the other was performed with open lists, making it possible to just skip a line on the open lists to place a woman with a lump in the screened group. Obviously, this compromises the results of the trial.

One measure of how well the random allocation process went is to see if the demographic features (eg, age distribution, family income, number of children) of the two groups are the same. However, with a compromise such as that in the Canadian National Breast Screening Study, shifting a few women with advanced cancer from one group to the other will have a dramatic impact on the mortality rates of the groups but will have no influence on the overall demographics because the other thousands of women involved will dilute the effect. Randomization must be completely blinded.


    STATISTICAL POWER
 TOP
 ABSTRACT
 INTRODUCTION
 SOME BASIC CONCEPTS
 THE IDEAL SCREENING TEST...
 DOES EARLIER DETECTION MEAN...
 INCREASED SURVIVAL DOES NOT...
 OVERDIAGNOSIS (OR PSEUDODISEASE)...
 LENGTH BIAS SAMPLING
 SELECTION BIAS
 RCT IS THE ONLY...
 BLINDED RANDOMIZATION IS...
 STATISTICAL POWER
 STATISTICAL SIGNIFICANCE
 POINT ESTIMATES AND THE...
 NONCOMPLIANCE AND CONTAMINATION
 SCREENING INTERVAL
 THRESHOLDS FOR INTERVENTION
 HOW TO MEASURE THE...
 UNEXPECTED CONSEQUENCES OF...
 GENERALIZABILITY
 BEWARE OF THE ARTIFICIAL...
 "AGE CREEP"
 SURROGATE END POINTS
 REFERENCES
 
Before a study begins, estimates must be made as to how many participants will be needed in the trial. The statistical power of a projected study is determined ahead of time so that sufficient numbers of participants will be included. This is done by predicting the expected benefit (ie, percentage reduction in deaths) and then calculating how many participants will be needed to endow the study with a high probability of showing benefit with given estimates of cancer incidence and expected death rates. Since RCTs are very expensive, these estimates are often used to choose the lowest number of participants that is likely to show the expected benefit. The "power" of data to "prove" a benefit is also used to determine the success of a trial after the study has been concluded. If the estimated benefit proves to have been too high and the actual benefit is lower, the trial will not be able to show statistically significant results and a real benefit may be overlooked (18,19).

The statistical power of a study is critical. Failure to appreciate its importance has confused many analysts. If a study includes just enough individuals to show a benefit for the entire population that participated in the trial, then a retrospective analysis of data from a subgroup within the trial may not show benefit merely because there were not enough individuals in that group to permit a statistically significant result. This is precisely what happened in the breast cancer screening trials in which data in women aged 40–49 years were analyzed, retrospectively, as a separate group (20). There were far too few women in that age group to demonstrate significant results in the early years of follow-up (20).

The fact that the benefit did not achieve statistical significance (because the trials did not include sufficient numbers of women to be able to show a significant benefit in this subgroup in the early years of follow-up) was misinterpreted by some analysts as meaning that there was no benefit. Following up the women for a longer period of time meant that there were more deaths from breast cancer and more patient years such that the benefit became significant (21).

Obviously, if numbers were not critical, then only very small trials would be needed. Retrospective subgroup analysis of data that lack the statistical power to permit accurate analysis should only be used to raise the next research question; it is scientifically unjustified to use these analyses to make medical recommendations. Investigators who wish to analyze data by subgroups (eg, age, sex) must plan at the outset to be sure that there will be sufficient numbers of participants in these groups to permit legitimate analysis.


    STATISTICAL SIGNIFICANCE
 TOP
 ABSTRACT
 INTRODUCTION
 SOME BASIC CONCEPTS
 THE IDEAL SCREENING TEST...
 DOES EARLIER DETECTION MEAN...
 INCREASED SURVIVAL DOES NOT...
 OVERDIAGNOSIS (OR PSEUDODISEASE)...
 LENGTH BIAS SAMPLING
 SELECTION BIAS
 RCT IS THE ONLY...
 BLINDED RANDOMIZATION IS...
 STATISTICAL POWER
 STATISTICAL SIGNIFICANCE
 POINT ESTIMATES AND THE...
 NONCOMPLIANCE AND CONTAMINATION
 SCREENING INTERVAL
 THRESHOLDS FOR INTERVENTION
 HOW TO MEASURE THE...
 UNEXPECTED CONSEQUENCES OF...
 GENERALIZABILITY
 BEWARE OF THE ARTIFICIAL...
 "AGE CREEP"
 SURROGATE END POINTS
 REFERENCES
 
As is well known, many events are due to chance alone. A coin can be flipped and turn up heads repeatedly, but the chance of this continuing to happen diminishes with an increasing number of flips. Similarly, in a screening trial, there may be fewer deaths in one group or the other on the basis of chance alone. When the balance shifts back and forth, it is termed statistical fluctuation and is usually due to the small numbers involved. It is particularly evident in the early years of follow-up in screening trials, when the number of deaths is small. Early on, there can even appear to be more deaths among the screened women due to statistical fluctuation. This occurred in some of the mammography trials (22,23), but if there is a real benefit and the numbers are sufficiently large (usually with longer follow-up), the "truth" will appear beyond chance.

There are mathematical formulas that are used to estimate the likelihood that a result is due to chance. These "tests of significance" are based on varying assumptions, but the basic principle behind them can be seen in the following example: If we are told that there were seven cancer deaths in the screened group and 10 cancer deaths in the unscreened control group, most of us would say that seven versus 10 was not a big difference and could easily be due to chance. However, if there were 70 deaths in the screened group and 100 deaths in the control group, even though the ratio is exactly the same, we would be more inclined to believe that there was a real benefit on the basis of the larger numbers. "Statistical significance" is merely a way of showing this mathematically.


    POINT ESTIMATES AND THE TIME TO SHOW A BENEFIT
 TOP
 ABSTRACT
 INTRODUCTION
 SOME BASIC CONCEPTS
 THE IDEAL SCREENING TEST...
 DOES EARLIER DETECTION MEAN...
 INCREASED SURVIVAL DOES NOT...
 OVERDIAGNOSIS (OR PSEUDODISEASE)...
 LENGTH BIAS SAMPLING
 SELECTION BIAS
 RCT IS THE ONLY...
 BLINDED RANDOMIZATION IS...
 STATISTICAL POWER
 STATISTICAL SIGNIFICANCE
 POINT ESTIMATES AND THE...
 NONCOMPLIANCE AND CONTAMINATION
 SCREENING INTERVAL
 THRESHOLDS FOR INTERVENTION
 HOW TO MEASURE THE...
 UNEXPECTED CONSEQUENCES OF...
 GENERALIZABILITY
 BEWARE OF THE ARTIFICIAL...
 "AGE CREEP"
 SURROGATE END POINTS
 REFERENCES
 
Although the numbers keep changing as the populations being studied are followed over time (deaths from most cancers continue to occur over time and not all at once), the data must be analyzed at some point (or points) in time. These are called the point estimates. If the data are analyzed too soon, the number of deaths will be small, and the differences between the two groups may be small. Consequently, the early point estimates may not be significant, producing misleading analyses. Even experts have overlooked this important fact (24).

If we return to the basis of RCTs and the analogy of the twins, the problem is clear. For a benefit to be shown for a screened individual who would have died without screening, that individual’s "twin" in the unscreened group must die from the cancer. If the screening test enables the interruption of only moderately growing and slow-growing but lethal cancers, it may take many years for the control twin to die and the benefit to be revealed. Thus, the follow-up time is very important. A benefit may be overlooked if the follow-up is too short. In fact, an "early benefit" would be difficult to explain. For an early benefit to appear in a cancer screening trial, the cancer in the screened "twin" would have to be detected and treated just before it metastasized, while the cancer in the unscreened "twin" would have to metastasize soon after and kill the unscreened twin fairly quickly. On the basis of the fact that length bias sampling means that periodic screening is more likely to interrupt moderately growing and slow-growing cancers, an early benefit soon after the start of screening would actually be unlikely.


    NONCOMPLIANCE AND CONTAMINATION
 TOP
 ABSTRACT
 INTRODUCTION
 SOME BASIC CONCEPTS
 THE IDEAL SCREENING TEST...
 DOES EARLIER DETECTION MEAN...
 INCREASED SURVIVAL DOES NOT...
 OVERDIAGNOSIS (OR PSEUDODISEASE)...
 LENGTH BIAS SAMPLING
 SELECTION BIAS
 RCT IS THE ONLY...
 BLINDED RANDOMIZATION IS...
 STATISTICAL POWER
 STATISTICAL SIGNIFICANCE
 POINT ESTIMATES AND THE...
 NONCOMPLIANCE AND CONTAMINATION
 SCREENING INTERVAL
 THRESHOLDS FOR INTERVENTION
 HOW TO MEASURE THE...
 UNEXPECTED CONSEQUENCES OF...
 GENERALIZABILITY
 BEWARE OF THE ARTIFICIAL...
 "AGE CREEP"
 SURROGATE END POINTS
 REFERENCES
 
Factors that can confuse results are called confounding factors. These can greatly weaken the power of a trial. For example, there is usually no way to prevent individuals who are supposed to be in the unscreened control group from going out on their own and having the screening test. Even if an individual’s life is saved by this, the proper analysis of RCT results requires that this person be counted with the unscreened group. This "contamination" weakens the ability of the trial to show a benefit, and larger numbers of participants are needed to compensate (25). Similarly, some individuals in the screening group refuse to be screened. Nevertheless, to avoid introducing a bias, these individuals must be counted as having been screened even if they die from the cancer. This is called noncompliance, and increased numbers of participants are needed to overcome the dilutional effects of this confounding factor.

Although counting noncompliers with the screened group and counting contaminators as if they were unscreened controls seems nonsensical, it is necessary because not doing so could introduce major biases. If, for example, individuals who were destined to have poor-prognosis cancers refused screening and were consequently not included in the study group data, this could make screening appear successful when it was merely the fact that these individuals with bad cancers had refused to participate.

Because women in the mammographic screening trials were not forced to undergo screening if they were allocated to the screened group and because those allocated to the control group were not prevented from being screened, the mammography trial results are reported as showing a benefit for women who were "invited" to be screened as opposed to those who actually were screened. Many women who died from breast cancer in the screened group had actually refused screening, while some in the unscreened control group may have been saved because of mammograms obtained outside the screening trial. Because proper analysis requires that such women still be counted with the group to which they had been allocated, it is fairly certain that the mammographic screening trial results represent an underestimation of the true benefit from screening.


    SCREENING INTERVAL
 TOP
 ABSTRACT
 INTRODUCTION
 SOME BASIC CONCEPTS
 THE IDEAL SCREENING TEST...
 DOES EARLIER DETECTION MEAN...
 INCREASED SURVIVAL DOES NOT...
 OVERDIAGNOSIS (OR PSEUDODISEASE)...
 LENGTH BIAS SAMPLING
 SELECTION BIAS
 RCT IS THE ONLY...
 BLINDED RANDOMIZATION IS...
 STATISTICAL POWER
 STATISTICAL SIGNIFICANCE
 POINT ESTIMATES AND THE...
 NONCOMPLIANCE AND CONTAMINATION
 SCREENING INTERVAL
 THRESHOLDS FOR INTERVENTION
 HOW TO MEASURE THE...
 UNEXPECTED CONSEQUENCES OF...
 GENERALIZABILITY
 BEWARE OF THE ARTIFICIAL...
 "AGE CREEP"
 SURROGATE END POINTS
 REFERENCES
 
As noted earlier, length bias sampling is the concept that periodic screening is more likely to reveal moderately growing and slower-growing cancers than it is to reveal fast-growing cancers. Simply put, if the time between screening tests (screening interval) is too long, then a faster-growing cancer may be too small to be detected at one screening and will grow to a clinically detectable size before the next screening test.

In general, increasing the time between screening tests increases the number of cancers that become evident between screening tests (interval cancers). This means that the value of the screening test is diminished because an increasing number of individuals gain no benefit from the test. Because the cost of screening and the secondary costs that it creates (eg, for additional testing, biopsies) represent major challenges to its use, the goal is to try to determine the optimal screening interval (ie, lives saved vs time between screening examinations). At some point there is insufficient incremental value to justify decreasing the time between screening examinations (few if any additional lives are saved beyond those saved with the longer interval). It is often the cost of the test (including harms to the patients as well as economic cost) that dictates a compromise in which health planners support a longer interval than would be medically ideal (an ideal interval being one that results in the greatest number of lives saved).

When a new test is introduced that reveals cancers earlier than they would be revealed without the test, the first round of screening will not only reveal the cancers that have just reached the threshold where the test can demonstrate them but will also reveal cancers that have been accumulating in the population because the old method could not reveal the cancers until they reached an even larger size. Thus, in the first round of screening (prevalence screening), many more cancers will be detected than will be found at subsequent screening examinations if the time between screening examinations is not too long. If screening is repeated, and the time between screening examinations is appropriate, then most of the next group of cancers detected will have just entered the detectable phase. These new cancers are the incident cases that are newly discovered after each screening interval.

The period of time during which a cancer is detectable with a test before it is clinically evident is called the sojourn time. If, for example, a cancer becomes palpable at 1.5 cm, is detectable with mammography at 0.5 cm, and takes 2 years to grow from 0.5 to 1.5 cm, then the sojourn time is 2 years. If screening takes place every 3 years, then many cancers that were not detected at one screening examination will already have become palpable before the next screening examination, and the screening test will be less effective than it would have been if the screening interval were shorter. If there are too many cancers that become evident in the interval between screening examinations (ie, the screening interval is too long), then the screening test will have little value. Performing more frequent screening is generally better than having longer intervals between screening examinations, but the practical "best" time between screening examinations is a balance between the most effective period for reducing deaths through early detection and the cost of performing more frequent testing. To intercept cancers earlier, the screening interval should be less than half the sojourn time (26).


    THRESHOLDS FOR INTERVENTION
 TOP
 ABSTRACT
 INTRODUCTION
 SOME BASIC CONCEPTS
 THE IDEAL SCREENING TEST...
 DOES EARLIER DETECTION MEAN...
 INCREASED SURVIVAL DOES NOT...
 OVERDIAGNOSIS (OR PSEUDODISEASE)...
 LENGTH BIAS SAMPLING
 SELECTION BIAS
 RCT IS THE ONLY...
 BLINDED RANDOMIZATION IS...
 STATISTICAL POWER
 STATISTICAL SIGNIFICANCE
 POINT ESTIMATES AND THE...
 NONCOMPLIANCE AND CONTAMINATION
 SCREENING INTERVAL
 THRESHOLDS FOR INTERVENTION
 HOW TO MEASURE THE...
 UNEXPECTED CONSEQUENCES OF...
 GENERALIZABILITY
 BEWARE OF THE ARTIFICIAL...
 "AGE CREEP"
 SURROGATE END POINTS
 REFERENCES
 
As pointed out earlier, the ideal cancer screening test would yield no false-positive results (when the test suggests a possible cancer but the finding proves to be benign) and no false-negative results (when the test results suggest that there is no cancer present when there actually is cancer present that is undetected with the test). A high false-positive rate due to an aggressive interpretation of the test results may be undesirable because it increases the costs that result from screening (because these lesions require additional evaluation). This also increases the harms of screening because some (if not many) of the false-positive cases will require a biopsy to show that they are not cancer. However, waiting for a cancer to develop more suspicious characteristics may allow it to spread before it is detected. An aggressive approach may be the only way to detect cancers at an early enough time to alter the natural history of the tumor. Just as the screening interval may influence the value of the test, the thresholds that are set for intervention will influence the percentage of cancers detected (ie, the sensitivity of the test) and the lives saved.

We are unaware of a screening test that does not yield false-positive and false-negative results. Probably the best-known cancer screening test is the Papanicolaou (Pap) test for cancer of the uterine cervix. Depending on how the test is interpreted, as many as 10% of women will have falsely positive smears and will be recalled for additional testing that turns out to yield negative results. This is usually due to the fact that for all screening tests there is often an overlap in characteristics between benign and malignant lesions. Some lesions (eg, spiculated masses at mammography) have a very high probability of being cancer, while others (eg, well-circumscribed masses at mammography) have a low probability of being cancer, but a small risk nonetheless. If the goal is to find all cancers, then the threshold for intervention must be low and the false-positive call rate will be high. If the goal is to minimize the false-positive call rate, then the threshold for intervention is high and cancers will be allowed to slip through the screening test (27). Until a perfect screening test is developed, there is usually no way to avoid this relation