|
|
||||||||
Statistical Concepts Series |
1 From the Russell H. Morgan Department of Radiology and Radiological Science, Johns Hopkins University, 600 N Wolfe St, Central Radiology Viewing Area, Rm 117, Baltimore, MD 21287. Received December 17, 2001; revision requested January 29, 2002; revision received March 7; accepted March 13. Address correspondence to the author (e-mail: jeng@jhmi.edu).
| ABSTRACT |
|---|
|
|
|---|
© RSNA, 2003
Index terms: Radiology and radiologists, research Statistical analysis
| INTRODUCTION |
|---|
|
|
|---|
Superficial discussions of sample size determination are included in typical introductory biostatistics texts (13). The goal of this article is to augment these introductory discussions with additional practical material. First, the need for considering sample size will be reviewed. Second, the study design parameters affecting sample size will be identified. Third, formulae for calculating appropriate sample sizes for some common study designs will be defined. Finally, some advice will be offered on what to do if the calculated sample size is impracticably large. To assist the reader in performing the calculations described in this article and to encourage experimentation with them, a World Wide Web page has been developed that closely parallels the equations presented in this article. This page can be found at www.rad.jhmi.edu/jeng/javarad/samplesize/.
Even if a statistician is readily available, the investigator may find that a working knowledge of the factors affecting sample size will result in more fruitful communication with the statistician and in better research design. A working knowledge of these factors is also required to use one of the numerous Web pages (46) and computer programs (79) that have been developed for calculating appropriate sample sizes. It should be noted that Web pages for calculating sample size are typically limited for use in situations involving the well-known parametric statistics, which are those involving the calculation of summary means, proportions, or other parameters of an assumed underlying statistical distribution such as the normal, Student t, or binomial distributions. The calculation of sample size for nonparametric statistics such as the Wilcoxon rank sum test is performed by some computer programs (7,9).
| IMPORTANCE OF SAMPLE SIZE |
|---|
|
|
|---|
Sample size is important primarily because of its effect on statistical power. Statistical power is the probability that a statistical test will indicate a significant difference when there truly is one. Statistical power is analogous to the sensitivity of a diagnostic test (10), and one could mentally substitute the word "sensitivity" for the word "power" during statistical discussions.
In a study comparing two groups of individuals, the power (sensitivity) of a statistical test must be sufficient to enable detection of a statistically significant difference between the two groups if a difference is truly present. This issue becomes important if the study results were to demonstrate no statistically significant difference. If such a negative result were to occur, there would be two possible interpretations. The first interpretation is that the results of the statistical test are correct and that there truly is no statistically significant difference (a true-negative result). The second interpretation is that the results of the statistical test are erroneous and that there is actually an underlying difference, but the study was not powerful enough (sensitive enough) to find the difference, yielding a false-negative result. In statistical terminology, a false-negative result is known as a type II error. An adequate sample size gives a statistical test enough power (sensitivity) so that the first interpretation (that the results are true-negative) is much more plausible than the second interpretation (that a type II error occurred) in the event no statistically significant difference is found in the study.
It is well known that many published clinical research studies possess low statistical power owing to inadequate sample size or other design issues (11,12). One could argue that it is as wasteful and inappropriate to conduct a study with inadequate power as it is to obtain a diagnostic test of insufficient sensitivity to rule out a disease.
| PARAMETERS THAT DETERMINE APPROPRIATE SAMPLE SIZE |
|---|
|
|
|---|
Minimum Expected Difference
This parameter is the smallest measured difference between comparison groups that the investigator would like the study to detect. As the minimum expected difference is made smaller, the sample size needed to detect statistical significance increases. The setting of this parameter is subjective and is based on clinical judgment and experience with the problem being investigated. For example, suppose a study is designed to compare a standard diagnostic procedure of 80% accuracy with a new procedure of unknown but potentially higher accuracy. It would probably be clinically unimportant if the new procedure were only 81% accurate, but suppose the investigator believes that it would be a clinically important improvement if the new procedure were 90% accurate. Therefore, the investigator would choose a minimum expected difference of 10% (0.10). The results of pilot studies or a literature review can also guide the selection of a reasonable minimum difference.
Estimated Measurement Variability
This parameter is represented by the expected SD in the measurements made within each comparison group. As statistical variability increases, the sample size needed to detect the minimum difference increases. Ideally, the estimated measurement variability should be determined on the basis of preliminary data collected from a similar study population. A review of the literature can also provide estimates of this parameter. If preliminary data are not available, this parameter may have to be estimated on the basis of subjective experience, or a range of values may be assumed. A separate estimate of measurement variability is not required when the measurement being compared is a proportion (in contrast to a mean), because the SD is mathematically derived from the proportion.
Statistical Power
This parameter is the power that is desired from the study. As power is increased, sample size increases. While high power is always desirable, there is an obvious trade-off with the number of individuals that can feasibly be studied, given the usually fixed amount of time and resources available to conduct a study. In randomized controlled trials, the statistical power is customarily set to a number greater than or equal to 0.80, with many clinical trial experts now advocating a power of 0.90.
Significance Criterion
This parameter is the maximum P value for which a difference is to be considered statistically significant. As the significance criterion is decreased (made more strict), the sample size needed to detect the minimum difference increases. The significance criterion is customarily set to .05.
One- or Two-tailed Statistical Analysis
In a few cases, it may be known before the study that any difference between comparison groups is possible in only one direction. In such cases, use of a one-tailed statistical analysis, which would require a smaller sample size for detection of the minimum difference than would a two-tailed analysis, may be considered. The sample size of a one-tailed design with a given significance criterionfor example,
is equal to the sample size of a two-tailed design with a significance criterion of 2
, all other parameters being equal. Because of this simple relationship and because truly appropriate one-tailed analyses are rare, a two-tailed analysis is assumed in the remainder of this article.
| SAMPLE SIZES FOR COMPARATIVE RESEARCH STUDIES |
|---|
|
|
|---|
|
|
is the assumed SD of each group (assumed to be equal for both groups), the zcrit value is that given in Table 1 for the desired significance criterion, the zpwr value is that given in Table 2 for the desired statistical power, and D is the minimum expected difference between the two means. Both zcrit and zpwr are cutoff points along the x axis of a standard normal probability distribution that demarcate probabilities matching the specified significance criterion and statistical power, respectively. The two groups that make up N are assumed to be equal in number, and it is assumed that two-tailed statistical analysis will be used. Note that N depends only on the difference between the two means; it does not depend on the magnitude of either one.
|
|
= 15 mm Hg, zcrit = 1.960 (from Table 1), and zpwr = 0.842 (from Table 2). Equation (1) yields a sample size of N = 70.6. Therefore, a total of 70 patients (rounding N to the nearest even number) should be enrolled in the study: 35 to undergo the vascular procedure and 35 to receive medical therapy.
For a study in which two proportions are compared with a
2 test or a z test, which is based on the normal approximation to the binomial distribution, the equation for sample size (14) is
|
= (p1 + p2)/2, and N, zcrit, and zpwr are defined as they are for Equation (1). The two groups comprising N are assumed to be equal in number, and it is assumed that two-tailed statistical analysis will be used. Note that in this case, N depends not only on the difference between the two proportions but also on the magnitude of the proportions themselves. Therefore, Equation (2) requires the investigator to estimate p1 and p2, as well as their difference, before performing the study. However, Equation (2) does not require an independent estimate of SD because it is calculated from p1 and p2 within the equation.
As an example, suppose a standard diagnostic procedure has an accuracy of 80% for the diagnosis of a certain disease. A study is proposed to evaluate a new diagnostic procedure that may have greater accuracy. On the basis of their experience, the investigators decide that the new procedure would have to be at least 90% accurate to be considered significantly better than the standard procedure. A significance criterion of .05 and a power of 0.90 are chosen. With these assumptions, p1 = 0.80, p2 = 0.90, D = 0.10,
= 0.85, zcrit = 1.960, and zpwr = 0.842. Equation (2) yields a sample size of N = 398. Therefore, a total of 398 patients should be enrolled: 199 to undergo the standard diagnostic procedure and 199 to undergo the new one.
| SAMPLE SIZES FOR DESCRIPTIVE STUDIES |
|---|
|
|
|---|
In studies designed to estimate a mean, the equation for sample size (2,15) is
|
|
is the assumed SD for the group, the zcrit value is that given in Table 1, and D is the total width of the expected CI. Note that Equation (3) does not depend on statistical power because this concept only applies to statistical comparisons.
As an example, suppose a fetal sonographer wants to determine the mean fetal crown-rump length in a group of pregnancies. The sonographer would like the limits of the 95% confidence interval to be no more than 1 mm above or 1 mm below the mean crown-rump length of the group. From previous studies, it is known that the SD for the measurement is 3 mm. Based on these assumptions, D = 2 mm,
= 3 mm, and zcrit = 1.960 (from Table 1). Equation (3) yields a sample size of N = 35. Therefore, 35 fetuses should be examined in the study.
In studies designed to measure a characteristic in terms of a proportion, the equation for sample size (2,15) is
|
|
As an example, suppose an investigator would like to determine the accuracy of a diagnostic test with a 95% CI of ±10%. Suppose that, on the basis of results of preliminary studies, the estimated accuracy is 80%. With these assumptions, D = 0.20, p = 0.80, and zcrit = 1.960. Equation (4) yields a sample size of N = 61. Therefore, 61 patients should be examined in the study.
| MINIMIZING THE SAMPLE SIZE |
|---|
|
|
|---|
Use Continuous Measurements Instead of Categories
Because a radiologic diagnosis is often expressed in terms of a binary result, such as the presence or absence of a disease, it is natural to convert continuous measurements into categories. For example, the size of a lesion might be encoded as "small" or "large." For a sample of fixed size, the use of the actual measurement rather than the proportion in each category yields more power. This is because statistical tests that incorporate the use of continuous values are mathematically more powerful than those used for proportions, given the same sample size.
Use More Precise Measurements
For studies in which Equation (1) or Equation (2) applies, any way to increase the precision (decrease the variability) of the measurement process should be sought. For some types of research, precision can be increased by simply repeating the measurement. More complex equations are necessary for studies involving repeated measurements in the same individuals (17), but the basic principles are similar.
Use Paired Measurements
Statistical tests like the paired t test are mathematically more powerful for a given sample size than are unpaired tests because in paired tests, each measurement is matched with its own control. For example, instead of comparing the average lesion size in a group of treated patients with that in a control group, measuring the change in lesion size in each patient after treatment allows each patient to serve as his or her own control and yields more statistical power. Equation (1) can still be used in this case. D represents the expected change in the measurement, and
is the expected SD of this change. The additional power and reduction in sample size are due to the SD being smaller for changes within individuals than for overall differences between groups of individuals.
Use Unequal Group Sizes
Equations (1) and (2) involve the assumption that the comparison groups are equal in size. Although it is statistically most efficient if the two groups are equal in size, benefit is still gained by studying more individuals, even if the additional individuals all belong to one of the groups. For example, it may be feasible to recruit additional individuals into the control group even if it is difficult to recruit more individuals into the noncontrol group. More complex equations are necessary for calculating sample sizes when comparing means (13) and proportions (18) of unequal group sizes.
Expand the Minimum Expected Difference
Perhaps the minimum expected difference that has been specified is unnecessarily small, and a larger expected difference could be justified, especially if the planned study is a preliminary one. The results of a preliminary study could be used to justify a more ambitious follow-up study of a larger number of individuals and a smaller minimum difference.
| DISCUSSION |
|---|
|
|
|---|
Equations for calculating sample size, such as Equations (1) and (2), also provide a method for determining statistical power corresponding to a given sample size. To calculate power, solve for zpwr in the equation corresponding to the design of the study. The power can be then determined by referring to Table 2. In this way, an "observed power" can be calculated after a study has been completed, where the observed difference is used in place of the minimum expected difference. This calculation is known as retrospective power analysis and is sometimes used to aid in the interpretation of the statistical results of a study. However, retrospective power analysis is controversial because it can be shown that observed power is completely determined by the P value and therefore cannot add any additional information to its interpretation (19). Power calculations are most appropriate when they incorporate a minimum difference that is stated prospectively.
The accuracy of sample size calculations obviously depends on the accuracy of the estimates of the parameters used in the calculations. Therefore, these calculations should always be considered estimates of an absolute minimum. It is usually prudent for the investigator to plan to include more than the minimum number of individuals in a study to compensate for loss during follow-up or other causes of attrition.
Sample size is best considered early in the planning of a study, when modifications in study design can still be made. Attention to sample size will hopefully result in a more meaningful study whose results will eventually receive a high priority for publication.
| REFERENCES |
|---|
|
|
|---|
This article has been cited by other articles:
![]() |
E. Quaia, M. Bertolotto, V. Cioffi, A. Rossi, E. Baratella, R. Pizzolato, and M. A. Cova Comparison of Contrast-Enhanced Sonography with Unenhanced Sonography and Contrast-Enhanced CT in the Diagnosis of Malignancy in Complex Cystic Renal Masses Am. J. Roentgenol., October 1, 2008; 191(4): 1239 - 1249. [Abstract] [Full Text] [PDF] |
||||
![]() |
K. Fisher Assessing clinically meaningful change following a programme for managing chronic pain Clinical Rehabilitation, March 1, 2008; 22(3): 252 - 259. [Abstract] [PDF] |
||||
![]() |
G. Ascenti, S. Mazziotti, G. Zimbaro, N. Settineri, C. Magno, D. Melloni, R. Caruso, and E. Scribano Complex Cystic Renal Masses: Characterization with Contrast-enhanced US Radiology, April 1, 2007; 243(1): 158 - 165. [Abstract] [Full Text] [PDF] |
||||
![]() |
G. Andreisek, S. R. Duc, J. M. Froehlich, J. Hodler, and D. Weishaupt MR Arthrography of the Shoulder, Hip, and Wrist: Evaluation of Contrast Dynamics and Image Quality with Increasing Injection-to-Imaging Time Am. J. Roentgenol., April 1, 2007; 188(4): 1081 - 1088. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. H. Choi, M. H. Han, W. K. Moon, K. R. Son, J.-K. Won, J.-H. Kim, B. J. Kwon, D. G. Na, H.-J. Weinmann, and K.-H. Chang Cervical Lymph Node Metastases: MR Imaging of Gadofluorine M and Monocrystalline Iron Oxide Nanoparticle-47 in a Rabbit Model of Head and Neck Cancer Radiology, December 1, 2006; 241(3): 753 - 762. [Abstract] [Full Text] [PDF] |
||||
![]() |
A.-J. Lemke, M. J. Brinkmann, T. Schott, S. M. Niehues, U. Settmacher, P. Neuhaus, and R. Felix Living Donor Right Liver Lobes: Preoperative CT Volumetric Measurement for Calculation of Intraoperative Weight and Volume Radiology, September 1, 2006; 240(3): 736 - 742. [Abstract] [Full Text] [PDF] |
||||
![]() |
M.-J. Kim, Y. E. Chung, K. W. Kim, J.-J. Chung, J. S. Lim, Y. T. Oh, and J. H. Kim Variation of the Time to Aortic Enhancement of Fixed-Duration Versus Fixed-Rate Injection Protocols Am. J. Roentgenol., January 1, 2006; 186(1): 185 - 192. [Abstract] [Full Text] [PDF] |
||||
![]() |
Y. Nakayama, K. Awai, Y. Funama, M. Hatemura, M. Imuta, T. Nakaura, D. Ryu, S. Morishita, S. Sultana, N. Sato, et al. Abdominal CT with Low Tube Voltage: Preliminary Observations about Radiation Dose, Contrast Enhancement, Image Quality, and Noise Radiology, December 1, 2005; 237(3): 945 - 951. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Gaeta, F. Minutoli, E. Scribano, G. Ascenti, S. Vinci, D. Bruschetta, L. Magaudda, and A. Blandino CT and MR Imaging Findings in Athletes with Early Tibial Stress Injuries: Comparison with Bone Scintigraphy Findings and Emphasis on Cortical Abnormalities Radiology, May 1, 2005; 235(2): 553 - 561. [Abstract] [Full Text] [PDF] |
||||
![]() |
N. Peters, J. Herzog, C. Opherk, and M. Dichgans A Two-Year Clinical Follow-Up Study in 80 CADASIL Subjects: Progression Patterns and Implications for Clinical Trials Stroke, July 1, 2004; 35(7): 1603 - 1608. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. Eng Sample Size Estimation: A Glimpse beyond Simple Formulas Radiology, March 1, 2004; 230(3): 606 - 612. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
| RADIOLOGY | RADIOGRAPHICS | RSNA JOURNALS ONLINE |