|
|
||||||||
Statistical Concepts Series |
1 From the Department of Radiology, Boston University School of Medicine, 88 E Newton St, Atrium 2, Boston, MA 02118 (R.T.); and Health Services Research and Development Service, Department of Veterans Affairs, Washington, DC (P.E.C.). Received February 11, 2002; revision requested March 18; revision received April 1; accepted May 1. Address correspondence to R.T. (e-mail: tello@alum.mit.edu).
| ABSTRACT |
|---|
|
|
|---|
© RSNA, 2003
Index terms: Statistical analysis
| INTRODUCTION |
|---|
|
|
|---|
2 test, which is addressed in other articles in this series. What follows in this article is a brief introduction to three techniques for testing the difference of means. | HYPOTHESIS TESTING FOR TWO SAMPLE MEANS |
|---|
|
|
|---|
Independent Samples t Tests
If a researcher wants to compare means collected from two patient populations, a t test for independent samples will often be used. The term independent samples indicates that none of the patients in one sample are included in the second sample. The following research question will serve as an example: Are T2s in magnetic resonance (MR) imaging of malignant hepatic masses different from those for benign hepatic masses? Table 1 presents summary statistics from two patient samples to answer this question (Figure). One sample includes only patients with malignant tumors. The second sample includes only patients with benign tumors (hemangiomas). The average T2 for malignant lesions is about 92 msec, while hemangiomas had an average T2 of 136 msec. The t test is used to test the null hypothesis that the T2 for malignant tumors is not different from that for benign tumors: To conduct this test, the difference between the two means is used in conjunction with the variation found in both samples (SD) and sample sizes to compute a t test statistic. The t test formula is
|
|
is the mean for sample 1,
is the mean for sample 2, and S is the variance. The pooled standard error of the difference between two sample means is
|
|
|
|
|
Homogeneity of Variance
The t test assumes that the variances in each group are equal (termed homogeneous). Alternative methods can be used when this assumption is not valid. Statistical software programs often automatically compute t tests and report results for both equal and unequal variances. Alternate approximations of the t test when the variances are unequal can be found in a publication by Rosner (5). It is in this setting that the need to determine if variances are equal requires another statistical test.
Determination of whether the assumption of equal variances is valid requires the use of an F test. The F test involves conducting a variance ratio test (6). This calculation tests the null hypothesis that the variances are equal. If the results of the F test are statistically significant (P < .05), this suggests that the variance of the two groups is not equal. If this occurs, two recommended solutions are to either modify the t test to compensate for unequal variances or use a nonparametric test, called the Mann-Whitney test (7).
Paired t Tests
Two samples are paired when each data point of the first sample is matched and related to a data point in the second sample. This is common in studies in which measurements are collected from the same patients before and after intervention. Table 2 presents data on the size of a cancerous tumor in the same patient before and after receiving a new treatment. The research question represented in Table 2 is whether the new therapy affects tumor size. There are two groups represented by the same seven patients. One group is represented by the patient sample before therapy. The second group is represented by the same sample of patients after therapy. Before therapy, mean tumor size was 4.86 cm. After therapy, mean tumor size was 4.50 cm, representing a mean decrease of .36 cm. The t test is used to test the null hypothesis that the mean difference in tumor size between the groups before and after therapy does not differ significantly from zero, with the assumption that this difference is distributed normally. The test is used to compare the observed difference obtained from the sample of seven patients with the hypothesized value of no difference in the population. The paired t test formula is
|
|
is the observed mean difference, and S
is the standard error of the observed mean difference. On the basis of a t test statistic of 5.21, calculated as follows:
|
|
|
| ANOVA |
|---|
|
|
|---|
ANOVA calculations are best performed with statistical software (software easily capable of calculating the t test and variances include Excel version 5.0 [Microsoft, Bothell, Wash]; more sophisticated analyses for performance of ANOVA include Stata version 5.0 [College Park, Tex], SPSS [Chicago, Ill], or SAS [Cary, NC]), but the basic approach is to compare the means and variances of independent groups to determine if the groups are significantly different from one another. The null hypothesis proposes that the samples come from populations with the same mean and variance. The alternative hypothesis is that at least two of the means are not the same. If ANOVA is used with two groups, it would produce results comparable to those obtained with the two independent samples t test.
Table 3 presents data on vertebral bone density for three groups of women (8). The research question represented in Table 3 is whether the mean vertebral bone density varies among the three groups; hence, ANOVA is used to determine if there is a significant difference. In this example, age groups of 4655 years, 5665 years, and 6675 years are used to represent the three populations for patient screening. As reported in Table 3, the mean vertebral bone density is 1.12 for women 4655 years of age, 1.13 for women 5665 years of age, and 1.03 for women 6675 years of age. Calculation of the test statistic involves estimation of a ratio of the variance between the groups to the variance within the groups. The ratio, called the F statistic, is compared with the F sampling distribution instead of the t distribution discussed earlier (5). An F statistic of 1.0 occurs when the variance between the groups is the same as the variance within the groups.
|
|
is each group mean,
is the grand mean for all the groups [(sum of all scores)/N], nk is the number of patients in each group, K is the number of groups, and N is the total number of scores.
The sum of squares between groups is
nk(
-
)2. The sum of squares within groups is
(Xi -
)2 +
(Xi -
)2 +
(Xi -
)2. The example presented in Table 3 can be calculated by using the F statistic formula; however, the descriptive data in Table 3 would require manipulation. The sum of squares within groups is the variance multiplied by the number of scores in a group.
The F statistic in Table 3 is 4.499. This indicates that there is greater variation between the group means than within each group. In this case, there is a statistically significant difference (P < .013) between at least two of the age groups.
| THE MULTIPLE COMPARISON PROBLEM |
|---|
|
|
|---|
There are corrections for this problem (11,14). Some are useful for unordered groups, such as patient height versus sex, while others are applied to ordered groups (to evaluate a trend), such as patient height versus sex when stratified by age. It is also worth noting that there is a debate about which method to use, and some hold the view that this correction is overused (13). We focus our attention solely on unordered groups and the most commonly used correction, the Bonferroni method.
The Bonferroni correction is critical in adjusting the threshold for significance (14), which is equal to the desired P value (eg, .05, .01) divided by the number of outcome variables being examined. Consequently, when multiple statistical tests are conducted between the same variables, which would occur if multiple t tests were conducted for a comparison of age and bone density, the significance cut-off value is often adjusted to represent a more conservative estimate of statistical significance. One limitation of the Bonferroni correction is that by reducing the level of significance associated with each test, we have reduced the power of the test, thereby increasing the chance of incorrectly keeping the null hypothesis.
Table 4 presents the results of t tests by using the same data presented in Table 3. Use of the common threshold of .05 would result in the conclusion that there is a significant difference in vertebral bone density between those 4655 years of age and those 6675 years of age (P = .021) and also between those 5665 years of age and those 6675 years of age (P = .006). However, compensating for multiple comparisons would reduce the threshold from .05 to .017 (.05/3). This results in only the comparison between those 5665 years of age and those 6675 years of age, which reaches statistical significance.
|
In summary, it is often necessary to test hypotheses that relate a continuous outcome to an interventionfor example, tumor size versus treatment options. Depending on the number of groups (more than two) being analyzed, the ANOVA technique may be used to test for an effect, or a t test may be used if there are only two groups. A paired t test is used to examine two groups if the control population is linked on an individual basis, such as when a pre- and posttreatment comparison is made in the same patients. For radiologists, the comparisons may involve a new imaging technique, the use of contrast material, or a new MR imaging sequence.
Fundamental limitations in using these tests include the understanding that they generate an estimate of the probability that the differences observed would be due to random chance alone. This estimate is based not only on differences between means but also on sample variability and sample size. In addition, the assumption that the underlying population is distributed normally is not always appropriatein which case, special nonparametric techniques are available. In a more common misapplication, the t test is used inappropriately to compare two groups of categoric or binary data. Finally, use of the tests presented in this article is limited to comparisons between two variables (such as patient age and bone density), which may often oversimplify much more complex relationships. More complex relationships are best analyzed with other techniques, such as multiple regression or ANOVA.
| ACKNOWLEDGMENTS |
|---|
| FOOTNOTES |
|---|
| REFERENCES |
|---|
|
|
|---|
This article has been cited by other articles:
![]() |
S. Parratte, P. Kilian, V. Pauly, P. Champsaur, and J.-N. A. Argenson The use of ultrasound in acquisition of the anterior pelvic plane in computer-assisted total hip replacement: A CADAVER STUDY J Bone Joint Surg Br, February 1, 2008; 90-B(2): 258 - 263. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. van Straten, H.W. Venema, C.B.L.M. Majoie, N.J.M. Freling, C.A. Grimbergen, and G.J. den Heeten Image Quality of Multisection CT of the Brain: Thickly Collimated Sequential Scanning versus Thinly Collimated Spiral Scanning with Image Combining AJNR Am. J. Neuroradiol., March 1, 2007; 28(3): 421 - 427. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. E. Roos, B. Chilla, M. Zanetti, M. Schmid, P. Koch, C. W. A. Pfirrmann, and J. Hodler MRI of Meniscal Lesions: Soft-Copy (PACS) and Hard-Copy Evaluation Versus Reviewer Experience. Am. J. Roentgenol., March 1, 2006; 186(3): 786 - 790. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. B. Jani, M. J. Blend, R. Hamilton, C. Brendler, C. Pelizzari, L. Krauz, B. Sapra, S. Vijayakumar, A. Awan, and R. R. Weichselbaum Radioimmunoscintigraphy for Postprostatectomy Radiotherapy: Analysis of Toxicity and Biochemical Control J. Nucl. Med., August 1, 2004; 45(8): 1315 - 1322. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. B. Jani, D. Spelbring, R. Hamilton, M. J. Blend, C. Pelizzari, C. Brendler, L. Krauz, S. Vijayakumar, B. Sapra, and R. R. Weichselbaum Impact of Radioimmunoscintigraphy on Definition of Clinical Target Volume for Radiotherapy After Prostatectomy J. Nucl. Med., February 1, 2004; 45(2): 238 - 246. [Abstract] [Full Text] [PDF] |
||||
![]() |
P. H. Brown, R. Tello, and P. E. Crewson Hypothesis Testing of Means [letter] * Drs Tello and Crewson respond: Radiology, December 1, 2003; 229(3): 930 - 931. [Full Text] [PDF] |
||||
![]() |
K. H. Zou, K. Tuncali, and S. G. Silverman Correlation and Simple Linear Regression Radiology, June 1, 2003; 227(3): 617 - 628. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
| RADIOLOGY | RADIOGRAPHICS | RSNA JOURNALS ONLINE |