Radiology
HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH TABLE OF CONTENTS
 QUICK SEARCH:   [advanced]


     


Published online before print February 28, 2003, 10.1148/radiol.2271020085
This Article
Right arrow Abstract Freely available
Right arrow Figures Only
Right arrow Full Text (PDF)
Right arrow All Versions of this Article:
2271020085v1
227/1/1    most recent
Right arrow Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when eLetters are posted
Right arrow Alert me if a correction is posted
Right arrow Citation Map
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Right arrow reprints & permissions
Citing Articles
Right arrow Citing Articles via HighWire
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Tello, R.
Right arrow Articles by Crewson, P. E.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Tello, R.
Right arrow Articles by Crewson, P. E.
(Radiology 2003;227:1-4.)
© RSNA, 2003


Statistical Concepts Series

Hypothesis Testing II: Means1

Richard Tello, MD, MSME, MPH and Philip E. Crewson, PhD

1 From the Department of Radiology, Boston University School of Medicine, 88 E Newton St, Atrium 2, Boston, MA 02118 (R.T.); and Health Services Research and Development Service, Department of Veterans Affairs, Washington, DC (P.E.C.). Received February 11, 2002; revision requested March 18; revision received April 1; accepted May 1. Address correspondence to R.T. (e-mail: tello@alum.mit.edu).


    ABSTRACT
 TOP
 ABSTRACT
 INTRODUCTION
 HYPOTHESIS TESTING FOR TWO...
 ANOVA
 THE MULTIPLE COMPARISON PROBLEM
 REFERENCES
 
Whenever means are reported in the literature, they are likely accompanied by tests to determine statistical significance. The t test is a common method for statistical evaluation of the difference between two sample means. It provides information on whether the means from two samples are likely to be different in the two populations from which the data originated. Similarly, paired t tests are common when comparing means from the same set of patients before and after an intervention. Analysis of variance techniques are used when a comparison involves more than two means. Each method serves a particular purpose, has its own computational formula, and uses a different sampling distribution to determine statistical significance. In this article, the authors discuss the basis behind analysis of continuous data with use of paired and unpaired t tests, the Bonferroni correction, and multivariate analysis of variance for readers of the radiology literature.

© RSNA, 2003

Index terms: Statistical analysis


    INTRODUCTION
 TOP
 ABSTRACT
 INTRODUCTION
 HYPOTHESIS TESTING FOR TWO...
 ANOVA
 THE MULTIPLE COMPARISON PROBLEM
 REFERENCES
 
To establish if there is a statistically significant difference in two groups that are measured with a continuous variable, such as patient height versus sex, a test of the hypothesis that there is no difference must be performed. In this article, we discuss the application of three commonly used methods for testing the difference of means. The three methods are the independent samples t test, the paired samples t test, and one-way analysis of variance (ANOVA). Each method is used to compare means obtained from continuous sample data, but each is designed to serve a particular purpose. All three approaches require normally distributed variables and produce an estimate of whether there is a significant difference between means. Independent samples t tests provide information on whether the means from two samples are likely to be different in the two populations from which the data originated. Paired samples t tests compare means from the same set of observations (patients) before and after an intervention is performed. ANOVA is used to test the differences between three or more sample means. The t test is a commonly used statistical test that is easy to calculate but can be misapplied (1,2). If a radiologist wishes to compare two proportions, such as the sensitivity of two tests, the appropriate test is the {chi}2 test, which is addressed in other articles in this series. What follows in this article is a brief introduction to three techniques for testing the difference of means.


    HYPOTHESIS TESTING FOR TWO SAMPLE MEANS
 TOP
 ABSTRACT
 INTRODUCTION
 HYPOTHESIS TESTING FOR TWO...
 ANOVA
 THE MULTIPLE COMPARISON PROBLEM
 REFERENCES
 
The t test can be used to evaluate the difference between two means from two independent samples or between two samples for which the observations in the second sample are not independent of those in the first sample. The latter is commonly referred to as a paired t test. Both paired and unpaired t tests use the t sampling distribution to determine the P value. As reported in a previous article (3), test statistics are compared with sampling distributions to determine the probability of error. The t distribution is similar to the standard normal distribution, except that it compensates for small sample sizes (especially fewer than 30 observations). As total sample size of the two groups increases beyond 120 observations, the two sampling distributions are virtually identical. This attribute allows the t distribution to be used for all sample sizes, large or small.

Independent Samples t Tests
If a researcher wants to compare means collected from two patient populations, a t test for independent samples will often be used. The term independent samples indicates that none of the patients in one sample are included in the second sample. The following research question will serve as an example: Are T2s in magnetic resonance (MR) imaging of malignant hepatic masses different from those for benign hepatic masses? Table 1 presents summary statistics from two patient samples to answer this question (Figure). One sample includes only patients with malignant tumors. The second sample includes only patients with benign tumors (hemangiomas). The average T2 for malignant lesions is about 92 msec, while hemangiomas had an average T2 of 136 msec. The t test is used to test the null hypothesis that the T2 for malignant tumors is not different from that for benign tumors: To conduct this test, the difference between the two means is used in conjunction with the variation found in both samples (SD) and sample sizes to compute a t test statistic. The t test formula is

where is the mean for sample 1, is the mean for sample 2, and S is the variance. The pooled standard error of the difference between two sample means is

where n1 is the size of sample 1, n2 is the size of sample 2, s1 is the variance of sample 1, and s2 is the variance of sample 2.


View this table:
[in this window]
[in a new window]

 
TABLE 1. Example of the Independent Samples (unpaired) t Test of T2s for Benign Hemangiomas and Malignant Lesions

 


View larger version (16K):
[in this window]
[in a new window]
[Download PPT slide]
 
Calculations for the independent samples t test statistic reported in Table 1.

 
As shown in Table 1, the calculations result in a t test statistic of 7.44, which, when compared with the t distribution, produces a P value of less than .001 (4) by performing the following equations:

Assuming a .05 cutoff P value, the null hypothesis is rejected in favor of the conclusion that on average, there is a statistically significant difference in the T2 for MR imaging of malignant tumors compared with that for benign tumors.

Homogeneity of Variance
The t test assumes that the variances in each group are equal (termed homogeneous). Alternative methods can be used when this assumption is not valid. Statistical software programs often automatically compute t tests and report results for both equal and unequal variances. Alternate approximations of the t test when the variances are unequal can be found in a publication by Rosner (5). It is in this setting that the need to determine if variances are equal requires another statistical test.

Determination of whether the assumption of equal variances is valid requires the use of an F test. The F test involves conducting a variance ratio test (6). This calculation tests the null hypothesis that the variances are equal. If the results of the F test are statistically significant (P < .05), this suggests that the variance of the two groups is not equal. If this occurs, two recommended solutions are to either modify the t test to compensate for unequal variances or use a nonparametric test, called the Mann-Whitney test (7).

Paired t Tests
Two samples are paired when each data point of the first sample is matched and related to a data point in the second sample. This is common in studies in which measurements are collected from the same patients before and after intervention. Table 2 presents data on the size of a cancerous tumor in the same patient before and after receiving a new treatment. The research question represented in Table 2 is whether the new therapy affects tumor size. There are two groups represented by the same seven patients. One group is represented by the patient sample before therapy. The second group is represented by the same sample of patients after therapy. Before therapy, mean tumor size was 4.86 cm. After therapy, mean tumor size was 4.50 cm, representing a mean decrease of .36 cm. The t test is used to test the null hypothesis that the mean difference in tumor size between the groups before and after therapy does not differ significantly from zero, with the assumption that this difference is distributed normally. The test is used to compare the observed difference obtained from the sample of seven patients with the hypothesized value of no difference in the population. The paired t test formula is

where is the observed mean difference, and S is the standard error of the observed mean difference. On the basis of a t test statistic of 5.21, calculated as follows:

the probability of falsely rejecting the null hypothesis of no change in size (ie, the observed difference is due to random chance) is .002. Hence, the null hypothesis is rejected in favor of the conclusion that tumors shrank in patients who underwent therapy.


View this table:
[in this window]
[in a new window]

 
TABLE 2. Example of the Paired t Test to Evaluate Tumor Size before and after Therapy in Seven Patients

 

    ANOVA
 TOP
 ABSTRACT
 INTRODUCTION
 HYPOTHESIS TESTING FOR TWO...
 ANOVA
 THE MULTIPLE COMPARISON PROBLEM
 REFERENCES
 
In the previous section, we compared the means of two normally distributed variables with the two-sample t test. When the means of more than two distributions must be compared, one-way ANOVA is used. With ANOVA, the means of two or more independent groups (each of which follow a normal distribution and have similar SDs) can be evaluated to ascertain the relative variability between the groups compared with the variability within the groups.

ANOVA calculations are best performed with statistical software (software easily capable of calculating the t test and variances include Excel version 5.0 [Microsoft, Bothell, Wash]; more sophisticated analyses for performance of ANOVA include Stata version 5.0 [College Park, Tex], SPSS [Chicago, Ill], or SAS [Cary, NC]), but the basic approach is to compare the means and variances of independent groups to determine if the groups are significantly different from one another. The null hypothesis proposes that the samples come from populations with the same mean and variance. The alternative hypothesis is that at least two of the means are not the same. If ANOVA is used with two groups, it would produce results comparable to those obtained with the two independent samples t test.

Table 3 presents data on vertebral bone density for three groups of women (8). The research question represented in Table 3 is whether the mean vertebral bone density varies among the three groups; hence, ANOVA is used to determine if there is a significant difference. In this example, age groups of 46–55 years, 56–65 years, and 66–75 years are used to represent the three populations for patient screening. As reported in Table 3, the mean vertebral bone density is 1.12 for women 46–55 years of age, 1.13 for women 56–65 years of age, and 1.03 for women 66–75 years of age. Calculation of the test statistic involves estimation of a ratio of the variance between the groups to the variance within the groups. The ratio, called the F statistic, is compared with the F sampling distribution instead of the t distribution discussed earlier (5). An F statistic of 1.0 occurs when the variance between the groups is the same as the variance within the groups.


View this table:
[in this window]
[in a new window]

 
TABLE 3. Example of ANOVA to Compare Differences in Vertebral Density in Three Age Groups of Women

 
The F statistic is the mean squares between groups, divided by the mean squares within groups:

where is each group mean, is the grand mean for all the groups [(sum of all scores)/N], nk is the number of patients in each group, K is the number of groups, and N is the total number of scores.

The sum of squares between groups is {Sigma}nk( - )2. The sum of squares within groups is {Sigma}(Xi - )2 + {Sigma}(Xi - )2 + {Sigma}(Xi - )2. The example presented in Table 3 can be calculated by using the F statistic formula; however, the descriptive data in Table 3 would require manipulation. The sum of squares within groups is the variance multiplied by the number of scores in a group.

The F statistic in Table 3 is 4.499. This indicates that there is greater variation between the group means than within each group. In this case, there is a statistically significant difference (P < .013) between at least two of the age groups.


    THE MULTIPLE COMPARISON PROBLEM
 TOP
 ABSTRACT
 INTRODUCTION
 HYPOTHESIS TESTING FOR TWO...
 ANOVA
 THE MULTIPLE COMPARISON PROBLEM
 REFERENCES
 
The vertebral bone density example could also be analyzed by using three t tests (46–55-year-old group vs 56–65-year-old group; 46–55-year-old group vs 66–75-year-old group; and 56–65-year-old group vs 66–75-year-old group), which is commonly performed (although often incorrectly) for simplicity of communication. Similarly, it is not uncommon for investigators who evaluate many outcomes to report statistical significance with P values at the .04 and .02 levels (9,10). This approach, however, leads to a multiple comparisons problem (11,12). In this situation, one may falsely conclude a significant effect where there is none. In particular, use of a .05 cut-off value for significance theoretically guarantees that if there were 20 pairwise comparisons, there will by chance alone appear to be one with significance at the .05 level (20 x .05 = 1).

There are corrections for this problem (11,14). Some are useful for unordered groups, such as patient height versus sex, while others are applied to ordered groups (to evaluate a trend), such as patient height versus sex when stratified by age. It is also worth noting that there is a debate about which method to use, and some hold the view that this correction is overused (13). We focus our attention solely on unordered groups and the most commonly used correction, the Bonferroni method.

The Bonferroni correction is critical in adjusting the threshold for significance (14), which is equal to the desired P value (eg, .05, .01) divided by the number of outcome variables being examined. Consequently, when multiple statistical tests are conducted between the same variables, which would occur if multiple t tests were conducted for a comparison of age and bone density, the significance cut-off value is often adjusted to represent a more conservative estimate of statistical significance. One limitation of the Bonferroni correction is that by reducing the level of significance associated with each test, we have reduced the power of the test, thereby increasing the chance of incorrectly keeping the null hypothesis.

Table 4 presents the results of t tests by using the same data presented in Table 3. Use of the common threshold of .05 would result in the conclusion that there is a significant difference in vertebral bone density between those 46–55 years of age and those 66–75 years of age (P = .021) and also between those 56–65 years of age and those 66–75 years of age (P = .006). However, compensating for multiple comparisons would reduce the threshold from .05 to .017 (.05/3). This results in only the comparison between those 56–65 years of age and those 66–75 years of age, which reaches statistical significance.


View this table:
[in this window]
[in a new window]

 
TABLE 4. Adjustment for Multiple Comparisons of Vertebral Density according to Age Group

 
The far right column of Table 4 shows the exact probability of error with use of the Bonferroni adjustment provided by SPSS statistical software. The Bonferroni adjustment is a common option in statistical software when using ANOVA to determine if the group means are different from each other. The P value for each comparison is adjusted so it can be compared directly with the P < .05 cutoff. Again, by using P < .05 as the standard for significance, the results of the Bonferroni adjustment listed in Table 4 indicate that the only statistically significant difference is between those 56–65 years of age and those 66–75 years of age (P = .015).

In summary, it is often necessary to test hypotheses that relate a continuous outcome to an intervention—for example, tumor size versus treatment options. Depending on the number of groups (more than two) being analyzed, the ANOVA technique may be used to test for an effect, or a t test may be used if there are only two groups. A paired t test is used to examine two groups if the control population is linked on an individual basis, such as when a pre- and posttreatment comparison is made in the same patients. For radiologists, the comparisons may involve a new imaging technique, the use of contrast material, or a new MR imaging sequence.

Fundamental limitations in using these tests include the understanding that they generate an estimate of the probability that the differences observed would be due to random chance alone. This estimate is based not only on differences between means but also on sample variability and sample size. In addition, the assumption that the underlying population is distributed normally is not always appropriate—in which case, special nonparametric techniques are available. In a more common misapplication, the t test is used inappropriately to compare two groups of categoric or binary data. Finally, use of the tests presented in this article is limited to comparisons between two variables (such as patient age and bone density), which may often oversimplify much more complex relationships. More complex relationships are best analyzed with other techniques, such as multiple regression or ANOVA.


    ACKNOWLEDGMENTS
 
The authors thank Miriam Bowdre for secretarial support in preparation of the manuscript.


    FOOTNOTES
 
Abbreviation: ANOVA = analysis of variance


    REFERENCES
 TOP
 ABSTRACT
 INTRODUCTION
 HYPOTHESIS TESTING FOR TWO...
 ANOVA
 THE MULTIPLE COMPARISON PROBLEM
 REFERENCES
 

  1. Mullner M, Matthews H, Altman DG. Reporting on statistical methods to adjust for confounding: a cross-sectional survey. Ann Intern Med 2002; 136:122-126.[Abstract/Free Full Text]
  2. Bland JM, Altman DG. One and two sided tests of significance. BMJ 1994; 309:248.[Free Full Text]
  3. Zou KH, Fielding JR, Silverman SG, Tempany CM. Hypothesis testing I: proportions. Radiology 2003; 226:609-613.[Abstract/Free Full Text]
  4. Fenlon HM, Tello R, deCarvalho VLS, Yucel EK. Signal characteristics of focal liver lesions on double echo T2-weighted conventional spin echo MRI: observer performance versus quantitative measurements of T2 relaxation times. J Comput Assist Tomogr 2000; 24:204-211.[CrossRef][Medline]
  5. Rosner B. Fundamentals of biostatistics 4th ed. Boston, Mass: Duxbury, 1995; 270-273.
  6. Altman DG. Practical statistics for medical research London, England: Chapman & Hall/CRC, 1997.
  7. Rosner B. Fundamentals of biostatistics 4th ed. Boston, Mass: Duxbury, 1995; 570-575.
  8. Bachman DM, Crewson PE. Comparison of central DXA with heel ultrasound and finger DXA for detection of osteoporosis. J Clin Densitom 2002; 5:131-141.[CrossRef][Medline]
  9. Sheafor DH, Keogan MT, Delong DM, Nelson RC. Dynamic helical CT of the abdomen: prospective comparison of pre- and postprandial contrast enhancement. Radiology 1998; 206:359-363.[Abstract/Free Full Text]
  10. Tello R, Seltzer SE. Hepatic contrast-enhanced CT: statistical design for prospective analysis. Radiology 1998; 209:879-881.[Free Full Text]
  11. Gonen M, Panageas KS, Larson SM. Statistical issues in analysis of diagnostic imaging experiments with multiple observations per patient. Radiology 2001; 221:763-767.[Abstract/Free Full Text]
  12. Ware JH, Mosteller F, Delgado F, Donnelly C, Ingelfinger JA. P values. In: Bailar JC, III, Mosteller F, eds. Medical uses of statistics. 2nd ed. Boston, Mass: NEJM Books, 1992; 181.
  13. Perneger TV. What’s wrong with Bonferroni adjustments. BMJ 1998; 316:1236-1238.[Free Full Text]
  14. Armitage P, Berry G. Statistical methods in medical research 3rd ed. Oxford, England: Blackwell Scientific, 1994; 331.



This article has been cited by other articles:


Home page
J Bone Joint Surg BrHome page
S. Parratte, P. Kilian, V. Pauly, P. Champsaur, and J.-N. A. Argenson
The use of ultrasound in acquisition of the anterior pelvic plane in computer-assisted total hip replacement: A CADAVER STUDY
J Bone Joint Surg Br, February 1, 2008; 90-B(2): 258 - 263.
[Abstract] [Full Text] [PDF]


Home page
Am. J. Neuroradiol.Home page
M. van Straten, H.W. Venema, C.B.L.M. Majoie, N.J.M. Freling, C.A. Grimbergen, and G.J. den Heeten
Image Quality of Multisection CT of the Brain: Thickly Collimated Sequential Scanning versus Thinly Collimated Spiral Scanning with Image Combining
AJNR Am. J. Neuroradiol., March 1, 2007; 28(3): 421 - 427.
[Abstract] [Full Text] [PDF]


Home page
Am. J. Roentgenol.Home page
J. E. Roos, B. Chilla, M. Zanetti, M. Schmid, P. Koch, C. W. A. Pfirrmann, and J. Hodler
MRI of Meniscal Lesions: Soft-Copy (PACS) and Hard-Copy Evaluation Versus Reviewer Experience.
Am. J. Roentgenol., March 1, 2006; 186(3): 786 - 790.
[Abstract] [Full Text] [PDF]


Home page
JNMHome page
A. B. Jani, M. J. Blend, R. Hamilton, C. Brendler, C. Pelizzari, L. Krauz, B. Sapra, S. Vijayakumar, A. Awan, and R. R. Weichselbaum
Radioimmunoscintigraphy for Postprostatectomy Radiotherapy: Analysis of Toxicity and Biochemical Control
J. Nucl. Med., August 1, 2004; 45(8): 1315 - 1322.
[Abstract] [Full Text] [PDF]


Home page
JNMHome page
A. B. Jani, D. Spelbring, R. Hamilton, M. J. Blend, C. Pelizzari, C. Brendler, L. Krauz, S. Vijayakumar, B. Sapra, and R. R. Weichselbaum
Impact of Radioimmunoscintigraphy on Definition of Clinical Target Volume for Radiotherapy After Prostatectomy
J. Nucl. Med., February 1, 2004; 45(2): 238 - 246.
[Abstract] [Full Text] [PDF]


Home page
RadiologyHome page
P. H. Brown, R. Tello, and P. E. Crewson
Hypothesis Testing of Means [letter] * Drs Tello and Crewson respond:
Radiology, December 1, 2003; 229(3): 930 - 931.
[Full Text] [PDF]


Home page
RadiologyHome page
K. H. Zou, K. Tuncali, and S. G. Silverman
Correlation and Simple Linear Regression
Radiology, June 1, 2003; 227(3): 617 - 628.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow Figures Only
Right arrow Full Text (PDF)
Right arrow All Versions of this Article:
2271020085v1
227/1/1    most recent
Right arrow Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when eLetters are posted
Right arrow Alert me if a correction is posted
Right arrow Citation Map
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Right arrow reprints & permissions
Citing Articles
Right arrow Citing Articles via HighWire
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Tello, R.
Right arrow Articles by Crewson, P. E.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Tello, R.
Right arrow Articles by Crewson, P. E.


HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH TABLE OF CONTENTS
RADIOLOGY RADIOGRAPHICS RSNA JOURNALS ONLINE