MicrobiologyBytes: Maths & Computers for Biologists: ANOVA with SPSS Updated: February 6, 2009 Search

ANOVA with SPSS

Never, ever, run any statistical test without performing EDA first!

What's wrong with t-tests?
Nothing, except ...
If you want to compare three or more groups using t-tests with the usual 0.05 level of significance, you would have to compare the three groups pairwise (A to B, A to C, B to C), so the chance of getting the wrong result would be:

1 - (0.95 x 0.95 x 0.95)   =   14.3%

If you wanted to compare four or more groups, the chance of getting the wrong result would be (0.95)6 = 26%, and for five groups, 40%. Not good, is it? So we use ANOVA. Never perform multiple t-tests: Anyone on this module discovered performing multiple t-tests when they should use ANOVA will be shot!

ANalysis Of VAriance (ANOVA) is such an important statistical method that it would be easy to spend a whole module on this test alone. Like the t-test, ANOVA is a parametric test which assumes:

so it's important to carry out EDA before starting AVOVA! In fact, ANOVA is quite a robust procedure, so as long as the groups are similar, the test is normally reliable.

ANOVA tests the null hypothesis that the means of all the groups being compared are equal, and produces a statistic called F which is equivalent to the t-statistic from a t-test. But there's a catch. If the means of all the groups tested by ANOVA are equal, fine. But if the result tells us to reject the null hypothesis, we still don't know which of the means differ. We solve this problem by performing what is known as a "post hoc" (after the event) test.

Reminder:

ANOVA jargon:

The array of options for different ANOVA tests in SPSS is confusing, so I'll go through the most important bits using some examples.

One-Way / Single-Factor ANOVA:

Data:

 
Pain Scores for Analgesics
Drug:
Pain Score:
Diclofenac 0, 35, 31, 29, 20, 7, 43, 16
Ibuprophen 30, 40, 27, 25, 39, 15, 30, 45
Paracetamol 16, 33, 25, 32, 21, 54, 57, 19
Asprin 55, 58, 56, 57, 56, 53, 59, 55

Since it would be unethical to withhold pain relief, there is no control group and we are just interested in knowing whether one drug performs better (lower pain score) than another, so we need to perform a one-way/single-factor ANOVA.
We enter this data into SPSS using dummy values (1, 2, 3, 4) for the drugs so this numeric data can be used in the ANOVA:

Data

It's always a good idea to enter descriptive labels for data into the Variable View window, or the output is difficult to interpret!

EDA (Analyzer: Descriptive Statistics: Explore) shows that the data is normally distributed, so we can proceed with the ANOVA:

Analyze: Compare Means: One-Way ANOVA

Dependent variable: Pain Score
Factor: Drug:

Dialog

Output:

Test of Homogeneity of Variances: Pain
Levene Statistic df1 df2 Sig.
4.837 3 28 .008

The significance value for homogeneity of variances is <.05, so the variances of the groups are significantly different. Since this is an assumption of ANOVA, we need to be very careful in interpreting the outcome of this test:

ANOVA: Pain

Sum of Squares df Mean Square F Sig.
Between Groups 4956.375 3 1652.125 11.967 .000
Within Groups 3865.500 28 138.054    
Total 8821.875 31      

This is the main ANOVA result. The significance value comparing the groups (drugs) is <.05, so we could reject the null hypothesis (there is no difference in the mean pain scores with the four drugs). However, since the variances are significantly different, this might be the wrong answer. Fortunately, the Welch and Brown-Forsythe statistics can still be used in these circumstances:

Robust Tests of Equality of Means: Pain

Statistic df1 df2 Sig.
Welch 32.064 3 12.171 .000
Brown-Forsythe 11.967 3 18.889 .000

The significance value of these are both <.05, so we still reject the null hypothesis. However, this result does not tell us which drugs are responsible for the difference, so we need the post hoc test results:

Multiple Comparisons
Dependent Variable: Pain

(I) Drug (J) Drug Mean Difference (I-J) Std. Error Sig. 95% Confidence Interval
Lower Bound Upper Bound
Tukey HSD 1 2 -8.750 5.875 .457 -24.79 7.29
3 -9.500 5.875 .386 -25.54 6.54
4 -33.500(*) 5.875 .000 -49.54 -17.46
2 1 8.750 5.875 .457 -7.29 24.79
3 -.750 5.875 .999 -16.79 15.29
4 -24.750(*) 5.875 .001 -40.79 -8.71
3 1 9.500 5.875 .386 -6.54 25.54
2 .750 5.875 .999 -15.29 16.79
4 -24.000(*) 5.875 .002 -40.04 -7.96
4 1 33.500(*) 5.875 .000 17.46 49.54
2 24.750(*) 5.875 .001 8.71 40.79
3 24.000(*) 5.875 .002 7.96 40.04
Games-Howell 1 2 -8.750 6.176 .513 -27.05 9.55
3 -9.500 7.548 .602 -31.45 12.45
4 -33.500(*) 5.194 .001 -50.55 -16.45
2 1 8.750 6.176 .513 -9.55 27.05
3 -.750 6.485 .999 -20.09 18.59
4 -24.750(*) 3.471 .001 -36.03 -13.47
3 1 9.500 7.548 .602 -12.45 31.45
2 .750 6.485 .999 -18.59 20.09
4 -24.000(*) 5.558 .014 -42.26 -5.74
4 1 33.500(*) 5.194 .001 16.45 50.55
2 24.750(*) 3.471 .001 13.47 36.03
3 24.000(*) 5.558 .014 5.74 42.26
* The mean difference is significant at the .05 level.

The Tukey test relies on homogeneity of variance, so we ignore these results. The Games-Howell post-hoc test does not rely on homogeneity of variance (this is why we used two different post-hoc tests) and so can be used. SPSS kindly flags (*) which differences are significant!
Result: Drug 4 (Asprin) produces significantly different result from the other three drugs:

Formal Reporting:  When we report the outcome of an ANOVA, we cite the value of the F ratio and give the number of degrees of freedom, outcome (in a neutral fashion) and significance value. So in this case:

There is a significant difference between the pain scores for asprin and the other three drugs tested, F(3,28) = 11.97, p < .05.

 

Two-Factor ANOVA

Do anti-cancer drugs have different effects in males and females?
Data:

Drug:
cisplatin
vinblastine
5-fluorouracil

Gender:

Female
Male
Female
Male
Female
Male
Tumour
Size:
65
50
70
45
55
35
70
55
65
60
65
40
60
80
60
85
70
35
60
65
70
65
55
55
60
70
65
70
55
35
55
75
60
70
60
40
60
75
60
80
50
45
50
65
50
60
50
40

We enter this data into SPSS using dummy values for the drugs (1, 2, 3) and genders (1,2) so the coded data can be used in the ANOVA:

Data

It's always a good idea to enter descriptive labels for data into the Variable View window, or the output is difficult to interpret!

EDA (Analyze: Descriptive Statistics: Explore) shows that the data is normally distributed, so we can proceed with the ANOVA:

Analyze: General Linear Model: Univariate

Dependent variable: Tumour Diameter
Fixed Factors: Gender, Drug:

Dialog

Also select:

Post Hoc: Tukey and Games-Howell:

Dialog

Options:

Display Means for: Gender, Drug, Gender*Drug
Descriptive Statistics
Homogeneity tests:

Dialog

Output:

Levene's Test of Equality of Error Variances(a)
Dependent Variable: Diameter
F df1 df2 Sig.
1.462 5 42 .223
Tests the null hypothesis that the error variance of the dependent variable is equal across groups.
a Design: Intercept+Gender+Drug+Gender * Drug

The significance result for homogeneity of variance is >.05, which shows that the error variance of the dependent variable is equal across the groups, i.e. the assumption of the ANOVA test has been met.

Tests of Between-Subjects Effects
Dependent Variable: Diameter
Source Type III Sum of Squares df Mean Square F Sig.
Corrected Model 3817.188(a) 5 763.438 10.459 .000
Intercept 167442.188 1 167442.188 2294.009 .000
Gender 42.188 1 42.188 .578 .451
Drug 2412.500 2 1206.250 16.526 .000
Gender * Drug 1362.500 2 681.250 9.333 .000
Error 3065.625 42 72.991    
Total 174325.000 48      
Corrected Total 6882.813 47      
a R Squared = .555 (Adjusted R Squared = .502)

The highlighted values are significant (<.05), but there is no effect of gender (p = 0.451). Again, this does not tell us which drugs behave differently, so again we need to look at the post hoc tests:

Multiple Comparisons
Dependent Variable: Diameter

(I) Drug (J) Drug Mean Difference (I-J) Std. Error Sig. 95% Confidence Interval
Lower Bound Upper Bound
Tukey HSD cisplatin vinblastine -1.25 3.021 .910 -8.59 6.09
5-flourouracil 14.38(*) 3.021 .000 7.04 21.71
vinblastine cisplatin 1.25 3.021 .910 -6.09 8.59
5-flourouracil 15.63(*) 3.021 .000 8.29 22.96
5-flourouracil cisplatin -14.38(*) 3.021 .000 -21.71 -7.04
vinblastine -15.63(*) 3.021 .000 -22.96 -8.29
Games-Howell cisplatin vinblastine -1.25 3.329 .925 -9.46 6.96
5-flourouracil 14.38(*) 3.534 .001 5.64 23.11
vinblastine cisplatin 1.25 3.329 .925 -6.96 9.46
5-flourouracil 15.63(*) 3.699 .001 6.50 24.75
5-flourouracil cisplatin -14.38(*) 3.534 .001 -23.11 -5.64
vinblastine -15.63(*) 3.699 .001 -24.75 -6.50
Based on observed means.
* The mean difference is significant at the .05 level.

In this example, we can use the Tukey or Games-Howell results. Again, SPSS helpfully flags which results have reached statistical significance. We already know from the main ANOVA table that the effect of gender is not significant, but the post hoc tests show which drugs produce significantly different outcomes.

Formal Reporting:  When we report the outcome of an ANOVA, we cite the value of the F ratio and give the number of degrees of freedom, outcome (in a neutral fashion) and significance value. So in this case:

There is a significant difference between the tumour diameter for 5-flourouracil and the other two drugs tested, F(5,47) = 10.46, p < .05.

 

 

Repeated Measures ANOVA

Remember that one of the assumptions of ANOVA is independence of the groups being compared. In lots of circumstances, we want to test the same thing repeatedly, e.g:

This type of study reduces variability in the data and so increases the power to detect effects, but violates the assumption of independence, so as with the paired t-test, we need to use a special form of ANOVA called repeated measures. In a parametric test, the assumption that the relationship between pairs of groups is equal is called "sphericity". Violating sphericity means that the F statistic cannot be compared to the normal tables of F, and so software cannot calculate a significance value. SPSS includes a procedure called Mauchly's test which tells us if the assumption of sphericity has been violated:

If Mauchly’s test is significant then we cannot trust the F-ratios produced by SPSS unless we apply a correction (which, fortunately, SPSS helps us to do).

One-Way Repeated Measures ANOVA

i.e. one independent variable, e.g. pain score after surgery:

Patient1
Patient2
Patient3
1
3
1
2
5
3
4
6
6
5
7
4
5
9
1
6
10
3

This data can be entered directly into SPSS. Note that each column represents a repeated measures variable (patients in this case). There is no need for a coding variable (as with between-group designs, above):

Data

It's always a good idea to enter descriptive labels for data into the Variable View window, or the output is difficult to interpret! Next:

Analyze: General Linear Model: Repeated Measures

Dialog

Within-Subject factor name: Patient
Number of Levels: 3 (because there are 3 patients)
Click Add, then Define (factors):

Dialog

There are no proper post hoc tests for repeated measures variables in SPSS. However, via the Options button, you can use the paired t-test procedure to compare all pairs of levels of the independent variable, and then apply a Bonferroni correction to the probability at which you accept any of these tests. The resulting probability value should be used as the criterion for statistical significance. A ‘Bonferroni correction’ is achieved by dividing the probability value (usually 0.05) by the number of tests conducted, e.g. if we compare all levels of the independent variable of these data, we make three comparisons and so the appropriate significance level is 0.05/3 = 0.0167. Therefore, we accept t-tests as being significant only if they have a p value <0.0167.

Dialog

Output:

Mauchly's Test of Sphericity
Within Subjects Effect Mauchly's W Approx. Chi-Square df Sig. Epsilon
Greenhouse-Geisser Huynh-Feldt Lower-bound
patient .094 9.437 2 .009 .525 .544 .500

Mauchly’s test is significant (p <.05) so we conclude that the assumption of sphericity has not been met.

Tests of Within-Subjects Effects
Source
Type III Sum of Squares df Mean Square F Sig.
patient Sphericity Assumed 44.333 2 22.167 8.210 .008
Greenhouse-Geisser 44.333 1.050 42.239 8.210 .033
Huynh-Feldt 44.333 1.088 40.752 8.210 .031
Lower-bound 44.333 1.000 44.333 8.210 .035
Error(patient) Sphericity Assumed 27.000 10 2.700    
Greenhouse-Geisser 27.000 5.248 5.145    
Huynh-Feldt 27.000 5.439 4.964    
Lower-bound 27.000 5.000 5.400    

Because the significance values are <.05, we conclude that there was a significant difference between the three patients, but this test does not tell us which patients differed from each other. The next issue is which of the three corrections to use. Going back to Mauchly's test:

Post Hoc Tests:

Pairwise Comparisons
(I) patient (J) patient Mean Difference (I-J) Std. Error Sig.(a) 95% Confidence Interval for Difference(a)
Lower Bound Upper Bound
1 2 -2.833(*) .401 .003 -4.252 -1.415
3 .833 .946 1.000 -2.509 4.176
2 1 2.833(*) .401 .003 1.415 4.252
3 3.667 1.282 .106 -.865 8.199
3 1 -.833 .946 1.000 -4.176 2.509
2 -3.667 1.282 .106 -8.199 .865
Based on estimated marginal means
* The mean difference is significant at the .05 level.
a Adjustment for multiple comparisons: Bonferroni.

Formal reporting:

Mauchly’s test indicated that the assumption of sphericity had been violated (chi-square = 9.44, p <.05), therefore degrees of freedom were corrected using Greenhouse-Geisser estimates of sphericity (epsilon = 0.53). The results show that the pain scores of the three patients differed significantly, F(1.05, 5.25) = 8.21, p <.05. Post hoc tests revealed that although the pain score of Patient2 was significantly higher than that of than Patient1 (p<.001), Patient3's score was not significantly differently from either of the other patients (both p>.05).

 

Two-Way Repeated Measures ANOVA

i.e. two independent variables:

In a study of the best way to keep fields free of weeds for an entire growing season, a farmer treated test plots in 10 fields with either five different concentrations of weedkiller (independent variable 1) or five different length blasts with a flamethrower (independent variable 2). At the end of they growing season, the number of weeds per square metre were counted. To exclude bias (e.g. pre-existing seedbank in the soil), the following year, the farmer repeated the experiment but this time the treatments the fields received were reversed:

Treatment:
Weedkiller
Flamethrower
Severity:
1
2
3
4
5
1
2
3
4
5
Field1
10
15
18
22
37
9
13
13
18
22
Field2
10
18
10
42
60
7
14
20
21
32
Field3
7
11
28
31
56
9
13
24
30
35
Field4
9
19
36
45
60
7
14
9
20
25
Field5
15
14
29
33
37
14
13
20
22
29
Field6
14
13
26
26
49
5
12
17
16
33
Field7
9
12
19
37
48
5
15
12
17
24
Field8
9
18
22
31
39
13
13
14
17
17
Field9
12
14
24
28
53
12
13
21
19
22
Field10
7
11
21
23
45
12
14
20
21
29

SPSS Data View:

Data View

It's always a good idea to enter descriptive labels for data into the Variable View window, or the output is difficult to interpret:

Variable View

Analyze: General Linear Model: Repeated Measures
Define Within Subject Factors (remember, "factor" = test or treatment):

Treatment, (2 treatments, weedkiller or flamethrower) (SPSS only allows 8 characters for the name)
Severity (5 different severities):

Define Factors Dialog

Click Define and define Within Subject Variables:

Dialog

As above, there are no post hoc tests for repeated measures ANOVA in SPSS, but via the Options button, we can apply a Bonferroni correction to the probability at which you accept any of the tests:

Dialog

Output:

Mauchly's Test of Sphericity(b)
Measure: MEASURE_1
Within Subjects Effect Mauchly's W Approx. Chi-Square df Sig. Epsilon
Greenhouse-Geisser Huynh-Feldt Lower-bound
treatmen 1.000 .000 0 . 1.000 1.000 1.000
severity .092 17.685 9 .043 .552 .740 .250
treatmen * severity .425 6.350 9 .712 .747 1.000 .250

The outcome of Mauchly’s test is significant (p <.05) for the severity of treatment, so we need to correct the F-values for this, but not for the treatments themselves.

Tests of Within-Subjects Effects
Source
Type III Sum of Squares df Mean Square F Sig.
treatmen Sphericity Assumed 1730.560 1 1730.560 34.078 .000
Greenhouse-Geisser 1730.560 1.000 1730.560 34.078 .000
Huynh-Feldt 1730.560 1.000 1730.560 34.078 .000
Lower-bound 1730.560 1.000 1730.560 34.078 .000
Error(treatmen) Sphericity Assumed 457.040 9 50.782    
Greenhouse-Geisser 457.040 9.000 50.782    
Huynh-Feldt 457.040 9.000 50.782    
Lower-bound 457.040 9.000 50.782    
severity Sphericity Assumed 9517.960 4 2379.490 83.488 .000
Greenhouse-Geisser 9517.960 2.209 4309.021 83.488 .000
Huynh-Feldt 9517.960 2.958 3217.666 83.488 .000
Lower-bound 9517.960 1.000 9517.960 83.488 .000
Error(severity) Sphericity Assumed 1026.040 36 28.501    
Greenhouse-Geisser 1026.040 19.880 51.613    
Huynh-Feldt 1026.040 26.622 38.541    
Lower-bound 1026.040 9.000 114.004    
treatmen * severity Sphericity Assumed 1495.240 4 373.810 20.730 .000
Greenhouse-Geisser 1495.240 2.989 500.205 20.730 .000
Huynh-Feldt 1495.240 4.000 373.810 20.730 .000
Lower-bound 1495.240 1.000 1495.240 20.730 .001
Error(treatmen*severity) Sphericity Assumed 649.160 36 18.032    
Greenhouse-Geisser 649.160 26.903 24.129    
Huynh-Feldt 649.160 36.000 18.032    
Lower-bound 649.160 9.000 72.129    

Since there was no violation of sphericity, we can look at the comparison of the two treatments without any correction. The significance value shows (0.000) that there was a significant difference between the two treatments, but does not tell us which treatments produced this effect.
The output also tells us the effect of the severity of treatments, but remember there was a violation of sphericity here, so we must look at the corrected F-ratios. All of the corrected values are highly significant and so we can use the Greenhouse-Geisser corrected values as these are the most conservative.

Pairwise Comparisons
(I) severity (J) severity Mean Difference (I-J) Std. Error Sig.(a) 95% Confidence Interval for Difference(a)
Lower Bound Upper Bound
1 2 -4.200(*) .895 .011 -7.502 -.898
3 -10.400(*) 1.190 .000 -14.790 -6.010
4 -16.200(*) 1.764 .000 -22.709 -9.691
5 -27.850(*) 2.398 .000 -36.698 -19.002
2 1 4.200(*) .895 .011 .898 7.502
3 -6.200(*) 1.521 .028 -11.810 -.590
4 -12.000(*) 1.280 .000 -16.723 -7.277
5 -23.650(*) 2.045 .000 -31.197 -16.103
3 1 10.400(*) 1.190 .000 6.010 14.790
2 6.200(*) 1.521 .028 .590 11.810
4 -5.800 1.690 .075 -12.036 .436
5 -17.450(*) 2.006 .000 -24.852 -10.048
4 1 16.200(*) 1.764 .000 9.691 22.709
2 12.000(*) 1.280 .000 7.277 16.723
3 5.800 1.690 .075 -.436 12.036
5 -11.650(*) 1.551 .000 -17.373 -5.927
5 1 27.850(*) 2.398 .000 19.002 36.698
2 23.650(*) 2.045 .000 16.103 31.197
3 17.450(*) 2.006 .000 10.048 24.852
4 11.650(*) 1.551 .000 5.927 17.373
* The mean difference is significant at the .05 level.
a Adjustment for multiple comparisons: Bonferroni.

This shows that there was only one pair for which there was no significant difference: 40% weedkiller followed by 2 minutes flame thrower, and 2 minutes flame thrower followed by 40% weedkiller. The differences for all the other pairs are significant. It does not matter if the farmer uses weedkiller or a flamethrower, but how much weedkiller and how long a burst of flame does make a difference to weed control.

Formal report:

There was a significant main effect of the type of treatment, F(1, 9) = 34.08, p < .001.
There was a significant main effect of the severity of treatment, F(2.21, 19.88) = 83.49, p <.001.

 


© MicrobiologyBytes 2009.