MicrobiologyBytes: Maths & Computers for Biologists: ANOVA with SPSS  Updated: February 6, 2009  Search 
Never, ever, run any statistical test without performing EDA first!
What's wrong with ttests?
Nothing, except ...
If you want to compare three or more groups using ttests with the usual 0.05
level of significance, you would have to compare the three groups pairwise (A
to B,
A
to C, B to C),
so
the
chance
of getting the wrong result would
be:
1  (0.95 x 0.95 x 0.95) = 14.3%
If you wanted to compare four or more groups, the chance of getting the wrong result would be (0.95)^{6} = 26%, and for five groups, 40%. Not good, is it? So we use ANOVA. Never perform multiple ttests: Anyone on this module discovered performing multiple ttests when they should use ANOVA will be shot!
ANalysis Of VAriance (ANOVA) is such an important statistical method that it would be easy to spend a whole module on this test alone. Like the ttest, ANOVA is a parametric test which assumes:
so it's important to carry out EDA before starting AVOVA! In fact, ANOVA is quite a robust procedure, so as long as the groups are similar, the test is normally reliable.
ANOVA tests the null hypothesis that the means of all the groups being compared are equal, and produces a statistic called F which is equivalent to the tstatistic from a ttest. But there's a catch. If the means of all the groups tested by ANOVA are equal, fine. But if the result tells us to reject the null hypothesis, we still don't know which of the means differ. We solve this problem by performing what is known as a "post hoc" (after the event) test.
Reminder:
ANOVA jargon:
The array of options for different ANOVA tests in SPSS is confusing, so I'll go through the most important bits using some examples.
Data:
Pain Scores for Analgesics 

Drug: 
Pain Score: 
Diclofenac  0, 35, 31, 29, 20, 7, 43, 16 
Ibuprophen  30, 40, 27, 25, 39, 15, 30, 45 
Paracetamol  16, 33, 25, 32, 21, 54, 57, 19 
Asprin  55, 58, 56, 57, 56, 53, 59, 55 
Since it would be unethical to withhold pain relief, there is
no control group and we are just interested in knowing whether one drug
performs better (lower pain score) than another, so we need to perform a oneway/singlefactor
ANOVA.
We enter this data into SPSS
using dummy values (1, 2, 3, 4) for the drugs so this numeric data can be used
in the ANOVA:
It's always a good idea to enter descriptive labels for data into the Variable View window, or the output is difficult to interpret!
EDA (Analyzer: Descriptive Statistics: Explore) shows that the data is normally distributed, so we can proceed with the ANOVA:
Analyze: Compare Means: OneWay ANOVA
Dependent variable: Pain Score
Factor: Drug:
Output:
Levene Statistic  df1  df2  Sig. 

4.837  3  28  .008 
The significance value for homogeneity of variances is <.05, so the variances of the groups are significantly different. Since this is an assumption of ANOVA, we need to be very careful in interpreting the outcome of this test:
Sum of Squares  df  Mean Square  F  Sig.  

Between Groups  4956.375  3  1652.125  11.967  .000 
Within Groups  3865.500  28  138.054  
Total  8821.875  31 
This is the main ANOVA result. The significance value comparing the groups (drugs) is <.05, so we could reject the null hypothesis (there is no difference in the mean pain scores with the four drugs). However, since the variances are significantly different, this might be the wrong answer. Fortunately, the Welch and BrownForsythe statistics can still be used in these circumstances:
Statistic  df1  df2  Sig.  

Welch  32.064  3  12.171  .000 
BrownForsythe  11.967  3  18.889  .000 
The significance value of these are both <.05, so we still reject the null hypothesis. However, this result does not tell us which drugs are responsible for the difference, so we need the post hoc test results:
(I) Drug  (J) Drug  Mean Difference (IJ)  Std. Error  Sig.  95% Confidence Interval  

Lower Bound  Upper Bound  
Tukey HSD  1  2  8.750  5.875  .457  24.79  7.29 
3  9.500  5.875  .386  25.54  6.54  
4  33.500(*)  5.875  .000  49.54  17.46  
2  1  8.750  5.875  .457  7.29  24.79  
3  .750  5.875  .999  16.79  15.29  
4  24.750(*)  5.875  .001  40.79  8.71  
3  1  9.500  5.875  .386  6.54  25.54  
2  .750  5.875  .999  15.29  16.79  
4  24.000(*)  5.875  .002  40.04  7.96  
4  1  33.500(*)  5.875  .000  17.46  49.54  
2  24.750(*)  5.875  .001  8.71  40.79  
3  24.000(*)  5.875  .002  7.96  40.04  
GamesHowell  1  2  8.750  6.176  .513  27.05  9.55 
3  9.500  7.548  .602  31.45  12.45  
4  33.500(*)  5.194  .001  50.55  16.45  
2  1  8.750  6.176  .513  9.55  27.05  
3  .750  6.485  .999  20.09  18.59  
4  24.750(*)  3.471  .001  36.03  13.47  
3  1  9.500  7.548  .602  12.45  31.45  
2  .750  6.485  .999  18.59  20.09  
4  24.000(*)  5.558  .014  42.26  5.74  
4  1  33.500(*)  5.194  .001  16.45  50.55  
2  24.750(*)  3.471  .001  13.47  36.03  
3  24.000(*)  5.558  .014  5.74  42.26  
* The mean difference is significant at the .05 level. 
The Tukey test relies on homogeneity of variance, so we ignore
these results. The GamesHowell posthoc test does not rely on homogeneity of
variance (this is why we used two different posthoc tests) and so can be used.
SPSS kindly flags (^{*}) which differences are significant!
Result: Drug 4 (Asprin) produces significantly different result
from the other
three drugs:
Formal Reporting: When we report the outcome of an ANOVA, we cite the value of the F ratio and give the number of degrees of freedom, outcome (in a neutral fashion) and significance value. So in this case:
There is a significant difference between the pain scores for asprin and the other three drugs tested, F(3,28) = 11.97, p < .05. 
Do anticancer drugs have different effects in males and females?
Data:
Drug:  cisplatin 
vinblastine 
5fluorouracil 

Gender: 
Female 
Male 
Female 
Male 
Female 
Male 
Tumour Size: 
65 
50 
70 
45 
55 
35 
70 
55 
65 
60 
65 
40 

60 
80 
60 
85 
70 
35 

60 
65 
70 
65 
55 
55 

60 
70 
65 
70 
55 
35 

55 
75 
60 
70 
60 
40 

60 
75 
60 
80 
50 
45 

50 
65 
50 
60 
50 
40 
We enter this data into SPSS using dummy values for the drugs (1, 2, 3) and genders (1,2) so the coded data can be used in the ANOVA:
It's always a good idea to enter descriptive labels for data into the Variable View window, or the output is difficult to interpret!
EDA (Analyze: Descriptive Statistics: Explore) shows that the data is normally distributed, so we can proceed with the ANOVA:
Analyze: General Linear Model: Univariate
Dependent variable: Tumour Diameter
Fixed Factors: Gender, Drug:
Also select:
Post Hoc: Tukey and GamesHowell:
Options:
Display Means for: Gender, Drug, Gender*Drug
Descriptive Statistics
Homogeneity tests:
Output:
F  df1  df2  Sig. 

1.462  5  42  .223 
Tests the null hypothesis that the error variance of the dependent variable is equal across groups.  
a Design: Intercept+Gender+Drug+Gender * Drug 
The significance result for homogeneity of variance is >.05, which shows that the error variance of the dependent variable is equal across the groups, i.e. the assumption of the ANOVA test has been met.
Source  Type III Sum of Squares  df  Mean Square  F  Sig. 

Corrected Model  3817.188(a)  5  763.438  10.459  .000 
Intercept  167442.188  1  167442.188  2294.009  .000 
Gender  42.188  1  42.188  .578  .451 
Drug  2412.500  2  1206.250  16.526  .000 
Gender * Drug  1362.500  2  681.250  9.333  .000 
Error  3065.625  42  72.991  
Total  174325.000  48  
Corrected Total  6882.813  47  
a R Squared = .555 (Adjusted R Squared = .502) 
The highlighted values are significant (<.05), but there is no effect of gender (p = 0.451). Again, this does not tell us which drugs behave differently, so again we need to look at the post hoc tests:
(I) Drug  (J) Drug  Mean Difference (IJ)  Std. Error  Sig.  95% Confidence Interval  

Lower Bound  Upper Bound  
Tukey HSD  cisplatin  vinblastine  1.25  3.021  .910  8.59  6.09 
5flourouracil  14.38(*)  3.021  .000  7.04  21.71  
vinblastine  cisplatin  1.25  3.021  .910  6.09  8.59  
5flourouracil  15.63(*)  3.021  .000  8.29  22.96  
5flourouracil  cisplatin  14.38(*)  3.021  .000  21.71  7.04  
vinblastine  15.63(*)  3.021  .000  22.96  8.29  
GamesHowell  cisplatin  vinblastine  1.25  3.329  .925  9.46  6.96 
5flourouracil  14.38(*)  3.534  .001  5.64  23.11  
vinblastine  cisplatin  1.25  3.329  .925  6.96  9.46  
5flourouracil  15.63(*)  3.699  .001  6.50  24.75  
5flourouracil  cisplatin  14.38(*)  3.534  .001  23.11  5.64  
vinblastine  15.63(*)  3.699  .001  24.75  6.50  
Based on observed means.  
* The mean difference is significant at the .05 level. 
In this example, we can use the Tukey or GamesHowell results. Again, SPSS helpfully flags which results have reached statistical significance. We already know from the main ANOVA table that the effect of gender is not significant, but the post hoc tests show which drugs produce significantly different outcomes.
Formal Reporting: When we report the outcome of an ANOVA, we cite the value of the F ratio and give the number of degrees of freedom, outcome (in a neutral fashion) and significance value. So in this case:
There is a significant difference between the tumour diameter for 5flourouracil and the other two drugs tested, F(5,47) = 10.46, p < .05. 
Remember that one of the assumptions of ANOVA is independence of the groups being compared. In lots of circumstances, we want to test the same thing repeatedly, e.g:
This type of study reduces variability in the data and so increases the power to detect effects, but violates the assumption of independence, so as with the paired ttest, we need to use a special form of ANOVA called repeated measures. In a parametric test, the assumption that the relationship between pairs of groups is equal is called "sphericity". Violating sphericity means that the F statistic cannot be compared to the normal tables of F, and so software cannot calculate a significance value. SPSS includes a procedure called Mauchly's test which tells us if the assumption of sphericity has been violated:
If Mauchly’s test is significant then we cannot trust the Fratios produced by SPSS unless we apply a correction (which, fortunately, SPSS helps us to do).
i.e. one independent variable, e.g. pain score after surgery:
Patient1 
Patient2 
Patient3 
1 
3 
1 
2 
5 
3 
4 
6 
6 
5 
7 
4 
5 
9 
1 
6 
10 
3 
This data can be entered directly into SPSS. Note that each column represents a repeated measures variable (patients in this case). There is no need for a coding variable (as with betweengroup designs, above):
It's always a good idea to enter descriptive labels for data into the Variable View window, or the output is difficult to interpret! Next:
Analyze: General Linear Model: Repeated Measures
WithinSubject factor name: Patient
Number of Levels: 3 (because there are 3 patients)
Click Add, then
Define (factors):
There are no proper post hoc tests for repeated measures variables in SPSS. However, via the Options button, you can use the paired ttest procedure to compare all pairs of levels of the independent variable, and then apply a Bonferroni correction to the probability at which you accept any of these tests. The resulting probability value should be used as the criterion for statistical significance. A ‘Bonferroni correction’ is achieved by dividing the probability value (usually 0.05) by the number of tests conducted, e.g. if we compare all levels of the independent variable of these data, we make three comparisons and so the appropriate significance level is 0.05/3 = 0.0167. Therefore, we accept ttests as being significant only if they have a p value <0.0167.
Output:
Within Subjects Effect  Mauchly's W  Approx. ChiSquare  df  Sig.  Epsilon  

GreenhouseGeisser  HuynhFeldt  Lowerbound  
patient  .094  9.437  2  .009  .525  .544  .500 
Mauchly’s test is significant (p <.05) so we conclude that the assumption of sphericity has not been met.
Source  Type III Sum of Squares  df  Mean Square  F  Sig.  

patient  Sphericity Assumed  44.333  2  22.167  8.210  .008 
GreenhouseGeisser  44.333  1.050  42.239  8.210  .033  
HuynhFeldt  44.333  1.088  40.752  8.210  .031  
Lowerbound  44.333  1.000  44.333  8.210  .035  
Error(patient)  Sphericity Assumed  27.000  10  2.700  
GreenhouseGeisser  27.000  5.248  5.145  
HuynhFeldt  27.000  5.439  4.964  
Lowerbound  27.000  5.000  5.400 
Because the significance values are <.05, we conclude that there was a significant difference between the three patients, but this test does not tell us which patients differed from each other. The next issue is which of the three corrections to use. Going back to Mauchly's test:
Post Hoc Tests:
(I) patient  (J) patient  Mean Difference (IJ)  Std. Error  Sig.(a)  95% Confidence Interval for Difference(a)  

Lower Bound  Upper Bound  
1  2  2.833(*)  .401  .003  4.252  1.415 
3  .833  .946  1.000  2.509  4.176  
2  1  2.833(*)  .401  .003  1.415  4.252 
3  3.667  1.282  .106  .865  8.199  
3  1  .833  .946  1.000  4.176  2.509 
2  3.667  1.282  .106  8.199  .865  
Based on estimated marginal means  
* The mean difference is significant at the .05 level.  
a Adjustment for multiple comparisons: Bonferroni. 
Formal reporting:
Mauchly’s test indicated that the assumption of sphericity had been violated (chisquare = 9.44, p <.05), therefore degrees of freedom were corrected using GreenhouseGeisser estimates of sphericity (epsilon = 0.53). The results show that the pain scores of the three patients differed significantly, F(1.05, 5.25) = 8.21, p <.05. Post hoc tests revealed that although the pain score of Patient2 was significantly higher than that of than Patient1 (p<.001), Patient3's score was not significantly differently from either of the other patients (both p>.05). 
i.e. two independent variables:
In a study of the best way to keep fields free of weeds for an entire growing season, a farmer treated test plots in 10 fields with either five different concentrations of weedkiller (independent variable 1) or five different length blasts with a flamethrower (independent variable 2). At the end of they growing season, the number of weeds per square metre were counted. To exclude bias (e.g. preexisting seedbank in the soil), the following year, the farmer repeated the experiment but this time the treatments the fields received were reversed:
Treatment:  Weedkiller

Flamethrower


Severity:  1 
2 
3 
4 
5 
1 
2 
3 
4 
5 
Field1  10 
15 
18 
22 
37 
9 
13 
13 
18 
22 
Field2  10 
18 
10 
42 
60 
7 
14 
20 
21 
32 
Field3  7 
11 
28 
31 
56 
9 
13 
24 
30 
35 
Field4  9 
19 
36 
45 
60 
7 
14 
9 
20 
25 
Field5  15 
14 
29 
33 
37 
14 
13 
20 
22 
29 
Field6  14 
13 
26 
26 
49 
5 
12 
17 
16 
33 
Field7  9 
12 
19 
37 
48 
5 
15 
12 
17 
24 
Field8  9 
18 
22 
31 
39 
13 
13 
14 
17 
17 
Field9  12 
14 
24 
28 
53 
12 
13 
21 
19 
22 
Field10  7 
11 
21 
23 
45 
12 
14 
20 
21 
29 
SPSS Data View:
It's always a good idea to enter descriptive labels for data into the Variable View window, or the output is difficult to interpret:
Analyze: General Linear Model: Repeated Measures
Define Within Subject Factors (remember, "factor"
=
test or treatment):
Treatment, (2 treatments, weedkiller or flamethrower) (SPSS only allows 8 characters for the name)
Severity (5 different severities):
Click Define and define Within Subject Variables:
As above, there are no post hoc tests for repeated measures ANOVA in SPSS, but via the Options button, we can apply a Bonferroni correction to the probability at which you accept any of the tests:
Output:
Within Subjects Effect  Mauchly's W  Approx. ChiSquare  df  Sig.  Epsilon  

GreenhouseGeisser  HuynhFeldt  Lowerbound  
treatmen  1.000  .000  0  .  1.000  1.000  1.000 
severity  .092  17.685  9  .043  .552  .740  .250 
treatmen * severity  .425  6.350  9  .712  .747  1.000  .250 
The outcome of Mauchly’s test is significant (p <.05) for the severity of treatment, so we need to correct the Fvalues for this, but not for the treatments themselves.
Source  Type III Sum of Squares  df  Mean Square  F  Sig.  

treatmen  Sphericity Assumed  1730.560  1  1730.560  34.078  .000 
GreenhouseGeisser  1730.560  1.000  1730.560  34.078  .000  
HuynhFeldt  1730.560  1.000  1730.560  34.078  .000  
Lowerbound  1730.560  1.000  1730.560  34.078  .000  
Error(treatmen)  Sphericity Assumed  457.040  9  50.782  
GreenhouseGeisser  457.040  9.000  50.782  
HuynhFeldt  457.040  9.000  50.782  
Lowerbound  457.040  9.000  50.782  
severity  Sphericity Assumed  9517.960  4  2379.490  83.488  .000 
GreenhouseGeisser  9517.960  2.209  4309.021  83.488  .000  
HuynhFeldt  9517.960  2.958  3217.666  83.488  .000  
Lowerbound  9517.960  1.000  9517.960  83.488  .000  
Error(severity)  Sphericity Assumed  1026.040  36  28.501  
GreenhouseGeisser  1026.040  19.880  51.613  
HuynhFeldt  1026.040  26.622  38.541  
Lowerbound  1026.040  9.000  114.004  
treatmen * severity  Sphericity Assumed  1495.240  4  373.810  20.730  .000 
GreenhouseGeisser  1495.240  2.989  500.205  20.730  .000  
HuynhFeldt  1495.240  4.000  373.810  20.730  .000  
Lowerbound  1495.240  1.000  1495.240  20.730  .001  
Error(treatmen*severity)  Sphericity Assumed  649.160  36  18.032  
GreenhouseGeisser  649.160  26.903  24.129  
HuynhFeldt  649.160  36.000  18.032  
Lowerbound  649.160  9.000  72.129 
Since there was no violation of sphericity, we can look at the comparison
of the two treatments without any correction. The significance value shows (0.000)
that there was a significant difference between the two treatments, but does
not tell us which treatments produced this effect.
The output also tells us the effect of the severity of treatments, but remember
there was a violation of sphericity here, so we must look at the corrected
Fratios.
All of the corrected values are highly significant and so we can use the
GreenhouseGeisser
corrected values as these are the most conservative.
(I) severity  (J) severity  Mean Difference (IJ)  Std. Error  Sig.(a)  95% Confidence Interval for Difference(a)  

Lower Bound  Upper Bound  
1  2  4.200(*)  .895  .011  7.502  .898 
3  10.400(*)  1.190  .000  14.790  6.010  
4  16.200(*)  1.764  .000  22.709  9.691  
5  27.850(*)  2.398  .000  36.698  19.002  
2  1  4.200(*)  .895  .011  .898  7.502 
3  6.200(*)  1.521  .028  11.810  .590  
4  12.000(*)  1.280  .000  16.723  7.277  
5  23.650(*)  2.045  .000  31.197  16.103  
3  1  10.400(*)  1.190  .000  6.010  14.790 
2  6.200(*)  1.521  .028  .590  11.810  
4  5.800  1.690  .075  12.036  .436  
5  17.450(*)  2.006  .000  24.852  10.048  
4  1  16.200(*)  1.764  .000  9.691  22.709 
2  12.000(*)  1.280  .000  7.277  16.723  
3  5.800  1.690  .075  .436  12.036  
5  11.650(*)  1.551  .000  17.373  5.927  
5  1  27.850(*)  2.398  .000  19.002  36.698 
2  23.650(*)  2.045  .000  16.103  31.197  
3  17.450(*)  2.006  .000  10.048  24.852  
4  11.650(*)  1.551  .000  5.927  17.373  
* The mean difference is significant at the .05 level.  
a Adjustment for multiple comparisons: Bonferroni. 
This shows that there was only one pair for which there was no significant difference: 40% weedkiller followed by 2 minutes flame thrower, and 2 minutes flame thrower followed by 40% weedkiller. The differences for all the other pairs are significant. It does not matter if the farmer uses weedkiller or a flamethrower, but how much weedkiller and how long a burst of flame does make a difference to weed control.
Formal report:
There was a significant main effect of the type of treatment, F(1,
9) = 34.08, p < .001. There was a significant main effect of the severity of treatment, F(2.21, 19.88) = 83.49, p <.001. 
© MicrobiologyBytes 2009.