| MicrobiologyBytes: Maths & Computers for Biologists: Comparing Populations with SPSS | Updated: February 6, 2009 | Search |
Never, ever, run any statistical test without performing EDA first!
Regression or Correlation?Regression and correlation are similar and easily confused. In some situations it makes sense to perform both calculations. Calculate correlation if:
Calculate regressions only if:
|
Correlation or Regression? Correlation makes no assumption as to whether one variable is dependent
on the other(s) and is not concerned with the relationship between variables;
instead it gives an estimate as to the degree of association between the
variables. Regression attempts to describe the dependence of a variable
on one (or more) explanatory variables; it implicitly assumes that there
is a one-way causal effect from the explanatory variable(s) to the response
variable, regardless of whether the path of effect is direct or indirect.
The best way to appreciate this difference is by example. |
There are numerous methods for calculating correlations, e.g:
In SPSS:
As with other tests in SPSS, variables must be entered in separate columns. Before examining correlation, always draw a scatterplot to identify outliers and ensure the data is suitable for analysis:
Graphs: Scatter and select Simple Scatter or Matrix Scatter (for more than one pair of variables)
A correlation between two variables is called a bivariate correlation. In SPSS, select:
Analyze: Correlate: Bivariate
Select the variables you are interested in comparing and the type of correlation analysis you want to perform:
The output reports the correlation coefficient and whether the outcome is statistically significant (i.e. p <.05).
REMEMBER: Correlation indicates whether there is a relationship between two variables, but not what causes the relationship or what the relationship means! Correlation can tell you whether there is a relationship between the chicken and the egg, but not whether eggs come from chickens, or chickens come from eggs!! |
Formal Reporting: When you report the outcome of a correlation test, cite the value of the coefficient, degrees of freedom (N - 2) in brackets and the significance value, e.g:
There is a weak/strong/etc correlation between variableA and variableB, r = 0.55(12), p (one or two-tailed) < .05. |
Regression predicts an outcome from one or more variables. Simple regression examines the outcome from a single variable, multiple regression is where several variables are combined to predict an outcome. In reality, this involves constructing a simple statistical model consisting of a straight line which is the line of best fit between the variables. This is achieved by considering the differences between the datapoints and this line of best fit, which are known as residuals. To perform simple regression in SPSS:
Analyze: Regression: Linear
The output includes the Model Summary, e.g:
| Model | R | R Square | Adjusted R Square | Std. Error of the Estimate |
|---|---|---|---|---|
| 1 | .397 | .157 | .149 | 23.92947 |
R = the correlation coefficient
R2 = the coefficient of determination. This value determines how much of the variation in one variable is due to the other variable. In the example above, R2 = 0.149, so 14.9% of the variation in the outcome is determined by the predictor variable (and 85.1% of the variation is caused by something else!).
| Model | Sum of Squares | df | Mean Square | F | Sig. | |
|---|---|---|---|---|---|---|
| 1 | Regression | 10802.625 | 1 | 10802.625 | 18.865 | .000 |
| Residual | 57834.579 | 101 | 572.620 | |||
| Total | 68637.204 | 102 |
The ANOVA output reports whether the model results in statistically significant prediction. In this case, Significance = .000, so the outcome is statistically significant (i.e. p <.05).
| Model | Unstandardized Coefficients | Standardized Coefficients | t | Sig. | ||
|---|---|---|---|---|---|---|
| B | Std. Error | Beta | ||||
| 1 | (Constant) | 45.321 | 3.503 | 12.938 | .000 | |
| Predictor | .567 | .130 | .397 | 4.343 | .000 | |
The coefficients output makes specific predictions about the magnitude of the effects predicted and confirms the significance of the model.
Formal Reporting: When you report the outcome of simple regression, cite the value of R2, the number of degrees of freedom and significance value, e.g:
VariableA strongly predicted VariableB, R2 = .157(1,102), p < .001. |
To plot a regression line in SPSS:
Analyze: Regression: Curve Estimation: Linear (or other if required)
Assumptions:The t-test is a parametric test which assumes that the data analyzed:
If you use the t-test under other circumstances, the results will be meaningless! In other situations, non-parametric tests should be used to compare the groups, e.g:
In order to compare three or more groups, other tests must be used, e.g. ANOVA (next week). NEVER perform multiple t-tests! |
The t-test is used to compare two groups and comes in three versions:
Where correlation and regression are used to examine relationships between datasets, the t test is the primary example of how to examine differences between datasets, so, as with any statistical investigation, the starting point is to think about what question you want to ask. In SPSS:
If the data does not meet the criteria for a t test, it is sometimes possible to transform it so that the test can be used, but not always. In other cases, alternative tests must be used:
| Dependency | Variable | Distribution | Test | SPSS |
|---|---|---|---|---|
| Independent Scores | Interval or Ratio | Symmetric, Homogeneous | t Test | Analyze:
Compare Means:
Independent-Samples T Test - Equal Variances Assumed (output) |
| Symmetric, Nonhomogeneous | Welch's t Test | Analyze:
Compare Means:
Independent-Samples T Test - Equal Variances Not Assumed (output) |
||
| Skewed | Mann-Whitney U (Wilcoxon Rank Sum Test) | Analyze: Nonparametric Tests: 2 Independent Samples: Mann-Whitney | ||
| Ordinal | n/a | |||
| Related Scores | Interval or Ratio | Symmetric, Difference Scores | Paired Samples t Test | Analyze: Compare Means: Paired-Samples T Test |
| Nonsymmetric, Difference Scores | Wilcoxon Test for Paired Data | Analyze: Nonparametric Tests: 2 Related Samples: Mann-Whitney: Wilcoxon | ||
| Ordinal | n/a |
Performing a paired T test in SPSS is straightforward:
Analyze: Compare Means: Paired-Samples T Test
and select the pair of variables to be compared.
Performing an independent samples T test is slightly more complicated. SPSS requires that in addition to the values to be compared, a grouping variable must be entered. This is a dummy coding variable (remember, in SPSS, variables are always columns), which indicates the different groups, e.g:

Analyze: Compare Means: Independent-Samples T Test
Select the Test Variable (data), and the Grouping Variable (dummy). Click the Define Groups button and use the dummy coding variable you entered to select the groups to be compared (don't forget to label the groups in the Variable View window so you know what the output means!):

The output (in this case for a paired t test) includes:
| Mean | N | Std. Deviation | Std. Error Mean | ||
|---|---|---|---|---|---|
| Pair 1 | Before Treatment | 40.0000 | 12 | 9.29320 | 2.68272 |
| After Treatment | 47.0000 | 12 | 11.02889 | 3.18377 |
| N | Correlation | Sig. | ||
|---|---|---|---|---|
| Pair 1 | Before Treatment & After Treatment | 12 | .545 | .067 |
| Paired Differences | t | df | Sig. (2-tailed) | ||||||
|---|---|---|---|---|---|---|---|---|---|
| Mean | Std. Deviation | Std. Error Mean | 95% Confidence Interval of the Difference | ||||||
| Lower | Upper | ||||||||
| Pair 1 | Before Treatment - After Treatment | -7.00000 | 9.80723 | 2.83110 | -13.23122 | -.76878 | -2.473 | 11 | .031 |
The output reports the correlation coefficient for the datasets (similarity) as well as the t test result (difference). In the above example, Sig < .05, so there is evidence of a significant difference between the means of the two groups.
Formal Reporting: When you report the outcome of the t test, cite the value of the t statistic, the number of degrees of freedom (in brackets) and significance value, e.g:
There was a significant difference in the mean scores of the patients before and after treatment, t = -2.473(11), p < .05. |
Assumptions:The c2 test is a non-parametric test which assumes that the data analyzed:
IMPORTANT:
|
This is an example of a non-parametric test. c2 (pronounced "kye-square") is used when data consists of nominal or ordinal variables rather than quantitative variables, i.e. when we are interested in how many members fall into given descriptive categories (not for quantitative measurements, such as weight, etc). The c2 test is by default one-tailed and should only be carried out on raw data (not percentages, proportions or other derived data). Based on the decision tree table above, a "real" statistician would advise using tests such as the Mann-Whitney U Test or Wilcoxon Test for Paired Data in cases where the t test is not appropriate. The c2 should be reserved for categorical data, e.g. male or female, pregnant or not pregnant.
There are two versions of the c2 test in SPSS:
1. One-way chi-square goodness of fit test:
Used to analyze
whether a frequency distribution for a categorical or nominal variable is consistent
with expectations:
Example:
Observed: |
Expected: |
|
Boys: |
762 |
792 |
Girls: |
822 |
792 |
| Total: | 1584 |
1584 |
As with the t-test, the data is entered into SPSS with dummy coding variables which are used to define groups (e.g. sex in this example):


Note that the "coding data" (Sex in this case) must be entered as a numerical variable not as alphanumeric data or SPSS will not perform the test. The Variable View window can be used to add labels to the coded data. After the data has been entered, click Data: Weight Cases and weight the cases by the variable to be analyzed (Freq in this case).

Note that there is no need to enter the expected data in this case since SPSS can calculate equal expected values. Select Analyze: Nonparametric Tests: Chi-square, select Test Variable (Freq in this case) and click OK. Output:
| Observed N | Expected N | Residual | |
|---|---|---|---|
| 762 | 762 | 792.0 | -30.0 |
| 822 | 822 | 792.0 | 30.0 |
| Total | 1584 |
| Freq | |
|---|---|
| Chi-Square | 2.273 |
| df | 1 |
| Asymp. Sig. | .132 |
So in this case, we conclude:
There is no significant difference between number of observed and expected births, chi-square = 2.273(1, N=1584), p < .05. |
2. Chi square test of independence:
Used to analyze whether two
categorical or nominal variables are related or associated with each other.
Example:
| Biology: | Psychology: | |
| Males: | 41 |
39 |
| Females: | 62 |
58 |
Enter the data as follows:

As in the previous examples, the dummy coding variable (Gender and Subject in this case) must be entered as numerical variables not as alphanumeric data or SPSS will not perform the test. Remember: The Data View window only contains data (numbers). The Variable View window should be used to add labels to the coded data so you can interpret the output. After the data has been entered, click Data: Weight Cases and weight the cases by the variable to be analyzed (Freq in this case).
Output:
| Subject | Total | |||
|---|---|---|---|---|
| Biology | Psychology | |||
| Gender | Male | 41 | 39 | 80 |
| Female | 62 | 58 | 120 | |
| Total | 103 | 97 | 200 | |
| Value | df | Asymp. Sig. (2-sided) | Exact Sig. (2-sided) | Exact Sig. (1-sided) | |
|---|---|---|---|---|---|
| Pearson Chi-Square | .003(a) | 1 | .954 | ||
| Continuity Correction | .000 | 1 | 1.000 | ||
| Likelihood Ratio | .003 | 1 | .954 | ||
| Fisher's Exact Test | 1.000 | .534 | |||
| Linear-by-Linear Association | .003 | 1 | .954 | ||
| N of Valid Cases | 200 | ||||
a: 0
cells (.0%) have expected count less than 5. The minimum expected count
is 38.80. |
|||||
Since none of the cells have an expected count of less than 5, we can ignore Fisher's Exact test and use the (Pearson) Chi-Square result. In this case, we conclude:
There is no significant difference between the gender of the students on the two courses, chi-square = .003 (df=1, N=200), p < .05. |
© MicrobiologyBytes 2009.