MicrobiologyBytes: Maths & Computers for Biologists: Comparing Populations with SPSS  Updated: February 6, 2009  Search 
Never, ever, run any statistical test without performing EDA first!
Regression or Correlation?Regression and correlation are similar and easily confused. In some situations it makes sense to perform both calculations. Calculate correlation if:
Calculate regressions only if:

Correlation or Regression? Correlation makes no assumption as to whether one variable is dependent
on the other(s) and is not concerned with the relationship between variables;
instead it gives an estimate as to the degree of association between the
variables. Regression attempts to describe the dependence of a variable
on one (or more) explanatory variables; it implicitly assumes that there
is a oneway causal effect from the explanatory variable(s) to the response
variable, regardless of whether the path of effect is direct or indirect.
The best way to appreciate this difference is by example. 
There are numerous methods for calculating correlations, e.g:
In SPSS:
As with other tests in SPSS, variables must be entered in separate columns. Before examining correlation, always draw a scatterplot to identify outliers and ensure the data is suitable for analysis:
Graphs: Scatter and select Simple Scatter or Matrix Scatter (for more than one pair of variables)
A correlation between two variables is called a bivariate correlation. In SPSS, select:
Analyze: Correlate: Bivariate
Select the variables you are interested in comparing and the type of correlation analysis you want to perform:
The output reports the correlation coefficient and whether the outcome is statistically significant (i.e. p <.05).
REMEMBER: Correlation indicates whether there is a relationship between two variables, but not what causes the relationship or what the relationship means! Correlation can tell you whether there is a relationship between the chicken and the egg, but not whether eggs come from chickens, or chickens come from eggs!! 
Formal Reporting: When you report the outcome of a correlation test, cite the value of the coefficient, degrees of freedom (N  2) in brackets and the significance value, e.g:
There is a weak/strong/etc correlation between variableA and variableB, r = 0.55(12), p (one or twotailed) < .05. 
Regression predicts an outcome from one or more variables. Simple regression examines the outcome from a single variable, multiple regression is where several variables are combined to predict an outcome. In reality, this involves constructing a simple statistical model consisting of a straight line which is the line of best fit between the variables. This is achieved by considering the differences between the datapoints and this line of best fit, which are known as residuals. To perform simple regression in SPSS:
Analyze: Regression: Linear
The output includes the Model Summary, e.g:
Model  R  R Square  Adjusted R Square  Std. Error of the Estimate 

1  .397  .157  .149  23.92947 
R = the correlation coefficient
R^{2} = the coefficient of determination. This value determines how much of the variation in one variable is due to the other variable. In the example above, R^{2} = 0.149, so 14.9% of the variation in the outcome is determined by the predictor variable (and 85.1% of the variation is caused by something else!).
Model  Sum of Squares  df  Mean Square  F  Sig.  

1  Regression  10802.625  1  10802.625  18.865  .000 
Residual  57834.579  101  572.620  
Total  68637.204  102 
The ANOVA output reports whether the model results in statistically significant prediction. In this case, Significance = .000, so the outcome is statistically significant (i.e. p <.05).
Model  Unstandardized Coefficients  Standardized Coefficients  t  Sig.  

B  Std. Error  Beta  
1  (Constant)  45.321  3.503  12.938  .000  
Predictor  .567  .130  .397  4.343  .000 
The coefficients output makes specific predictions about the magnitude of the effects predicted and confirms the significance of the model.
Formal Reporting: When you report the outcome of simple regression, cite the value of R^{2}, the number of degrees of freedom and significance value, e.g:
VariableA strongly predicted VariableB, R^{2} = .157(1,102), p < .001. 
To plot a regression line in SPSS:
Analyze: Regression: Curve Estimation: Linear (or other if required)
Assumptions:The ttest is a parametric test which assumes that the data analyzed:
If you use the ttest under other circumstances, the results will be meaningless! In other situations, nonparametric tests should be used to compare the groups, e.g:
In order to compare three or more groups, other tests must be used, e.g. ANOVA (next week). NEVER perform multiple ttests! 
The ttest is used to compare two groups and comes in three versions:
Where correlation and regression are used to examine relationships between datasets, the t test is the primary example of how to examine differences between datasets, so, as with any statistical investigation, the starting point is to think about what question you want to ask. In SPSS:
If the data does not meet the criteria for a t test, it is sometimes possible to transform it so that the test can be used, but not always. In other cases, alternative tests must be used:
Dependency  Variable  Distribution  Test  SPSS 

Independent Scores  Interval or Ratio  Symmetric, Homogeneous  t Test  Analyze:
Compare Means:
IndependentSamples T Test  Equal Variances Assumed (output) 
Symmetric, Nonhomogeneous  Welch's t Test  Analyze:
Compare Means:
IndependentSamples T Test  Equal Variances Not Assumed (output) 

Skewed  MannWhitney U (Wilcoxon Rank Sum Test)  Analyze: Nonparametric Tests: 2 Independent Samples: MannWhitney  
Ordinal  n/a  
Related Scores  Interval or Ratio  Symmetric, Difference Scores  Paired Samples t Test  Analyze: Compare Means: PairedSamples T Test 
Nonsymmetric, Difference Scores  Wilcoxon Test for Paired Data  Analyze: Nonparametric Tests: 2 Related Samples: MannWhitney: Wilcoxon  
Ordinal  n/a 
Performing a paired T test in SPSS is straightforward:
Analyze: Compare Means: PairedSamples T Test
and select the pair of variables to be compared.
Performing an independent samples T test is slightly more complicated. SPSS requires that in addition to the values to be compared, a grouping variable must be entered. This is a dummy coding variable (remember, in SPSS, variables are always columns), which indicates the different groups, e.g:
Analyze: Compare Means: IndependentSamples T Test
Select the Test Variable (data), and the Grouping Variable (dummy). Click the Define Groups button and use the dummy coding variable you entered to select the groups to be compared (don't forget to label the groups in the Variable View window so you know what the output means!):
The output (in this case for a paired t test) includes:
Mean  N  Std. Deviation  Std. Error Mean  

Pair 1  Before Treatment  40.0000  12  9.29320  2.68272 
After Treatment  47.0000  12  11.02889  3.18377 
N  Correlation  Sig.  

Pair 1  Before Treatment & After Treatment  12  .545  .067 
Paired Differences  t  df  Sig. (2tailed)  

Mean  Std. Deviation  Std. Error Mean  95% Confidence Interval of the Difference  
Lower  Upper  
Pair 1  Before Treatment  After Treatment  7.00000  9.80723  2.83110  13.23122  .76878  2.473  11  .031 
The output reports the correlation coefficient for the datasets (similarity) as well as the t test result (difference). In the above example, Sig < .05, so there is evidence of a significant difference between the means of the two groups.
Formal Reporting: When you report the outcome of the t test, cite the value of the t statistic, the number of degrees of freedom (in brackets) and significance value, e.g:
There was a significant difference in the mean scores of the patients before and after treatment, t = 2.473(11), p < .05. 
Assumptions:The c^{2} test is a nonparametric test which assumes that the data analyzed:
IMPORTANT:

This is an example of a nonparametric test. c^{2} (pronounced "kyesquare") is used when data consists of nominal or ordinal variables rather than quantitative variables, i.e. when we are interested in how many members fall into given descriptive categories (not for quantitative measurements, such as weight, etc). The c^{2} test is by default onetailed and should only be carried out on raw data (not percentages, proportions or other derived data). Based on the decision tree table above, a "real" statistician would advise using tests such as the MannWhitney U Test or Wilcoxon Test for Paired Data in cases where the t test is not appropriate. The c^{2} should be reserved for categorical data, e.g. male or female, pregnant or not pregnant.
There are two versions of the c^{2} test in SPSS:
1. Oneway chisquare goodness of fit test:
Used to analyze
whether a frequency distribution for a categorical or nominal variable is consistent
with expectations:
Example:
Observed: 
Expected: 

Boys: 
762 
792 
Girls: 
822 
792 
Total:  1584 
1584 
As with the ttest, the data is entered into SPSS with dummy coding variables which are used to define groups (e.g. sex in this example):
Note that the "coding data" (Sex in this case) must be entered as a numerical variable not as alphanumeric data or SPSS will not perform the test. The Variable View window can be used to add labels to the coded data. After the data has been entered, click Data: Weight Cases and weight the cases by the variable to be analyzed (Freq in this case).
Note that there is no need to enter the expected data in this case since SPSS can calculate equal expected values. Select Analyze: Nonparametric Tests: Chisquare, select Test Variable (Freq in this case) and click OK. Output:
Observed N  Expected N  Residual  

762  762  792.0  30.0 
822  822  792.0  30.0 
Total  1584 
Freq  

ChiSquare  2.273 
df  1 
Asymp. Sig.  .132 
So in this case, we conclude:
There is no significant difference between number of observed and expected births, chisquare = 2.273(1, N=1584), p < .05. 
2. Chi square test of independence:
Used to analyze whether two
categorical or nominal variables are related or associated with each other.
Example:
Biology:  Psychology:  
Males:  41 
39 
Females:  62 
58 
Enter the data as follows:
As in the previous examples, the dummy coding variable (Gender and Subject in this case) must be entered as numerical variables not as alphanumeric data or SPSS will not perform the test. Remember: The Data View window only contains data (numbers). The Variable View window should be used to add labels to the coded data so you can interpret the output. After the data has been entered, click Data: Weight Cases and weight the cases by the variable to be analyzed (Freq in this case).
Output:
Subject  Total  

Biology  Psychology  
Gender  Male  41  39  80 
Female  62  58  120  
Total  103  97  200 
Value  df  Asymp. Sig. (2sided)  Exact Sig. (2sided)  Exact Sig. (1sided)  

Pearson ChiSquare  .003(a)  1  .954  
Continuity Correction  .000  1  1.000  
Likelihood Ratio  .003  1  .954  
Fisher's Exact Test  1.000  .534  
LinearbyLinear Association  .003  1  .954  
N of Valid Cases  200  
a: 0
cells (.0%) have expected count less than 5. The minimum expected count
is 38.80. 
Since none of the cells have an expected count of less than 5, we can ignore Fisher's Exact test and use the (Pearson) ChiSquare result. In this case, we conclude:
There is no significant difference between the gender of the students on the two courses, chisquare = .003 (df=1, N=200), p < .05. 
© MicrobiologyBytes 2009.