MicrobiologyBytes: Maths & Computers for Biologists: Comparing Populations with SPSS Updated: February 6, 2009 Search

Statistics with SPSS

Never, ever, run any statistical test without performing EDA first!

Correlation & Regression:

Regression or Correlation?

Regression and correlation are similar and easily confused. In some situations it makes sense to perform both calculations. Calculate correlation if:

  • You measured both X and Y in each subject and wish to quantify how well they are associated.
  • Calculate the Pearson (parametric) correlation coefficient if you can assume that both X and Y are sampled from normally-distributed populations.
  • Otherwise calculate the Spearman (nonparametric) correlation coefficient.
  • Don't calculate a correlation coefficient if you manipulated the X variable (e.g. in an experiment).

Calculate regressions only if:

  • One of the variables (X) is likely to precede or cause the other variable (Y).
  • Choose linear regression if you manipulated the X variable, e.g. in an experiment. It makes a difference which variable is called X and which is called Y, as linear regression calculations are not symmetrical with respect to X and Y. If you swap the two variables, you will obtain a different regression line.
  • In contrast, correlation calculations are symmetrical with respect to X and Y. If you swap the labels X and Y, you will still get the same correlation coefficient.

Correlation or Regression?

Correlation makes no assumption as to whether one variable is dependent on the other(s) and is not concerned with the relationship between variables; instead it gives an estimate as to the degree of association between the variables. Regression attempts to describe the dependence of a variable on one (or more) explanatory variables; it implicitly assumes that there is a one-way causal effect from the explanatory variable(s) to the response variable, regardless of whether the path of effect is direct or indirect. The best way to appreciate this difference is by example.
Consider samples of the leg length and skull size from a population of elephants. It would be reasonable to suggest that these two variables are associated in some way, as elephants with short legs tend to have small heads and elephants with long legs tend to have big heads. We might demonstrate an association exists by performing a correlation analysis. However, would regression be an appropriate tool to describe a relationship between head size and leg length? Does an increase in skull size cause an increase in leg length? Does a decrease in leg length cause the skull to shrink? It is meaningless to apply a causal regression analysis to these variables as they are interdependent and one is not wholly dependent on the other, but more likely some other factor that affects them both (e.g. food supply, genetic makeup).
On the other hand, consider these two variables: crop yield and temperature. They are measured independently, one by a weather station thermometer and the other by Farmer Giles' scales. While correlation analysis might show a high degree of association between these two variables, regression analysis might also be able to demonstrate the dependence of crop yield on temperature. However, careless use of regression analysis would also demonstrate that temperature is dependent on crop yield: this would suggest that if you grow really big crops you'll be guaranteed a hot summer! Dumb, or what?

There are numerous methods for calculating correlations, e.g:

In SPSS:

As with other tests in SPSS, variables must be entered in separate columns. Before examining correlation, always draw a scatterplot to identify outliers and ensure the data is suitable for analysis:

Graphs: Scatter and select Simple Scatter or Matrix Scatter (for more than one pair of variables)

A correlation between two variables is called a bivariate correlation. In SPSS, select:

Analyze: Correlate: Bivariate

Select the variables you are interested in comparing and the type of correlation analysis you want to perform:

The output reports the correlation coefficient and whether the outcome is statistically significant (i.e. p <.05).

REMEMBER: Correlation indicates whether there is a relationship between two variables, but not what causes the relationship or what the relationship means! Correlation can tell you whether there is a relationship between the chicken and the egg, but not whether eggs come from chickens, or chickens come from eggs!!

Formal Reporting:  When you report the outcome of a correlation test, cite the value of the coefficient, degrees of freedom (N - 2) in brackets and the significance value, e.g:

There is a weak/strong/etc correlation between variableA and variableB, r = 0.55(12), p (one or two-tailed) < .05.

 

Regression predicts an outcome from one or more variables. Simple regression examines the outcome from a single variable, multiple regression is where several variables are combined to predict an outcome. In reality, this involves constructing a simple statistical model consisting of a straight line which is the line of best fit between the variables. This is achieved by considering the differences between the datapoints and this line of best fit, which are known as residuals. To perform simple regression in SPSS:

Analyze: Regression: Linear

The output includes the Model Summary, e.g:

Model Summary:
Model R R Square Adjusted R Square Std. Error of the Estimate
1 .397 .157 .149 23.92947

R = the correlation coefficient
R2 = the coefficient of determination. This value determines how much of the variation in one variable is due to the other variable. In the example above, R2 = 0.149, so 14.9% of the variation in the outcome is determined by the predictor variable (and 85.1% of the variation is caused by something else!).

ANOVA
Model Sum of Squares df Mean Square F Sig.
1 Regression 10802.625 1 10802.625 18.865 .000
Residual 57834.579 101 572.620    
Total 68637.204 102      

The ANOVA output reports whether the model results in statistically significant prediction. In this case, Significance = .000, so the outcome is statistically significant (i.e. p <.05).



Coefficients
Model
Unstandardized Coefficients Standardized Coefficients t Sig.
B Std. Error Beta
1 (Constant) 45.321 3.503   12.938 .000
Predictor .567 .130 .397 4.343 .000

The coefficients output makes specific predictions about the magnitude of the effects predicted and confirms the significance of the model.

Formal Reporting:  When you report the outcome of simple regression, cite the value of R2, the number of degrees of freedom and significance value, e.g:

VariableA strongly predicted VariableB, R2 = .157(1,102), p < .001.

To plot a regression line in SPSS:

Analyze: Regression: Curve Estimation: Linear (or other if required)

 

Student's t-test:

 

Assumptions:

The t-test is a parametric test which assumes that the data analyzed:

  1. Be continuous, interval data comprising a whole population or sampled randomly from a population.
  2. Has a normal distribution.
  3. Sample size should not differ hugely between the groups.

If you use the t-test under other circumstances, the results will be meaningless!

In other situations, non-parametric tests should be used to compare the groups, e.g:

In order to compare three or more groups, other tests must be used, e.g. ANOVA (next week). NEVER perform multiple t-tests!

The t-test is used to compare two groups and comes in three versions:

Where correlation and regression are used to examine relationships between datasets, the t test is the primary example of how to examine differences between datasets, so, as with any statistical investigation, the starting point is to think about what question you want to ask. In SPSS:

If the data does not meet the criteria for a t test, it is sometimes possible to transform it so that the test can be used, but not always. In other cases, alternative tests must be used:

Decision Tree:
Dependency Variable Distribution Test SPSS
Independent Scores Interval or Ratio Symmetric, Homogeneous t Test Analyze: Compare Means: Independent-Samples T Test
- Equal Variances Assumed (output)
Symmetric, Nonhomogeneous Welch's t Test Analyze: Compare Means: Independent-Samples T Test
- Equal Variances Not Assumed (output)
Skewed Mann-Whitney U (Wilcoxon Rank Sum Test) Analyze: Nonparametric Tests: 2 Independent Samples: Mann-Whitney
Ordinal n/a
Related Scores Interval or Ratio Symmetric, Difference Scores Paired Samples t Test Analyze: Compare Means: Paired-Samples T Test
Nonsymmetric, Difference Scores Wilcoxon Test for Paired Data Analyze: Nonparametric Tests: 2 Related Samples: Mann-Whitney: Wilcoxon
Ordinal n/a

Performing a paired T test in SPSS is straightforward:

Analyze: Compare Means: Paired-Samples T Test

and select the pair of variables to be compared.

Performing an independent samples T test is slightly more complicated. SPSS requires that in addition to the values to be compared, a grouping variable must be entered. This is a dummy coding variable (remember, in SPSS, variables are always columns), which indicates the different groups, e.g:

Dummy coding variable

Analyze: Compare Means: Independent-Samples T Test

Select the Test Variable (data), and the Grouping Variable (dummy). Click the Define Groups button and use the dummy coding variable you entered to select the groups to be compared (don't forget to label the groups in the Variable View window so you know what the output means!):

Dialog

The output (in this case for a paired t test) includes:

Paired Samples Statistics:
Mean N Std. Deviation Std. Error Mean
Pair 1 Before Treatment 40.0000 12 9.29320 2.68272
After Treatment 47.0000 12 11.02889 3.18377
Paired Samples Correlations:
N Correlation Sig.
Pair 1 Before Treatment & After Treatment 12 .545 .067






Paired Samples Test:
Paired Differences t df Sig. (2-tailed)
Mean Std. Deviation Std. Error Mean 95% Confidence Interval of the Difference
Lower Upper
Pair 1 Before Treatment - After Treatment -7.00000 9.80723 2.83110 -13.23122 -.76878 -2.473 11 .031

The output reports the correlation coefficient for the datasets (similarity) as well as the t test result (difference). In the above example, Sig < .05, so there is evidence of a significant difference between the means of the two groups.

Formal Reporting:  When you report the outcome of the t test, cite the value of the t statistic, the number of degrees of freedom (in brackets) and significance value, e.g:

There was a significant difference in the mean scores of the patients before and after treatment, t = -2.473(11), p < .05.

 

 

c2-test (chi-square test):

Assumptions:

The c2 test is a non-parametric test which assumes that the data analyzed:

  1. Consist of nominal or ordinal category variables (i.e. each case can only be in one category or another). Therefore, this test cannot be used with repeated measures (paired) study designs.
  2. Consist of entire populations or be randomly sampled from the population.
  3. No data point should be zero (if so, use Fisher's exact test - included in the SPSS output).
  4. 80% of the expected frequencies should be 5 or more (if not, try aggregating groups or use Fisher's exact test for small sample sizes).
If you use the c2 test under other circumstances, the results will be meaningless!

IMPORTANT:

  • Note that acceptance or rejection of the null hypothesis can only be interpreted strictly in terms of the question asked, e.g. "There is a difference between the groups" or "There is no difference between the groups" and NOT EXTRAPOLATED to "There is a difference between the groups because...

This is an example of a non-parametric test. c2 (pronounced "kye-square") is used when data consists of nominal or ordinal variables rather than quantitative variables, i.e. when we are interested in how many members fall into given descriptive categories (not for quantitative measurements, such as weight, etc). The c2 test is by default one-tailed and should only be carried out on raw data (not percentages, proportions or other derived data). Based on the decision tree table above, a "real" statistician would advise using tests such as the Mann-Whitney U Test or Wilcoxon Test for Paired Data in cases where the t test is not appropriate. The c2 should be reserved for categorical data, e.g. male or female, pregnant or not pregnant.

There are two versions of the c2 test in SPSS:

1. One-way chi-square goodness of fit test:
Used to analyze whether a frequency distribution for a categorical or nominal variable is consistent with expectations:

Example:

Live Births at Leicester Royal Infirmary:
 
Observed:
Expected:

Boys:

762

792

Girls:

822
792
Total:
1584
1584

As with the t-test, the data is entered into SPSS with dummy coding variables which are used to define groups (e.g. sex in this example):

Dialog

Dialog

Note that the "coding data" (Sex in this case) must be entered as a numerical variable not as alphanumeric data or SPSS will not perform the test. The Variable View window can be used to add labels to the coded data. After the data has been entered, click  Data: Weight Cases  and weight the cases by the variable to be analyzed (Freq in this case).

Dialog

Note that there is no need to enter the expected data in this case since SPSS can calculate equal expected values. Select Analyze: Nonparametric Tests: Chi-square, select Test Variable (Freq in this case) and click OK. Output:

Freq:
  Observed N Expected N Residual
762 762 792.0 -30.0
822 822 792.0 30.0
Total 1584    
Test Statistics:
Freq
Chi-Square 2.273
df 1
Asymp. Sig. .132

So in this case, we conclude:

There is no significant difference between number of observed and expected births, chi-square = 2.273(1, N=1584), p < .05.

 

2. Chi square test of independence:
Used to analyze whether two categorical or nominal variables are related or associated with each other.

Example:

Students:
  Biology: Psychology:
Males:
41
39
Females:
62
58

Enter the data as follows:

Dialog

As in the previous examples, the dummy coding variable (Gender and Subject in this case) must be entered as numerical variables not as alphanumeric data or SPSS will not perform the test. Remember: The Data View window only contains data (numbers). The Variable View window should be used to add labels to the coded data so you can interpret the output. After the data has been entered, click  Data: Weight Cases  and weight the cases by the variable to be analyzed (Freq in this case).

Output:


Crosstab Count:
Subject Total
Biology Psychology
Gender Male 41 39 80
Female 62 58 120
Total 103 97 200
Chi-Square Tests:
Value df Asymp. Sig. (2-sided) Exact Sig. (2-sided) Exact Sig. (1-sided)
Pearson Chi-Square .003(a) 1 .954    
Continuity Correction .000 1 1.000    
Likelihood Ratio .003 1 .954    
Fisher's Exact Test       1.000 .534
Linear-by-Linear Association .003 1 .954    
N of Valid Cases 200        
a: 0 cells (.0%) have expected count less than 5. The minimum expected count is 38.80.

Since none of the cells have an expected count of less than 5, we can ignore Fisher's Exact test and use the (Pearson) Chi-Square result. In this case, we conclude:

There is no significant difference between the gender of the students on the two courses, chi-square = .003 (df=1, N=200), p < .05.

 


© MicrobiologyBytes 2009.