| MicrobiologyBytes: StatsBytes: Bivariate data | Updated: February 15, 2012 | Search |
If you have data which consists of two variables (bivariate data), apart from initial EDA on each variable, the obvious thing to do is to compare one variable to the other. Before you can do that, you need to decide whether the variables are related to each other, which will help to determine what sorts of test you can use to compare them - are the variables are independent or dependent? In statistics jargon:
|
||||||
| Objective: | Measurement, normal distribution | Count, rank, or measurement with non-normal distribution | ||||
| Describe (all data types) | EDA (mean), plots | EDA (median), plots | ||||
Compare to a hypothetical distribution |
One sample t test |
Wilcoxon test, chi square goodness of fit |
||||
Compare independent variables |
Unpaired t test |
Wilcoxon test (unpaired, = Mann-Whitney test), Fisher's test (small groups) or Chi-square test of independence (large groups) / Chi-square test of homogeneity (large groups) |
||||
Compare dependent variables |
Paired t test |
Wilcoxon test (paired) |
||||
Measure association between variables |
Pearson correlation test |
Spearman correlation test |
||||
| Prediction from another variable |
Simple linear regression |
[Nonparametric regression] |
||||
| Compare 3 or more independent variables |
ANOVA |
Kruskal-Wallis test, Chi-square test of independence |
||||
| Compare 3 or more dependent variables | Repeated measures ANOVA | Friedman test | ||||
The t test is a parametric test because it examines whether there is a significant difference between the mean of the two variables being tested. The mean is a reliable indicator for data with a normal frequency distribution but not for skewed data, which is why the t test is parametric and only gives a reliable answer with normal data. There are three versions of the t test:
To run an unpaired t test, we need to know if the variances of the two variables are "equal" or are significantly different from each other. This does not affect the paired t test, but for unpaired t tests the first step is to test the variances. Example:
> var.test(x, y) #where x and y are the names of the two variables
F test to compare two variances
data: x and y
F = 0.4609, num df = 349, denom df = 439, p-value = 1.081e-13
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval: 0.3781617 0.5631120
sample estimates: ratio of variances 0.4609334
In this case, the p value for the test is < 0.05, so we reject the null hypothesis ("There is no significant difference between the variances of x and y") and assume the variances are unequal. (Remember: GTA LTR "Greater Than (or equal to) Accept, Less Than Reject").
In R, the unpaired t test assumes the variances of the groups being test are not equal, but if this is not the case, include the var.equal=TRUE argument in the t.test() function:
> t.test(x, y, alt="less") # Use: t.test(x,y,alt="less",var.equal=TRUE) if variances are equal
Welch Two Sample t-test
data: x and y
t = -7.24, df = 769.965, p-value = 5.442e-13
alternative hypothesis: true difference in means is less than 0
95 percent confidence interval: -Inf -9.827272
sample estimates:
mean of x 34.74286
mean of y 47.46364
In this example, the p value is < 0.05, so we reject the null hypothesis ("There is no significant difference between the means of x and y") (Remember: GTA LTR "Greater Than (or equal to) Accept, Less Than Reject") and conclude that there is a significant different between the means of the two groups.
Paired t test - used to compare the mean of two dependent variables
The paired t test is not affected by the variances of the variables, so this does not have to be considered and we can go straight into the test. Example - consider the measurement of blood pressure in a group of patients before and after exercise:
> t.test(before, after, paired=TRUE)
Paired t-test
data: before and after
t = 17.175, df = 98, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval: 16.75431 21.13182
sample estimates: mean of the differences 18.94307
In this case, the p value is < 0.05, so we reject the null hypothesis ("There is no significant difference between means blood pressure before and after exercise") (Remember: GTA LTR "Greater Than (or equal to) Accept, Less Than Reject") and conclude that there is a significant different between the means of the two variables.
With two variables, perhaps the most obvious question to ask is what is the relationship between them - this is easy to determine with a correlation test. Unfortunately, correlation is also the most abused statistical test. Consider this example:
In:
A Possible Effect of Electromagnetic Radiation from Mobile Phone Base Stations on the Number of Breeding House Sparrows (Passer domesticus). Joris Everaert & Dirk Bauwens (2007) Electromagnetic Biology and Medicine, 26: 63–72. doi: 10.1080/15368370701205693 data is presented which clearly shows that as the density of mobile phone masts has increased, the number of house sparrows has decreased. The Daily Mail (30 April 2007) was quick to blame the phone masts for the loss of sparrows.
> sparrows <- c(1.5,1.9,0.8,1.0,1.3,0.8) #number of sparrows at different sites
> masts <- c(0.153,0.084,0.245,0.130,0.109,0.043) #electric field strength
> plot(sparrows,masts) # This plot does not tell us very much - try it for yourself - so:
Calculate the Pearson correlation coefficient, "r":
> cor.test(sparrows, masts)
Pearson's product-moment correlation
data: sparrows and masts
t = -0.5273, df = 4, p-value = 0.6259
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval: -0.8836744 0.7018201
sample estimates: cor -0.2549572
Correlation coefficients produce values from -1 (perfect negative correlation: as one value rises, the other falls) to +1 (perfect positive correlation: both values rise in unison). This data produces a small negative value (-0.3, rounded), but is this result statistically significant? Since p = 0.6259 (>0.05), we accept the null hypothesis: "There is no significant correlation between sparrows and masts, r = 0.3 (df=4), p >0.05". There is a weak correlation, but it is not statistically significant. HOWEVER, even if the test result had been statistically significant, would this association prove that mobile phone masts harm sparrows? Of course not. There are thousands of possible "confounding factors" that this test does not consider. In response to the Daily Mail story, the RSPB stated that it believes the decline in sparrows is due to tidier gardens and better maintained housing which reduces nesting sites and availability of food. Another way of thinking about this is that correlation can tell you whether there is a relationship between the chicken and the egg, but not which came first. Please write out 100 times: "CORRELATION DOES NOT EQUAL CAUSATION" and read this: Austin Bradford Hill, (1965) The Environment and Disease: Association or Causation? Proceedings of the Royal Society of Medicine, 58, 295-300.
If we want to go beyond association (correlation) and predict to what extent the value of one variable affects the value of the other, we need to carry out what is known as simple linear regression.
Regression or Correlation? Regression and correlation are similar and easily confused. In some situations it makes sense to perform both calculations. Calculate correlation if:
Calculate linear regressions only if:
Correlation makes no assumption as to whether one variable is dependent on the other(s) and is not concerned with the relationship between variables; instead it gives an estimate as to the degree of association between the variables. Regression attempts to describe the dependence of a variable on one (or more) explanatory variables; it implicitly assumes that there is a one-way causal effect from the explanatory variable(s) to the response variable, regardless of whether the path of effect is direct or indirect. The best way to appreciate this difference is by example. |
In R, linear regression, i.e. drawing a graph of the relationship between variables, is known as a linear model, function lm(). Example: Does students performance (final mark) in an Autumn term module predict their performance in a Spring term module?
> plot(autumn, spring) #plots a scatter graph of the variables, where autumn = x, spring = y
> abline(lm(spring ~ autumn)) #draws a line though the plot based on the linear model lm()
> summary(lm(spring ~ autumn)) #shows the statistics for the linear model
Call: lm(formula = spring ~ autumn)
Residuals:
Min 1Q Median 3Q Max
-15.167 -12.124 2.681 9.720 12.484
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.2189 7.9921 0.027 0.978
autumn 1.0630 0.1023 10.391 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 9.774 on 197 degrees of freedom
Multiple R-squared: 0.3541, Adjusted R-squared: 0.3508
<2.2e-16
R-squared (the "coefficient of determination") shows how much of the variation in the dependent variable (spring) is due to the independent variable (autumn). In this example, R-squared = 0.3508, so 35.08% of the variation in the Spring marks is determined by the predictor variable (Autumn marks) - and 64.92% of the variation is caused by other factors! The p value (2.2e-16) is <0.05, so we reject the null hypothesis "Autumn marks do not predict Spring marks". However, bear in mind that this test suggests that Autumn marks are only a minor component involved in predicting Spring marks. When we formally report the outcome of simple linear regression, we cite the value of R-squared, the number of degrees of freedom and significance value, e.g: "Autumn marks predict Spring marks, R-squared = 0.3508(1,197), p < 0.05".
The function wilcox.test() performs a range of different tests in R depending on the arguments used with the function. These include the Wilcoxon signed rank test for univariate data, the Wilcoxon rank sum test of unpaired bivariate data (also called the Mann-Whitney test), and the Wilcoxon signed rank test for paired bivariate data used to compare dependent measurements (see below). The Wilcoxon test compares median values (in a similar way to how the t test compares mean values), and so is suitable for skewed data.
Example: Compare the concentration of potassium (mmol L-1) in serum samples from bears and badgers. Clearly, these variables are independent, and EDA (not shown) indicates that both variables have a negative skew.
> wilcox.test(bears, badgers)
Wilcoxon rank sum test
data: bears and badgers
W = 480, p-value = 0.3149
p > 0.05 so we accept the null hypothesis - there is no significant difference between the median serum potassium concentrations in bears and badgers.
N.B: The chi-square requires that you use numerical values, not percentages or ratios.
The chi square test is confusing because there are three different versions and you have to know which one to use to get the "right" answer:
The difference between (2) and (3) is a matter of design. This difference is subtle but important. However, both tests are performed the same way in R:
Example of chi square test of independence: Live Births at Grumbledown Infirmary:
Observed: Expected:
Boys: 762 792
Girls: 822 792
Total: 1584 1584
(expected values assume the null hypothesis that there will be no difference in the number of male and female births so are equally distributed)
This data measured two characteristics (boys and girls) from a single population (live births at Grumbledown Infirmary), so we use the chi square test of independence. We need to construct a 2x2 contingency table using R (alternatively, read in the data table from a .csv file):
> row1 <- c(762,822) #observed
> row2 <- c(792,792) #expected
> data.table <- rbind(row1,row2)
> data.table
[,1] [,2]
row1 762 822
row2 792 792
> chisq.test(data.table)
Pearson's Chi-squared test with Yates' continuity correction
data: data.table
X-squared = 1.0622, df = 1, p-value = 0.3027
It doesn't matter which way round the rows/columns go, because the table is symmetrical it gives the same answer:
> row1 <- c(762,792) #Boys, observed, expected
> row2 <- c(822,792) #Girls, observed, expected
> data.table <- rbind(row1,row2)
> data.table
[,1] [,2]
row1 762 792
row2 822 792
> chisq.test(data.table)
Pearson's Chi-squared test with Yates' continuity correction
data: data.table
X-squared = 1.0622, df = 1, p-value = 0.3027
Since the p value (0.3027) is greater than 0.05 (the level of significance for the test), we accept the null hypothesis that there is no significant difference between male and female of births. Remember: GTA LTR "Greater Than (or equal to) Accept, Less Than Reject"). As a result, we conclude:
"There is no significant difference between number of observed and expected births, chi-square = 1.06(df = 1, N=1584), p > 0.05."
This is the way you formally report a chi square test result, including the conclusion, value of chi square, degrees of freedom (1 in this case), total number of observation (N) and significance level.
Example of chi square test of homogeneity: Limpet shell colour
Dark: Light:
North-facing: 470 582
South-facing: 492 667
This data measured the characteristic shell colour(dark/light) for two populations (north and south facing), so we use the chi square test of homogeneity:
> north <- c(470,582)
> south <- c(492,667)
> data.table <- rbind(north,south)
> data.table
[,1] [,2]
north 470 582
south 492 667
> chisq.test(data.table)
Pearson's Chi-squared test with Yates' continuity correction
data: data.table
X-squared = 1.0234, df = 1, p-value = 0.3117
Again, it dosen't matter which way round the rows/columns go, because the table is symmetrical it gives the same answer:
> data.table2 <- cbind(north,south)
> data.table2
north south
[1,] 470 492
[2,] 582 667
> chisq.test(data.table2)
Pearson's Chi-squared test with Yates' continuity correction
data: data.table2
X-squared = 1.0234, df = 1, p-value = 0.3117
p (0.3117) > 0.05 so we accept the null hypothesis that the populations are homogeneous, (df = 1, N = 2211), chi-square = 1.02, p > 0.05.
Some of the assumptions for the chi-square test are:
If either of these is not true, we need to use an alternative test which is more accurate for smaller datasets, Fisher's Exact test. The null hypothesis is that the relative proportions of one variable are independent of the second variable.
Example: Phenotypes of the Scarlet tiger moth (Callimorpha dominula):
Genotype: White-spotted (AA) Intermediate (Aa) Little spotting (aa) Total
Number 1469 138 5 1612
In this case, the null hypothesis is that the population is in proportion to the Hardy–Weinberg principle.Observed: Expected:
AA: 1469 1467
Aa: 138 141
aa: 5 3
The test is performed in the same way as a chi square test:
> observed <- c(1469,138,5)
> expected <- c(1467,141,3)
> data.table <- rbind(observed,expected)
> fisher.test(data.table)
Fisher's Exact Test for Count Data
data: data.table
p-value = 0.8064
p (0.8064) > 0.05 so we accept the null hypothesis that the population is in proportion to the Hardy–Weinberg principle.
The above tests are used to examine independent variables. To examine two dependent variables, we use the version of the Wilcoxon signed rank test for paired data. This is done by changing the arguments used with wilcox.test (default argument paired = FALSE) using "paired = TRUE".
Example: Blood pressure measurements in patients before and after biofeedback relaxation therapy:
Before: After:
97 96
88 86
75 79
90 89
85 91
94 89
77 86
89 99
82 94
90 96before <- c(97,88,75,90,85,94,77,89,82,90)
after <- c(96,86,75,90,85,94,77,89,82,90)
> wilcox.test(before, after, paired=TRUE)
Wilcoxon signed rank test with continuity correction
data: before and after
V = 3, p-value = 0.3711
P >0.05 so we accept the null hypothesis that there is no significant difference in median blood pressure readings before and after treatment.
If EDA shows that both variables are normally-distributed, you should measure the association between them using the parametric Pearson correlation test: cor.test() However, if one or both variables are not normally-distributed, a non-parametric correlation test must be used. The most common non-parametric test used to measure the association between variables is the Spearman correlation test. This looks at the ranks (sequential order) of two variables. A useful function in R if the data is not in rank order is rank():
> a <- c(3, 1, 4, 15, 92)
> rank(a)
[1] 2 1 3 4 5 #i.e. 2nd lowest, lowest, etc
However, you don't need to use this if you use the method="spearman" argument with the cor.test() function as this is done automatically:
Example: Assessment of tuna quality (tasting panel scores):
> panelA <- c(44.4, 45.9, 41.9, 53.3, 44.7, 44.1, 50.7, 45.2, 60.1)
> panelB <- c( 2.6, 3.1, 2.5, 5.0, 3.6, 4.0, 5.2, 2.8, 3.8)
> cor.test(panelA, panelB, method="spearman")
Spearman's rank correlation rho
data: panelA and panelB
S = 48, p-value = 0.0968
alternative hypothesis: true rho is not equal to 0
sample estimates: rho 0.6
P >0.05 so we accept the null hypothesis that there is no correlation between the scores given by the two panels.
Kendall's tau is an alternative non-parametric correlation test which may be preferable to Spearman for small datasets containing many "tied ranks", i.e. equal scores. In practice, it is usual to run both of these tests and use the lower value.
> cor.test(panelA, panelB, method="kendal")
Kendall's rank correlation tau
data: panelA and panelB
T = 26, p-value = 0.1194
alternative hypothesis: true tau is not equal to 0
sample estimates: tau 0.4444444
Same result in this case, p >0.05, so we accept the null hypothesis that there is no correlation between the scores given by the two panels.

StatsBytes by A.J. Cann is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License