MicrobiologyBytes: StatsBytes: Bivariate data Updated: February 15, 2012Search

logo

Bivariate data

Thinking about data

  1. Before you do anything, think about the data you have. What sort of data is it and where did it come from? Is it measurement data (continuous, i.e. all values possible), or is it count (unordered, discrete/non-continuous), rank (ordered, discrete/non-continuous) or nominal (names) data? Be careful with nominal data - numbers can be used as names.
  2. Next, visually inspect the data. Can you see any abnormalities? Is is correctly formatted for R (one variable per column, names, if present, correctly written)?
  3. Perform EDA using graphical tools, e.g. plot(), boxplot(), hist(), qqnorm(), qqline(), and if you still can't decide whether variables have a normal frequency distribution, shapiro.test() (check the distribution of each variable, not the entire dataset).

If you have data which consists of two variables (bivariate data), apart from initial EDA on each variable, the obvious thing to do is to compare one variable to the other. Before you can do that, you need to decide whether the variables are related to each other, which will help to determine what sorts of test you can use to compare them - are the variables are independent or dependent? In statistics jargon:

Selecting a statistical test:      Type of data:   
Univariate
Bivariate
Multivariate
Objective: Measurement, normal distribution Count, rank, or measurement with non-normal distribution
Describe (all data types) EDA (mean), plots EDA (median), plots

Compare to a hypothetical distribution

One sample t test
Wilcoxon test, chi square goodness of fit

Compare independent variables

Unpaired t test
Wilcoxon test (unpaired, = Mann-Whitney test), Fisher's test (small groups) or Chi-square test of independence (large groups) / Chi-square test of homogeneity (large groups)

Compare dependent variables

Paired t test
Wilcoxon test (paired)

Measure association between variables

Pearson correlation test
Spearman correlation test
Prediction from another variable
Simple linear regression
[Nonparametric regression]
Compare 3 or more independent variables
ANOVA
Kruskal-Wallis test, Chi-square test of independence
Compare 3 or more dependent variables Repeated measures ANOVA Friedman test

See: Which is the correct statistical test to use? (2008) British Journal of Oral and Maxillofacial Surgery 46(1): 38-41

 

Parametric tests

 

Student's t test

The t test is a parametric test because it examines whether there is a significant difference between the mean of the two variables being tested. The mean is a reliable indicator for data with a normal frequency distribution but not for skewed data, which is why the t test is parametric and only gives a reliable answer with normal data. There are three versions of the t test:

 

To run an unpaired t test, we need to know if the variances of the two variables are "equal" or are significantly different from each other. This does not affect the paired t test, but for unpaired t tests the first step is to test the variances. Example:

> var.test(x, y) #where x and y are the names of the two variables
F test to compare two variances
data: x and y
F = 0.4609, num df = 349, denom df = 439, p-value = 1.081e-13
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval: 0.3781617 0.5631120
sample estimates: ratio of variances 0.4609334

In this case, the p value for the test is < 0.05, so we reject the null hypothesis ("There is no significant difference between the variances of x and y") and assume the variances are unequal. (Remember: GTA LTR "Greater Than (or equal to) Accept, Less Than Reject").
In R, the unpaired t test assumes the variances of the groups being test are not equal, but if this is not the case, include the var.equal=TRUE argument in the t.test() function:

> t.test(x, y, alt="less") # Use: t.test(x,y,alt="less",var.equal=TRUE) if variances are equal
Welch Two Sample t-test
data: x and y
t = -7.24, df = 769.965, p-value = 5.442e-13
alternative hypothesis: true difference in means is less than 0
95 percent confidence interval: -Inf -9.827272
sample estimates:
mean of x 34.74286
mean of y 47.46364

In this example, the p value is < 0.05, so we reject the null hypothesis ("There is no significant difference between the means of x and y") (Remember: GTA LTR "Greater Than (or equal to) Accept, Less Than Reject") and conclude that there is a significant different between the means of the two groups.

 

Paired t test - used to compare the mean of two dependent variables
The paired t test is not affected by the variances of the variables, so this does not have to be considered and we can go straight into the test. Example - consider the measurement of blood pressure in a group of patients before and after exercise:

> t.test(before, after, paired=TRUE)
Paired t-test
data: before and after
t = 17.175, df = 98, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval: 16.75431 21.13182
sample estimates: mean of the differences 18.94307

In this case, the p value is < 0.05, so we reject the null hypothesis ("There is no significant difference between means blood pressure before and after exercise") (Remember: GTA LTR "Greater Than (or equal to) Accept, Less Than Reject") and conclude that there is a significant different between the means of the two variables.

 

Measuring association between variables (correlation)

With two variables, perhaps the most obvious question to ask is what is the relationship between them - this is easy to determine with a correlation test. Unfortunately, correlation is also the most abused statistical test. Consider this example:

Poor sparrows In: A Possible Effect of Electromagnetic Radiation from Mobile Phone Base Stations on the Number of Breeding House Sparrows (Passer domesticus). Joris Everaert & Dirk Bauwens (2007) Electromagnetic Biology and Medicine, 26: 63–72. doi: 10.1080/15368370701205693 data is presented which clearly shows that as the density of mobile phone masts has increased, the number of house sparrows has decreased. The Daily Mail (30 April 2007) was quick to blame the phone masts for the loss of sparrows.

> sparrows <- c(1.5,1.9,0.8,1.0,1.3,0.8) #number of sparrows at different sites
> masts <- c(0.153,0.084,0.245,0.130,0.109,0.043)
#electric field strength
> plot(sparrows,masts)
# This plot does not tell us very much - try it for yourself - so:

Calculate the Pearson correlation coefficient, "r":

> cor.test(sparrows, masts)
Pearson's product-moment correlation
data: sparrows and masts
t = -0.5273, df = 4, p-value = 0.6259
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval: -0.8836744 0.7018201
sample estimates: cor -0.2549572

Correlation coefficients produce values from -1 (perfect negative correlation: as one value rises, the other falls) to +1 (perfect positive correlation: both values rise in unison). This data produces a small negative value (-0.3, rounded), but is this result statistically significant? Since p = 0.6259 (>0.05), we accept the null hypothesis: "There is no significant correlation between sparrows and masts, r = 0.3 (df=4), p >0.05". There is a weak correlation, but it is not statistically significant. HOWEVER, even if the test result had been statistically significant, would this association prove that mobile phone masts harm sparrows? Of course not. There are thousands of possible "confounding factors" that this test does not consider. In response to the Daily Mail story, the RSPB stated that it believes the decline in sparrows is due to tidier gardens and better maintained housing which reduces nesting sites and availability of food. Another way of thinking about this is that correlation can tell you whether there is a relationship between the chicken and the egg, but not which came first. Please write out 100 times: "CORRELATION DOES NOT EQUAL CAUSATION" and read this: Austin Bradford Hill, (1965) The Environment and Disease: Association or Causation? Proceedings of the Royal Society of Medicine, 58, 295-300.

 

Regression

If we want to go beyond association (correlation) and predict to what extent the value of one variable affects the value of the other, we need to carry out what is known as simple linear regression.

Regression or Correlation?

Regression and correlation are similar and easily confused. In some situations it makes sense to perform both calculations. Calculate correlation if:

  • You both variables X and Y were measurements and wish to quantify how well they are associated (i.e. vary together).
  • Don't calculate a correlation coefficient if you manipulated the X variable, e.g. in an experiment.

Calculate linear regressions only if:

  • One of the variables (X, the "independent variable") is likely to affect the other variable (Y, the "dependent variable").
  • A plot of the relationship between X and Y shows a straight line, not a curve or some other complex interaction. ALWAYS plot the relationship between the variables and consider the result before deciding whether to perform a regression analysis.
  • Choose regression rather than correlation if you manipulated the X variable, e.g. in an experiment.
  • In regression analysis, it makes a difference which variable is called X and which is called Y, as regression calculations are not symmetrical with respect to X and Y. If you swap the two variables, you will obtain a different regression line. In contrast, correlation calculations are symmetrical with respect to X and Y because these are independent variables. If you swap the labels X and Y, you will still get the same correlation coefficient.

Correlation makes no assumption as to whether one variable is dependent on the other(s) and is not concerned with the relationship between variables; instead it gives an estimate as to the degree of association between the variables. Regression attempts to describe the dependence of a variable on one (or more) explanatory variables; it implicitly assumes that there is a one-way causal effect from the explanatory variable(s) to the response variable, regardless of whether the path of effect is direct or indirect. The best way to appreciate this difference is by example.
Consider samples of the leg length and skull size from a population of elephants. It would be reasonable to suggest that these two variables are associated in some way, as elephants with short legs tend to have small heads and elephants with long legs tend to have big heads. We might demonstrate an association exists by performing a correlation analysis. However, would regression be an appropriate tool to describe a relationship between head size and leg length? Does an increase in skull size cause an increase in leg length? Does a decrease in leg length cause the skull to shrink? It is unwise to apply regression analysis to these variables as they are interdependent and one is not wholly dependent on the other, but more likely on some other factor that affects them both (e.g. food supply, genetic makeup, etc).
On the other hand, consider these two variables: crop yield and temperature. They are measured independently, one by a weather station thermometer and the other by Farmer Giles' scales. While correlation analysis might show a high degree of association between these two variables, regression analysis might also be able to demonstrate the dependence of crop yield on temperature. However, careless use of regression analysis might suggest that temperature is dependent on crop yield: this would suggest that if you grow really big crops you'll be guaranteed a hot summer!

In R, linear regression, i.e. drawing a graph of the relationship between variables, is known as a linear model, function lm(). Example: Does students performance (final mark) in an Autumn term module predict their performance in a Spring term module?

> plot(autumn, spring) #plots a scatter graph of the variables, where autumn = x, spring = y
> abline(lm(spring ~ autumn))
#draws a line though the plot based on the linear model lm()
> summary(lm(spring ~ autumn))
#shows the statistics for the linear model
Call: lm(formula = spring ~ autumn)
Residuals:
Min 1Q Median 3Q Max
-15.167 -12.124 2.681 9.720 12.484
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.2189 7.9921 0.027 0.978
autumn 1.0630 0.1023 10.391 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 9.774 on 197 degrees of freedom
Multiple R-squared: 0.3541, Adjusted R-squared: 0.3508
<2.2e-16

R-squared (the "coefficient of determination") shows how much of the variation in the dependent variable (spring) is due to the independent variable (autumn). In this example, R-squared = 0.3508, so 35.08% of the variation in the Spring marks is determined by the predictor variable (Autumn marks) - and 64.92% of the variation is caused by other factors! The p value (2.2e-16) is <0.05, so we reject the null hypothesis "Autumn marks do not predict Spring marks". However, bear in mind that this test suggests that Autumn marks are only a minor component involved in predicting Spring marks. When we formally report the outcome of simple linear regression, we cite the value of R-squared, the number of degrees of freedom and significance value, e.g: "Autumn marks predict Spring marks, R-squared = 0.3508(1,197), p < 0.05".


Non-parametric tests

Wilcoxon rank sum test

The function wilcox.test() performs a range of different tests in R depending on the arguments used with the function. These include the Wilcoxon signed rank test for univariate data, the Wilcoxon rank sum test of unpaired bivariate data (also called the Mann-Whitney test), and the Wilcoxon signed rank test for paired bivariate data used to compare dependent measurements (see below). The Wilcoxon test compares median values (in a similar way to how the t test compares mean values), and so is suitable for skewed data.

Example: Compare the concentration of potassium (mmol L-1) in serum samples from bears and badgers. Clearly, these variables are independent, and EDA (not shown) indicates that both variables have a negative skew.

> wilcox.test(bears, badgers)
Wilcoxon rank sum test
data: bears and badgers
W = 480, p-value = 0.3149

p > 0.05 so we accept the null hypothesis - there is no significant difference between the median serum potassium concentrations in bears and badgers.


Chi square test

N.B: The chi-square requires that you use numerical values, not percentages or ratios.

The chi square test is confusing because there are three different versions and you have to know which one to use to get the "right" answer:

  1. If you only have one variable, use the chi square goodness of fit test to examine whether the data the data came from a specified population (i.e. observed values - the variable - against expected values).
  2. The chi square test of independence examines whether two characteristics associated with subjects in a single population are independent, i.e. whether one variable affects the other. This is calculated from a 2x2 "contingency table".
  3. The chi square test of homogeneity examines whether several populations (two or more) are similar (homogeneous) with respect to a variable.

The difference between (2) and (3) is a matter of design. This difference is subtle but important. However, both tests are performed the same way in R:

Example of chi square test of independence: Live Births at Grumbledown Infirmary:

Observed: Expected:
Boys: 762 792
Girls: 822 792
Total: 1584 1584
(expected values assume the null hypothesis that there will be no difference in the number of male and female births so are equally distributed)

This data measured two characteristics (boys and girls) from a single population (live births at Grumbledown Infirmary), so we use the chi square test of independence. We need to construct a 2x2 contingency table using R (alternatively, read in the data table from a .csv file):

> row1 <- c(762,822) #observed
> row2 <- c(792,792) #expected
> data.table <- rbind(row1,row2)
> data.table
[,1] [,2]
row1 762 822
row2 792 792
> chisq.test(data.table)
Pearson's Chi-squared test with Yates' continuity correction
data: data.table
X-squared = 1.0622, df = 1, p-value = 0.3027

It doesn't matter which way round the rows/columns go, because the table is symmetrical it gives the same answer:

> row1 <- c(762,792) #Boys, observed, expected
> row2 <- c(822,792) #Girls, observed, expected
> data.table <- rbind(row1,row2)
> data.table
[,1] [,2]
row1 762 792
row2 822 792
> chisq.test(data.table)
Pearson's Chi-squared test with Yates' continuity correction
data: data.table
X-squared = 1.0622, df = 1, p-value = 0.3027

Since the p value (0.3027) is greater than 0.05 (the level of significance for the test), we accept the null hypothesis that there is no significant difference between male and female of births. Remember: GTA LTR "Greater Than (or equal to) Accept, Less Than Reject"). As a result, we conclude:
"There is no significant difference between number of observed and expected births, chi-square = 1.06(df = 1, N=1584), p > 0.05."
This is the way you formally report a chi square test result, including the conclusion, value of chi square, degrees of freedom (1 in this case), total number of observation (N) and significance level.

 

Example of chi square test of homogeneity: Limpet shell colour
Dark: Light:
North-facing: 470 582
South-facing: 492 667

This data measured the characteristic shell colour(dark/light) for two populations (north and south facing), so we use the chi square test of homogeneity:

> north <- c(470,582)
> south <- c(492,667)
> data.table <- rbind(north,south)
> data.table
[,1] [,2]
north 470 582
south 492 667
> chisq.test(data.table)
Pearson's Chi-squared test with Yates' continuity correction
data: data.table
X-squared = 1.0234, df = 1, p-value = 0.3117

Again, it dosen't matter which way round the rows/columns go, because the table is symmetrical it gives the same answer:

> data.table2 <- cbind(north,south)
> data.table2
north south
[1,] 470 492
[2,] 582 667
> chisq.test(data.table2)
Pearson's Chi-squared test with Yates' continuity correction
data: data.table2
X-squared = 1.0234, df = 1, p-value = 0.3117

p (0.3117) > 0.05 so we accept the null hypothesis that the populations are homogeneous, (df = 1, N = 2211), chi-square = 1.02, p > 0.05.

 

Fisher's Exact test

Some of the assumptions for the chi-square test are:

If either of these is not true, we need to use an alternative test which is more accurate for smaller datasets, Fisher's Exact test. The null hypothesis is that the relative proportions of one variable are independent of the second variable.

Example: Phenotypes of the Scarlet tiger moth (Callimorpha dominula):
Genotype: White-spotted (AA) Intermediate (Aa) Little spotting (aa) Total
Number 1469 138 5 1612
In this case, the null hypothesis is that the population is in proportion to the Hardy–Weinberg principle.

Observed: Expected:
AA: 1469 1467
Aa: 138 141
aa: 5 3

The test is performed in the same way as a chi square test:

> observed <- c(1469,138,5)
> expected <- c(1467,141,3)
> data.table <- rbind(observed,expected)
> fisher.test(data.table)
Fisher's Exact Test for Count Data
data: data.table
p-value = 0.8064

p (0.8064) > 0.05 so we accept the null hypothesis that the population is in proportion to the Hardy–Weinberg principle.



Watch this video at 480p for better resolution

 

Wilcoxon signed rank test for paired data

The above tests are used to examine independent variables. To examine two dependent variables, we use the version of the Wilcoxon signed rank test for paired data. This is done by changing the arguments used with wilcox.test (default argument paired = FALSE) using "paired = TRUE".

Example: Blood pressure measurements in patients before and after biofeedback relaxation therapy:
Before: After:
97      96
88      86
75      79
90      89
85      91
94      89
77      86
89      99
82      94
90      96

before <- c(97,88,75,90,85,94,77,89,82,90)
after <- c(96,86,75,90,85,94,77,89,82,90)
> wilcox.test(before, after, paired=TRUE)
Wilcoxon signed rank test with continuity correction
data: before and after
V = 3, p-value = 0.3711

P >0.05 so we accept the null hypothesis that there is no significant difference in median blood pressure readings before and after treatment.

 

Non-parametric correlation tests

If EDA shows that both variables are normally-distributed, you should measure the association between them using the parametric Pearson correlation test: cor.test() However, if one or both variables are not normally-distributed, a non-parametric correlation test must be used. The most common non-parametric test used to measure the association between variables is the Spearman correlation test. This looks at the ranks (sequential order) of two variables. A useful function in R if the data is not in rank order is rank():

> a <- c(3, 1, 4, 15, 92)
> rank(a)
[1] 2 1 3 4 5 #i.e. 2nd lowest, lowest, etc

However, you don't need to use this if you use the method="spearman" argument with the cor.test() function as this is done automatically:
Example: Assessment of tuna quality (tasting panel scores):

> panelA <- c(44.4, 45.9, 41.9, 53.3, 44.7, 44.1, 50.7, 45.2, 60.1)
> panelB <- c( 2.6, 3.1, 2.5, 5.0, 3.6, 4.0, 5.2, 2.8, 3.8)
> cor.test(panelA, panelB, method="spearman")
Spearman's rank correlation rho
data: panelA and panelB
S = 48, p-value = 0.0968
alternative hypothesis: true rho is not equal to 0
sample estimates: rho 0.6

P >0.05 so we accept the null hypothesis that there is no correlation between the scores given by the two panels.

Kendall's tau is an alternative non-parametric correlation test which may be preferable to Spearman for small datasets containing many "tied ranks", i.e. equal scores. In practice, it is usual to run both of these tests and use the lower value.

> cor.test(panelA, panelB, method="kendal")
Kendall's rank correlation tau
data: panelA and panelB
T = 26, p-value = 0.1194
alternative hypothesis: true tau is not equal to 0
sample estimates: tau 0.4444444

Same result in this case, p >0.05, so we accept the null hypothesis that there is no correlation between the scores given by the two panels.

 

Self assessment questions


Creative Commons License
StatsBytes by A.J. Cann is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License