MicrobiologyBytes: Maths & Computers for Biologists: Correlation Updated: February 6, 2009 Search

Correlation

Further information on this topic can be found in Chapter 11 of:

CoverMaths from Scratch for Biologists

Numerical ability is an essential skill for everyone studying the biological sciences but many students are frightened by the 'perceived' difficulty of mathematics, and are nervous about applying mathematical skills in their chosen field of study. Maths from Scratch for Biologists is a highly instructive, informal text that explains step by step how and why you need to tackle maths within the biological sciences. (Amazon.co.UK)

Correlation, the relationship between two variables, is closely related to prediction.

The greater the association between variables, the more accurately we can predict the outcome of events. There is rarely an exact correlation of observed results with a mathematical function - the points never fit exactly on the line. The question is therefore whether an association between two variables could have occurred by chance.

There are numerous methods for calculating correlation, e.g:

Pearson correlation calculations are based on the assumption that both X and Y values are sampled from populations that follow a normal (Gaussian) distribution, at least approximately, although with large samples, this assumption is not too important.
Alternatively, the nonparametric Spearman correlation is based on ranking the two variables, and so makes no assumption about the distribution of the values.

A correlation analysis is performed in the same as any other statistical test of significance:

 

  1. Formulate the null hypothesis. A simpler hypothesis has priority over a more complex theory, so the null hypothesis (H0) is therefore that "There is no correlation between the datasets". You also need to set the significance level (a) before performing the test (e.g. 0.05).
  2. Calculate the correlation coefficient (r) for the test data. The Pearson formula is rather cumbersome:
Pearson calculation

so don't learn this equation - use software to perform the calculation, e.g. MSExcel (see below).

  1. Determine whether the value of r is equal to or less than the critical value required to accept the null hypothesis. To do this you need to calculate the number of degrees of freedom for the test - this takes into account the number of independent observations used in the calculation of the test statistic and is needed to find the true value in a probability table. For a one-tailed test, df = n-1, and for a two-tailed test (most usual) df=n-2. Look up the calculated value of r in a table of critical values of the correlation coefficient.
  2. When you look up the r value, ignore the sign (+/-). If you did not do this, you would need two tables of r values, one positive and one negative, but otherwise identical. The sign of the r value tells you whether the correlation is positive or negative, but can be ignored when deciding whether a result is significant or not.
  3. If the calculated value of r is less than or equal to the critical value of r, accept the null hypothesis - there is no proof of significant correlation between the variables. (Note: not "there is no correlation", simply, "there is no proof of correlation".)
    If the calculated value of r is greater than the critical value of r (from the table), reject the null hypothesis and accept the alternative hypothesis.

 

Values of r range from +1 (perfect correlation), through 0 (no correlation), to -1 (perfect negative correlation):

Correlation

 

Warning:
Correlation tests are in some ways the most misused of all statistical procedures!
They are able to show whether two variables could be connected. However, they are not able to show that the variables are not connected! If one variable depends on another, i.e. there is a causal relationship, then it is always possible to find some kind of correlation between the two variables. However, if both variables depend on a third, they can show a correlation without any causal dependency between them. Take care!

Example:

There is a direct correlation between the number of mobile phone masts and the decline in the numbers of house sparrows, Passer domesticus. But do mobile phone masts harm sparrows, or are both effects caused by something else? Or are they both completely independent observations which just happen to correlate? We don't know because correlation tests do not reveal this information - further investigation is necessary.

sparrow

 

MSExcelMSExcel has two functions which determine parametric correlation coefficients:


=CORREL(array1,array2)

where:
array1 is a cell range of values
array2 is a second cell range of values
This function returns the correlation coefficient of the array1 and array2 cell ranges.

=PEARSON(array1,array2)

where:
array1 is a set of independent values
array2 is a set of dependent values
This function returns the "Pearson product moment correlation coefficient", r, a dimensionless index which ranges from -1.0 to 1.0, and reflects the extent of a linear relationship between two data sets. You still need to formulate H0 & HA, set a and look up the calculated value in the table of critical values of r.

For most purposes, these two functions are identical.

Correlation is also one of the tools available on the MSExcel Analysis ToolPak.

 

Example:
In patients undergoing renal (kidney) dialysis, is there any association between heart rate and blood pressure?

Patient:
Heart Rate:
Blood Pressure:
1
83
141
2
86
162
3
88
161
4
92
154
5
94
171
6
98
174
7
101
184
8
114
190
9
117
187
10
121
191
  1. H0: There is no association between the variables. a=0.05
  1. Using MSExcel, r = 0.903
  1. df = 10 - 2 = 8 (2-tail)
  1. Critical value of r from table (8 df) = 0.632
  1. rcalc is greater than rcrit, so H0 is rejected - there is a significant association between the heart rate and blood pressure in these patients.

This can be seen visually by plotting a scatter graph of this data and drawing a trendline through it:

scatter graphWarning: You cannot accurately assess whether a significant correlation between variables exists by visual examination alone!

Bestsellers - Music - DVDs - Videos - Electronics
Search for ... (keywords):
Search for ... (keywords):

Bestsellers - Music - DVDs - Videos - Electronics

Nonparametric Correlation: The Spearman Test

While desirable, it is not always possible to use a parametric test such as the Pearson method. Fortunately, there are also nonparametric correlation tests, the most frequently-used of which is the Spearman test. Unfortunately, there is no built-in Spearman test in MSExcel, so you're going to have to do some work!

Calculation of the Spearman rank order correlation coefficient (rs) is used when the data consists of ordinal variables (i.e. variables with an ordered series where numbers indicate rank order only). Although this is a nonparametric statistic, it may be a better indicator than the Pearson coefficient of a non-linear relationship between two variables.

To perform the Spearman test, the data must first be converted into rank order. When converting to rank order, the smallest value on X becomes a rank of 1, etc, e.g:

x:
y:
Convert to ranks:
x:
y:
7.3
4.2
 
2
1
5.15
7.6
 
1
2
8.99
9.12
 
3
4
9.01
8.7
 
4
3

e.g:

MSExcel

=RANK(number,ref,order)

where:
number is the number whose rank you want to find
ref = array
if order is zero or omitted, MSExcel ranks number as if ref were a list sorted in descending order, if order is any nonzero value, MSExcel ranks number as if ref were a list sorted in ascending order.

The equation for the Spearman calculation is:

Spearman calculation

where:
N is the number of pairs (XY)
D is the difference between each pair (X - Y)

There is no built-in formula for the Spearman calculation in MSExcel, but you can easily perform a Spearman calculation as follows. Start by converting the data to rank order if this has not already been done, then calculate the differences between the pairs (D), the squares of the differences (D2) and the sum of the squares (SD2):

Spearman

Convert the Spearman formula (above) into an MSExcel formula:

rs = 1-(6*SD^2/(N(N^2 -1)))

After calculating the value of rs, this is compared with the critical value of r in deciding whether to accept or reject the null hypothesis. For a one-tailed test df = n-1 and for a two-tailed test (most usual) df=n-2.

Values of r range from +1 (perfect correlation), through 0 (no correlation), to -1 (perfect negative correlation). In general terms, correlation coefficients:


© MicrobiologyBytes 2009.