Linear Regression
Further information on this topic can be
found in Chapter 11 of:
Maths
from Scratch for Biologists
Numerical ability is an essential skill for everyone
studying the biological sciences but many students are frightened by
the
'perceived' difficulty of mathematics, and are nervous about applying
mathematical skills in their chosen field of study. Maths from Scratch
for Biologists
is a highly instructive, informal text that explains step by step how
and why you need to tackle maths within the biological sciences.
(Amazon.co.UK)
Regression and correlation are related, but different, tests:
- Correlation quantifies how closely two variables are connected.
- Regression finds the line that best predicts Y from values
of X.
|
Regression or Correlation?
Linear regression and correlation are similar and easily confused. In
some situations it makes sense to perform both calculations. Calculate correlation
if:
- You measured both X and Y in each subject and wish to quantity how well
they are associated.
- Calculate the Pearson (parametric) correlation coefficient if you can
assume that both X and Y are sampled from normally-distributed populations.
- Otherwise calculate the Spearman (nonparametric) correlation coefficient.
- Don't
calculate a correlation coefficient if you manipulated the X variable
(e.g. in an experiment).
Calculate linear regressions only if:
- One of the variables (X) is likely to precede or cause the other variable
(Y).
- Choose linear regression if you manipulated the X variable, e.g. in
an experiment. It makes a difference which variable is called X and which
is called Y,
as
linear
regression calculations are not symmetrical with respect to X and Y. If
you swap the two variables, you will obtain a different regression line.
- In contrast, correlation calculations are symmetrical with respect to
X and Y. If you swap the labels X and Y, you will still get the same correlation
coefficient.
|
Linear regression works by by minimizing the sum of the square of the vertical
distances of the points from the regression line, hence is known as the "least
squares" method. The calculation effectively minimizes the sizes of squares
drawn between the data points and the regression line:

(You don't normally see the squares, or even the line unless
you choose to - they are just drawn here for illustration)
Performing a regression analysis is similar to performing a correlation test:
- Formulate the null hypothesis. Remember how to do this:
a simpler hypothesis has priority over a more complex theory. The null hypothesis
(H0) is therefore that "Y is independent of X, therefore the
slope of the regression line is 0".
-
Calculate the test statistics. A regression line is actually
a running series of means of the expected value of Y for each value of X
and is calculated from the following equation:

However, don't learn these equations - we don't expect you to
calculate regression lines by hand! Use the MSExcel
Analysis ToolPak Regression option.
- Interpret the test statistics: (N.B these data are for
the renal dialysis example)

An MSExcel regression analysis calculates and displays the potentially confusing
set of statistics shown above, so here's what they mean:
- Multiple R: Correlation coefficient - you don't need to
do a separate correlation test.
- R Square (r2): The most important regression
statistic - equivalent to the r value from a correlation test, shows how closely
X and Y are related. By taking the square of the r value, all values of r2 are
positive (remember that values of r can range from -1 to +1), and fall between
0 (no correlation) and 1 (perfect correlation).
The r2 value tells you how much your ability to predict is improved
by using the regression line, compared with not using it. The least possible
improvement is 0, i.e. the regression line does not help at all. The greatest
possible improvement is 1, i.e. the regression line fits the data perfectly.
The value of r2 is always between 0 and 1 since the regression line
is never worse than worthless (r2=0), and can't be better than perfect
(r2=1).
r and r2 values only give a guide to the "goodness-of-fit" and
do NOT indicate whether an association between the variables is statistically
significant. For this, for this, additional tests of significance
must be performed (see ANOVA, below).
- Adjusted R Square: Adjusted for more than one X value.
- Standard Error: of the regression.
- ANOVA: An
ANOVA analysis of the data is performed in order to determine whether the
association between the variables is statistically significant. This is
determined by the result of the F-test ("F"),
and is indicated by "Significance F", the associated P value for
the F test. The value of "Significance F" displayed depends on
the results of the regression analysis and the confidence level chosen in
the
regression analysis dialog box. For a confidence level of 95%, if "Significance
F" is <0.05, then the null hypothesis is rejected (there is a statistically
significant association between X and Y). Conversely, if "Significance
F" is >0.05, then the null hypothesis is accepted (there is no statistically
significant association between X and Y). In this case, "Significance
F" = 0, so the null hypothesis is rejected. Phew! this agrees with
the correlation result !
- t-statistic: t-test for the data
- P-value: for t-test
Summary:
There are two ways to perform a regression analysis in MSExcel:
- Plot a scatter graph of this data and draw a "trendline" through
it
or
- Use the MSExcel Analysis ToolPak to perform
a regression analysis. MSExcel regression analysis has some limitations:
- Only
accepts 16 data points (draw a "trendline" and display the value of r2
if you have more datapoints than this).
- Often chooses inappropriate scales for graphs of regression lines. This
can be easily fixed by clicking on the graph and entering the appropriate
values into the chart dialog.
Remember that linear regression is a parametric statistic and
may not give reliable results if applied to skewed datasets! This is not
a limitation of MSExcel - applies to all regression analysis. In spite
of these limitations, MSExcel offers a quick means of performing a regression
analysis.
© MicrobiologyBytes 2009.