| MicrobiologyBytes: Maths & Computers for Biologists: Descriptive Statistics | Updated: February 6, 2009 | Search |
Further information on this topic can be found in Chapter 8 of:
Maths
from Scratch for Biologists
Numerical ability is an essential skill for everyone
studying the biological sciences but many students are frightened by
the
'perceived' difficulty of mathematics, and are nervous about applying
mathematical skills in their chosen field of study. Maths from Scratch
for Biologists
is a highly instructive, informal text that explains step by step how
and why you need to tackle maths within the biological sciences.
(Amazon.co.UK)
Remember:
Statistics is the systematic collection and display of numerical data.
There are many different-shaped frequency distributions:

How do we describe these? By measuring the:
Central Tendency and Variability
The central tendency of a dataset, i.e. the centre of a frequency distribution, is most commonly measured by the 3 Ms:
|
In
the results from DNA
microarray experiments, many of the hybridized spots have very low fluorescence.
However, a few of the spots have very strong fluorescence. This produces
a strong
positively-skewed distribution:

In
this case, the relatively small number of high-intensity samples distorts the
results. Since this is not a normal distribution, calculating the mean is misleading
and statistical tests based on the normal distribution would give meaningless
answers. The solution is to divide the fluorescence intensity of each sample
by the median fluorescence intensity. Plotting the result of
this gives a normal distribution.
Transforming data to allow you to use parametric statistics is completely legitimate. Different sorts of mathematical transformation work best for different datasets:
Whereas the central tendency is a summary measure of the overall level of a dataset, variability (or dispersion) measures the amount of scatter in a dataset.
Example:

Variability is commonly measured by three criteria:
a) Range:
The difference between the largest and the smallest value in the dataset. Since
the range only takes into account two values from the entire dataset, it may
be heavily influenced by outliers in the data. Therefore, another criterion
is used - the interquartile range is commonly used, i.e. the
distance between the 25th and 75th percentiles (Q3 - Q1).
N.B. By definition, this contains 50% of the datapoints in a normally-distributed
dataset.
The semi-interquartile range, i.e. (Q3 - Q1)/2,
covers 25% of the datapoints and is even less subject to variation due to scatter.
Although the range is a crude measure of variability, it is easy to calculate
and useful as an outline description of a dataset (EDA,
box & whisker plot). Note that the interquartile range and semi-interquartile
are similar in concept to the median.
A deviation score is a measure of by how much each point in a frequency distribution lies above or below the mean for the entire dataset:
![]()
where:
X = raw score
= the mean
Note that if you add all the deviation scores for a dataset together, you
automatically get the mean for that dataset.
In order to define the amount of deviation of a dataset from the mean, calculate
the mean of all the deviation scores, i.e. the variance:

Note that the variance is expressed in squared units, e.g. if the raw scores are weight in kg, the variance is kg2.
Thus it is more useful to consider the square root of the variance, which is the:
c) Standard Deviation (SD):
To determine the standard deviation of a dataset:

As with the other measures of data variability, the standard deviation determined from a sample (subset) of a dataset will be biased - since outliers are excluded, it will tend to underestimate the population standard deviation. Hence the formula is modified for samples compared with whole populations.
Why do we look at samples instead of whole populations?
Since the standard deviation is calculated in a similar fashion to the mean, i.e. take every data point in the population or sample into consideration, it is not of great value for highly skewed datasets.
Skewed datasets can be normalized by multiplying or dividing by the median
("transformation"), see above.
However, if you do this, you must take account of the fact that descriptors
of the dataset such as the standard deviation, etc. have been altered by the
transformation!

Any statistic can have a standard error.
The standard error of a statistic is the standard deviation of the sampling distribution of that statistic.
The inferential statistics involved in the construction of confidence intervals (CI) and significance testing are based on standard errors.
The standard deviation is an index of how closely individual data points cluster around the mean - SD refers to individual data points.
Standard errors are important because they reflect how much sampling fluctuation a statistic will show, i.e. how good an estimate of the population the sample statistic is - SE refers to the variability of the sample statistic (e.g. the standard error of the mean - SEM).
How good an estimate is the mean of a population? One way to determine this is to repeat the experiment many times and to determine the mean of the means. However, this is tedious and frequently impossible.
Fortunately, standard errors can be calculated from a single experiment.
The standard error of any statistic depends on the sample size - in general, the larger the sample size the smaller the standard error, thus:

In a normal distribution:

There is less than a 1 in 20 chance of any sample falling outside ±2 SD (95% CI, P = 0.05) and less than a 1 in 100 chance of any sample falling outside ±3 SD (99% CI, P = 0.01).
|
Use the CONFIDENCE function, syntax: CONFIDENCE(alpha,standard_dev,size) where: |
When do you use SD, SE or CI?
Remember:
SD & SE have units - the same units that the datapoints were measured in!
CI has no units since it is a probability (e.g. P = 0.05).
|
|
© MicrobiologyBytes 2009.