MicrobiologyBytes: Maths & Computers for Biologists: Descriptive Statistics Updated: February 6, 2009 Search

Descriptive Statistics

Further information on this topic can be found in Chapter 8 of:

CoverMaths from Scratch for Biologists

Numerical ability is an essential skill for everyone studying the biological sciences but many students are frightened by the 'perceived' difficulty of mathematics, and are nervous about applying mathematical skills in their chosen field of study. Maths from Scratch for Biologists is a highly instructive, informal text that explains step by step how and why you need to tackle maths within the biological sciences. (Amazon.co.UK)

 

Remember:

Statistics is the systematic collection and display of numerical data.

There are many different-shaped frequency distributions:

Frequency distributions

 

How do we describe these? By measuring the:

Central Tendency and Variability

 

The central tendency of a dataset, i.e. the centre of a frequency distribution, is most commonly measured by the 3 Ms:

 

  • Mode: The most frequently occurring value in a dataset. Easy to determine, but subject to variation & of limited value.
  • Median: The middle value in a dataset, i.e. half the variables have values greater than the median and the other half values which are less. The median is less sensitive to outliers (extreme scores) than the mean and thus a better measure than the mean for highly skewed distributions, e.g. family income.
  • Mean: The average value of a dataset, i.e. the sum of all the data divided by the number of variables.
    The arithmetic mean is commonly called the "average". When the word "mean" is used without a modifier, it usually refers to the arithmetic mean.
    The mean is a good measure of central tendency for symmetrical (e.g. normal) distributions but can be misleading in skewed distributions since it is influenced by outliers. In general, the mean is larger than the median in positively skewed distributions and less than the median in negatively skewed distributions. Therefore, other statistics such as the median may be more informative for distributions such as reaction time or family income that are frequently very skewed. The mean, median, and mode are equal in symmetrical frequency distributions. The mean is higher than the median in positively (right) skewed distributions and lower than the median in negatively (left) skewed distributions.

    Formula for the arithmetic mean:   Mean
    where X is the raw data and N or n is the number of scores.

     

    • The geometric mean is the nth root of the product of the scores, e.g:
      the geometric mean of the scores: 1, 2, 3, and 4 is the 4th root of 1 * 2 * 3 * 4 which is the 4th root of 24 = 2.21.
      The geometric mean is less affected by extreme values than the arithmetic mean and is useful for some positively skewed distributions.
      What is the use of the geometric mean?

 

Transforming Data:

DNA microarrayIn the results from DNA microarray experiments, many of the hybridized spots have very low fluorescence. However, a few of the spots have very strong fluorescence. This produces a strong positively-skewed distribution:

graph

Normalized graphIn this case, the relatively small number of high-intensity samples distorts the results. Since this is not a normal distribution, calculating the mean is misleading and statistical tests based on the normal distribution would give meaningless answers. The solution is to divide the fluorescence intensity of each sample by the median fluorescence intensity. Plotting the result of this gives a normal distribution.

Transforming data to allow you to use parametric statistics is completely legitimate. Different sorts of mathematical transformation work best for different datasets:

Variability:

Whereas the central tendency is a summary measure of the overall level of a dataset, variability (or dispersion) measures the amount of scatter in a dataset.

Example:

Variability

Variability is commonly measured by three criteria:

a) Range:

The difference between the largest and the smallest value in the dataset. Since the range only takes into account two values from the entire dataset, it may be heavily influenced by outliers in the data. Therefore, another criterion is used - the interquartile range is commonly used, i.e. the distance between the 25th and 75th percentiles (Q3 - Q1). N.B. By definition, this contains 50% of the datapoints in a normally-distributed dataset.
The semi-interquartile range, i.e. (Q3 - Q1)/2, covers 25% of the datapoints and is even less subject to variation due to scatter.
Although the range is a crude measure of variability, it is easy to calculate and useful as an outline description of a dataset (EDA, box & whisker plot). Note that the interquartile range and semi-interquartile are similar in concept to the median.

b) Variance:

A deviation score is a measure of by how much each point in a frequency distribution lies above or below the mean for the entire dataset:

Variance

where:
X = raw score
Bar x = the mean

Note that if you add all the deviation scores for a dataset together, you automatically get the mean for that dataset.
In order to define the amount of deviation of a dataset from the mean, calculate the mean of all the deviation scores, i.e. the variance:

Variance

Note that the variance is expressed in squared units, e.g. if the raw scores are weight in kg, the variance is kg2.

Thus it is more useful to consider the square root of the variance, which is the:

c) Standard Deviation (SD):

To determine the standard deviation of a dataset:

  1. Calculate the mean of all the scores
  2. Find the deviation of each score from the mean
  3. Square each deviation
  4. Calculate the average of the deviations
  5. Take the square root of the average deviation

Standard deviation

 

As with the other measures of data variability, the standard deviation determined from a sample (subset) of a dataset will be biased - since outliers are excluded, it will tend to underestimate the population standard deviation. Hence the formula is modified for samples compared with whole populations.

Why do we look at samples instead of whole populations?

Since the standard deviation is calculated in a similar fashion to the mean, i.e. take every data point in the population or sample into consideration, it is not of great value for highly skewed datasets.

Skewed datasets can be normalized by multiplying or dividing by the median ("transformation"), see above.
However, if you do this, you must take account of the fact that descriptors of the dataset such as the standard deviation, etc. have been altered by the transformation!

Normal distribution

 

Standard Error (SE)

Any statistic can have a standard error.

The standard error of a statistic is the standard deviation of the sampling distribution of that statistic.

The inferential statistics involved in the construction of confidence intervals (CI) and significance testing are based on standard errors.

The standard deviation is an index of how closely individual data points cluster around the mean - SD refers to individual data points.

Standard errors are important because they reflect how much sampling fluctuation a statistic will show, i.e. how good an estimate of the population the sample statistic is - SE refers to the variability of the sample statistic (e.g. the standard error of the mean - SEM).

How good an estimate is the mean of a population? One way to determine this is to repeat the experiment many times and to determine the mean of the means. However, this is tedious and frequently impossible.

Fortunately, standard errors can be calculated from a single experiment.

The standard error of any statistic depends on the sample size - in general, the larger the sample size the smaller the standard error, thus:

Standard error

 

Confidence Intervals (CI)

Statistics means never having to say you're certain.

In a normal distribution:

Normal distribution

There is less than a 1 in 20 chance of any sample falling outside ±2 SD (95% CI, P = 0.05) and less than a 1 in 100 chance of any sample falling outside ±3 SD (99% CI, P = 0.01).

MSExcelHow to use Excel to find confidence intervals:

Use the CONFIDENCE function, syntax:

CONFIDENCE(alpha,standard_dev,size)

where:
alpha is the significance level used to compute the confidence level, e.g. an alpha of 0.05 indicates a 95 percent confidence level.
standard_dev is the population standard deviation for the dataset.
size is the sample size.
Example:
In dataset (population) of size 50, with a mean of 30 & population standard deviation of 2.5, we can be 95% confident that the population mean is in the interval: CONFIDENCE(0.05,2.5,50) = 0.7. In other words, the mean equals 30 ± 0.7, or 29.3 to 30.7

When do you use SD, SE or CI?

 

Remember:

SD & SE have units - the same units that the datapoints were measured in!

CI has no units since it is a probability (e.g. P = 0.05).

 

MSExcelWe don't expect you to do ALL statistical calculations by hand
- use the worksheet formulae in Microsoft Excel!

 

© MicrobiologyBytes 2009.