MicrobiologyBytes: Maths & Computers for Biologists: Introduction to Statistics  Updated: February 6, 2009  Search 
Further information on this topic can be found in Chapter 7 of:
Maths from Scratch for Biologists
Numerical ability is an essential skill for everyone
studying the biological sciences but many students are frightened by
the
'perceived' difficulty of mathematics, and are nervous about applying
mathematical skills in their chosen field of study. Maths from Scratch
for Biologists
is a highly instructive, informal text that explains step by step how
and why you need to tackle maths within the biological sciences.
(Amazon.co.UK)
Statistics (the systematic collection and display of numerical data) is the most abused area of numeracy.
Unless appropriate statistical techniques are used, the results are meaningless:
Garbage In, Garbage Out!
The most important aspect of statistics is:
Statistics are a tool, not an objective in themselves  therefore framing the question to be asked in a form appropriate for statistical analysis is a critical step.
Statistics are not a mysterious black box  you must UNDERSTAND what you are trying to do you will not be able to construct the right questions and apply the right techniques:
Research
Question 
>

Statistical
Question 
>

Data
Collection 
>

Statistical
Conclusion 
>

Research
Conclusion 
It is not sufficient to know the formulae for various statistical tests  you must know when (and when not) to apply them!
Statistical Variables:Different classes of information are known as the variables of a dataset, e.g:
Variables may be classified as either quantitative or qualitative:
Which of the above variables are quantitative and which are qualitative?
Qualitative data can be divided into: Nominal variables: Variables with no inherent order or ranking sequence, e.g. numbers used as names (group 1, group 2...), gender, etc. Ordinal variables: Variables with an ordered series, e.g. "greatly dislike, moderately dislike, indifferent, moderately like, greatly like". Numbers assigned to such variables indicate rank order only  the "distance" between the numbers has no meaning. Interval variables: Equally spaced variables, e.g. temperature. The difference between a temperature of 66 degrees and 67 degrees is taken to be the same as the difference between 76 degrees and 77 degrees. Interval variables do not have a true zero, e.g. 88 degrees is not necessarily double the temperature of 44 degrees. Ratio variables: Variables spaced equal intervals with a true zero point, e.g. age.

Discrete variables: The set of all possible values which consists only of isolated points, e.g. counting variables (1, 2, 3 ...).
Continuous variables: The set of all values which consists of intervals, e.g. 09, 1019, 2029... etc.
Descriptive statistics are used to analyze the basic features of data under consideration.
Age:


Gender:

2029:

3039:

4049:

5059:

6069:

Male: 
12

13

7

8

7

Female: 
12

14

10

9

11

Numerical summaries:
Bar Charts / Histograms:

Pie Charts:

Scatter Diagrams:




plus many others... (look up "Chart Types" in MSExcel Help)
The basis of most statistical investigations is construction of a frequency distribution:
the number of observations for each of the possible categories in a dataset
Age:

Frequency:

1

11

2

12

3

19

4

6

5

2

Grouped frequency distributions:
Remember that the aim of descriptive statistics is to analyze the basic features of a dataset. Complex data can be simplified by combining individual scores to form a smaller number of groups, referred to as class intervals. The, intervals should be:
e.g:
Marks:

Frequency:

91100

1

8190

2

7180

9

6170

0

5160

2

Cumulative frequency distributions:
Cumulative frequency distributions are useful to show what proportion of a dataset lies above or below certain limits, e.g:
Marks:

Frequency:

Cumulative f

Cumulative %

91100

1

50

100

8190

9

49

98

7180

9

40

80

6170

13

31

62

5160

7

18

36

4150

6

11

22

3140

4

5

10

2130

0

1

2

1120

0

1

2

110

1

1

2

N = 50

What percentage of this class scored the required passmark of 41%?
Percentiles are points on a frequency distribution below which a specified percentage of cases in the distribution fall, e.g: a person scoring at the 75th percentile did better than 75% of those in the distribution:
(n+1)P 
100

where:
n = number of cases
P = desired percentile


N.B. the 25th, 50th and 75th percentiles are also referred to as quartiles (Q).
To calculate percentiles from grouped data an interpolation method is required:
P_{x} = LCB + (N/C * C)
where:
LCB = the value (percentage) of the lower class (frequency group) boundary
N = the number of additional cases required to make up the
desired percentile
C = the number of cases in the class (frequency group)
e.g:
Marks: 
Frequency: 
Cumulative f 
Cumulative %

Examples: 
91100 
1 
50 
100 
90% of 50 = 45 cases P_{90} = 81 + (5/9 * 9) = 86% 
8190 
9 
49 
98 

7180 
9 
40 
80 

6170 
13 
31 
61 
50% of 50 = 25 cases P_{50} = 61 + (7/13 * 9) = 65.85% 
5160 
7 
18 
36 

4150 
6 
11 
22 

3140 
4 
5 
10 

2130 
0 
1 
2 

1120 
0 
1 
2 

110 
1 
1 
2 

N = 50 
You may wish to note that MSExcel uses the following formulae to calculate quartiles. If the data are assumed to be in ascending order:
(Freund, J and Perles, B (1987) "A New Look at Quartiles of Ungrouped Data", The American Statistician, 41, 3, 200203)
Remember that statistics is the science of collecting, classifying and interpreting numerical data. Graphs make it easy to see features in numerical data.

A frequency histogram is a series of rectangles representing the frequencies of the class intervals.
The same data can also be drawn as a frequency polygon:
Which format is best?
A number of differentlyshaped frequency distributions are common:
In skewed (i.e. nonsymmetrical) frequency distributions, the "tail" of the frequency distribution points in the direction of the skew, e.g.
Negative skew 
Positive skew 
In numerical terms, the shape of a frequency distribution is measured by two parameters:
where: where µ is the mean and s is the standard deviation. A negative value indicates a negative skew and a positive value a positive skew. In Microsoft Excel, skew can be calculated using the SKEW function: SKEW(array).
where: where µ is the mean and s is the standard deviation. Kurtosis characterizes the relative peakedness or flatness of a distribution compared with the normal distribution (value = 0). Positive kurtosis indicates a relatively peaked distribution. Negative kurtosis indicates a relatively flat distribution. In Microsoft Excel, kurtosis can be calculated using the KURT function: KURT(array).
When many independent random factors act in an additive manner to create variability, the dataset follows a bellshaped distribution called the normal (or Gaussian, after Carl Friedrich Gauss, 17771855) distribution:
The normal distribution has some special mathematical properties which form the basis of many statistical tests.
Although no real datasets follow the normal distribution exactly, many kinds of data follow a distribution that is approximately Gaussian.
A normal distribution can be defined by 2 parameters, the mean and the standard deviation.
N.B: Not all datasets follow a normal distributions, e.g.
Statistical methods which depend on the parameters of populations or probability distributions and are referred to as parametric methods. Parametric tests include:
These tests are only meaningful for continuous data which is sampled from a population with an underlying normal distribution or whose distribution can be rendered normal by mathematical transformation.
Nonparametric methods require fewer assumptions about a population or probability distribution and are applicable in a wider range of situations:
Nonparametric methods are useful in situations where the assumptions required by parametric methods appear questionable. A few of the more commonly used nonparametric methods include:
These tests are characterised as distribution free  i.e. neither the values obtained nor the population from which the sample was drawn need have a normal distribution. Unlike the parametric tests which can give erroneous results these can always be used safely regardless of the distribution of the data.
Unfortunately, they are less flexible in practice and less powerful than parametric tests. In cases where both parametric and nonparametric methods are applicable, statisticians usually recommend using parametric methods because they tend to provide better precision.
Accuracy  a measurement of how close the average of a large
set of measurements is to the true or target value.
Precision  a measure of the closeness of repeated observations
to each other without rererence to the true or target value, i.e. the reproducibility
of the result.
Consider the results of the Aardvark's Archery Club:
In order to choose the appropriate statistical test, you must answer two questions:
Goal:

Dataset:


Measurement (from a normal distribution) 
Rank, Score, or
Measurement (from nonnormal distribution) 
Binomial (e.g. heads or tails) 
Survival Time: 

Describe one group:  
Compare one group to a hypothetical value:  
Compare two unpaired groups:  Fisher's
exact test (or chisquare for large samples) 

Compare two paired groups:  
Compare three or more unmatched groups:  
Compare three or more matched groups:  
Quantify association between two variables:  
Predict value from another measured variable:  
Predict value from several measured or binomial variables: 
On this course, we will consider only the tests in
the shaded cells. For information on other statistical tests, consult the table above. 
So how do you choose an appropriate statistical test?
We will investigate this further in the next session, Descriptive Statistics.
© MicrobiologyBytes 2009.