MicrobiologyBytes: Maths & Computers for Biologists: Introduction to Statistics Updated: February 6, 2009 Search

Introduction to Statistics

Further information on this topic can be found in Chapter 7 of:

CoverMaths from Scratch for Biologists

Numerical ability is an essential skill for everyone studying the biological sciences but many students are frightened by the 'perceived' difficulty of mathematics, and are nervous about applying mathematical skills in their chosen field of study. Maths from Scratch for Biologists is a highly instructive, informal text that explains step by step how and why you need to tackle maths within the biological sciences. (Amazon.co.UK)

Statistics (the systematic collection and display of numerical data) is the most abused area of numeracy.

 

97% of statistics are made up on the spot.

 

Unless appropriate statistical techniques are used, the results are meaningless:

Garbage In, Garbage Out!

The most important aspect of statistics is:

Choosing the right statistical test !

Statistics are a tool, not an objective in themselves - therefore framing the question to be asked in a form appropriate for statistical analysis is a critical step.

Statistics are not a mysterious black box - you must UNDERSTAND what you are trying to do you will not be able to construct the right questions and apply the right techniques:

Research
Question
->
Statistical
Question
->
Data
Collection
->
Statistical
Conclusion
->
Research
Conclusion

It is not sufficient to know the formulae for various statistical tests - you must know when (and when not) to apply them!

Statistical Variables:

Different classes of information are known as the variables of a dataset, e.g:

  • Age
  • Weight
  • Height
  • Gender
  • Marital status
  • Annual income

 

  • Variables which are experimentally manipulated by an investigator are called independent variables.
  • Variables which are measured are called dependent variables.
  • All other factors which may affect the dependent variable are called confounding, extraneous or secondary variables - unless these are the same for each group being tested comparisons will be unreliable.

 

Variables may be classified as either quantitative or qualitative:

  • Quantitative data measures either how much or how many of something, i.e. a set of observations where any single observation is a number that represents an amount or a count.
  • Qualitative data provide labels, or names, for categories of like items, i.e. a set of observations where any single observation is a word or code that represents a class or category.

Which of the above variables are quantitative and which are qualitative?

 

Qualitative data can be divided into:

Nominal variables: Variables with no inherent order or ranking sequence, e.g. numbers used as names (group 1, group 2...), gender, etc.

Ordinal variables: Variables with an ordered series, e.g. "greatly dislike, moderately dislike, indifferent, moderately like, greatly like".  Numbers assigned to such variables indicate rank order only - the "distance" between the numbers has no meaning.

Interval variables: Equally spaced variables, e.g. temperature. The difference between a temperature of 66 degrees and 67 degrees is taken to be the same as the difference between 76 degrees and 77 degrees. Interval variables do not have a true zero, e.g. 88 degrees is not necessarily double the temperature of 44 degrees.

Ratio variables: Variables spaced equal intervals with a true zero point, e.g. age.

 

Quantitative data can be divided into:

Discrete variables: The set of all possible values which consists only of isolated points, e.g. counting variables (1, 2, 3 ...).

Continuous variables: The set of all values which consists of intervals, e.g. 0-9, 10-19, 20-29... etc.

variables

 

Statistical Methods:

Descriptive statistics are used to analyze the basic features of data under consideration.

 
Age:
Gender:
20-29:
30-39:
40-49:
50-59:
60-69:
Male:
12
13
7
8
7
Female:
12
14
10
9
11

 

 

Bar Charts / Histograms:
Pie Charts:
Scatter Diagrams:
bar chart
pie chart
scatter diagram

plus many others... (look up "Chart Types" in MSExcel Help)

 

Frequency Distributions

The basis of most statistical investigations is construction of a frequency distribution:

the number of observations for each of the possible categories in a dataset

Age:
Frequency:
1
11
2
12
3
19
4
6
5
2

 

Grouped frequency distributions:

Remember that the aim of descriptive statistics is to analyze the basic features of a dataset. Complex data can be simplified by combining individual scores to form a smaller number of groups, referred to as class intervals. The, intervals should be:

e.g:

Marks:
Frequency:
91-100
1
81-90
2
71-80
9
61-70
0
51-60
2

 

Cumulative frequency distributions:

Cumulative frequency distributions are useful to show what proportion of a dataset lies above or below certain limits, e.g:

Marks:
Frequency:
Cumulative f
Cumulative %
91-100
1
50
100
81-90
9
49
98
71-80
9
40
80
61-70
13
31
62
51-60
7
18
36
41-50
6
11
22
31-40
4
5
10
21-30
0
1
2
11-20
0
1
2
1-10
1
1
2
 
N = 50
   

 

What percentage of this class scored the required passmark of 41%?

Percentiles are points on a frequency distribution below which a specified percentage of cases in the distribution fall, e.g: a person scoring at the 75th percentile did better than 75% of those in the distribution:

(n+1)P
100

where:
n = number of cases
P = desired percentile

Marks:
Cumulative f
Cumulative %
Percentile: 
95.27
129
100
 
94.13
128
99
 
92.46
127
98
 
91.66
126
98
 
91.66
125
97
 
91.59
124
96
 
91.46
123
95
 
90.92
122
95
 
90.60
121
94
 
90.06
120
93
 
89.66
119
92
 
89.52
118
91
 
89.39
117
91
 
89.06
116
90
 
88.59
115
89
 
87.99
114
88
 
87.93
113
88
 
87.39
112
87
 
87.26
111
86
 
87.12
110
85
 
86.99
109
84
 
86.93
108
84
 
86.86
107
83
 
86.72
106
82
 
86.59
105
81
 
86.59
104
81
 
86.53
103
80
 
86.52
102
79
 
85.99
101
78
 
85.86
100
78
 
85.66
99
77
 
85.19
98
76
P75
85.12
97
75
 
84.92
96
74
 
84.86
95
74
 
84.80
94
73
 
84.52
93
72
 
84.39
92
71
 
84.32
91
71
 
83.92
90
70
 
83.65
89
69
 
83.52
88
68
 
83.39
87
67
 
82.92
86
67
 
82.92
85
66
 
82.79
84
65
 
82.59
83
64
 
82.59
82
64
 
82.52
81
63
 
82.46
80
62
 
81.85
79
61
 
81.72
78
60
 
81.66
77
60
 
81.66
76
59
 
81.58
75
58
 
81.19
74
57
 
80.85
73
57
 
80.79
72
56
 
80.79
71
55
 
80.72
70
54
 
80.12
69
53
 
80.05
68
53
 
79.99
67
52
 
79.66
66
51
P50
79.52
65
50
 
Marks:
Cumulative f
Cumulative %
Percentile: 
79.32
64
50
 
79.19
63
49
 
79.12
62
48
 
79.12
61
47
 
78.79
60
47
 
78.18
59
46
 
78.12
58
45
 
78.12
57
44
 
77.99
56
43
 
77.98
55
43
 
77.85
54
42
 
77.79
53
41
 
77.58
52
40
 
77.32
51
40
 
77.18
50
39
 
76.98
49
38
 
76.85
48
37
 
76.45
47
36
 
76.45
46
36
 
76.25
45
35
 
76.19
44
34
 
76.13
43
33
 
75.98
42
33
 
75.45
41
32
 
75.32
40
31
 
75.32
39
30
 
75.19
38
29
 
75.12
37
29
 
75.12
36
28
 
74.78
35
27
 
74.78
34
26
 
71.05
33
26
P25
70.98
32
25
 
70.78
31
24
 
69.51
30
23
 
69.05
29
22
 
68.91
28
22
 
68.05
27
21
 
66.31
26
20
 
66.11
25
19
 
65.71
24
19
 
65.38
23
18
 
63.51
22
17
 
62.37
21
16
 
61.71
20
16
 
60.38
19
15
 
58.64
18
14
 
58.32
17
13
 
57.91
16
12
 
55.84
15
12
 
55.04
14
11
 
54.37
13
10
 
54.17
12
9
 
51.84
11
9
 
51.84
10
8
 
51.17
9
7
 
50.97
8
6
 
50.84
7
5
 
46.64
6
5
 
38.30
5
4
 
37.83
4
3
 
37.02
3
2
 
23.96
2
2
 
17.35
1
1
 

N.B. the 25th, 50th and 75th percentiles are also referred to as quartiles (Q).

To calculate percentiles from grouped data an interpolation method is required:

  1. Find the class within which the percentile lies.
  2. Determine the percentage between the bottom of the distribution and the class containing the percentile.
  3. Determine the number of additional cases required to make up the percentile.
  4. Assume (!) the scores in the class are evenly distributed.
  5. Find the additional number of cases in the class required to make up the percentile.
  6. Add this to the number of cases between the bottom of the distribution and the class containing the percentile.

Px = LCB + (N/C * C)

where:
LCB = the value (percentage) of the lower class (frequency group) boundary
N = the number of additional cases required to make up the desired percentile
C = the number of cases in the class (frequency group)

e.g:

Marks:
Frequency:
Cumulative f
Cumulative %
Examples:
91-100
1
50

100

90% of 50 = 45 cases

P90 = 81 + (5/9 * 9) = 86%

81-90
9
49
98
71-80
9
40
80
61-70
13
31
61

50% of 50 = 25 cases

P50 = 61 + (7/13 * 9) = 65.85%

51-60
7
18
36
41-50
6
11
22
31-40
4
5
10
21-30
0
1
2
11-20
0
1
2
1-10
1
1
2
 
N = 50
 

 

You may wish to note that MSExcel uses the following formulae to calculate quartiles. If the data are assumed to be in ascending order:

(Freund, J and Perles, B (1987) "A New Look at Quartiles of Ungrouped Data", The American Statistician, 41, 3, 200-203)

 

Frequency Graphs

Remember that statistics is the science of collecting, classifying and interpreting numerical data. Graphs make it easy to see features in numerical data.

Histograms

Marks:
f
Cum f
Cum %
90-99
0
50
100
80-89
2
50
100
70-79
6
48
96
60-69
9
42
84
50-59
12
33
66
40-49
10
21
42
30-39
7
11
22
20-29
3
4
8
10-19
1
1
2
0-9
0
0
0
Frequency histogram

Frequency polygon

 

A frequency histogram is a series of rectangles representing the frequencies of the class intervals.

The same data can also be drawn as a frequency polygon:


Which format is best?

A number of differently-shaped frequency distributions are common:

Frequency distributions

In skewed (i.e. non-symmetrical) frequency distributions, the "tail" of the frequency distribution points in the direction of the skew, e.g.

Positive skew

Negative skew

Negative skew

Positive skew

In numerical terms, the shape of a frequency distribution is measured by two parameters:

Skew

where: where µ is the mean and s is the standard deviation. A negative value indicates a negative skew and a positive value a positive skew. In Microsoft Excel, skew can be calculated using the SKEW function: SKEW(array).

Kurtosis

where: where µ is the mean and s is the standard deviation. Kurtosis characterizes the relative peakedness or flatness of a distribution compared with the normal distribution (value = 0). Positive kurtosis indicates a relatively peaked distribution. Negative kurtosis indicates a relatively flat distribution. In Microsoft Excel, kurtosis can be calculated using the KURT function: KURT(array).

 

The Normal (Gaussian) Distribution:

When many independent random factors act in an additive manner to create variability, the dataset follows a bell-shaped distribution called the normal (or Gaussian, after Carl Friedrich Gauss, 1777-1855) distribution:

Normal distribution

The normal distribution has some special mathematical properties which form the basis of many statistical tests.

Although no real datasets follow the normal distribution exactly, many kinds of data follow a distribution that is approximately Gaussian.

A normal distribution can be defined by 2 parameters, the mean and the standard deviation.

 

N.B: Not all datasets follow a normal distributions, e.g.

 

Parametric & Nonparametric Methods:

Statistical methods which depend on the parameters of populations or probability distributions and are referred to as parametric methods. Parametric tests include:

These tests are only meaningful for continuous data which is sampled from a population with an underlying normal distribution or whose distribution can be rendered normal by mathematical transformation.

 

Nonparametric methods require fewer assumptions about a population or probability distribution and are applicable in a wider range of situations:

  1. Methods used with qualitative data.
        or:
  2. Methods used with quantitative data when no assumption can be made about the population probability distribution.

Nonparametric methods are useful in situations where the assumptions required by parametric methods appear questionable. A few of the more commonly used nonparametric methods include:

These tests are characterised as distribution free - i.e. neither the values obtained nor the population from which the sample was drawn need have a normal distribution. Unlike the parametric tests which can give erroneous results these can always be used safely regardless of the distribution of the data.

Unfortunately, they are less flexible in practice and less powerful than parametric tests. In cases where both parametric and nonparametric methods are applicable, statisticians usually recommend using parametric methods because they tend to provide better precision.

 

Accuracy and Precision:

Accuracy - a measurement of how close the average of a large set of measurements is to the true or target value.
Precision - a measure of the closeness of repeated observations to each other without rererence to the true or target value, i.e. the reproducibility of the result.

Consider the results of the Aardvark's Archery Club:

Accuracy and precision

 

Choosing an Appropriate Statistical Test

In order to choose the appropriate statistical test, you must answer two questions:

Goal:
Dataset:
Measurement
(from a normal distribution)
Rank, Score, or Measurement
(from non-normal distribution)
Binomial
(e.g. heads or tails)
Survival Time:
Describe one group:
Mean, SD
Median, interquartile range
Proportion
Kaplan-Meier survival curve
Compare one group to a hypothetical value:
One-sample t test
Wilcoxon test
Chi-square
or
Binomial test
 
Compare two unpaired groups:
Unpaired t test
Mann-Whitney test
Fisher's exact test
(or chi-square for large samples)
Log-rank test or Mantel-Haenszel
Compare two paired groups:
Paired t test
Wilcoxon test
McNemar's test
Conditional proportional hazards regression
Compare three or more unmatched groups:
One-way ANOVA
Kruskal-Wallis test
Chi-square test
Cox proportional hazard regression
Compare three or more matched groups:
Repeated-measures ANOVA
Friedman test
Cochrane Q test
Conditional proportional hazards regression
Quantify association between two variables:
Pearson correlation
Spearman correlation
Contingency coefficients
 
Predict value from another measured variable:
Simple regression
Nonparametric regression
Simple logistic regression
Cox proportional hazard regression
Predict value from several measured or binomial variables:
Multiple regression
 
Multiple logistic regression
Cox proportional hazard regression
On this course, we will consider only the tests in the shaded cells.
For information on other statistical tests, consult the table above.

 

So how do you choose an appropriate statistical test?

 

We will investigate this further in the next session, Descriptive Statistics.


© MicrobiologyBytes 2009.