MicrobiologyBytes: Maths & Computers for Biologists: Inferential Statistics Updated: February 6, 2009 Search

Inferential Statistics
- Comparing Groups II

Further information on this topic can be found in Chapter 10 of:

CoverMaths from Scratch for Biologists

Numerical ability is an essential skill for everyone studying the biological sciences but many students are frightened by the 'perceived' difficulty of mathematics, and are nervous about applying mathematical skills in their chosen field of study. Maths from Scratch for Biologists is a highly instructive, informal text that explains step by step how and why you need to tackle maths within the biological sciences. (Amazon.co.UK)

c2-test (chi-squared test)

This is an example of a non-parametric test. Unlike Student's t-test, it makes no assumptions about the distribution of the data.

c2 (pronounced "kye-squared") is used when data consists of nominal or ordinal variables rather than quantitative variables, i.e. when we are interested in how many members fall into given descriptive categories (not for quantitative measurements, such as weight, etc).

The c2 test of independence asks "Are two variables of interest independent (not related) or related (dependent)?" and deals with integers - the number of variables which fall into different, mutually exclusive categories. The test investigates whether the proportions of certain categories are different in different groups.

When the variables are independent, knowledge of one variable gives no information about the other variable.
When they are dependent, knowledge of one variable is predictive of the value of the other variable.

  • Is level of education related to level of income?
  • Is affiliation to a political party related to a person's preferred television network?
  • Is there a relationship between gender and examination performance?

The c2 test is by default one-tailed and can only be carried out on raw data (not percentages, proportions or other derived data):

Chi-squared distribution

As with Student's t frequency distribution, you don't need to know the formula for the c2 probability density function, simply look up the value of c2 in a statistical table.

The basis of the c2 test is:

chi-square formula

and:

H0: observed group mean - expected group mean = 0
(there is no difference between the two groups)

HA:  observed group mean - expected group mean does not equal 0
(there is a difference between the two groups)

 

The c2 test has two main uses:

  1. Comparing the distribution of one category variable (nominal or ordinal) with another.
  2. Comparing an observed distribution with a theoretically expected one.
    The expectation might be that the data would be normally distributed, or that particular attributes (e.g. treatment and disease) are independent, i.e. no closer association than might be expected by chance. In the first case, a table of values for a normal distribution would be the source of the expected values. In the second, the expected values would be calculated assuming independence (random distribution).

 

Assumptions:

The c2 test is a non-parametric test which assumes that the data analyzed:

  1. Consist of nominal or ordinal variables.
  2. Consist of entire populations or be randomly sampled from the population.
  3. No single data point should be zero (if so, use Fisher's exact test - below).
  4. All the objects counted should be independent of one another.
  5. 80% of the expected frequencies should be 5 or more (if not, try aggregating groups or use Fisher's exact test for small sample sizes).

If you use the c2 test under other circumstances, the results will be meaningless!

IMPORTANT:

  • Note that acceptance or rejection of the null hypothesis can only be interpreted strictly in terms of the question asked, e.g. "There is a difference between the groups" or "There is no difference between the groups" and NOT EXTRAPOLATED to "There is a difference between the groups because...

 

A. Comparing the distribution of one category variable with another

Example:
Of 120 male and 100 female applicants to university, 90 male and 40 female had work experience.
Does the gender of an applicant to university correspond to whether or not they have prior work experience?

 

 

Work experience:

 

 

Yes

No

Total

Gender of applicant:

Male

90

30

120

 

Female

40

60

100

 

Total

130

90

220

 

 

 

Work experience:

 

 

Yes

No

Total

Gender of applicant:

Male

a

b

a+b

 

Female

c

d

c+d

 

Total

a+c

b+d

n

 

equation

df = (number of columns-1) * (number of rows-1)

For the above test, df = (2-1) * (2-1) = 1

Critical Values for the Chi-Squared Distribution

 
a

df

0.995

0.99

0.975

0.95

0.9

0.1

0.05

0.025

0.01

0.005

1

0.000

0.000

0.001

0.004

0.016

2.706

3.841

5.024

6.635

7.879

 

 

ALTERNATIVE METHOD:   c2 calculation using observed and expected values:

 

Observed:

Expected:

O-E:

(O - E)2 / 2

 

Yes

No

Yes

No

Yes

No

Yes

No

Male:

90

30

71

49

19

-19

5.1

7.4

Female:

40

60

59

41

-19

19

6.1

8.8

Total:

130

90

130

90

0

0

11.2

16.2

 

The advantage of this method is that it can be applied to problems where there are more than two groups, for example:

 

B. Comparing an observed distribution with a theoretically expected one

Using the method of observed and expected values we can also use the c2 test to compare an observed distribution with a theoretically expected one.

Example:

Colour:

Observed:

Expected from genetic theory:

White:

380

51%

Brown:

330

40.8%

Black:

74

8.2%

Colour:

Observed:

Theoretical proportion:

Expected:

O - E

(O - E)2 / 2

White:

380

0.510

400 (0.510x784)

-20

1.0

Brown:

330

0.408

320 (0.408x784)

10

0.3125

Black:

74

0.082

64 (0.082x784)

10

1.5625

Total:

784

1.0

784

0

2.8750

 

Critical Values for the Chi-Squared Distribution

 
a

df

0.995

0.99

0.975

0.95

0.9

0.1

0.05

0.025

0.01

0.005

2

0.010

0.020

0.051

0.103

0.211

4.605

5.991

7.378

9.210

10.597

 

Excel The c2 test is not found in the Microsoft Excel Analysis ToolPak.

MSExcel can still be useful in performing c2 tests as it saves much work in performing the calculations. However, it is necessary to construct the contingency table yourself. To see an example, right-click on the MSExcel icon opposite, i.e. click with the right hand mouse button. Choose the "Save Target As" option to download an MSExcel document to your computer.

  • CHITEST(actual_range,expected_range)  Returns a c2 P-value for the test, i.e. the probability that the null hypothesis is true (Tcalc < Tcrit).
  • CHIINV(probability,degrees_freedom)   Returns the critical value of c2 for the probability (a-value) and degrees of freedom specified, e.g. CHIINV(0.05,10) = 18.30703. Compare the results from this function is used to compare observed results with expected ones to decide whether the original hypothesis is valid.

 


Fisher's Exact Test

 

N.B: Limitations of the c2 test:

The c2 test is a non-parametric test which assumes that the data analyzed:

  1. Consist of nominal or ordinal variables.
  2. Consist of entire populations or be randomly sampled from the population.
  3. No single data point should be zero (if so, use Fisher's exact test).
  4. All the objects counted should be independent of one another.
  5. 80% of the expected frequencies should be five or more (if not, try aggregating groups or use Fisher's exact test for small sample sizes).

Sir Ronald Aylmer Fisher (1890-1962) "the father of modern statistics"

Sir Ron developed the concept of likelihood:

The likelihood of a parameter is proportional to the probability of the data and it gives a function which usually has a single maximum value, called the maximum likelihood.

He also contributed to the development of methods suitable for small samples and studied hypothesis testing.

Fisher's exact test is an alternative to c2 for testing the hypothesis that there is a statistically significant difference between two groups. It has the advantage that it does not make any approximations (Fisher's exact test), and so is suitable for small sample sizes.

Assumptions of Fisher's exact test:

Fisher's exact test is a non-parametric test which assumes that the data analyzed:

  1. Consist of nominal or ordinal variables.
  2. Consist of entire populations or be randomly sampled from the population, as in all significance tests.
  3. Independent observations: It is assumed that the value of the first unit sampled has no effect on the value of the second unit. Pooling data from before-after tests or matched samples would violate this assumption.
  4. Mutual exclusivity: A given case may fall in only one class.

The formula for calculating p values from Fisher's exact test is complicated. As long as the test criteria are appropriate, you can perform Fisher's test using one of the many online calculators or other statistics software (there is no built-in function for Fisher's test in MSExcel).


© MicrobiologyBytes 2009.