MicrobiologyBytes: Maths & Computers for Biologists: Probability Updated: January 28, 2007 Search

Probability

Further information on this topic can be found in Chapter 9 of:

CoverMaths from Scratch for Biologists

Numerical ability is an essential skill for everyone studying the biological sciences but many students are frightened by the 'perceived' difficulty of mathematics, and are nervous about applying mathematical skills in their chosen field of study. Maths from Scratch for Biologists is a highly instructive, informal text that explains step by step how and why you need to tackle maths within the biological sciences. (Amazon.co.UK)

OK, lets get the important stuff out of the way first:

What are your chances of winning the National Lottery jackpot?

Prize:
Probability:
Odds:
Jackpot: Match 6 numbers / 49
P = (6/49)x(5/48)x(4/47)x(3/46)x(2/45)x(1/44)
1 in 13,983,816
Match 5 numbers / 49
P = (5/49)x(4/48)x(3/47)x(2/46)x(1/45)
1 in 55,492
Match 4 numbers / 49
P = (4/49)x(3/48)x(2/47)x(1/46)
1 in 1,033
Win ANY prize: Match 3 numbers / 49
P = (3/49)x(2/48)x(1/47)
1 in 57

With 2 draws each week, if you a buy a ticket for each draw, you will, on average, win the jackpot every 140 thousand years.

 

So, with that out of the way, why do we have to know about probabilities?  Because:

 

Statistical methods depend upon probability theory.

 

Probability, P = number of observations / total number of observations

or, to put it another way:

P = number of specific outcomes / total number of possible outcomes

 

The simplest way to understand probabilities is through proportional frequency:

Example:

Joke!
In a group of mice there are 200 white mice and 50 brown mice:
Mice - geddit?

 

 

Replacing versus not replacing selections:

If we replace the first selection and make a second selection, then the probability of making a given selection is unaltered. Thus, in the above example the probability of picking a brown mouse is still 50/250 = 1/5 = 0.2

If we do not replace our first selection the probability when making the second selection will change:

Example:

Studying repeated samples (selections) from natural populations is easier if we assume replacement occurs.
We can usually assume this if the population is large.

When the result of the first sample does not affect the probability of the result of subsequent samples, the samples are said to be independent (a requirement of many of statistical tests).

 

 

Calculating the Probability of Multiple Events

The number of possible combinations of events is given by the factorial product of the number of events (written as "n!") - the product of an integer and all the lower integers, e.g:

For 3 events (A, B, C), number of possible combinations = 3!   = 3 * 2 * 1   = 6

1
2
3
4
5
6
ABC
ACB
BAC
BCA
CAB
CBA
Note that these are different combinations, e.g. putting a meal on a plate then dropping the plate is not the same as dropping a plate then putting a meal on it!

 

Example:   A population of 50 brown mice, 200 white mice, selections with replacement:

a) Probability of 3 brown mice in 3 selections = (50/250) * (50/250) * (50/250)

= (1/5) * (1/5) * (1/5) = 0.008

b) Probability of selecting, in order, brown, brown and then white = (50/250) * (50/250) * (200/250)

= (1/5) * (1/5) * (4/5) = 0.032

c) If, however, we are not interested in the order (i.e. brown, brown, white) but just the overall outcome (i.e. 2 brown, 1 white), the probability is different:

Possible outcome of 3 selections with replacement:

Selection outcome:

Probability of selection:

Probability of outcome:

1

2

3

1

2

3

Sum:

Total:

B

B

B

1/5

1/5

1/5

(1/5) * (1/5) * (1/5)

0.008

B

W

B

1/5

4/5

1/5

(1/5) * (4/5) * (1/5)

0.032

B

B

W

1/5

1/5

4/5

(1/5) * (1/5) * (4/5)

0.032

B

W

W

1/5

4/5

4/5

(1/5) * (4/5) * (4/5)

0.128

W

B

B

4/5

1/5

1/5

(4/5) * (1/5) * (1/5)

0.032

W

W

B

4/5

4/5

1/5

(4/5) * (4/5) * (1/5)

0.128

W

B

W

4/5

1/5

4/5

(4/5) * (1/5) * (4/5)

0.128

W

W

W

4/5

4/5

4/5

(4/5) * (4/5) * (4/5)

0.512

 

 

 

 

 

 

TOTAL:

1.0

 

Thus, the sum of probabilities of a set of mutually exclusive, exhaustive outcomes is 1, but the probability of 2 brown mice and 1 white mouse, irrespective of the order of selection is:

Selection outcome:

Probability of selection:

Probability of outcome:

1

2

3

1

2

3

Sum

Total

B

W

B

1/5

4/5

1/5

(1/5) * (4/5) * (1/5)

0.032

B

B

W

1/5

1/5

4/5

(1/5) * (1/5) * (4/5)

0.032

W

B

B

4/5

1/5

1/5

(4/5) * (1/5) * (1/5)

0.032

 

 

 

 

 

 

TOTAL:

0.096

 

Note the difference in outcome between an ordered selection (probability = 0.032) and selection irrespective of order (probability = 0.096) = the sum of all the possible ordered selections.

 

These examples illustrate the two rules of probability:

  1. The SUM  or  OR rule:
    The probability of any one of several distinct events is the sum of their individual probabilities, provided that the events are mutually exclusive (i.e. occurrence of one event precludes the others, e.g. selection without replacement).

  2. The PRODUCT  or  AND rule:
    The probability of several distinct events occurring successively or jointly is the product of their individual probabilities, provided that the events are independent (i.e. the outcome of one event must have no influence on the others, e.g. tossing a coin).

 

 

The Binomial Distribution

The binomial probability distribution describes what will happen when there are two possible outcomes of an event, e.g:

Such binary variables turn out to occur quite frequently in biology.

In its simplest form, the binomial expansion summarizes the possible outcomes for any number of samples when there are only two possible outcomes (e.g. brown and white mice).

For independent events, the binomial distribution is given by:

(P + Q)n

where:
P is the probability of one of the possible events
Q is the probability of the second event ( = 1 - P )
n is the number of trials in the series

For samples of 1 (n=1): (P + Q)1 = (P + Q)

For samples of 2 (n=2): (P + Q)2 = P2 + 2PQ + Q2

For samples of 3 (n=3): (P + Q)3 = P3 + 3P2Q + 3PQ2 + Q3

etc.

 

Back to the mice! These expansions of the binomial equation describe all the possible outcomes from the experiment above:

If P = brown mice and Q = white mice, for 3 samples from the population ( n = 3):

These are all the possible outcomes.

 

In the population from which the samples were drawn:

and we can therefore calculate the distribution of outcomes and from the binomial equation and compare the observed and expected distributions using the c2 test.

In this example we can calculate the probability of 2 brown mice and 1 white mouse being selected as:

3P2Q = 3(0.2)2(0.8) = 0.096
(note that this is the same as in the table above)

This is OK when there are a small number of samples and a small number of outcomes but gets progressively more difficult as the sample size increases. For example, try calculating how many different ways there are to select 7 brown mice and 6 white mice in 13 selections!

To perform such calculations, we can use the following equation:

Number of outcomes =equation

where:
n = number of selections
r = number of one of the outcomes
(remember "!" = factorial)

Let's check this works:

For 2 brown mice and 1 white mouse:

equation( i.e. BBW, BWB, WBB).

So for 7 brown mice and 6 white mice:

equation

equation

 

If we know the probability of the outcome for a single selection (e.g. probability of brown or probability of white) we can calculate the total probability for the outcome using:

equation where:
P is the total probability of the outcome (e.g. 2 brown mice and 1 white mouse)
p is the probability of the event that occurs r times
(1-p) is the probability of the event that occurs n-r times

In our example of two brown mice and 1 white mouse:

equation

which is as calculated above in the table or using the binomial equation 3PQ2.

 

 

In practice, we can also look up the probability of an event from a
table of binoimial probabilities.

 

Suppose that 1% of a population has a characteristic under study, e.g. an inherited defect in the (mythical) stat gene which restricts the ability of carriers to understand statistics).
There are no external signs that we can use to recognise carriers, so we must select individuals from the population at random and test them.
If the sample size used is too small there is a risk of not finding any carriers, if it is too large scarce testing resources will be wasted.
What sample size is required to give a good likelihood of sampling affected individuals?

The binomial distribution can be used in a case such as this because the variable is binary and mutually exclusive, i.e. each individual will or not will not carry the defective gene.

If 1% of the population is affected then P = 0.01 (affected) and Q = 0.99 (not affected).

To find the probability of finding some (i.e. 1 or more) carriers, the easiest way to obtain the figure is to calculate is the probability of no cases (i.e. P(0) ) for a given sample size, e.g. 20. We can do this by making use of the binomial equation and setting the number of successes, r, to 0, and the number of trials, n, to 20. This will give us the probability of taking a sample of 20 individuals and finding no infected individuals:

equation

P(0) =    20!   

* 0.010 (1-0.01)20-0

  0!*(20-0)!  

= 1 * 1 * 0.9920

= 0.82

N.B: 1! = 0! = 1
A number raised to the power 0 is 1 and a number raised to the power 1 is itself, e.g. 200 = 1 and 201 = 20.

Thus, if 1% of the population is affected there is a 82% chance that a sample of 20 individuals will fail to find any carriers. Consequently a sample size of 20 would appear to be too small to give a reasonable chance of finding at least one carrier.
If n = 50, P(0) = 0.9950 = 0.61, i.e. a 39% chance of finding an affected carrier.
If n = 100, P(0) = 0.99100 = 0.37, i.e. a 63% chance of finding an affected carrier.

As the percentage of affected individuals drops the probability of missing such infections in a sample of 20 individuals increases, e.g. if only 0.1% of the population are carriers there is only a 2% chance of finding any in a sample of 20 people, i.e. P(0) = 0.98.

This type of calculation can be useful to determine the minimum sample number needed to obtain at least 1 positive result from a sample for any binary variable, e.g. to find at least one affected carrier in a random sample. All that is required is the probability of the event, e.g. if 1 in 1000 of the population carry a particular genetic polymorphism, then P = 0.001.

 

MSExcelMicrosoft Excel has built-in binomial probability functions:

Use the online Help to find out more about these!

BINOMDIST(number_s,trials,probability_s,cumulative)

where:
number_s = is the number of successes in trials
trials = is the number of independent trials
probability_s = is the probability of success on each trial
cumulative = is a logical value that determines the form of the function. If cumulative is TRUE, then BINOMDIST returns the cumulative distribution function, which is the probability that there are at most number_s successes; if FALSE, it returns the probability mass function, which is the probability that there are number_s successes.

BINOMDIST is used in problems with a fixed number of tests or trials, when the outcomes of any trial are only success or failure, when trials are independent, and when the probability of success is constant throughout the experiment.

Example:
Flipping a coin can only result in heads or tails. The probability of the first flip being heads is 0.5, and the probability of 6 out of 10 flips being heads is:   BINOMDIST(6,10,0.5,FALSE)   = 0.205078

Also:

CRITBINOM(trials,probability_s,alpha)

Returns the smallest value for which the cumulative binomial distribution is greater than or equal to a criterion value. Use this function for quality assurance applications, e.g. to determine the greatest number of defective parts that are allowed to come off an assembly line run without rejecting the entire lot.

where:
Trials is the number of trials.
Probability_s is the probability of a success on each trial.
Alpha is the criterion value.

Coincidences:

When working with larger numbers than 2 and 3, probability theory has some unexpected results. Many unexpected coincidences are merely the result of large populations, e.g:

Why do "coincidences" matter?

Because when you are trying to determine if an event is statistically significant or not, the "expected" answer can be very misleading - events which might seem very unlikely to occur by chance can do precisely that if enough cases are involved.

 

Odds Ratios

Odds ratios are widely used in medical literature because:

The odds are a way of representing probability.

 

Oh yes, I was going to tell you how to win the National Lottery jackpot:

Buy 14,000,000 tickets.

 

"The best way to get rich from probability theory is to find someone who knows less about it than you do"

Fun stuff:


© MicrobiologyBytes 2009.