MicrobiologyBytes: Maths & Computers for Biologists: Exploratory Data Analysis With SPSS Updated: February 6, 2009 Search

The Normal Frequency Distribution

When many independent random factors act in an additive manner to create variability, the dataset follows a bell-shaped distribution called the normal (or Gaussian distribution, after Carl Friedrich Gauss, 1777-1855):

Normal distribution

The normal distribution has some special mathematical properties which form the basis of many statistical tests. Although no real datasets follow the normal distribution exactly, many kinds of data follow a distribution that is approximately Gaussian. A normal distribution can be defined by two parameters, the mean and the standard deviation. By definition, normal frequency distributions are continuous (not bimodal). Of course, not all datasets follow a normal distributions, e.g.

Frequency distributions
Normal distribution

How to recognize a normal (& non-normal) distribution:

  1. In a perfect normal frequency distribution, the mean, median and mode are equal. The data is continuous and symmetrically distributed around the central point. This does not mean that there are no outliers, but the data is no bimodal (or multimodal).
  2. In a perfect normal frequency distribution:
  3. Kolmogorov-Smirnov & Shapiro-Wilk tests: statistical methods which determine whether one distribution is significantly different from another.
  4. Normal Probability Plots: PP and QQ plots

Fortunately, SPSS contains powerful, easy to use tools which make it easy to assess frequency distributions and help you make decisions about which tests to use (see below).

Parametric & Nonparametric Methods:

Statistical methods which depend on the parameters of populations or probability distributions and are referred to as parametric methods. Parametric tests include:

These tests are only meaningful for numerical data which is sampled from a population with an underlying normal distribution or whose distribution can be rendered normal by mathematical transformation.

Nonparametric methods require fewer assumptions about a population or probability distribution and are applicable in a wider range of situations:

  1. Methods used with qualitative data.
        or:
  2. Methods used with quantitative data when no assumption can be made about the population probability distribution.

Nonparametric methods are useful in situations where the assumptions required by parametric methods appear questionable. A few of the more commonly used nonparametric methods include:

These tests are characterised as distribution free - i.e. neither the values obtained nor the population from which the sample was drawn need have a normal distribution. Unlike the parametric tests which can give erroneous results these can always be used safely regardless of the distribution of the data.

Unfortunately, they are less flexible in practice and less powerful than parametric tests. In cases where both parametric and nonparametric methods are applicable, statisticians usually recommend using parametric methods because they tend to provide better precision.

 

Exploratory Data Analysis (EDA)

Why perform EDA?

Much of statistics is about detecting patterns - something which the human eye and brain are very good at. EDA shows you the patterns which are hidden when the data is in numerical form.

Never, ever, run any statistical test without performing EDA first!

EDA includes:

The good news is that SPSS makes EDA very easy:

Important: Treatment of Missing Values:
If you run Analyze: Descriptive Statistics: Explore, SPSS will "Exclude cases listwise". What this means in practice is that all the variables (columns) will be treated as if they have the same number of cases (rows). If this is true then there's no problem, but if not, then some of the EDA statistics will be wrongly calculated. To avoid this problem, in the Analyze: Descriptive Statistics: Explore dialog, click the Options button and select: Exclude cases pairwisewise:

Dialog

EDA on what?

As a first approximation, it may be useful to perform EDA on an entire datset to get a quick feeling for what the data looks like. However, if you are going to be comparing groups (subsets of the data) using a parametric test, you need to ensure that each of the groups being compared has a normal frequency distribution, so you need to perform EDA on all the groups separately.

 

Transforming Data

DNA microarrayIn the results from DNA microarray experiments, many of the hybridized spots have very low fluorescence. However, a few of the spots have very strong fluorescence. This produces a strong positively-skewed distribution:

graph

Normalized graphIn this case, the relatively small number of high-intensity samples distorts the results. Since this is not a normal distribution, calculating the mean is misleading and statistical tests based on the normal distribution would give meaningless answers. The solution is to divide the fluorescence intensity of each sample by the median fluorescence intensity. Plotting the result of this gives a normal distribution.

Different sorts of mathematical transformation work best for different datasets:

While it is worth remembering the above suggestions, transforming a dataset is an empirical exercise - perform several of the most likely transformations and test for a normal distribution by EDA. People get understandably concerned about transforming data. Transforming data to allow you to use parametric statistics is completely legitimate as log as:

Read this excellent article. To transform variables in SPSS:Transform: Compute, and select the options you want to construct a new variable, e.g. Ln, Lg10, Sqrt, 1/[var], etc:

Skewed dataset

Transform: Compute:

Transformation dialog

becomes:

Transformed dataset

 


© MicrobiologyBytes 2009.