MicrobiologyBytes: Maths & Computers for Biologists: Exploratory Data Analysis With SPSS | Updated: February 6, 2009 | Search |
When many independent random factors act in an additive manner to create variability, the dataset follows a bell-shaped distribution called the normal (or Gaussian distribution, after Carl Friedrich Gauss, 1777-1855):
The normal distribution has some special mathematical properties which form the basis of many statistical tests. Although no real datasets follow the normal distribution exactly, many kinds of data follow a distribution that is approximately Gaussian. A normal distribution can be defined by two parameters, the mean and the standard deviation. By definition, normal frequency distributions are continuous (not bimodal). Of course, not all datasets follow a normal distributions, e.g.
Fortunately, SPSS contains powerful, easy to use tools which make it easy to assess frequency distributions and help you make decisions about which tests to use (see below).
Statistical methods which depend on the parameters of populations or probability distributions and are referred to as parametric methods. Parametric tests include:
These tests are only meaningful for numerical data which is sampled from a population with an underlying normal distribution or whose distribution can be rendered normal by mathematical transformation.
Nonparametric methods require fewer assumptions about a population or probability distribution and are applicable in a wider range of situations:
Nonparametric methods are useful in situations where the assumptions required by parametric methods appear questionable. A few of the more commonly used nonparametric methods include:
These tests are characterised as distribution free - i.e. neither the values obtained nor the population from which the sample was drawn need have a normal distribution. Unlike the parametric tests which can give erroneous results these can always be used safely regardless of the distribution of the data.
Unfortunately, they are less flexible in practice and less powerful than parametric tests. In cases where both parametric and nonparametric methods are applicable, statisticians usually recommend using parametric methods because they tend to provide better precision.
Much of statistics is about detecting patterns - something which the human eye and brain are very good at. EDA shows you the patterns which are hidden when the data is in numerical form.
Never, ever, run any statistical test without performing EDA first!
EDA includes:
The good news is that SPSS makes EDA very easy:
Important: Treatment of Missing Values:
If you run Analyze: Descriptive Statistics: Explore,
SPSS will "Exclude cases listwise". What this means in practice is that all the
variables (columns) will be treated as if they have the same number of cases
(rows). If this is true then there's no problem, but if not, then some of the
EDA statistics will be wrongly calculated. To avoid this problem, in the Analyze:
Descriptive Statistics: Explore dialog, click the Options button
and select:
Exclude cases pairwisewise:
As a first approximation, it may be useful to perform EDA on an entire datset to get a quick feeling for what the data looks like. However, if you are going to be comparing groups (subsets of the data) using a parametric test, you need to ensure that each of the groups being compared has a normal frequency distribution, so you need to perform EDA on all the groups separately.
In the results from DNA microarray experiments, many of the hybridized spots have very low fluorescence. However, a few of the spots have very strong fluorescence. This produces a strong positively-skewed distribution:
In
this case, the relatively small number of high-intensity samples distorts the
results. Since this is not a normal distribution, calculating the mean is misleading
and statistical tests based on the normal distribution would give meaningless
answers. The solution is to divide the fluorescence intensity of each sample
by the median fluorescence intensity. Plotting the result of
this gives a normal distribution.
Different sorts of mathematical transformation work best for different datasets:
While it is worth remembering the above suggestions, transforming a dataset is an empirical exercise - perform several of the most likely transformations and test for a normal distribution by EDA. People get understandably concerned about transforming data. Transforming data to allow you to use parametric statistics is completely legitimate as log as:
Read this excellent article. To transform variables in SPSS:Transform: Compute, and select the options you want to construct a new variable, e.g. Ln, Lg10, Sqrt, 1/[var], etc:
Transform: Compute:
becomes:
© MicrobiologyBytes 2009.