| MicrobiologyBytes: StatsBytes: Exploratory Data Analysis (EDA) | Updated: November 8, 2011 | Search |
Numerical summaries can simplify complex data, but need to be used carefully. The mean, the median and the mode are all measures of "central tendency" single values that attempt to describe a dataset by identifying the central value within the data. But in summarizing an entire dataset into a single number, they can sometimes be more misleading than revealing. Consider this:
10 students are riding on a bus. Each student has an annual income of £10,000. What is the mean (average), the median (middle value) and the modal (most frequently occurring value) income on the bus?
Mean: £10,000
Median: £10,000
Mode: £10,000
Now suppose one of the students gets off the bus and Mark Zuckerberg (annual income £100,000,000) gets on (this happens all the time). What is the mean, the median and the modal (most frequently occurring value) income on the bus?
Mean: £10,009,000
Median: £10,000
Mode: £10,000
Hmm, the mean doesn't seem to be a very good indicator of income from the student's point of view. (Alternatively, just as the mean is "too sensitive", the mode rather insensitive, so also of limited value.) The mean is a good summary of a dataset as long as the data is symmetrical ("normally distributed"). But many datasets are not symmetrical...
When many independent random factors act in an additive manner to create variability, the dataset follows a bell-shaped distribution called the normal (or Gaussian distribution, after Carl Friedrich Gauss, 1777-1855). The normal distribution has some special mathematical properties (parameters) which form the basis of many statistical tests ("parametric tests"). Although no real datasets follow the normal distribution exactly, many kinds of data follow a distribution that is approximately Gaussian:

It is possible to overlay a normal frequency curve on a histogram like this (via http://www.statmethods.net/graphs/density.html):
> h <- hist(x)
> xfit <- seq(min(x),max(x),length=40)
> yfit <- dnorm(xfit,mean=mean(x),sd=sd(x))
> yfit <- yfit*diff(h$mids[1:2])*length(x)
> lines(xfit, yfit, col="red")

Normal frequency distributions
are continuous (not bimodal or multimodal). Of course, not all datasets follow a normal distribution.
A dataset which is not symmetrically distributed around the central value is said to be skewed. The term kurtosis is used to describe how closely values are clustered around the mean, i.e. whether the data is more "peaked" or "flatter" than a normal frequency distribution. Values of skew range from -1 (negative skew) through 0 (symmetrical) to +1 (positive skew). There is no built-in function in R to calculate skew, but the following code creates a user-defined function which calculates skew values:
> skew=function (x, na.rm=FALSE)
+ {
+ if (na.rm) x <- x[!is.na(x)] #remove missing values
+ sum((x - mean(x))^3)/(length(x) * sd(x)^3) #calculate skew
+ }
>
Consider the following three variables:

> skew=function (x, na.rm=FALSE)
+ {
+ if (na.rm) x <- x[!is.na(x)] #remove missing values
+ sum((x - mean(x))^3)/(length(x) * sd(x)^3) #calculate skew
+ }
>> skew(A, na.rm=TRUE) #negative skew
[1] -0.4353575
> skew(B, na.rm=TRUE) #nearly normal
[1] 0.1528822
> skew(C, na.rm=TRUE) #positive skew
[1] 0.7033732
As for skew, there is no built-in function in r to calculate kurtosis, but various additional packages add this capability if needed.
Why does any of this matter?
Tests which rely on the normal frequency distribution for accuracy ("parametric tests") preform calculations based on the distribution of the data, e.g. 95% of the values lie within +/- 2 standard deviations of the mean, etc. Non-normal frequency distributions affect the amount of data in the main body and the amount in the "tails" of the distribution, and so affect the accuracy of the test.
Boxplots provide a quick display of data but are not very sensitive - they may not distinguish between unimodal, bimodal or multimodal distributions so are "risky" to rely on. Example:
Basic procedure for analysis in R, read data in, then:
> attach(dataframe)
> names(dataframe)
> hist(variablename)
e.g:> graphs <- read.csv(file="graphs.csv" ,header=TRUE ,sep=",")
> attach(graphs)
> names(graphs)> hist(y1)
> hist(y2)
> boxplot(y1)
> boxplot(y2)

Stemplots (or "stem and leaf" plots) are slightly better, presenting data in the form of a crude histogram, and can easily be plotted with R:
> stem(y2)
The decimal point is 1 digit(s) to the right of the |0 | 2
0 | 779
1 | 11124
1 | 567789
2 | 00011233334
2 | 55577788999
3 | 00000111222333334444
3 | 56666677777888999
4 | 0011111222334444
4 | 555555666667777888999
5 | 00000111112233344444
5 | 555667778888999
6 | 00000112222344
6 | 5555666667899
7 | 0011112334
7 | 68889
8 | 0223
8 | 569
9 | 012
9 | 9
The problem with stemplots is that they are not very pretty and unsuitable for large datasets. The R hist() function is a much more sophisticated way of visualizing data. Unlike boxplots, histograms visualize the whole dataset rather than relying on a few measurements (min, Q1, median, Q3, max) and so do not get "tricked" by frequency distributions such as bimodal or multimodal. At the same time, histograms are a powerful and elegant way of visualizing large datasets.
To be safe, EDA uses more than one technique to test data, for example a frequency histogram and a normal probability plot such as a QQ plot. Quantile-Quantile (Q-Q) plots are scatter plots which comparing the data to a known frequency distribution. In R, qqnorm plots the data against a normal frequency distribution. A perfect normal distribution comes out as a straight 45 degree line. To make this easier to see, we add the qqline function to the graph:
> qqnorm(variablename)
> qqline(variablename)

The closer the datapoints are to a straight line, the closer the dataset is to a normal distribution. Data which is not normally-distributed shows a systematic curve away from the straight line of a theoretical normal distribution, e.g:

In addition, the Shapiro-Wilk normality test tests the null hypothesis that a sample of data comes from a normally distributed population (Shapiro, S.S., Wilk, M.B. (1965). "An analysis of variance test for normality (complete samples)". Biometrika 52 (3-4): 591–611. doi:10.1093/biomet/52.3-4.591).
The Shapiro-Wilk test calculates a statistic called W, but more importantly, also gives a probability value that tells you whether to accept the null hypothesis for the test. The null (negative) hypothesis for statistical tests is always of the form: "There is NO DIFFERENCE between ...", so for the Shapiro-Wilk test, the null hypothesis is always "There is no difference between the data tested and a normal frequency distribution". In most statistical tests, we tend to set a level of significance of 95% (the test will give the correct answer 19 times out of 20, or 95 times out of a hundred). This is converted to an alpha value of 1 - 0.95 = 0.05. If the p value for the test statistic is greater than or equal to the alpha value, we accept the null hypothesis (in this case, that there is no difference between the data tested and a normal frequency distribution), but if p is less than 0.05, we can't so we reject the null hypothesis for the test. Summary:
In R, shapiro.test() performs the Shapiro-Wilk test of normality. Example:
> shapiro.test(x1)
Shapiro-Wilk normality test
data: x1
W = 0.9606, p-value = 4.314e-08
p <0.05 so we can't be sure that variable x1 has a normal distribution.> shapiro.test(y2)
Shapiro-Wilk normality test
data: y2
W = 0.9943, p-value = 0.644
p >0.05 so we assume that variable y2 has a normal distribution.
VERY IMPORTANT:
Under some circumstances, the Shapiro-Wilk test can give a misleading answer. NEVER rely on the Shapiro-Wilk test alone, ALWAYS combine it with graphical tests (histogram, QQ plot).
Read this: Normality tests don't do what you think they do:
"The qq and histogram plots show the distribution is closer to normal than any distribution you are likely to see in the real world, but the test rejects normality with a very high degree of confidence. Does the significant test against normality mean that we should not use normal theory statistics in this case? (Another hint: the answer is no)"
EDA: Data analysis case study:
Watch this video at 480p for better resolution

StatsBytes by A.J. Cann is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License