| MicrobiologyBytes: StatsBytes: Univariate data | Updated: January 19, 2012 | Search |
If you have generated some data from an experiment or are given data, before even attempting to perform any tests it is necessary to spend some time thinking about what sort of data it is, and therefore, what you can do with it. If it is your own data, you will know how it was generated, but if you have been given the data, you should try to determine what sort of data it is from the information you have. As far as R is concerned, there are five basic data types:
Numbers - real numbers (quantitative data).
Strings - any sort of character (including numbers). Strings are specified by using quotation marks: variable <- c("hello","there")
Factors - explanatory variables which tell you something about other data (metadata).
Data Frames - a vectors which can contain many variables, possibly of different different types.
Tables
When thinking about data, it is more helpful to do it in a way which allows you to think about what you can do with that data. For that reason, the next three sections are divided up by how many variables your data set has:
|
||||||
| Objective: | Measurement, normal distribution | Count, rank, or measurement with non-normal distribution | ||||
| Describe (all data types) | EDA (mean), plots | EDA (median), plots | ||||
Compare to a hypothetical distribution |
One sample t test |
Wilcoxon test, chi square goodness of fit |
||||
Compare independent variables |
Unpaired t test |
Wilcoxon test (unpaired, = Mann-Whitney test), Fisher's test (small groups) or Chi-square test of independence (large groups) / Chi-square test of homogeneity (large groups) |
||||
Compare dependent variables |
Paired t test |
Wilcoxon test (paired) |
||||
Measure association between variables |
Pearson correlation test |
Spearman correlation test |
||||
| Prediction from another variable |
Simple linear regression |
[Nonparametric regression] |
||||
| Compare 3 or more independent variables |
ANOVA |
Kruskal-Wallis test, Chi-square test of independence |
||||
| Compare 3 or more dependent variables | Repeated measures ANOVA | Friedman test | ||||
At first sight, it might seem that there is not much you can do with univariate data, but this is a very common situation. The things you can do with univariate data in R (and the questions you can ask) include (see previous sections):
str()
summary()
plot()
boxplot()
hist()
stem()
qqnorm()
qqline()
shapiro.test()
These tests will allow you to make a quick and accurate description of the data you have and to think about what it is trying to tell you. But you can do more. Parametric statistic tests rely for their accuracy on certain parameters in the data, most commonly (but not exclusively) the presence of a normal frequency distribution. You can perform a parametric test on data which does not have a normal frequency distribution, but the result will be meaningless because you won't know whether it is accurate or not, so there is no point. Never use parametric tests on non-numerical data (e.g. counts, ranks, names) or on measurement data (i.e. continuous numerical data) with a non-normal frequency distribution - in these cases use a non-parametric test.
USE A PARAMETRIC TEST IF YOU CAN BECAUSE THEY ARE MORE ACCURATE AND MORE POWERFUL THAN NON-PARAMETRIC TESTS.
The steps in data analysis are always the same:
You probably think of the t test and the chi square test for testing bivariate or multivariate data. In fact, there are several different flavours of these tests, and some are useful for univariate data. For normally distributed measurement data, we can use the one sample t test to decide if the data the data came from a specified population or distribution. The one-sample t test is a parametric test (requires a normal frequency distribution) which compares the mean score of a sample to a known value, usually a population mean. Example:
There is large variability in survival for prostate cancer across Europe. The mean European 5-year survival rate is 76.5%. The rates for eight English regions are: 85.8, 83.5, 87.1, 85.2, 87.3, 82.6, 86.6, 83.5%. Do these values differ significantly from the European averages? Tests for normality indicate no evidence that this measurement data does not have a normal frequency distribution, so we proceed with the one-sample t test:
> cancer <- c(85.8, 83.5, 87.1, 85.2, 87.3, 82.6, 86.6, 83.5)
> t.test(cancer, mu=76.5) #mu = value of the mean we are comparing the data to
One Sample t-test
data: cancer
t = 13.6109, df = 7, p-value = 2.719e-06
alternative hypothesis: true mean is not equal to 76.5
95 percent confidence interval: 83.68854 86.71146
sample estimates: mean of x 85.2
The p value is less than 0.05, so we reject the null hypothesis that there is no significant difference between English survival rates and the European average. (Remember: GTA LTR "Greater Than (or equal to) Accept, Less Than Reject"). The mean English 5-year survival rate (85.2%) is significantly greater than the European average.
The non-parametric equivalents of the one sample t test are the Wilcoxon test and the chi square goodness of fit test. The R function wilcox.test() performs a non-parametric test for the median rather than the mean which the one-sample t test examines. Example:
The median time students from Grumbledown University spend on Facebook is 81 hours per month. The median times for samples from 8 different degree courses are: 96, 90, 96, 43, 23, 98, 17, 37 hours per month. Tests for normality indicate that this measurement data does not have a normal frequency distribution, and we only know the median value for the University, not the mean, so we proceed with the Wilcoxon test:
> facebook <- c(96, 90, 96, 43, 23, 98, 17, 37)
> wilcox.test(facebook, mu=81, alt="greater") #mu = value of the median we are comparing the data to
Wilcoxon signed rank test with continuity correction
data: facebook
V = 10, p-value = 0.8833
The p value is greater than 0.05 so we accept the null hypothesis that there is no significant difference between our sample and the median for Grumbledown University. (Remember: GTA LTR "Greater Than (or equal to) Accept, Less Than Reject").
Note that the function wilcox.test() performs a range of different tests in R depending on the arguments used with the function. These include the Wilcoxon signed rank test for univariate data (above), the Wilcoxon signed rank test of unpaired bivariate data (also called the Mann-Whitney test), and the Wilcoxon signed rank test for paired bivariate data.
The chi square goodness of fit test examines whether the data came from a specified population, so can be used to look at observed and expected values. Example: Live Births at Grumbledown Infirmary:
| Year: | Births: |
|---|---|
| 2001 | 1538 |
| 2002 | 1692 |
| 2003 | 1365 |
| 2004 | 1517 |
| 2005 | 1458 |
| 2006 | 1908 |
| 2007 | 2160 |
| 2008 | 2029 |
| 2009 | 2581 |
| 2010 | 2292 |
Question: Is the number of births increasing significantly or could this data have arisen by chance?
| Year: | Observed: | Expected: |
|---|---|---|
| 2001 | 1538 | 1854 |
| 2002 | 1692 | 1854 |
| 2003 | 1365 | 1854 |
| 2004 | 1517 | 1854 |
| 2005 | 1458 | 1854 |
| 2006 | 1908 | 1854 |
| 2007 | 2160 | 1854 |
| 2008 | 2029 | 1854 |
| 2009 | 2581 | 1854 |
| 2010 | 2292 | 1854 |
| Total: | 18540 | 18540 |
(expected values assume the null hypothesis that there will be no significant difference in the number of births from years to year, i.e. = total births/number of years)
> observed <- c(1538, 1692, 1365, 1517, 1458, 1908, 2160, 2029, 2581, 2292)
> expected <- rep(0.1,10) #Top R Tip! Expected values must = 1, but you don't have to write it all out!
> expected
[1] 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1
> chisq.test(observed,p=expected)
Chi-squared test for given probabilities
X-squared = 799.9763, df = 9, p-value < 2.2e-16
Since the p value is less than 0.05 (the level of significance for the test), we reject the null hypothesis (no significant difference in the number of births per year of births. (Remember: GTA LTR "Greater Than (or equal to) Accept, Less Than Reject"). As a result, we conclude:
"There is a significant increase number of births per year, chi-square = 799.97(df = 9, N=18540), p < 0.05."
This is the way you formally report a chi square test result, including the conclusion, value of chi square, degrees of freedom (10 - 1 = 9 in this case), total number of observations (N) and significance level.

StatsBytes by A.J. Cann is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License