MicrobiologyBytes: StatsBytes: Multivariate data Updated: January 18, 2012Search

R

Multivariate data

REMEMBER: The initial steps in data analysis are always the same:

  1. Before you do anything, think about the data. What sort of data is it and where did it come from? Is it measurement data (continuous, i.e. all values possible), or is it count (unordered, discrete/non-continuous), rank (ordered, discrete/non-continuous) or nominal (names) data? Be careful with nominal data - numbers can be used as names.
  2. Next, visually inspect the data. Can you see any abnormalities? Is is correctly formatted for R (one variable per column, names, if present, correctly written)?
  3. Perform EDA using graphical tools, e.g. plot(), boxplot(), hist(), qqnorm(), qqline(), and if you still can't decide whether variables have a normal frequency distribution, shapiro.test() (check the distribution of each variable, not the entire dataset).
  4. Decide what sort of statistical test you are going to use, parametric or non-parametric.
Selecting a statistical test:      Type of data:   
Univariate
Bivariate
Multivariate
Objective: Measurement, normal distribution Count, rank, or measurement with non-normal distribution
Describe (all data types) EDA (mean), plots EDA (median), plots

Compare to a hypothetical distribution

One sample t test
Wilcoxon test, chi square goodness of fit

Compare independent variables

Unpaired t test
Wilcoxon test (unpaired, = Mann-Whitney test), Fisher's test (small groups) or Chi-square test of independence (large groups) / Chi-square test of homogeneity (large groups)

Compare dependent variables

Paired t test
Wilcoxon test (paired)

Measure association between variables

Pearson correlation test
Spearman correlation test
Prediction from another variable
Simple linear regression
[Nonparametric regression]
Compare 3 or more independent variables
ANOVA
Kruskal-Wallis test, Chi-square test of independence
Compare 3 or more dependent variables Repeated measures ANOVA Friedman test

See: Which is the correct statistical test to use? (2008) British Journal of Oral and Maxillofacial Surgery 46(1): 38-41

 

Parametric tests

ANOVA - ANalysis Of Variance

The t test can be used to compare two variables, but we never use it when there are more than two variables. Why not? If you want to compare three or more groups using t tests with the usual 0.05 level of significance, you would have to compare the three groups pairwise (A to B, A to C, B to C), so the chance of getting the wrong result would be:

1 - (0.95 x 0.95 x 0.95) = 14.3%

If you wanted to compare four or more groups, the chance of getting the wrong result would be 0.956 = 26%, and for five groups, 40%. Not good, is it? Never perform multiple pairwise t tests! Instead we, use ANalysis Of VAriance (ANOVA). This is such an important (and complicated) statistical method that it would be easy to devote a whole course to ANOVA alone, so we won't be doing any more than barely scratching the surface here. ANOVA works best (i.e. is most accurate) when you have a balanced design, i.e. the same number of observations in each variable.

ANOVA tests the null hypothesis that the means of all the groups being compared are equal, and produces a statistic called F which is equivalent to the t statistic from a t test. But there's a catch. If the means of all the groups tested by ANOVA are equal, fine. But if the result tells us to reject the null hypothesis, we still don't know which of the means differ. We solve this problem by performing what is known as a "post hoc" (after the event) test.

Reminder:

ANOVA jargon (sorry):
Way = an independent variable, so a one-way ANOVA has one independent variable, two-way ANOVA has two independent variables, etc. Simple ANOVA tests the hypothesis that means from two or more samples are equal (drawn from populations with the same mean). The t-test is actually a particular application of one-way ANOVA which compares two groups.
Repeated measures: Used when members of a sample are measured under different conditions - measurement of the dependent variable is repeated. Using standard ANOVA is not appropriate because it fails to take into account correlation between the repeated measures, violating the assumption of independence. This approach can be used for several reasons, e.g. where research requires repeated measures, such as longitudinal research which measures something several times. For example, if you train rats to run through a maze and time how long it takes them over several trials, the variables (each trial) are not independent because the same animals are being tested each time. This means that standard ANOVA, which is designed to compare independent variables, will be unreliable. Unfortunately, performing repeated measures ANOVA in R is very difficult, so will not be covered here. Be aware that you need to think about study design! If you really need to test a repeated measures design, you'll need to install specialist packages which extend the functionality of R, or use different software such as SPSS (instructions here).

One way ANOVA example: Pain Scores for Analgesics
Drug: Patient Pain Scores:
Diclofenac: 0, 35, 31, 29, 20, 7, 43, 16
Ibuprophen: 30, 40, 27, 25, 39, 15, 30, 45
Paracetamol: 16, 33, 25, 32, 21, 54, 57, 19
Asprin: 55, 58, 56, 57, 56, 53, 59, 55

(Since it would be unethical to withhold pain relief, there is no control group and we are just interested in knowing whether one drug performs better (lower pain score) than another, so we need to perform a one-way ANOVA.)

> Diclofenac <- c(0, 35, 31, 29, 20, 7, 43, 16)
> Ibuprophen <- c(30, 40, 27, 25, 39, 15, 30, 45)
> Paracetamol <- c(16, 33, 25, 32, 21, 54, 57, 19)
> Asprin <- c(55, 58, 56, 57, 56, 53, 59, 55)

> boxplot(Diclofenac, Ibuprophen, Paracetamol, Asprin) #Hmm, not sure? Check further:

> shapiro.test(Diclofenac)
Shapiro-Wilk normality test
data: Diclofenac
W = 0.9723, p-value = 0.9153

> shapiro.test(Ibuprophen)
Shapiro-Wilk normality test
data: Ibuprophen
W = 0.9615, p-value = 0.8247

> shapiro.test(Paracetamol)
Shapiro-Wilk normality test
data: Paracetamol
W = 0.8638, p-value = 0.1309

> shapiro.test(Asprin)
Shapiro-Wilk normality test
data: Asprin
W = 0.9769, p-value = 0.9459

There are several ways we could set up the data for ANOVA in R, but the easiest is to make two variables, one for the dependent variable (pain) and one for the independent variable (drug):

> analgesics <-read.csv(file="analgesics.csv", header=TRUE, sep=",")
> attach(analgesics)
> names(analgesics)
> analgesics
#This is the structure of the dataframe
pain drug
1 0 Diclofenac
2 35 Diclofenac
3 31 Diclofenac
4 29 Diclofenac
5 20 Diclofenac
6 7 Diclofenac
7 43 Diclofenac
8 16 Diclofenac
9 30 Ibuprophen
10 40 Ibuprophen
11 27 Ibuprophen
12 25 Ibuprophen
13 39 Ibuprophen
14 15 Ibuprophen
15 30 Ibuprophen
16 45 Ibuprophen
17 16 Paracetamol
18 33 Paracetamol
19 25 Paracetamol
20 32 Paracetamol
21 21 Paracetamol
22 54 Paracetamol
23 57 Paracetamol
24 19 Paracetamol
25 55 Asprin
26 58 Asprin
27 56 Asprin
28 57 Asprin
29 56 Asprin
30 53 Asprin
31 59 Asprin
32 55 Asprin

> analgesics.aov = aov(pain ~ drug) #Run the ANOVA test
#Format: name.aov = aov(dependent.variable ~ independent.variable)

> summary(analgesics.aov)
Df Sum Sq Mean Sq F value Pr(>F)
drug 3 4956.4 1652.12 11.967 3.203e-05 ***
Residuals 28 3865.5 138.05
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Interpretation:
Asterisks indicate a significant result:
*** = 0.001
** = 0.01
* = 0.05
This result shows three asterisks "***" so we know there is a highly significant difference - but between what? To find that out, we need to run what is called a post-hoc test, Tukey's Honest Significant Difference:

> TukeyHSD(analgesics.aov)
Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = pain ~ drug)
$drug
diff lwr upr p adj
Diclofenac-Asprin -33.50 -49.540062 -17.459938 0.0000233
Ibuprophen-Asprin -24.75 -40.790062 -8.709938 0.0012824
Paracetamol-Asprin -24.00 -40.040062 -7.959938 0.0017961
Ibuprophen-Diclofenac 8.75 -7.290062 24.790062 0.4570399
Paracetamol-Diclofenac 9.50 -6.540062 25.540062 0.3857095
Paracetamol-Ibuprophen 0.75 -15.290062 16.790062 0.9992380

This gives us the p-values of all the pairwise comparisons. This tells us whether to accept (p >= 0.05) or reject (p < 0.05) the null hypothesis (there is no significant difference between the mean pain scores of the drugs). In this case, the results for asprin are significantly different from those for the other three drugs.
Formal Reporting: When we report the outcome of an ANOVA, we cite the value of F, the number of degrees of freedom, outcome (in a neutral fashion) and significance value. So in this case: "There is a significant difference between the pain scores for asprin and the other three drugs tested, F(3,28) = 11.97, p < 0.05."

ANOVA Procedure Summary:

  1. Carry out EDA to ensure each of the variables is normally distributed. (If not, use a non-parametric test - see below)
  2. Make sure you are clear about which is the dependent and which is the independent variable.
  3. Create a .csv data file, row 1 = variable names. Read into R as a dataframe, attach(), names().
  4. Run the analysis: your.aov = aov(dependent ~ independent) #IMPORTANT: In your ANOVA model, always list the variables this way round.
  5. View the result: summary(your.aov)
  6. Carry out pairwise post-hoc testing using Tukey HSD test: TukeyHSD(your.aov)
  7. Interpret the results and draw conclusions.

See also: One-way ANOVA in R – including post-hocs/t-tests and graphs

 

Two way ANOVA
One way ANOVA has one independent variable (drug, in the analgesics example above) and one independent variable (pain score). Two way ANOVA deals with more complicated experimental designs and has two independent variables, as in the following example: Do anti-cancer drugs have different effects in males and females?

Independent variables: drug, gender
Dependent variable: tumour size

> tumours <-read.csv(file="tumours.csv", header=TRUE, sep=",")
> attach(tumours)
> names(tumours)
[1] "drug" "gender" "tumoursize"
> tumours
#This is the structure of the dataframe
drug gender tumoursize
1 cisplatin male 50
2 cisplatin male 55
3 cisplatin male 80
4 cisplatin male 65
5 cisplatin male 70
6 cisplatin male 75
7 cisplatin male 75
8 cisplatin male 65
9 cisplatin female 65
10 cisplatin female 70
11 cisplatin female 60
12 cisplatin female 60
13 cisplatin female 60
14 cisplatin female 55
15 cisplatin female 60
16 cisplatin female 50
17 vinblastine male 45
18 vinblastine male 60
19 vinblastine male 85
20 vinblastine male 65
21 vinblastine male 70
22 vinblastine male 70
23 vinblastine male 80
24 vinblastine male 60
25 vinblastine female 70
26 vinblastine female 65
27 vinblastine female 60
28 vinblastine female 70
29 vinblastine female 65
30 vinblastine female 60
31 vinblastine female 60
32 vinblastine female 50
33 5fluorouracil male 35
34 5fluorouracil male 40
35 5fluorouracil male 45
36 5fluorouracil male 55
37 5fluorouracil male 35
38 5fluorouracil male 40
39 5fluorouracil male 45
40 5fluorouracil male 40
41 5fluorouracil female 55
42 5fluorouracil female 65
43 5fluorouracil female 70
44 5fluorouracil female 55
45 5fluorouracil female 55
46 5fluorouracil female 60
47 5fluorouracil female 60
48 5fluorouracil female 60

> aov.tumours = aov(tumoursize~drug*gender,data=tumours) #Perform the analysis of variance

> summary(tumours) #Show the summary table
drug gender tumoursize
5fluorouracil:16 female:24 Min. :35.00
cisplatin :16 male :24 1st Qu.:55.00
vinblastine :16 Median :60.00
Mean :59.69
3rd Qu.:66.25
Max. :85.00

> print(model.tables(aov.tumours,"means"),digits=3) #Report the means and the number of subjects/cell
Tables of means
Grand mean 59.6875
drug
5fluorouracil cisplatin vinblastine
50.9 63.4 64.7
gender
female male
60.8 58.5
drug:gender
gender
drug female male
5fluorouracil 60.0 41.9
cisplatin 60.0 66.9
vinblastine 62.5 66.9

> TukeyHSD(aov.tumours)
Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = tumoursize ~ drug * gender, data = tumours)
$drug
diff lwr upr p adj
cisplatin-5fluorouracil 12.50 5.374062 19.625938 0.0003245
#Significant, < 0.05
vinblastine-5fluorouracil 13.75 6.624062 20.875938 0.0000851
#Significant, < 0.05
vinblastine-cisplatin 1.25 -5.875938 8.375938 0.9049682
$gender
diff lwr upr p adj
male-female -2.291667 -7.124695 2.541362 0.3440889
#Not significant, > 0.05
$`drug:gender`
diff lwr upr p adj
cisplatin:female-5fluorouracil:female -2.131628e-14 -12.382883 12.382883 1.0000000
vinblastine:female-5fluorouracil:female 2.500000e+00 -9.882883 14.882883 0.9902703
5fluorouracil:male-5fluorouracil:female -1.812500e+01 -30.507883 -5.742117 0.0010635
#Significant, < 0.05
cisplatin:male-5fluorouracil:female 6.875000e+00 -5.507883 19.257883 0.5665334
vinblastine:male-5fluorouracil:female 6.875000e+00 -5.507883 19.257883 0.5665334
vinblastine:female-cisplatin:female 2.500000e+00 -9.882883 14.882883 0.9902703
5fluorouracil:male-cisplatin:female -1.812500e+01 -30.507883 -5.742117 0.0010635
#Significant, < 0.05
cisplatin:male-cisplatin:female 6.875000e+00 -5.507883 19.257883 0.5665334
vinblastine:male-cisplatin:female 6.875000e+00 -5.507883 19.257883 0.5665334
5fluorouracil:male-vinblastine:female -2.062500e+01 -33.007883 -8.242117 0.0001612
#Significant, < 0.05
cisplatin:male-vinblastine:female 4.375000e+00 -8.007883 16.757883 0.8961909
vinblastine:male-vinblastine:female 4.375000e+00 -8.007883 16.757883 0.8961909
cisplatin:male-5fluorouracil:male 2.500000e+01 12.617117 37.382883 0.0000052
#Significant, < 0.05
vinblastine:male-5fluorouracil:male 2.500000e+01 12.617117 37.382883 0.0000052
#Significant, < 0.05
vinblastine:male-cisplatin:male 0.000000e+00 -12.382883 12.382883 1.0000000

The post-hoc Tukey test shows the pairwise comparison of means with p values for the null hypothesis for each pair. This result indicates there is a significant difference between 5-fluorouracil and the other drugs, but no overall effect of gender.


Non-parametric tests

To compare three or more independent groups, we have a choice of two non-parametric tests, the Kruskal-Wallis test and the chi square test of independence.

Kruskal-Wallis test

The Kruskal-Wallis test is a sort of non-parametric version of one-way ANOVA which compares the median values of variables rather than the mean.

Example: Growth of pea seedlings
In an experiment to investigating the quality of three different fertilizers with different N:P:K (nitrogen(N), phosphorus(P) and potassium(K)) ratios, the growth of pea seedlings was measured. The data for this test is set up in the same way as for a one way ANOVA test (see above):

> fertilizers <-read.csv(file="fertilizers.csv", header=TRUE, sep=",")
> fertilizers
growth formula
1 14 formula1
2 15 formula1
3 17 formula1
4 19 formula1
5 15 formula1
6 18 formula1
7 15 formula1
8 12 formula1
9 15 formula1
10 14 formula1
11 21 formula2
12 20 formula2
13 15 formula2
14 16 formula2
15 22 formula2
16 19 formula2
17 16 formula2
18 22 formula2
19 22 formula2
20 13 formula2
21 12 formula3
22 6 formula3
23 6 formula3
24 13 formula3
25 6 formula3
26 8 formula3
27 9 formula3
28 7 formula3
29 10 formula3
30 8 formula3

> attach(fertilizers)
> names(fertilizers)
[1] "growth" "formula"

> fertilizers.kw = kruskal.test(growth ~ formula, data=fertilizers)
#Format similar to ANOVA: kruskal.test(dependent.variable ~ independent.variable, data=dataframe)

> fertilizers.kw
Kruskal-Wallis rank sum test
data: growth by formula
Kruskal-Wallis chi-squared = 21.0703, df = 2, p-value = 2.658e-05

P < 0.05 so there is a significant difference, but between what? To find out we need to run a post-hoc test, but unlike ANOVA (TukeyHSD), there is no in-built function in R that will conduct post-hoc analysis for the Kruskal-Wallis test. As with repeated measures ANOVA, if you find there is a significant difference between the variables, you'll need to install specialist packages which extend the functionality of R, or do the calculations manually (beyond the scope of this course). In this circumstance, it sensible to use boxplot() on the variables which should help you draw conclusions as to where significant differences occur.

 

Chi square test

An alternative to the Kruskal-Wallis test is the chi square test. There is an example of this test in the notes on bivariate data. The setup for more than two variables is very similar:

Example: Limpet shell colour

Dark: Light: Medium:
North-facing: 470 582 320
South-facing: 492 667 376

> north <- c(470,582,320)
> south <- c(492,667,376)
> data.table <- rbind(north,south)
> data.table
[,1] [,2] [,3]
north 470 582 320
south 492 667 376
> chisq.test(data.table)
Pearson's Chi-squared test
data: data.table
X-squared = 1.659, df = 2, p-value = 0.4363

 

p (0.4363) > 0.05 so we accept the null hypothesis that the north and south-facing populations of limpets are homogeneous, i.e. that the difference in proportions is not significant, (df = 2, N = 2907), p > 0.05.

 

Friedman test

The Friedman test is a non-parametric test used to compare three or more dependent groups. It is a non-parametric alternative to two-way ANOVA. This test only works when you have a balanced design (i.e. the same number of observations in each variable) but is suitable for repeated measures testing (see above). The test works by ranking the data in a similar way to the Spearman correlation test.

Example: Sewage treatment and biological oxygen demand (BOD): LINK
Effluent streams from 6 sewage treatment sites were tested for BOD before, 1 month and 6 months after a redesign to reduce the running costs of the sites. Did the redesign affect the efficiency of sewage treatment?

BOD: Dependent variable
Site, Time: Independent variables

> biological.oxygen.demand <-read.csv(file="biological.oxygen.demand.csv", header=TRUE, sep=",")

> biological.oxygen.demand
measurement site time
1 19 1 Before
2 7 1 Month1
3 12 1 Month6
4 15 2 Before
5 17 2 Month1
6 12 2 Month6
7 13 3 Before
8 5 3 Month1
9 18 3 Month6
10 6 4 Before
11 12 4 Month1
12 20 4 Month6
13 6 5 Before
14 8 5 Month1
15 7 5 Month6
16 15 6 Before
17 19 6 Month1
18 24 6 Month6

> attach(biological.oxygen.demand)
> names(biological.oxygen.demand)
[1] "measurement" "site" "time"

> friedman.test(measurement ~ site | time, data=biological.oxygen.demand)
#Format: friedman.test(dependent.variable ~ independent.variable | independent.variable, data=dataframe)

Friedman rank sum test
data: measurement and site and time
Friedman chi-squared = 6.8137, df = 5, p-value = 0.2349

p > 0.05 so we accept the null hypothesis that there is no significant differences in BOD before and after the redesign of the sites. If there had been a significant effect we could have run a post-hoc pairwise comparison with wilcox.test(), e.g:

> pairwise.wilcox.test(Value, Group, p.adj="bonferroni", exact=F)

Alternatively, draw boxplots of each variable which may help you reach conclusions.

 

Self assessment questions


Creative Commons License
StatsBytes by A.J. Cann is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License