MicrobiologyBytes: StatsBytes: Graphs | Updated: January 17, 2013 | Search |

*"I've also learned that simple is better, which is a kind of loose generalization of less is more. The simple-is-better idea is widely applicable to the representation, analysis, and reporting of data. If, as the old cliché has it, a picture is worth a thousand words, in describing a distribution, a frequency polygon or, better still, a Tukey stem and leaf diagram is usually worth more than the first four moments, that is, the mean, standard deviation, skewness, and kurtosis. I do not question that the moments efficiently summarize the distribution or that they are useful in some analytic contexts. Statistics packages eagerly give them to us and we dutifully publish them, but they do not usually make it possiblefor most of us or most of the consumers of our products to see the distribution. They don't tell us, forexample, that there are no cases between scores of 72 and 90, or that this score of 24 is somewhere in leftfield, or that there is a pile-up of scores of 9. These are the kinds of features of our data that we surely need to know about, and they become immediately evident with simple graphic representation." Things I Have Learned (So Far). Jacob Cohen (1990) American Psychologist 45(12): 1304-1312.
*

One of the best things about R is the enormous range of graphs it is able to plot. This is important because the brain is very good at detecting patterns, so visual summaries are one of the best ways of simplifying data. This document describes some of the basic graph plotting functions in R - the program is capable of going far beyond what is included here, see:

> demo(graphics)

You can also save the active window (e.g. graph files) from the File menu - File: Save As

A scatter plot of x against y is a simple but powerful way of visualizing data: plot() [see: ?plot ] *Try it for yourself:*

Download this file: **graphs.csv** by right-clicking on the link and saving it your computer. Remember to set the working directory. Then type these commands into R:

> graphs <-read.csv(file="graphs.csv",header=TRUE,sep=",")

> attach(graphs)

> names(graphs)

[1] "x1" "y1" "x2" "y2"> str(graphs)

'data.frame': 440 obs. of 4 variables:

$ x1: int 1 1 2 3 3 3 4 4 4 4 ...

$ y1: int 1 1 1 1 3 3 3 4 4 4 ...

$ x2: int 2 2 8 10 12 12 13 14 14 15 ...

$ y2: int 2 7 7 9 11 11 11 12 14 15 ...> plot(x1, y1) #This opens the graph window - click here to see the result

Also try these for yourself:

> plot(graphs) #see the result

> plot(x2, y2) #see the result

> plot(x1, y1, type="l") #line graph - see the result

> plot(x1, y1, type="h") #histogram - see the result

> plot(x1, y1, type="s") #stair (cumulative value) - see the result

The plot() command has many additional arguments used to format and annotate the graphs produced (see: ?plot), for example:

> plot(x2, y2, main="Descriptive Title") #gives the graph a title - see the result

> plot(x2, y2, main="Descriptive Title", xlab="Labels for x axis", ylab="Labels for y axis") #label the axes - see the result

You can change colours by using "col= " e.g:

> plot(x2, y2, type="l", col="red", main="Descriptive Title", xlab="Labels for x axis", ylab="Labels for y axis") #see the result

There are plenty of colours to choose from: > colors()

You can save graphs by selecting the graph window and choosing: File menu: Save As

You can also combine multiple plots into one overall graph.

Boxplots (also known as box-and-whisker diagrams) are a convenient way of graphically depicting data through a five-number summary:

- the smallest observation (minimum value)
- lower quartile (Q1)
- median (Q2)
- upper quartile (Q3)
- largest observation (maximum value)

A boxplot can also indicate if any observations might be considered to be outliers (values which appear not to lie within the distribution of the rest of the data). In R, the command is boxplot() (see ?boxplot) and this can be used to plot individual variables or an entire dataframe. Using the **graphs.csv** data above:

> boxplot(x1, x2) - see the result

> boxplot(graphs) - see the result

> boxplot(graphs, horizontal=TRUE) - see the result

> boxplot(graphs, col="blue") #see colors() - see the result

(Note the outliers in x1 - observations which are numerically distant from the rest of the data)

stripchart() produces one dimensional scatter plots (dot plots) of data. These plots are an alternative to boxplots when sample sizes are small. Try:

> stripchart(graphs)

> stripchart(graphs, vertical = TRUE, col = "red")

R can plot pie charts but these are a **bad** way of displaying information. The eye is good at judging linear measures and bad at judging relative areas. A scatter plot or bar chart is a preferable way of displaying this type of data (*Cleveland: The elements of graphing data. Wadsworth: Monterey, CA, USA, 1985, page 264: "Data that can be shown by pie charts always can be shown by a dot chart. This means that judgements of position along a common scale can be made instead of the less accurate angle judgements." This statement is based on the empirical investigations perceptual psychologists.*) So don't use pie charts! But if you must, here are a few examples:

> slices <- c(95, 5, 45)

> lbls <- c("fat", "thin", "medium")

> pie(slices, labels = lbls, main="Simple Pie Chart") # produces a simple pie chart

> pct <- round(slices/sum(slices)*100)

> lbls <- paste(lbls, pct) # add percentages to labels

> lbls <- paste(lbls,"%", sep="") # add % to labels

> pie(slices, labels = lbls, col=rainbow(length(lbls)), main="Simple Pie Chart") #chart with custom colours and annotations

Before plotting a graph, you need to think about the type of data you want to display:

- Bar charts barplot(data) are used to display data 'discrete' (discontinuous) data, e.g. eye colour, arbitrary scales (agree, disagree, etc). By convention the data type is emphasised by the use of separate bars.
- Histograms are used to display data continuous data, e.g. height, weight, etc. By convention the data type is emphasised by the use of touching (connected) bars.

For statisticians and scientists, histograms are a major tool for data visualization. The concept of the frequency distribution (shape) of data is implict in drawing a histogram. In R, hist() plots a frequency distribution histogram of a variable. Using the dataset graphs.csv from above:

> hist(x1) #see the result

> hist(x2, breaks=20) # plots a histogram with 20 bars ("bins") - see the result

If you would like to control the range of the axes on the histogram, use:

> hist(x2, xlim=c(0,100), ylim=c(0,100))

and adjust the numbers to get the axis scale you want. (Basics of Histograms - *good post on making your histograms do what you want*)

If wanted, it is possible to add the probability distribution for the graph as a line, which smooths out artefacts caused by the bin range. To calculate this, the values in all the histogram bins are assumed to total 1, and the value for each bar plotted by using the density() and lines() functions on top of the graph, e.g:

> hist(y1, freq=FALSE) #freq=FALSE turns off frequency counts and calculates bar heights so they are scaled to sum to one, giving a probability density.

> lines(density(y1), col="red") #see the result

Histograms are a powerful visual tool for making decisions about data. However, on their own, they do not reveal the quality of data, or whether differences between variables are statistically significant. To show that, we need to add **error bars**. Depending on what is wanted, various values can be set for the error bars. (See: Error bars in experimental biology. J Cell Biol. 2007 177(1): 7-11)

The **standard deviation**** (SD)** is an estimate of how far individual values differ from the population mean. The SD shows the variability of the measurements, but does not take into account sample size. To assess statistical significance, you must take into account sample size as well as variability. Therefore, SD error bars show the variability of the data (short error bars = tightly clustered, long error bars = widely spread), but observing whether SD error bars overlap or not tells you nothing about whether the difference is, or is not, statistically significant.

The **standard error of the mean (SEM)** is an estimate of how far sample means from samples of size *n* differ from the population mean.

Standard Error of the Mean = standard deviation(sample)/square root(sample size)

If you plot a graph with error bars corresponding to the standard error of the mean, you can work out whether differences between values are statistically significant. When SEM error bars do not overlap, you cannot be sure that the difference between two means is statistically significant, but when they do, you can be sure the difference between the two means is not statistically significant (P>0.05) (here is an example).

"If we want to say how widely scattered some measurements are, we use the standard deviation. If we want to indicate the uncertainty around the estimate of the mean measurement, we quote the standard error of the mean." Standard deviations and standard errors. (2005) BMJ 331: 903

*SEM Error Bars Example:*

Once again using the **graphs.csv** data:

> graphs <- read.csv(file="graphs.csv", header=TRUE, sep=",")

> attach(graphs)

> names(graphs)

[1] "x1" "y1" "x2" "y2"

> summary(graphs)

x1 y1 x2 y2

Min. : 1.00 Min. : 1.00 Min. : 2.00 Min. : 2.00

1st Qu.:20.00 1st Qu.: 21.00 1st Qu.:44.00 1st Qu.: 33.00

Median :31.00 Median : 39.00 Median :61.00 Median : 46.00

Mean :34.74 Mean : 47.46 Mean :58.09 Mean : 46.89

3rd Qu.:47.00 3rd Qu.: 77.25 3rd Qu.:74.00 3rd Qu.: 60.00

Max. :98.00 Max. :100.00 Max. :99.00 Max. : 99.00

NA's :90.00 NA's :56.00 NA's :241.00> means <- c(34.74, 47.46, 58.09, 46.89) #make a file containing the mean values

Calculate the SEMs: SEMvariablename <- sd(variablename)/sqrt(length(variablename))

> SEM.x1 <- sd(x1, na.rm=TRUE)/sqrt(length(na.omit(x1)))

> SEM.x1

[1] 1.064221> SEM.x2 <- sd(x2, na.rm=TRUE)/sqrt(length(na.omit(x2)))

> SEM.x2

[1] 1.046663> SEM.y1 <- sd(y1, na.rm=TRUE)/sqrt(length(na.omit(y1)))

> SEM.y1

[1] 1.398042> SEM.y2 <- sd(y2, na.rm=TRUE)/sqrt(length(na.omit(y2)))

> SEM.y2

[1] 1.381382> error.bars <- c(1.06, 1.05, 1.40, 1.38) #make a file containing the standard errors for each value (rounded)

> SEM.graph <- barplot(means, ylim=c(0,max(means)+max(error.bars))) #plot the graph. The y-axis scale will depend on the size of the longest error bar. You can change it by setting a value of ylim or format the graph how you want it, e.g. col="whatever"

> arrows(SEM.graph, means-error.bars, SEM.graph, means+error.bars, code=3, angle=90, length=.1) #add the error bars

Various annotations can be added to the graph using additional arguments for the barplot function:

> SEM.graph <- barplot(means, main="A graph with SEM error bars", sub="SEM error bars are used to show statistical significance of variables", col.sub="red", xlab="x axis", ylab="y axis", names.arg=c("x1", "x2", "y1", "y2"), ylim=c(0,max(means)+max(error.bars)+1))

> arrows(SEM.graph, means-error.bars, SEM.graph, means+error.bars, code=3, angle=90, length=.1)

> box() #add a plot frame (if wanted):

*Conclusion:*

When SEM error bars do not overlap, you cannot be sure that the difference between two means is statistically significant, but when they do, you can be sure the difference between the two means is not statistically significant, so in this case x2 and y2 are not significantly different.

*SD Error Bars Example:*

Using data from **graphs2.txt**:

> graphs2 <- read.table(file="graphs2.txt", header=TRUE, sep="\t")

> attach(graphs2)

> names(graphs2)

[1] "Group1" "Group2" "Group3"> str(graphs2)

'data.frame': 99 obs. of 3 variables:

$ Group1: num 62 59 53.1 54.9 59.8 ...

$ Group2: num 50.2 48.9 36.6 44.3 46.2 ...

$ Group3: num 75.8 76.7 73.3 78.3 74.7 ...> summary(graphs2)

Group1 Group2 Group3

Min. :44.33 Min. :21.91 Min. :68.15

1st Qu.:51.45 1st Qu.:38.11 1st Qu.:75.28

Median :54.64 Median :43.80 Median :77.09

Mean :54.49 Mean :43.44 Mean :77.01

3rd Qu.:57.59 3rd Qu.:49.88 3rd Qu.:78.72

Max. :64.30 Max. :68.42 Max. :85.52

NA's : 3.00 NA's : 1.00> means <- c(54.49, 43.44, 77.01) #make a file containing the mean values

Calculate standard deviation for each variable:

> sd(Group1, na.rm=TRUE)

[1] 4.527885> sd(Group2, na.rm=TRUE)

[1] 8.70127> sd(Group3, na.rm=TRUE)

[1] 2.781995> error.bars <- c(4.527885, 8.70127, 2.781995) #make a file containing the standard deviations for each value

> SD.graph <- barplot(means, ylim=c(0,max(means)+max(error.bars))) #plot the graph. The y-axis scale will depend on the size of the longest error bar. You can change it by setting a value of ylim or format the graph how you want it, e.g. col="white"

> arrows(SD.graph, means-error.bars, SD.graph, means+error.bars, code=3, angle=90, length=.1) #add the error bars

Or a more elaborate version:

> SD.graph <- barplot(means, main="A graph with SD error bars", sub="SD error bars are used to show variability of data", col.sub="red", xlab="x axis", ylab="y axis", names.arg=c("Group1", "Group2", "Group3"), ylim=c(0,max(means)+max(error.bars)+1))

> arrows(SD.graph, means-error.bars, SD.graph, means+error.bars, code=3, angle=90, length=.1)

> box() #optional - add a plot frame if wanted:

To plot as a line graph rather than a bar chart:

> means <- c(54.49, 43.44, 77.01) #make a file containing the mean values

Calculate standard deviation for each variable:

> sd(Group1, na.rm=TRUE)

[1] 4.527885> sd(Group2, na.rm=TRUE)

[1] 8.70127> sd(Group3, na.rm=TRUE)

[1] 2.781995> error.bars <- c(4.527885, 8.70127, 2.781995) #make a file containing the standard deviations for each group

> SD.graph2 <-plot(means, ylim=c(0, max(means)+max(error.bars)),type="b") # note how you can make different kinds of line graphs using the type argument

> arrows(seq(1:3),means-error.bars,seq(1:3),means+error.bars, code=3, angle=90, length=.1)

Use error bars corresponding to +/- SD where you want to highlight the variability of the data.Summary:

Use error bars corresponding to +/- SEM where you want to highlight significant differences between variables.

**More help:**

*Getting Started with R: An introduction for biologists*concentrates on generating graphs in R, including plotting error bars.

StatsBytes by A.J. Cann is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License