| MicrobiologyBytes: StatsBytes: Descriptive Statistics | Updated: February 24, 2012 | Search |
Statistics is the systematic collection and display of (numerical) data. Even the simplest statistical analysis can make data easier to understand:
Tables are used to represent and summarize datasets, revealing patterns:
| Deaths from: | Heart Operations | Cancer | Trauma | Total: |
| Hospital 1 | 67 | 21 | 29 | 117 |
| Hospital 2 | 57 | 46 | 36 | 139 |
| Hospital 3 | 77 | 36 | 20 | 133 |
| Hospital 4 | 30 | 48 | 28 | 106 |
| Total: | 231 | 151 | 113 | 495 |
Graphs can summarize complex data and take advantage of the brain's ability to recognize patterns.
Numerical summaries of data can simplify complex data, but need to be used carefully. The mean, the median and the mode are all measures of "central tendency" single values that attempt to describe a dataset by identifying the central value within the data. In R, these summary statistics are easy to calculate. Try it for yourself:
Download this file: measurements.csv by right-clicking on the link and saving it your computer. Remember to set the working directory. Then type these commands into R:
measurements <-read.csv(file="measurements.csv", header=TRUE, sep=",")
attach(measurements)
names(measurements)
str(measurements)
measurements
In R, this looks like:
> measurements <-read.csv(file="measurements.csv", header=TRUE, sep=",") #create a vector called "measurements" by reading the measurements.csv dataframe into R
> attach(measurements) #attach the data to make all the attributes available
> names(measurements) #display the names of the variables
[1] "weight.kg" "height.m" "heart_rate"
> str(measurements) #display the structure of the dataframe
'data.frame': 36 obs. of 3 variables:
$ weight.kg : num 61.1 53.6 62.4 77.5 49.7 48.2 53.9 49.7 58.8 54.6 ...
$ height.m : num 1.86 1.95 1.77 1.8 1.62 1.59 2.01 1.78 1.92 1.75 ...
$ heart_rate: int 77 77 75 65 86 78 83 77 78 79 ...> measurements #display the data by simply typing the name of the vector - only useful for small datasets!
weight.kg height.m heart_rate
1 61.1 1.86 77
2 53.6 1.95 77
3 62.4 1.77 75
4 77.5 1.80 65
5 49.7 1.62 86
6 48.2 1.59 78
7 53.9 2.01 83
8 49.7 1.78 77
9 58.8 1.92 78
10 54.6 1.75 79
11 52.4 1.89 78
12 63.9 2.17 75
13 55.4 1.57 71
14 64.5 1.76 56
15 64.2 1.95 83
16 61.6 1.92 74
17 68.4 1.83 89
18 66.0 2.00 66
19 61.5 1.81 83
20 63.2 1.84 82
21 64.3 1.70 82
22 60.3 1.76 77
23 57.7 2.16 70
24 55.2 1.64 86
25 69.9 2.19 78
26 60.7 1.99 90
27 64.6 1.69 70
28 56.2 1.96 78
29 54.8 1.88 87
30 58.2 1.49 80
31 56.0 1.71 82
32 63.9 2.18 83
33 58.4 1.94 72
34 62.0 1.80 75
35 63.6 1.46 93
36 52.5 1.86 75
Now let's calculate some descriptive statistics. There is no command to calculate the mode of a dataset in R - WARNING: mode() does not calculate the most frequent value but sets the type or storage mode of an object in R! You can write your own R function to calculate the mode if you want, but it's simpler to use another program such as Open Office Calc or a Google Docs Spreadsheet. Since the mode is such an insensitive statistic, we won't bother with it.
> mean(measurements,na.rm=TRUE)
weight.kg height.m heart_rate
59.691667 1.838889 78.055556
mean() ("average") calculates the mean value for each variable, but the median() ("middle value") command works differently:
> median(measurements,na.rm=TRUE)
[1] 59.55
This gives us the overall median for the dataset, combining weight, height and heart rate and is therefor useless, so we need to type:
> median(weight.kg, na.rm=TRUE)
[1] 60.5
> median(height.m, na.rm=TRUE)
[1] 1.835
> median(heart_rate, na.rm=TRUE)
[1] 78
But there's a much easier way to do this - use the summary() command:
> summary(measurements, na.rm=TRUE)
weight.kg height.m heart_rate
Min. :48.20 Min. :1.460 Min. :56.00
1st Qu.:55.10 1st Qu.:1.740 1st Qu.:75.00
Median :60.50 Median :1.835 Median :78.00
Mean :59.69 Mean :1.839 Mean :78.06
3rd Qu.:63.90 3rd Qu.:1.950 3rd Qu.:83.00
Max. :77.50 Max. :2.190 Max. :93.00
In addition to the median and the mean, this single command also gives additional descriptive statistics:
To divide the variables up into percentiles, use the quantile() command, e.g. for the 95th percentile, the value below which 95% of the points in the variable fall:
> quantile(weight.kg, 0.95, na.rm=TRUE)
95% 68.775
Similarly:
> min(heart_rate, na.rm=TRUE)
[1] 56
> max(heart_rate, na.rm=TRUE)
[1] 93
> range(heart_rate, na.rm=TRUE)
[1] 56 93
To calculate the total weight of all the observations:
> sum(weight.kg, na.rm=TRUE)
[1] 2148.9
But to count (not sum!) the number of observations in a variable:
> length(na.omit(weight.kg))
[1] 36
Other commonly used descriptive statistics include the variance var() and the standard deviation, sd(). The variance of a dataset is the average of the squared differences from the mean. Squaring each difference makes them all positive numbers and makes the bigger differences stand out, but can also make the value of the variance very large, and in a different unit from the measurement. The standard deviation is the square root of the variance, a smaller number with the same units as the measurement itself. Both of these values measure the spread of a dataset away from the mean, i.e. indicate how tightly data is clustered around the mean (small value) or how scattered it is (larger value):
> var(heart_rate, na.rm=TRUE)
[1] 54.79683
> var(height.m, na.rm=TRUE)
[1] 0.03359873
> var(weight.kg, na.rm=TRUE)
[1] 38.33964> sd(weight.kg, na.rm=TRUE)
[1] 6.191901
> sd(height.m, na.rm=TRUE)
[1] 0.1832996
> sd(heart_rate, na.rm=TRUE)
[1] 7.402488
Watch this video at 480p or higher for better resolution

StatsBytes by A.J. Cann is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License