| MicrobiologyBytes: StatsBytes: Getting started with R | Updated: February 10, 2012 | Search |
In the past you may have "done statistics" by plugging numbers into a calculator. That's great, but it has it's limitations. It's slow, boring and inaccurate for a start. In the real world, scientists use statistics software to perform statistical tests on data. There is lots of different statistics software available, so why use R? Here are a few reasons:
If you want to know more about R, read the Wikipedia article. If you want more justification for using R, you can watch this hour long video.
Having access to powerful software such as R doesn't absolve you from understanding statistical principles - what you can, and more importantly, what you can't do with statistics. R won't help you with that!
If R is not already installed on your computer, download and install the current version from http://www.r-project.org/
In these documents, items shown in dark red text such as this are what you enter into R, or examples of what you see as output.
R has menus but it's best to forget them - type commands in the window at the prompt (">"). I encourage you to copy commands you see here to try things out for yourself, but don't enter the prompt (">") or you'll get an error, just enter the command itself.
In general, R is case sensitive.
Think of learning to use R as like learning a foreign language. You need to know the vocabulary and the grammar, and the more you practice, the better you get. Like a language, R has nouns, verbs and adjectives:
R assumes that the data will be in a defined location called the working directory. If it is not, you need to include the complete path to the directory in the command. However, there is an easier alternative which gets around this. Using the Preferences Menu, you can set the working directory for the program - the place where R will look for data and save output:
Don't forget which directory you have set! You can check the working directory using:
> getwd()
Watch this video at 480p for better resolution
> setwd(dir) sets the working directory, using the absolute filepath.
You may find it helpful to create a folder called "R data" (or whatever you like), set it as the working directory and make sure your data is in there before starting R.
To get help on any command in R, tyope ? and the name of the command, then predss return, e.g:
> ?plot
If that doesn't help there is lots of information available online. In Google, type R and the name of the command you are interested in, e.g:
In R, an object is something which can hold data. An object can be a single number or an entire dataset consisting of many variables (observations).
Variable names cannot start with numbers. To make life easier for yourself, use descriptive names, e.g. height, weight, time, rather than x, y, z - this will reduce errors.
The function:
> ls()
shows a list of all variables you have defined. To display the contents of a particular variable, just type the name of the variable and press the return key.
Types of Data
R recognizes about a dozen different types of data. You'll only need to know a few of them:
| Age | Weight | Height | Gender |
|---|---|---|---|
| 18 | 150 | 65 | F |
| 21 | 160 | 68 | M |
| 45 | 180 | 65 | M |
| 54 | 205 | 69 | M |
In this table, the variables are : Age, Weight, Height and Gender.
"<-" or "=" stores data as a variable (i.e. defines a variable), but "<-" is preferred because "=" can easily be confused with "==" e.g:
> variablename <- 12 # creates "variablename" with a value of 12
You can type data directly into R, e.g:
> variablename <- c(1 ,2 ,3 ,4, 5 ,6)
c stands for concatenate - join individual values together as a data file. An alternative function is scan() which saves you having to type the commas, e.g:
> variablename <- scan()
> 1 2 3 4 5 6
creates a variable called "variablename" with values 1 - 6. (Variable names cannot start with numbers)
Example: try this for yourself (in R, type these functions - don't include the ">" prompts or you will get an error, press the return or enter key at the end of each line):
> data1 <- scan()
> 1 2 3 4 5 6
> ls()
> list(data1)
> mean(data1)
IMPORTANT: Remember! In R:
Columns = variables (remember variable names cannot start with numbers!), rows = cases.
You can create data frames directly in R, Example:
> Pop <− c ( "A", "A", "A", "A", "A", "B", "B", "B", "B" )
> Ht <− c(23.4, 32.9, 29.7, 38.2, 32.7, 28.4, 27.3, 27.7, 30.1)
> Sx <− c("Female", "Female", "Female", "Male", "Male", "Female", "Male", "Male", "Female")
> myDataFrame <- data.frame (Population=Pop, Height=Ht , Sex=Sx)
To view variables, simply type the name of the variable and press return (try this for yourself in R).
Unless the dataset is very small, entering data manually is usually a bad idea - it is time consuming and likely to create errors. Usually you will have data in another format anyway so it's best to read the data into R. Although it is possible to format data directly in R, even experienced users often prefer to use other software such as Microsoft Excel, Open Office Calc or a Google Docs Spreadsheet to format the data before reading it into R. For example, here's a handy tip:
Watch this video at 480p for better resolution
To get help with reading data into R, type:
> ?read.table
To get a dataframe into R, for data in plain text file format (.txt) use read.table, e.g:
> data2 <-read.table(file="filename.txt", header=FALSE, sep="\t",) #"header=FALSE" tells R that the first row contains data points not labels
Comma-delimited files (.csv) have data items separated by commas.
Tab-delimited files (.txt) have data items separated by tabs.
Files in native .xls Microsoft Excel format can be opened in R but this is complicated, so best avoided. To make a csv file, create a standard Excel spreadsheet (.xls) and then: File: Save As/Download As. Alternatively, create or edit a spreadsheet as a Google Docs Spreadsheet then: File: Download: csv
To read .csv files into R:
>data3 <-read.csv(file="filename.csv", header=TRUE, sep=",") #"header=TRUE" tells R that the first row contains labels not data points
If the first row of the file does not contain variable names, use header=FALSE. Alternatively, R will allow you to look for the data on your computer (this can be outside the working directory, see above):
data4 <- read.csv(file.choose(), header=TRUE)
If the data is available online, you can load it into R using the url, e.g:
datatoplaywith = read.csv("http://www.microbiologybytes.com/statsbytes/datatoplaywith.csv", header=TRUE, sep=",")
To read tab-delimited files:
> data5 <-read.delim(file="filename.xls", header=TRUE, sep="\t")
Use attach(filename) to make the variables accessible by name, e.g:
> attach(data5)
Use names(filename) to get variable names, e.g:
> names(data5)
Try it for yourself:
Download this file: datatoplaywith.csv by right-clicking on the link and saving it your computer.
Remember to set the working directory.
Then type these commands into R (you can copy and paste them from here if you want, one line at a time, then press return):
datatoplaywith <-read.csv(file="datatoplaywith.csv", header=TRUE, sep=",")
attach(datatoplaywith)
names(datatoplaywith)
list(datatoplaywith)
datatoplaywith #does the same as list(datatoplaywith)
summary(datatoplaywith)
max(datatoplaywith)
min(datatoplaywith)
range(datatoplaywith)
list.files()
You need to try this in R for yourself, but if you'd like to watch it on video, with narration:
Watch this video at 480p for better resolution
IMPORTANT: MISSING DATA
If your variables contains missing values, these will be ignored by R, which means that some functions will give the wrong answer! To avoid this, you need to tell R what to do with missing values. This is normally done using the na.rm=TRUE argument in functions which require it ("rm" = remove), e.g. median(variable, na.rm=TRUE). You can see if variables contain missing values using:
> is.na(variablename)
If you get errors for some functions, check for missing variables and exclude them using na.rm=TRUE.
To remove data frames (or variables) from memory, use:
> rm(NameOfItemToBeRemoved)
Getting data out of R: There are many ways to do this. To save a variable as a file:
> write(variablename, "filename.txt", sep = "\t")
or:
> write(variablename, "filename.csv", sep = ",")
You can also save the active window (e.g. graph files) using the File menu - File: Save As
When you have finished a session in R, it is good practice to type:
> detach()
This cleans up the workspace by removing objects (so you don't get confused next time). To quit the program, type:
> q()
R asks "Save workspace image? [y/n/c]: y" Answering yes "y" saves a .RData file in the working directory that contains all the data you currently have in memory. When you restart R , it will load these data back into memory for you. Alternatively, you can use:
> save(data ,file="filename.Rdata")
The up and down arrows on the keyboard allow you to scroll through an R session and saves you lots of typing! You can also copy, paste and modify previous commands for reuse.
# comment - R ignores everything after the hash sign until the next line break. It's good practice to comment your work as you go so that you can remember what you were doing.
> ?commandname #displays help for that command (with syntax, examples), e.g. > ? mean()
attach() # attaches the variable names in a data frame so the variables can be accessed without having to type the data frame name. This command is controversial and needs to be used with care, also using the detach() command to clean up the workspace and avoid confusion between variables with the same name in different dataframes.
colors() # lists the colours (names) known by R
getwd() # lists the file path of the current working directory
list.files() # list files in current working directory [see getwd()]
ls () # lists all the variables you have defined
names() # lists the variables in a dataframe
quantile() # can be used to calculate percentiles of a variable
range() # range of a variable (max - min)
rm(NameOfItemToBeRemoved) # remove variables or data frames from memory
var() # variance of a variable
Arithmetic operators: + (plus), - (minus), * (multiply), / (divide), ^ (power) e.g.
> 14 * 403
[1] 5642Logical operators: == (equal), != (not equal), >, <, >=, <=, ! (not), | (or), & (and)
> q() # quits R - uses brackets because q is a function (= command)
Hint: The up and down arrow keys on your keyboard are your friends! They allow you to scroll back and forwards through your current R session and will save you lots of typing if you want to repeat or edit previous commands!
Two handy time-saving tips (try to remember these):
Watch this video at 480p for better resolution
Suggestion - If you're having difficulty remembering what to do each time you start R (e.g. setting the working directory, import a .csv file, etc), write yourself a little checklist and keep it somewhere handy until these procedures become second nature each time you start up R.

StatsBytes by A.J. Cann is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License