R ‘cheats’
ONE MORE R “CHEAT” SHEET
After spending uncountable hours in front of my computer trying to figure out how things work in R, I thought I might contribute to ease the learning process for R by compiling this list of helpful commands. Clearly all of these things can be found somewhere else, and this list is by no means complete. I will try to keep it updated.
Last update 2012/10/05 new ggplot2 syntax
Getting started
What you need: Download and install R: http://www.r-project.org/
The functionality of R can be extended by installing additional packages. I use mainly those packages
ggplot2 to make nice looking graphs;
grid also for plotting purposes;
gdata to import data
lme4 for all kinds of mixed models;
multcomp for posthoc tests of linear models
To install a package write this:
install.packages("libraryname", dependencies = TRUE)
How to load these packages
library(libraryname) #for example: library(lme4)
To load them always at startup write this:
local({
old <- getOption("defaultPackages")
options(defaultPackages = c(old, "ggplot2"))
})
Preparing your data table
To reduce the problems with your data in R replace any empty field with NA in Excel or OpenOffice and save your file as txt or csv file.
Getting data inside
Choose your working directory
Menu: File/Change Dir
Import data
dataname = read.csv("filename.csv")
dataname = read.table("filename.txt", header=TRUE)
Check if your data is correctly imported (numbers are still numbers, ...)
str(dataname)
more details about the data (Quantiles, Mean..)
summary(dataname)
Data generation and manipulation
variable1 = c(1,2,3,4,5) variable2 = c(5,4,3,2,1) dataname = data.frame(variable1,variable2)
cbind combines the two vectors into one table
variable3 = seq(1,10, by = 2)
this makes a sequence from 1 to 10 by 2 increments.
How to change from one format to the other
as.numeric(data$columnname) as.character(data$columnname) as.factor(data$columnname)
to change a variable to a number, change it first to a character, then to a number.
how to refer to columns/variables of your data
dataset$columnname
How to change the column names of your dataframe.
colnames(dataname)[2] = 'copepod number'
How to change one value in the dataframe
dataname[2,1] = 11
this changes the second element of the first column ([row,column])
how to reshape a data.frame
reshape(tt, idvar=c("animal.id","station", "bottle", "sex"),
timevar="part", direction="wide")
or
library(reshape) r4 = melt(r3, id="t")
you can avoid writing always the "dataset$" by attaching the data using:
attach(dataset)
remember to detach a dataset after you finished, as you can #get in trouble having two datasets with the same variable #names.
detach(dataset)
How to make a subset of your dataset
datasetsubset = dataset[dataset$variable1 == 0 & dataset$variable2 > 2,]
important! don't forget the comma at the end, otherwise no other columns will be selected.
How to calculate a new factor out of other factors
factor3 = factor1 + factor2
* multiply / divide + add - substract ^ power
How to drop all NAs from the data
x1=x[!is.na(x)]
get all possible ("unique") values from a variable
unique(data$variable)
how to aggregate data (mean, sum,...)
newdata = aggregate(data$x , by =list(Var1=data$Var1, Var2 = data$Var2), sum)
Simple plotting
plot(dataset$columnname~dataset$columnname)
or add ( … type="line") in case you want a lineplot
boxplot((dataset$columnname~dataset$columnname)
to open a new plot window
windows() #in windows quartz() #on a mac
to save a plot either use right-click (however only windows #metafile and bitmap here) or write
savePlot("Figurename.pdf", type="pdf")
Statistics
Mean and standard deviation
mean(dataset$columnname, na.rm=TRUE) sd(dataset$columnname, na.rm=TRUE)
na.rm will skip the NA values, otherwise you will get an NA as output.
Shapiro-Wilk
shapiro.test(dataset$columnname)
Kolomogorov-Smirnov
ks.test(dataset$columnname, dataset$columnname)
t-test
t.test(y ~ x, data = dataset)
Correlation
cor.test(y,x, data = dataset)
default is Pearsson product momentum, you can also define kendall, by adding method = "kendall")
Test for homogeneity of variances
fligner.test(y~x)
Simple linear model
modelname =lm(dataset$columnname~dataset$columnname)
After you have made your model you can use "summary()" #and "anova()" to get the results.
Complete model details
summary(modelname)
ANOVA Table
anova(modelname, type = "marginal") #"marginal" tests each factor independent of interactions
Some of the models have inherent plotting functions which generate plots to test if the model is valid.
plot(modelname)
#plotting the residuals of the model
plot(resid(modelname))
#Mixed effect models
Assuming normally distributed data
1. using the nlme package
modelname = lme(y~x1*x2, data=dataset, random = ~ 1|r1, na.action=na.omit)
#if you want to nest one factor into another write ~1|r1/r2 , this would mean r2 is nested within r2. If you have repeated measures use them as if they were random factors.
2. using the lme4 package
this is more flexible as it allows for different data distribution families, but it does not give you p-values for the gaussian case. Notice that the formulation of the random effects is slightly different to the lme model.
modelname = lmer(y~x1*x2 + (1|r1), data=dataset, family=gaussian, na.action=na.omit)
with a binomial distribution, here the data should in a "two-column" format.
modelname = lmer(cbind(y1,y2)~x1+x2+(1|r1)+(1r2), data=dataset, family=binomial(link=logit))
Other families to use
binomial(link = "logit") gaussian(link = "identity") Gamma(link = "inverse") inverse.gaussian(link = "1/mu^2") poisson(link = "log") quasi(link = "identity", variance = "constant") quasibinomial(link = "logit") quasipoisson(link = "log")
Posthoc tests on models
load package multcomp
library(multcomp) m1 = glht(modelname, linfct=mcp(TheFactorYouAreInterestedIn="Tukey")) summary(m1)
Fitting an equation to data
model=nls(variable1~variable2^unknownParameter, data=dataset, start = list(UnknownParameter = educatedguess))
example: model=nls(time~raddd^n, data=Bob, start = list(n=1))
model
"Beautiful" plotting with ggplot2
ggplot2 has a different approach compared to the "normal" R graphics, but I think you can control more and it also has nicer default settings.
This gives you an easy scatter plot
qplot(x-variable,y-variable, data=dataset)
scatter plot with a smooth function or a regression line
qplot(x-variable,y-variable, data=dataset)+geom_smooth()
qplot(x-variable,y-variable, data=dataset)+geom_smooth(method=”lm”)
For a lineplot use this:
qplot(x-variable,y-variable, data=dataset, geom=c("line"))
a combination of line and point
qplot(x-variable,y-variable, data=dataset, geom=c("line", "point"))
barchart
qplot(x-variable,y-variable, data=dataset, geom=c("bar"))
if you want to change the color or the shape of points,lines...
qplot(x-variable,y-variable, data=dataset, geom=c("line"),
shape = variablename, colour=variablename)
if the colour result looks weird you should try colour = factor(variablename)
to change the labels
qplot(....) + scale_x_continuous("xlabelname") +
scale_y_continuous("ylabelname")
in case your data on one axis is discrete use scale_x_discrete() instead.
how to get rid of the grey background in case the journal you want to submit to does not approve it.
qplot(x-variable,y-variable, data=dataset, geom=c("bar")) + theme_bw()
The second approach to ggplot2
using ggplot; this might be more tedious in the beginning, but it will pay off because you have more control over each element in the graph.
Define what data you have
p = ggplot(data=dataset, aes(x=xvariablenname, y=yvariablename))
normal plot
p + geom_point()
lineplot
p + geom_line()
barplot
p + geom_bar()
If you want to visualize different groups , you can just use
p + geom_bar(aes(fill=groupname))
if you want the bars to be next to each other use
p + geom_bar(position="dodge", aes(fill=groupname))
to get rid of the legend
p + theme(legend.position="none")
to change the title of the legend box
p + guides(colour = guide_legend(title = "title here"))
to rotate axis labels
p + theme(axis.text.x = element_text(angle=45, hjust=1.0))
to add a general title to the graph
p + labs(title = "New plot title")
how to limit the plotting area
p + coord_cartesian(xlim = c(-5000, 5000))
make an area graph (filled line graph)
ggplot(data, aes(x, y)) + geom_area(aes(fill = grouping.variable),
position = "identity")
+ scale_fill_manual(value = alpha(c("green", "blue"), 0.4))
error bars
mean_se = function (x, ...)
{
x = na.omit(x)
se = function(x)sqrt(var(x)/length(x))
data.frame(y=mean(x), ymin=mean(x)-se(x), ymax=mean(x)+se(x))
}
p + stat_summary(fun.data = "mean_se", colour = "red")
for more details visit the ggplot homepage
http://docs.ggplot2.org/