---
title: "R intro (session 2)"
author: Katia Oleinik (koleinik@bu.edu)
date: "10/19/2017"
output: html_document
---
A few useful links with many examples:
[http://r-statistics.co/ggplot2-Tutorial-With-R.html](http://r-statistics.co/ggplot2-Tutorial-With-R.html)
[http://shinyapps.org/apps/RGraphCompendium/index.php](http://shinyapps.org/apps/RGraphCompendium/index.php)
[http://manuals.bioinformatics.ucr.edu/home/R_BioCondManual](http://manuals.bioinformatics.ucr.edu/home/R_BioCondManual)
[http://www.cookbook-r.com/Graphs/](http://www.cookbook-r.com/Graphs/)
## Installing Packages
The following packages from R-CRAN will be used in this session. Use install.packages()
function to install them:
```{r eval=FALSE }
install.packages(c("ggplot2", "RColorBrewer", "dplyr"))
```
If the package comes from Bioconductor, then to install it:
```{r eval=FALSE}
#To build Bioconductor package:
#source("http://bioconductor.org/biocLite.R")
#biocLite("annotate")
#biocLite("geneplotter")
#biocLite("hgu95av2.db")
#biocLite("ggbio")
#biocLite("Homo.sapiens")
```
To load packages in R use library()
function. It's a common practice to have all library()
statements at the beginning of R script:
```{r warning=FALSE}
library(ggplot2)
library(RColorBrewer)
library(annotate)
#library(geneplotter)
#library(hgu95av2.db)
#library(Homo.sapiens)
#library(ggbio)
```
## Reading input datasets;
```{r }
salaries <- read.csv( "http://scv.bu.edu/classes/BI594/Salaries.csv")
```
Functions like head(), str(), summary()
are a great way to explore the dataset, learn about the size of the data, column names, type of each column and get a quick summary for each variable there.
```{r }
head( salaries )
str( salaries )
summary( salaries )
```
*Exercise*
Read and explore Ecoli dataset:
```{r echo=FALSE}
ecoli <- read.csv( "http://scv.bu.edu/classes/BI594/Ecoli_metadata.csv")
head(ecoli)
str(ecoli)
summary(ecoli)
```
Sometimes categorical variables are coded as string ("male", "female"), numeric (0, 1) or logical (TRUE, FALSE). When R reads the data by default it will treat all string variables as categorical variables.
Note: To disable this behaviour use stringAsFactors=FALSE
argument in read.csv()
and other similar functions.
Use factor()
function to convert any type of a vector to a factor:
```{r}
gender <- c(0,1,1,1,1,0,0,0) # define numeric variable
summary(gender ) # summary will calculate basic statistics for a numeric variable
gender <- factor(gender , labels=c("male", "female")) # convert vector to a categorical (factor) variable
summary(gender ) # summary produces appropriate output
```
In case of ecoli dataset some columns are not categorical variables but were read as such. We need to convert them back to "character" type:
```{r}
str(ecoli)
ecoli$sample <- as.character(ecoli$sample)
ecoli$run <- as.character(ecoli$run)
str(ecoli)
#Notice a different kind of summary output for character, factor and numeric columns
summary(ecoli)
```
## dplyr package
dplyr - R package that provide easy tools for data manipulation
```{r}
library(dplyr)
# Chose specific columns from a dataframe
select(ecoli, sample, clade, cit, genome_size)
# Use "pipeline synax" to perform several operations at once:
ecoli %>% select( sample, clade, cit, genome_size)
# Chose rows (filter the dataframe based on specific condition):
filter(ecoli, cit == "plus")
#Pipes (filter and then select columns)
ecoli %>%
filter(cit == "plus") %>%
select(sample, generation, clade)
#To save the resulting object into a dataframe""
ecoli_citplus <- ecoli %>%
filter(cit == "plus") %>%
select(sample, generation, clade)
# View the result:
str(ecoli_citplus )
# Split - apply - combine ( n() is a function within dplyr package to find count for each group )
ecoli %>%
group_by(cit) %>%
summarize( n() )
#To calculate mean for each subgroup
ecoli %>%
group_by(cit, clade) %>%
summarize( mean_size = mean(genome_size, na.rm = TRUE) )
```
# R graphics
> There are 3 main environments in R to create graphics:
* R base utilities
* lattice package
* ggplot2 package
There is also powerful "grid" package and a couple in-development interactive packages
### Base R Graphics
**Generic(high level) plot functions:**
* -plot: generic x-y plots, scatterplots
* -matplot
* -barplot
* -boxplot
* -hist: histogram
* -pie: pie charts
* -dotchart: cleveland dot plots
* -image, heatmap, contour, persp: image-like plots
* -qqnorm, qqline, qqplot: distribution analysis plots
* -pairs, coplot: display of multivariant data
* -dotchart
* -mosaicplot
* -stripchart
* -contour
* -hclust: hierarchial clustering
**Low level - functions that add to the existing plot created with a high-level plot functions:**
* -points
* -lines
* -abline
* -segments
* -text
* -mtext
* -title
* -axis
* -grid
* -legend
* -rug
* -polygon
* -box
```{r}
# create points for sqrt(x) and log(x) functions
x <- 1:10
y1 <- log(x)
y2 <- sqrt(x)
# First attempt to draw the points:
plot(x,y1)
lines(x,y2)
```
There are many problems with this plot: Line did not fit into the plot, log() line is displayed using points, while sqrt() is displayed as a line, there is no title or annotations.
R provides many options to every plot function to control various parameters:
```{r}
# Using R graphics parameters we can convert this plot into high quality one:
plot(x, y1, # specify the coordinates
type = "b", # type of the plot
ylim = c(min(c(y1,y2)), max(c(y1,y2))), # set the axis limits
main = "Logarithm and Square root functions", # title
ylab = "y", # # y label
col = "darkgreen", # line color
lwd = 2, # line tickness
pch=15, # symbol to display the point
cex.lab=1.5, # size of the font used for labels
font.lab = 3, # font type
cex.main = 2) # size of the font used for the title
#Additional line
lines(x,y2, # coordinates
col = "steelblue", # color
lwd =2, # line width
type="b", # line type (b - both, lines and points)
pch=19, # symbol used to display points
cex.lab=1.5) # font size for labels
#Add grid to the plot
abline(h = seq(0,3,by=0.5), v = c(0:10), col='lightgray',lty="dotted")
#Add legend
legend("topleft", # position
col=c("darkgreen","steelblue"), # colors
lty=1, # draw line
pch=c(15,19), # draw points
cex=1.5, # text size
legend=c("logarithm","square root")) # text
# Add text to the graph
text(x = 6, y = log(6), pos = 2, labels=expression(paste("y = ",ln(x))), cex=2, offset=0.75)
text(x = 6, y = sqrt(6), pos = 2, labels=expression(paste("y = ",sqrt(x))), cex=2, offset=0.75)
```
For our ecoli dataset
```{r}
plot(ecoli$genome_size,
pch=8,
main="Scatter plot of genome sizes",
cex=2,
col="steelblue",
ylab="Genome Size",
xlab="Sample")
```
**Barplot**
```{r}
# Basic Barplot
ranks <- table(salaries$rank)
barplot(ranks)
# Improved barplot
# Notice how the main plot returns the position of each bar which we will store in the variable b.
b<-barplot(ranks,
col = "darkgreen",
main = "Teaching Faculty",
sub = "Source: University Administration",
names.arg = names(ranks),
ylim = c(0, max(ranks) * 1.1),
axes = FALSE)
# Values stored in the variable b are used as x coordinates to place the text on top of each bar
text(b,ranks,
labels = ranks,
pos = 3)
```
**Colors in R**
There are many ways to specify a color in R:
* specify an index within a colorscheme
* by name
* using rgb, hsv, hex values
```{r eval=TRUE}
#1. Using index within colorscheme (palettes)
barplot(c(1:10), col=c(1:10))
#list colors in the current palette:
palette()
pie(rep(1,length(palette())),
col=palette())
```
Some available palettes:
* rainbow()
* heat.colors()
* terrain.colors()
* topo.colors()
* cm.colors()
* gray.colors()
```{r eal=FALSE}
num.col=15
barplot(c(1:num.col), col=rainbow(num.col))
barplot(c(1:num.col), col=heat.colors(num.col))
barplot(c(1:num.col), col=terrain.colors(num.col))
barplot(c(1:num.col), col=topo.colors(num.col))
barplot(c(1:num.col), col=cm.colors(num.col))
barplot(c(1:num.col), col=gray.colors(num.col))
```
A popular palettes could be found in package RColorBrewer:
```{r}
#display.brewer.all()
barplot(c(1:8), col=brewer.pal(8,"Dark2") )
barplot(c(1:9), col=brewer.pal(9,"Set1") )
```
```{r eval = FALSE}
# 2. Specify colors by name
colors() # list all colornames
```{r}
barplot(c(21,19,16,14,12,10,8,5,4,3,2,1), col=c("steelblue4",
"darkred",
"dodgerblue",
"forestgreen",
"gold",
"steelblue",
"firebrick",
"darkorange",
"darkgrey",
"tomato",
"olivedrab",
"dimgray"))
```
A lits of some of my favorite colors:
* steelblue, steelblue4
* firebrick
* tomato
* skyblue
* olivedrab
* forestgreen
* gold
* dodgerblue
* dimgray
* darkred
* darkorange
* darkgray
* springgreen
* violetred
* slateblue
* sienna
* seagreen
* sandybrown
* salmon
* saddlebrown
* maroon
* limegreen
```{r}
barplot(c(21,19,17,14,12,8,5,4,3,2,1), col=c("springgreen",
"skyblue",
"violetred",
"slateblue",
"sienna",
"seagreen",
"sandybrown",
"salmon",
"saddlebrown",
"maroon",
"limegreen"))
```
You can find a color table in this pdf document:
[http://www.stat.columbia.edu/~tzheng/files/Rcolor.pdf](http://www.stat.columbia.edu/~tzheng/files/Rcolor.pdf)
```{r}
# 3. Specify rgb values (4th argument is used to control transparency)
col1 <- rgb(0.1, 0.5, 0.1, alpha=0.8)
col2 <- rgb(0.9, 0.7, 0.0)
barplot(c(7,11), col=c(col1, col2))
```
```{r eval = FALSE}
#Conversion one color format to another
col2rgb(), rgb2hsv()
# Additional help related to colors
help(package=colorspace)
```
**Boxplot**
```{r}
#Simple boxplot
boxplot(salaries$salary)
```
```{r}
#let's display 2 boxplots: one for male and one for female names
boxplot( formula = salary ~ discipline,
data = salaries,
names = c("A","B"),
col = c("gold","forestgreen"),
horizontal = TRUE,
main = "Salary range for Disciplines A and B")
legend("topright",
legend=c("A","B"),
pch=15,
col = c("gold","forestgreen"))
# Side by side boxplot for ecoli dataset
boxplot(genome_size ~ cit, ecoli) #simple
#Add colors, title, labels
boxplot(genome_size ~ cit, ecoli, col=c("pink","purple", "darkgrey"),
main="Average expression differences between celltypes", ylab="Expression")
```
**Explore various symbols**
View help of points()
function:
```{r eval=FALSE}
?points
```
```{r}
#Draw a table with all symbols
plot(c(-1, 26), 0:1, type = "n", axes = FALSE, xlab="", ylab="")
text(0:25, 0.6, 0:25, cex=0.75)
points(0:25, rep(0.5, 26), pch = 0:25, bg = "grey")
```
**graphics parameters:**
* pch: plotting symbol (default - open circle)
* lty: line type (default - solid line)
* lwd: line width
* las: orientation of the axis labels on the plot
* bg: background color
* fg: foreground color
* mar: margin size
* oma: outer margin size
```{r}
levels(salaries$rank)
col3 <- c("steelblue4","firebrick","forestgreen")
cols <- col3 [ as.numeric(salaries$rank) ]
plot(x = salaries$yrs.service,
y = salaries$salary,
pch = 19,
main = "Analysis of dependency of 2 variables",
xlab = "Years of Service",
ylab = "Salary",
col = cols,
cex=1.5)
lm.fit <- lm(salary ~ yrs.service, data = salaries)
abline(lm.fit, lty = 4, lwd = 2, col="darkgray")
legend("topleft",levels(salaries$rank),fill=col3 )
eq <- substitute(italic(salary) == a + b %.% italic(yrs.service)*","~~italic(r)^2~"="~r2,
list(a = format(coef(lm.fit)[1], digits = 2),
b = format(coef(lm.fit)[2], digits = 2),
r2 = format(summary(lm.fit)$r.squared, digits = 3)))
#R-squared: what proportion of the variation in the outcome can be explained by the covariates/predictors
# if R-squared is low, than the outcome is poorly predicted by the covariates/
text(15,170000,labels=as.expression(eq), pos=4, cex=.8 )
```
**Histogram**
```{r}
#plot simple histogram
hist(salaries$salary)
# Add color and title:
hist(salaries$salary,
main="Histogram of Salaries of professors in Disciplines A and B",
col="steelblue3",
xlab="Salary, dollars")
# Same plot with probability along the y axes
hist(salaries$salary,
main="Histogram of Salaries of professors in Disciplines A and B",
col="steelblue3",
xlab="Salary, dollars",
prob=TRUE)
#Add rugplot
rug(salaries$salary)
#Add density line
lines(density(salaries$salary), col="darkblue", lwd=2)
```
**Pairs**
```{r}
# Delays per airport
pairs( salaries[ ,c("yrs.since.phd", "yrs.service","salary") ] )
#An alternative way using formula
pairs( ~ yrs.since.phd + yrs.service + salary,
data = salaries)
```
Improve the plot appearance using custom function:
```{r}
panel.cor <- function(x, y, digits = 2, prefix = "", cex.cor, ...)
{
usr <- par("usr"); on.exit(par(usr))
par(usr = c(0, 1, 0, 1))
r <- abs(cor(x, y))
txt <- format(c(r, 0.123456789), digits = digits)[1]
txt <- paste0(prefix, txt)
if(missing(cex.cor)) cex.cor <- 0.8/strwidth(txt)
text(0.5, 0.5, txt, cex = cex.cor * r)
}
pairs( ~ yrs.since.phd + yrs.service + salary,
data = salaries,
lower.panel = panel.smooth,
upper.panel = panel.cor)
```
**Multiple plots in one graph**
```{r}
#par.save <- par() # save current graphics parameters
#print( par() ) # print current graphics parameters
par( mfrow = c(1,2) ) # 1 row and 2 columns
# first graph
hist(salaries$salary,
main="Histogram of Salaries of professors in Disciplines A and B",
col="steelblue3",
xlab="Salary, dollars",
prob=TRUE)
#Add rugplot
rug(salaries$salary)
#Add density line
lines(density(salaries$salary), col="darkblue", lwd=2)
# second graph
b<-barplot(table(salaries$rank),
col=c("steelblue4","firebrick","forestgreen"),
main = "Professor Ranks",
names.arg = c("Associate Professor", "Assistant Professor", "Professor"),
ylim = c(0, max(table(salaries$rank)) * 1.25),
axes=FALSE)
text(b,table(salaries$rank),table(salaries$rank), pos=3)
mtext("Disciplines A and B")
par ( mfrow = c(1,1) )
```
```{r}
#basic plot with points
ggplot(salaries, aes( yrs.service, salary)) + geom_point()
ggplot(salaries, aes( yrs.service, salary)) +
geom_point() + # type of geom objects
theme_bw() + # theme
ggtitle(" Salary analysis") # plot title
# color the points according to the faculty type
ggplot(salaries, aes( yrs.service, salary)) +
geom_point( aes(color = factor(rank) ), size=5) + # type of geom objects
ggtitle(" Salary analysis") + # plot title
theme_bw( ) + # theme
theme ( plot.title=element_text(size=14,face="bold" ) )
#Add a regression line to the plot
ggplot(salaries, aes( yrs.service, salary)) +
geom_point( aes(color = factor(rank) ), size=5) + # type of geom objects
ggtitle(" Salary analysis") + # plot title
geom_smooth() +
theme_bw( ) + # theme
theme ( plot.title=element_text(size=14,face="bold" ) )
# Change the line to a simple lm line
ggplot(salaries, aes( yrs.service, salary)) +
geom_point( aes(color = factor(rank) ), size=5) + # type of geom objects
ggtitle(" Salary analysis") + # plot title
geom_smooth( method='lm' , col = "black") +
theme_bw( ) + # theme
theme ( plot.title=element_text(size=14,face="bold" ) )
ggplot(ecoli) +
geom_point(aes(x = sample, y= genome_size,
color = generation,
shape = cit), size = rel(3.0)) +
theme(axis.text.x = element_text(angle=45, hjust=1))
#boxplot
ggplot(salaries, aes( x=discipline, y=salary ) ) + geom_boxplot()
#adding addional features to the boxplot
ggplot(salaries, aes( x=discipline, y=salary ) ) +
geom_boxplot() +
geom_jitter(width = 0.2)
ggplot(salaries, aes( x=rank, y=salary) ) +
geom_boxplot( varwidth = TRUE ) + #width is proportional to the square roots of the number of observations
geom_jitter(width = 0.2) +
ggtitle(" Salary analysis") +
coord_flip() +
theme ( plot.title=element_text(size=14,face="bold" ) )
ggplot(ecoli) +
geom_boxplot(aes(x = cit, y = genome_size, fill = cit)) +
ggtitle('Boxplot of genome size by citrate mutant type') +
xlab('citrate mutant') +
ylab('genome size') +
theme(panel.grid.major = element_line(size = .5, color = "grey"),
axis.text.x = element_text(angle=45, hjust=1),
axis.title = element_text(size = rel(1.5)),
axis.text = element_text(size = rel(1.25)))
graph <- ggplot(ecoli) +
geom_boxplot(aes(x = cit, y = genome_size, fill = cit))
graph +ggtitle('Boxplot of genome size by citrate mutant type') +
xlab('citrate mutant') +
ylab('genome size') +
theme(panel.grid.major = element_line(size = .5, color = "grey"),
axis.text.x = element_text(angle=45, hjust=1),
axis.title = element_text(size = rel(1.5)),
axis.text = element_text(size = rel(1.25)))
#histogram
ggplot(salaries, aes(x=salary)) + geom_histogram()
ggplot(salaries, aes(x=salary)) + geom_histogram(binwidth=10000)
#improving histogram
ggplot(salaries, aes(x=salary)) +
geom_histogram(binwidth=10000,
color="black", fill="white")
ggplot(salaries, aes(x=salary)) +
geom_histogram(binwidth=10000,
color="black", fill="white", aes(y=..density..)) +
geom_density(alpha=.3, fill="darkorange")
```