--- title: "R intro (session 2)" author: Katia Oleinik (koleinik@bu.edu) date: "10/19/2017" output: html_document --- A few useful links with many examples: [http://r-statistics.co/ggplot2-Tutorial-With-R.html](http://r-statistics.co/ggplot2-Tutorial-With-R.html) [http://shinyapps.org/apps/RGraphCompendium/index.php](http://shinyapps.org/apps/RGraphCompendium/index.php) [http://manuals.bioinformatics.ucr.edu/home/R_BioCondManual](http://manuals.bioinformatics.ucr.edu/home/R_BioCondManual) [http://www.cookbook-r.com/Graphs/](http://www.cookbook-r.com/Graphs/) ## Installing Packages The following packages from R-CRAN will be used in this session. Use install.packages() function to install them: ```{r eval=FALSE } install.packages(c("ggplot2", "RColorBrewer", "dplyr")) ``` If the package comes from Bioconductor, then to install it: ```{r eval=FALSE} #To build Bioconductor package: #source("http://bioconductor.org/biocLite.R") #biocLite("annotate") #biocLite("geneplotter") #biocLite("hgu95av2.db") #biocLite("ggbio") #biocLite("Homo.sapiens") ``` To load packages in R use library() function. It's a common practice to have all library() statements at the beginning of R script: ```{r warning=FALSE} library(ggplot2) library(RColorBrewer) library(annotate) #library(geneplotter) #library(hgu95av2.db) #library(Homo.sapiens) #library(ggbio) ``` ## Reading input datasets; ```{r } salaries <- read.csv( "http://scv.bu.edu/classes/BI594/Salaries.csv") ``` Functions like head(), str(), summary() are a great way to explore the dataset, learn about the size of the data, column names, type of each column and get a quick summary for each variable there. ```{r } head( salaries ) str( salaries ) summary( salaries ) ``` *Exercise* Read and explore Ecoli dataset: ```{r echo=FALSE} ecoli <- read.csv( "http://scv.bu.edu/classes/BI594/Ecoli_metadata.csv") head(ecoli) str(ecoli) summary(ecoli) ``` Sometimes categorical variables are coded as string ("male", "female"), numeric (0, 1) or logical (TRUE, FALSE). When R reads the data by default it will treat all string variables as categorical variables. Note: To disable this behaviour use stringAsFactors=FALSE argument in read.csv() and other similar functions. Use factor() function to convert any type of a vector to a factor: ```{r} gender <- c(0,1,1,1,1,0,0,0) # define numeric variable summary(gender ) # summary will calculate basic statistics for a numeric variable gender <- factor(gender , labels=c("male", "female")) # convert vector to a categorical (factor) variable summary(gender ) # summary produces appropriate output ``` In case of ecoli dataset some columns are not categorical variables but were read as such. We need to convert them back to "character" type: ```{r} str(ecoli) ecoli$sample <- as.character(ecoli$sample) ecoli$run <- as.character(ecoli$run) str(ecoli) #Notice a different kind of summary output for character, factor and numeric columns summary(ecoli) ``` ## dplyr package dplyr - R package that provide easy tools for data manipulation ```{r} library(dplyr) # Chose specific columns from a dataframe select(ecoli, sample, clade, cit, genome_size) # Use "pipeline synax" to perform several operations at once: ecoli %>% select( sample, clade, cit, genome_size) # Chose rows (filter the dataframe based on specific condition): filter(ecoli, cit == "plus") #Pipes (filter and then select columns) ecoli %>% filter(cit == "plus") %>% select(sample, generation, clade) #To save the resulting object into a dataframe"" ecoli_citplus <- ecoli %>% filter(cit == "plus") %>% select(sample, generation, clade) # View the result: str(ecoli_citplus ) # Split - apply - combine ( n() is a function within dplyr package to find count for each group ) ecoli %>% group_by(cit) %>% summarize( n() ) #To calculate mean for each subgroup ecoli %>% group_by(cit, clade) %>% summarize( mean_size = mean(genome_size, na.rm = TRUE) ) ``` # R graphics > There are 3 main environments in R to create graphics: * R base utilities * lattice package * ggplot2 package There is also powerful "grid" package and a couple in-development interactive packages ### Base R Graphics **Generic(high level) plot functions:** * -plot: generic x-y plots, scatterplots * -matplot * -barplot * -boxplot * -hist: histogram * -pie: pie charts * -dotchart: cleveland dot plots * -image, heatmap, contour, persp: image-like plots * -qqnorm, qqline, qqplot: distribution analysis plots * -pairs, coplot: display of multivariant data * -dotchart * -mosaicplot * -stripchart * -contour * -hclust: hierarchial clustering **Low level - functions that add to the existing plot created with a high-level plot functions:** * -points * -lines * -abline * -segments * -text * -mtext * -title * -axis * -grid * -legend * -rug * -polygon * -box ```{r} # create points for sqrt(x) and log(x) functions x <- 1:10 y1 <- log(x) y2 <- sqrt(x) # First attempt to draw the points: plot(x,y1) lines(x,y2) ``` There are many problems with this plot: Line did not fit into the plot, log() line is displayed using points, while sqrt() is displayed as a line, there is no title or annotations. R provides many options to every plot function to control various parameters: ```{r} # Using R graphics parameters we can convert this plot into high quality one: plot(x, y1, # specify the coordinates type = "b", # type of the plot ylim = c(min(c(y1,y2)), max(c(y1,y2))), # set the axis limits main = "Logarithm and Square root functions", # title ylab = "y", # # y label col = "darkgreen", # line color lwd = 2, # line tickness pch=15, # symbol to display the point cex.lab=1.5, # size of the font used for labels font.lab = 3, # font type cex.main = 2) # size of the font used for the title #Additional line lines(x,y2, # coordinates col = "steelblue", # color lwd =2, # line width type="b", # line type (b - both, lines and points) pch=19, # symbol used to display points cex.lab=1.5) # font size for labels #Add grid to the plot abline(h = seq(0,3,by=0.5), v = c(0:10), col='lightgray',lty="dotted") #Add legend legend("topleft", # position col=c("darkgreen","steelblue"), # colors lty=1, # draw line pch=c(15,19), # draw points cex=1.5, # text size legend=c("logarithm","square root")) # text # Add text to the graph text(x = 6, y = log(6), pos = 2, labels=expression(paste("y = ",ln(x))), cex=2, offset=0.75) text(x = 6, y = sqrt(6), pos = 2, labels=expression(paste("y = ",sqrt(x))), cex=2, offset=0.75) ``` For our ecoli dataset ```{r} plot(ecoli$genome_size, pch=8, main="Scatter plot of genome sizes", cex=2, col="steelblue", ylab="Genome Size", xlab="Sample") ``` **Barplot** ```{r} # Basic Barplot ranks <- table(salaries$rank) barplot(ranks) # Improved barplot # Notice how the main plot returns the position of each bar which we will store in the variable b. b<-barplot(ranks, col = "darkgreen", main = "Teaching Faculty", sub = "Source: University Administration", names.arg = names(ranks), ylim = c(0, max(ranks) * 1.1), axes = FALSE) # Values stored in the variable b are used as x coordinates to place the text on top of each bar text(b,ranks, labels = ranks, pos = 3) ``` **Colors in R** There are many ways to specify a color in R: * specify an index within a colorscheme * by name * using rgb, hsv, hex values ```{r eval=TRUE} #1. Using index within colorscheme (palettes) barplot(c(1:10), col=c(1:10)) #list colors in the current palette: palette() pie(rep(1,length(palette())), col=palette()) ``` Some available palettes: * rainbow() * heat.colors() * terrain.colors() * topo.colors() * cm.colors() * gray.colors() ```{r eal=FALSE} num.col=15 barplot(c(1:num.col), col=rainbow(num.col)) barplot(c(1:num.col), col=heat.colors(num.col)) barplot(c(1:num.col), col=terrain.colors(num.col)) barplot(c(1:num.col), col=topo.colors(num.col)) barplot(c(1:num.col), col=cm.colors(num.col)) barplot(c(1:num.col), col=gray.colors(num.col)) ``` A popular palettes could be found in package RColorBrewer: ```{r} #display.brewer.all() barplot(c(1:8), col=brewer.pal(8,"Dark2") ) barplot(c(1:9), col=brewer.pal(9,"Set1") ) ``` ```{r eval = FALSE} # 2. Specify colors by name colors() # list all colornames ```{r} barplot(c(21,19,16,14,12,10,8,5,4,3,2,1), col=c("steelblue4", "darkred", "dodgerblue", "forestgreen", "gold", "steelblue", "firebrick", "darkorange", "darkgrey", "tomato", "olivedrab", "dimgray")) ``` A lits of some of my favorite colors: * steelblue, steelblue4 * firebrick * tomato * skyblue * olivedrab * forestgreen * gold * dodgerblue * dimgray * darkred * darkorange * darkgray * springgreen * violetred * slateblue * sienna * seagreen * sandybrown * salmon * saddlebrown * maroon * limegreen ```{r} barplot(c(21,19,17,14,12,8,5,4,3,2,1), col=c("springgreen", "skyblue", "violetred", "slateblue", "sienna", "seagreen", "sandybrown", "salmon", "saddlebrown", "maroon", "limegreen")) ``` You can find a color table in this pdf document: [http://www.stat.columbia.edu/~tzheng/files/Rcolor.pdf](http://www.stat.columbia.edu/~tzheng/files/Rcolor.pdf) ```{r} # 3. Specify rgb values (4th argument is used to control transparency) col1 <- rgb(0.1, 0.5, 0.1, alpha=0.8) col2 <- rgb(0.9, 0.7, 0.0) barplot(c(7,11), col=c(col1, col2)) ``` ```{r eval = FALSE} #Conversion one color format to another col2rgb(), rgb2hsv() # Additional help related to colors help(package=colorspace) ``` **Boxplot** ```{r} #Simple boxplot boxplot(salaries$salary) ``` ```{r} #let's display 2 boxplots: one for male and one for female names boxplot( formula = salary ~ discipline, data = salaries, names = c("A","B"), col = c("gold","forestgreen"), horizontal = TRUE, main = "Salary range for Disciplines A and B") legend("topright", legend=c("A","B"), pch=15, col = c("gold","forestgreen")) # Side by side boxplot for ecoli dataset boxplot(genome_size ~ cit, ecoli) #simple #Add colors, title, labels boxplot(genome_size ~ cit, ecoli, col=c("pink","purple", "darkgrey"), main="Average expression differences between celltypes", ylab="Expression") ``` **Explore various symbols** View help of points() function: ```{r eval=FALSE} ?points ``` ```{r} #Draw a table with all symbols plot(c(-1, 26), 0:1, type = "n", axes = FALSE, xlab="", ylab="") text(0:25, 0.6, 0:25, cex=0.75) points(0:25, rep(0.5, 26), pch = 0:25, bg = "grey") ``` **graphics parameters:** * pch: plotting symbol (default - open circle) * lty: line type (default - solid line) * lwd: line width * las: orientation of the axis labels on the plot * bg: background color * fg: foreground color * mar: margin size * oma: outer margin size ```{r} levels(salaries$rank) col3 <- c("steelblue4","firebrick","forestgreen") cols <- col3 [ as.numeric(salaries$rank) ] plot(x = salaries$yrs.service, y = salaries$salary, pch = 19, main = "Analysis of dependency of 2 variables", xlab = "Years of Service", ylab = "Salary", col = cols, cex=1.5) lm.fit <- lm(salary ~ yrs.service, data = salaries) abline(lm.fit, lty = 4, lwd = 2, col="darkgray") legend("topleft",levels(salaries$rank),fill=col3 ) eq <- substitute(italic(salary) == a + b %.% italic(yrs.service)*","~~italic(r)^2~"="~r2, list(a = format(coef(lm.fit)[1], digits = 2), b = format(coef(lm.fit)[2], digits = 2), r2 = format(summary(lm.fit)$r.squared, digits = 3))) #R-squared: what proportion of the variation in the outcome can be explained by the covariates/predictors # if R-squared is low, than the outcome is poorly predicted by the covariates/ text(15,170000,labels=as.expression(eq), pos=4, cex=.8 ) ``` **Histogram** ```{r} #plot simple histogram hist(salaries$salary) # Add color and title: hist(salaries$salary, main="Histogram of Salaries of professors in Disciplines A and B", col="steelblue3", xlab="Salary, dollars") # Same plot with probability along the y axes hist(salaries$salary, main="Histogram of Salaries of professors in Disciplines A and B", col="steelblue3", xlab="Salary, dollars", prob=TRUE) #Add rugplot rug(salaries$salary) #Add density line lines(density(salaries$salary), col="darkblue", lwd=2) ``` **Pairs** ```{r} # Delays per airport pairs( salaries[ ,c("yrs.since.phd", "yrs.service","salary") ] ) #An alternative way using formula pairs( ~ yrs.since.phd + yrs.service + salary, data = salaries) ``` Improve the plot appearance using custom function: ```{r} panel.cor <- function(x, y, digits = 2, prefix = "", cex.cor, ...) { usr <- par("usr"); on.exit(par(usr)) par(usr = c(0, 1, 0, 1)) r <- abs(cor(x, y)) txt <- format(c(r, 0.123456789), digits = digits)[1] txt <- paste0(prefix, txt) if(missing(cex.cor)) cex.cor <- 0.8/strwidth(txt) text(0.5, 0.5, txt, cex = cex.cor * r) } pairs( ~ yrs.since.phd + yrs.service + salary, data = salaries, lower.panel = panel.smooth, upper.panel = panel.cor) ``` **Multiple plots in one graph** ```{r} #par.save <- par() # save current graphics parameters #print( par() ) # print current graphics parameters par( mfrow = c(1,2) ) # 1 row and 2 columns # first graph hist(salaries$salary, main="Histogram of Salaries of professors in Disciplines A and B", col="steelblue3", xlab="Salary, dollars", prob=TRUE) #Add rugplot rug(salaries$salary) #Add density line lines(density(salaries$salary), col="darkblue", lwd=2) # second graph b<-barplot(table(salaries$rank), col=c("steelblue4","firebrick","forestgreen"), main = "Professor Ranks", names.arg = c("Associate Professor", "Assistant Professor", "Professor"), ylim = c(0, max(table(salaries$rank)) * 1.25), axes=FALSE) text(b,table(salaries$rank),table(salaries$rank), pos=3) mtext("Disciplines A and B") par ( mfrow = c(1,1) ) ``` ```{r} #basic plot with points ggplot(salaries, aes( yrs.service, salary)) + geom_point() ggplot(salaries, aes( yrs.service, salary)) + geom_point() + # type of geom objects theme_bw() + # theme ggtitle(" Salary analysis") # plot title # color the points according to the faculty type ggplot(salaries, aes( yrs.service, salary)) + geom_point( aes(color = factor(rank) ), size=5) + # type of geom objects ggtitle(" Salary analysis") + # plot title theme_bw( ) + # theme theme ( plot.title=element_text(size=14,face="bold" ) ) #Add a regression line to the plot ggplot(salaries, aes( yrs.service, salary)) + geom_point( aes(color = factor(rank) ), size=5) + # type of geom objects ggtitle(" Salary analysis") + # plot title geom_smooth() + theme_bw( ) + # theme theme ( plot.title=element_text(size=14,face="bold" ) ) # Change the line to a simple lm line ggplot(salaries, aes( yrs.service, salary)) + geom_point( aes(color = factor(rank) ), size=5) + # type of geom objects ggtitle(" Salary analysis") + # plot title geom_smooth( method='lm' , col = "black") + theme_bw( ) + # theme theme ( plot.title=element_text(size=14,face="bold" ) ) ggplot(ecoli) + geom_point(aes(x = sample, y= genome_size, color = generation, shape = cit), size = rel(3.0)) + theme(axis.text.x = element_text(angle=45, hjust=1)) #boxplot ggplot(salaries, aes( x=discipline, y=salary ) ) + geom_boxplot() #adding addional features to the boxplot ggplot(salaries, aes( x=discipline, y=salary ) ) + geom_boxplot() + geom_jitter(width = 0.2) ggplot(salaries, aes( x=rank, y=salary) ) + geom_boxplot( varwidth = TRUE ) + #width is proportional to the square roots of the number of observations geom_jitter(width = 0.2) + ggtitle(" Salary analysis") + coord_flip() + theme ( plot.title=element_text(size=14,face="bold" ) ) ggplot(ecoli) + geom_boxplot(aes(x = cit, y = genome_size, fill = cit)) + ggtitle('Boxplot of genome size by citrate mutant type') + xlab('citrate mutant') + ylab('genome size') + theme(panel.grid.major = element_line(size = .5, color = "grey"), axis.text.x = element_text(angle=45, hjust=1), axis.title = element_text(size = rel(1.5)), axis.text = element_text(size = rel(1.25))) graph <- ggplot(ecoli) + geom_boxplot(aes(x = cit, y = genome_size, fill = cit)) graph +ggtitle('Boxplot of genome size by citrate mutant type') + xlab('citrate mutant') + ylab('genome size') + theme(panel.grid.major = element_line(size = .5, color = "grey"), axis.text.x = element_text(angle=45, hjust=1), axis.title = element_text(size = rel(1.5)), axis.text = element_text(size = rel(1.25))) #histogram ggplot(salaries, aes(x=salary)) + geom_histogram() ggplot(salaries, aes(x=salary)) + geom_histogram(binwidth=10000) #improving histogram ggplot(salaries, aes(x=salary)) + geom_histogram(binwidth=10000, color="black", fill="white") ggplot(salaries, aes(x=salary)) + geom_histogram(binwidth=10000, color="black", fill="white", aes(y=..density..)) + geom_density(alpha=.3, fill="darkorange") ```