--- title: "Introduction to R (Part 1)" author: "PiBS Professional Skill Set" output: html_document ---

### R as a scientific calculator ```{r comment=""} 2+3 # addition 2^3 # power log(2) # built-in functions ```

### R variables ```{r comment=""} a <- 3 b = -5 # both assignment operators are equivalent, the first one is more traditional A <- 7 # R is case sensitive ``` Avoid using names c, t, cat, F, T, D as those are built-in functions/constants The variable name can contain letters, digits, underscores and dots and start with the letter or dot The variable name cannot contain dollar sign or other special characters. ```{r comment=""} str.var <- "oxygen" # character variable num.var <- 15.99 # numerical variable bool.var <- TRUE # boolean (or logical) variable bool.var.1 <- F # logical values can be abbriviated rna.str <- "TTAGAATT" # string variable mode(rna.str) # get the type (or mode) of the object nchar(rna.str) # get the lenght of the string ```

### R help ```{r eval=FALSE} # Access help file for the R function if the name of the function is known ?sd help(sd) # Search for help ??"standard deviation" help.search("standard deviation") help.search("analysis of variance") ```

### R vectors Vector is an array of R objects of the same type: Vectors can be numeric, character or logical. ```{r comment=""} # Function c() is used to create a vector from a list of values num.vector <- c( 5,7, 9,11, 4, -1, 0) # Numeric vectors can be defined in a number of ways: vals1 <- c (2, -7, 5, 3, -1 ) # concatenation vals2 <- 25:75 # range of values vals3 <- seq(from=0, to=3, by=0.5) # sequence definition vals4 <- rep(1, times=7) # repeat value vals5 <- numeric(9) # initialize numeric vector with zeros vals6 <- rnorm(5, mean=2, sd=1.5 ) # returns normally distributed values # view the values of the vector print(vals2) ``` R allows to operate on whole vectors: ```{r comment=""} # Define 2 vectors ( first with "distance" values and one with "time valuse") dist <- c(6, 10, 5, 8, 15, 9) time <- c(0.5, 0.9, 0.4,0.7,1.2,0.85) # Compute velocity values for each dist-time pair: velocity <- dist/time # each element of the first vector will be divided by the element from the second vector with the same index print(velocity) # Define a vector with values of body temperature in Fahrenheit ftemp <- c(97.8, 99.5, 97.9, 102.1, 98.4, 97.7) # Convert ftem to the vector with values in Celsius ctemp <- (ftemp -32)/9*5 print(ctemp) ```

### R operators

              +, -, *, /  - addition, subtraction, multiplication, division
              ^ or **     - exponentiation
              %%          - modulus
              %/%         - integer division
              %in%        - membership
              :           - sequence genration
              <, <=, ==, >=, >, !=     - boolean comparative
              |, ||       - OR  vectorized/not vectorized
              &, &&       - AND

### vector slicing (subsetting) ```{r comment=""} x <- c(36.6, 38.2, 36.4, 37.9, 41.0, 39.9, 36.8, 37.5) # define a numeric vector x[2] # returns second element x[2:4] # returns second through 4th elements inclusive x[c(1,3,5)] # returns 1st, 3rd and 5th elements x[-2] # returns all but 2nd element x[c(TRUE, FALSE, TRUE, FALSE, FALSE,FALSE, TRUE, FALSE)] # returns 1st, 3rd, and 7th elements #compare each element of the vector with a value x < 37.0 #return only those elements of the vector that satisfy a specific condition x[ x < 37.0 ] ``` There are a number of functions that are useful to locate a value satidfying specific condition(s) ```{r comment=""} which.max(x) # find the (first)maximum element and return its index which.min(x) which(x >= 37.0) # find the location of all the elements that satisfy a specific condition ```

### vector functions:

               max(x),   min(x),     sum(x)
               mean(x),  sd(), median(x),  range(x)
               var(x)                            - simple variance
               cor(x,y)                          - correlation between x and y
               sort(x), rank(x), order(x)
               cumsum(), cumprod(x), cummin(x), cumprod(x)
               duplicated(x), unique(x)
               summary()

### Reading input files There are many R functions to process files with various formats: Some come from base R: * read.table() * read.csv() * read.delim() * read.fwf() * scan() and many others There are many R packages which provide additional functionality to read files. ```{r} # Read regular csv file: salt <- read.csv("http://rcs.bu.edu/classes/FC764/intersalt.csv") ``` This data contains median blood pressure, as a fuction of salt intake * b - numeric vector * bp - mean diastolic blood pressure (mm Hg) * na - mean sodium excretion (mmol/24h) * country ### Exploring R dataframes ```{r comment=""} #Look at the first and last few records head(salt) tail(salt) #Get the list of the columns names(salt) #Number of rows: nrow(salt) #Number of columns: ncol(salt) #Get the structure of the data: str(salt) #Get the summary of the data: summary(salt) # Get unique values of a vector unique(salt$country) #numeric data exploratory analysis min(salt$bp) max(salt$bp) range(salt$bp) summary(salt$bp) ``` Sometimes it is helpful to visualize the data ``` ```{r} boxplot( salt$bp ) plot(salt$na, salt$bp) ``` ### R packages ```{r eval=FALSE} #To install package install.packages("ggplot2") ``` Once package is installed it can be loaded and used: ```{r} #Load R package library(ggplot2) # Use boxplot from ggplot2 package ggplot(salt, aes( na, bp)) + geom_point() ``` ### dataframes slicing ```{r comments=""} # dataframe slicing (accessing elements) salt[2,3] # element, from the 2nd row, 3rd column salt[2,] # all elements from the 2nd row salt[ ,3] # all elements in the 3rd column salt$bp # accessing elemnts using variable name salt[salt$bp > 75, ] # list all the rows in the dataset for which variablebp is greater than 75 salt[salt$bp < 70, ] # list all the rows in the dataset for which variablebp is less than 70 ``` ### Create dataframe from vectors ```{r comments=""} organism<-c("Human","Mouse","Fruit Fly", "Roundworm","Yeast") genomeSizeBP<-c(3000000000,3000000000,135600000,97000000,12100000) estGeneCount<-c(30000,30000,13061,19099,6034) dt <- data.frame(organism=organism, genomesz=genomeSizeBP, genecount=estGeneCount) dt # Comute gene density (divide genome size by the genome count) dt$genomesz/dt$genecount # create a new column "base pairs" per gene dt$bp <- dt$genomesz/dt$genecount dt ``` ### R packages R contains many packages. Many of them can be found on the CRAN website :
[https://cran.r-project.org/web/packages/available_packages_by_name.html](https://cran.r-project.org/web/packages/available_packages_by_name.html)
Bioconductor contains many packages used in bioinformatics: [https://www.bioconductor.org/](https://www.bioconductor.org/) ```{r eval=FALSE} #To install package install.packages("dplyr") ``` After package is installed it has to be loaded before the functions it contains can be used: ```{r eval=FALSE} #load package library(dplyr) ``` *dplyr* package is a very useful package for data manipulation. * filter() to select cases based on their values. * arrange() to reorder the cases. * select() and rename() to select variables based on their names. * mutate() and transmute() to add new variables that are functions of existing variables. * summarise() to condense multiple values to a single value. * sample\_n() and sample\_frac() to take random samples. The following webpage contains easy to folow guide for this package:
[https://cran.r-project.org/web/packages/dplyr/vignettes/dplyr.html](https://cran.r-project.org/web/packages/dplyr/vignettes/dplyr.html)

### Missing Values ```{r comment=""} x <- c(734, 145, NA, 456, NA) # define a numeric vector anyNA(x) # are there any missing data ``` To check which values are missing use *is.na()* function: ```{r comment=""} x == NA # this does not work ! - missing value cannot be compared to anything is.na(x) # check if the element in the vector is missing which(is.na(x)) # which elements are missing ``` By default mamy statistical functions will not compute if the data contain missing values: ```{r comment=""} mean(x) ``` To view the arguments that need to be used to remove missing data, read help topic for the function: ```{r eval=FALSE} ?mean ``` ```{r comment=""} #Perform computation removing the missing data mean(x, na.rm=TRUE) ```

### Hypothesis Testing in R This dataset consists of gene expression measurements for 10 genes under control and reatment conditions with 4 replicates each ```{r comment=""} genht <- read.csv("/project/scv/classes/FC764/geHT.csv") str(genht) #All control values control <- c(genht$c1, genht$c2, genht$c3, genht$c4 ) # Perform One sample t-test with a Null hypothesis that the true mean is 2000 t.test(control, mu=2000) ``` The test result with a p-value of 0.03583 rejects the null hypothesis at a critical value (alpha level) of 0.05 ```{r comment=""} # 2-sample t-test treat <- c(genht$t1, genht$t2, genht$t3, genht$t4 ) t.test(control, treat) ``` The p-value for this test very strongly rejects the null hypothesis of no difference between the mean of the treatment group and control group ```{r comment=""} # paired t-test #Suppose that control and treatment values were "before" and "after" values for each gene. We could then perform a paired t-test t.test(control, treat, paired=T) ``` ### Useful resources [BUMC SPH Basic Statistical Analysis Using the R Statistical Package](https://sphweb.bumc.bu.edu/otlt/MPH-Modules/BS/R/R-Manual/R-Manual-TOC.html) [R for applied epidemiology and public health](https://epirhandbook.com/) [What statistical analysis to use](https://stats.idre.ucla.edu/other/mult-pkg/whatstat/) [Examples of statistical analysis in R ](https://stats.idre.ucla.edu/r/whatstat/what-statistical-analysis-should-i-usestatistical-analyses-using-r/)