---
title: "Introduction to R (Part 1)"
author: "PiBS Professional Skill Set"
output: html_document
---
### R as a scientific calculator
```{r comment=""}
2+3 # addition
2^3 # power
log(2) # built-in functions
```
### R variables
```{r comment=""}
a <- 3
b = -5 # both assignment operators are equivalent, the first one is more traditional
A <- 7 # R is case sensitive
```
Avoid using names c, t, cat, F, T, D as those are built-in functions/constants
The variable name can contain letters, digits, underscores and dots and
start with the letter or dot
The variable name cannot contain dollar sign or other special characters.
```{r comment=""}
str.var <- "oxygen" # character variable
num.var <- 15.99 # numerical variable
bool.var <- TRUE # boolean (or logical) variable
bool.var.1 <- F # logical values can be abbriviated
rna.str <- "TTAGAATT" # string variable
mode(rna.str) # get the type (or mode) of the object
nchar(rna.str) # get the lenght of the string
```
### R help
```{r eval=FALSE}
# Access help file for the R function if the name of the function is known
?sd
help(sd)
# Search for help
??"standard deviation"
help.search("standard deviation")
help.search("analysis of variance")
```
### R vectors
Vector is an array of R objects of the same type:
Vectors can be numeric, character or logical.
```{r comment=""}
# Function c() is used to create a vector from a list of values
num.vector <- c( 5,7, 9,11, 4, -1, 0)
# Numeric vectors can be defined in a number of ways:
vals1 <- c (2, -7, 5, 3, -1 ) # concatenation
vals2 <- 25:75 # range of values
vals3 <- seq(from=0, to=3, by=0.5) # sequence definition
vals4 <- rep(1, times=7) # repeat value
vals5 <- numeric(9) # initialize numeric vector with zeros
vals6 <- rnorm(5, mean=2, sd=1.5 ) # returns normally distributed values
# view the values of the vector
print(vals2)
```
R allows to operate on whole vectors:
```{r comment=""}
# Define 2 vectors ( first with "distance" values and one with "time valuse")
dist <- c(6, 10, 5, 8, 15, 9)
time <- c(0.5, 0.9, 0.4,0.7,1.2,0.85)
# Compute velocity values for each dist-time pair:
velocity <- dist/time # each element of the first vector will be divided by the element from the second vector with the same index
print(velocity)
# Define a vector with values of body temperature in Fahrenheit
ftemp <- c(97.8, 99.5, 97.9, 102.1, 98.4, 97.7)
# Convert ftem to the vector with values in Celsius
ctemp <- (ftemp -32)/9*5
print(ctemp)
```
### R operators
+, -, *, / - addition, subtraction, multiplication, division
^ or ** - exponentiation
%% - modulus
%/% - integer division
%in% - membership
: - sequence genration
<, <=, ==, >=, >, != - boolean comparative
|, || - OR vectorized/not vectorized
&, && - AND
### vector slicing (subsetting)
```{r comment=""}
x <- c(36.6, 38.2, 36.4, 37.9, 41.0, 39.9, 36.8, 37.5) # define a numeric vector
x[2] # returns second element
x[2:4] # returns second through 4th elements inclusive
x[c(1,3,5)] # returns 1st, 3rd and 5th elements
x[-2] # returns all but 2nd element
x[c(TRUE, FALSE, TRUE, FALSE, FALSE,FALSE, TRUE, FALSE)] # returns 1st, 3rd, and 7th elements
#compare each element of the vector with a value
x < 37.0
#return only those elements of the vector that satisfy a specific condition
x[ x < 37.0 ]
```
There are a number of functions that are useful to locate a value satidfying specific condition(s)
```{r comment=""}
which.max(x) # find the (first)maximum element and return its index
which.min(x)
which(x >= 37.0) # find the location of all the elements that satisfy a specific condition
```
### vector functions:
max(x), min(x), sum(x)
mean(x), sd(), median(x), range(x)
var(x) - simple variance
cor(x,y) - correlation between x and y
sort(x), rank(x), order(x)
cumsum(), cumprod(x), cummin(x), cumprod(x)
duplicated(x), unique(x)
summary()
### Reading input files
There are many R functions to process files with various formats:
Some come from base R:
* read.table()
* read.csv()
* read.delim()
* read.fwf()
* scan()
and many others
There are many R packages which provide additional functionality to read files.
```{r}
# Read regular csv file:
salt <- read.csv("http://rcs.bu.edu/classes/FC764/intersalt.csv")
```
This data contains median blood pressure, as a fuction of salt intake
* b - numeric vector
* bp - mean diastolic blood pressure (mm Hg)
* na - mean sodium excretion (mmol/24h)
* country
### Exploring R dataframes
```{r comment=""}
#Look at the first and last few records
head(salt)
tail(salt)
#Get the list of the columns
names(salt)
#Number of rows:
nrow(salt)
#Number of columns:
ncol(salt)
#Get the structure of the data:
str(salt)
#Get the summary of the data:
summary(salt)
# Get unique values of a vector
unique(salt$country)
#numeric data exploratory analysis
min(salt$bp)
max(salt$bp)
range(salt$bp)
summary(salt$bp)
```
Sometimes it is helpful to visualize the data
```
```{r}
boxplot( salt$bp )
plot(salt$na, salt$bp)
```
### R packages
```{r eval=FALSE}
#To install package
install.packages("ggplot2")
```
Once package is installed it can be loaded and used:
```{r}
#Load R package
library(ggplot2)
# Use boxplot from ggplot2 package
ggplot(salt, aes( na, bp)) + geom_point()
```
### dataframes slicing
```{r comments=""}
# dataframe slicing (accessing elements)
salt[2,3] # element, from the 2nd row, 3rd column
salt[2,] # all elements from the 2nd row
salt[ ,3] # all elements in the 3rd column
salt$bp # accessing elemnts using variable name
salt[salt$bp > 75, ] # list all the rows in the dataset for which variablebp is greater than 75
salt[salt$bp < 70, ] # list all the rows in the dataset for which variablebp is less than 70
```
### Create dataframe from vectors
```{r comments=""}
organism<-c("Human","Mouse","Fruit Fly", "Roundworm","Yeast")
genomeSizeBP<-c(3000000000,3000000000,135600000,97000000,12100000)
estGeneCount<-c(30000,30000,13061,19099,6034)
dt <- data.frame(organism=organism,
genomesz=genomeSizeBP,
genecount=estGeneCount)
dt
# Comute gene density (divide genome size by the genome count)
dt$genomesz/dt$genecount
# create a new column "base pairs" per gene
dt$bp <- dt$genomesz/dt$genecount
dt
```
### R packages
R contains many packages. Many of them can be found on the CRAN website :
[https://cran.r-project.org/web/packages/available_packages_by_name.html](https://cran.r-project.org/web/packages/available_packages_by_name.html)
Bioconductor contains many packages used in bioinformatics:
[https://www.bioconductor.org/](https://www.bioconductor.org/)
```{r eval=FALSE}
#To install package
install.packages("dplyr")
```
After package is installed it has to be loaded before the functions it contains can be used:
```{r eval=FALSE}
#load package
library(dplyr)
```
*dplyr* package is a very useful package for data manipulation.
* filter() to select cases based on their values.
* arrange() to reorder the cases.
* select() and rename() to select variables based on their names.
* mutate() and transmute() to add new variables that are functions of existing variables.
* summarise() to condense multiple values to a single value.
* sample\_n() and sample\_frac() to take random samples.
The following webpage contains easy to folow guide for this package:
[https://cran.r-project.org/web/packages/dplyr/vignettes/dplyr.html](https://cran.r-project.org/web/packages/dplyr/vignettes/dplyr.html)
### Missing Values
```{r comment=""}
x <- c(734, 145, NA, 456, NA) # define a numeric vector
anyNA(x) # are there any missing data
```
To check which values are missing use *is.na()* function:
```{r comment=""}
x == NA # this does not work ! - missing value cannot be compared to anything
is.na(x) # check if the element in the vector is missing
which(is.na(x)) # which elements are missing
```
By default mamy statistical functions will not compute if the data contain missing values:
```{r comment=""}
mean(x)
```
To view the arguments that need to be used to remove missing data, read help topic for the function:
```{r eval=FALSE}
?mean
```
```{r comment=""}
#Perform computation removing the missing data
mean(x, na.rm=TRUE)
```
### Hypothesis Testing in R
This dataset consists of gene expression measurements for 10 genes
under control and reatment conditions with 4 replicates each
```{r comment=""}
genht <- read.csv("/project/scv/classes/FC764/geHT.csv")
str(genht)
#All control values
control <- c(genht$c1, genht$c2, genht$c3, genht$c4 )
# Perform One sample t-test with a Null hypothesis that the true mean is 2000
t.test(control, mu=2000)
```
The test result with a p-value of 0.03583 rejects the null hypothesis at a critical
value (alpha level) of 0.05
```{r comment=""}
# 2-sample t-test
treat <- c(genht$t1, genht$t2, genht$t3, genht$t4 )
t.test(control, treat)
```
The p-value for this test very strongly rejects the null hypothesis of no
difference between the mean of the treatment group and control group
```{r comment=""}
# paired t-test
#Suppose that control and treatment values were "before" and "after" values for each gene. We could then perform a paired t-test
t.test(control, treat, paired=T)
```
### Useful resources
[BUMC SPH Basic Statistical Analysis Using the R Statistical Package](https://sphweb.bumc.bu.edu/otlt/MPH-Modules/BS/R/R-Manual/R-Manual-TOC.html)
[R for applied epidemiology and public health](https://epirhandbook.com/)
[What statistical analysis to use](https://stats.idre.ucla.edu/other/mult-pkg/whatstat/)
[Examples of statistical analysis in R ](https://stats.idre.ucla.edu/r/whatstat/what-statistical-analysis-should-i-usestatistical-analyses-using-r/)