--- title: "Introduction to R" subtitle: Summer Training for Research Scholar Program format: html editor: visual toc: true --- ```{css, echo=FALSE} .output { background-color: #FDF8E9; border: 1px solid #e1d8bd; } ``` ## Installation 1. Install R using appropriate link from [The Comprehensive R Archive Network](http://cran.r-project.org/). 2. Download and install [RStudio](http://www.rstudio.com/ide) for free. Just click the “Download RStudio” button and follow instructions. ## Basics When you type a command at the prompt of the *Console* window and hit Enter, your computer executes the command and shows you the results. Then RStudio displays a new prompt for your next command. For example, if you type 2 + 3 and hit Enter, RStudio will display: ```{r class.output="output"} 2 + 3 ``` or ```{r class.output="output"} 21 / (4 + 3) ``` You can use common functions, like `log()`, `sin()`, `cos()` and many others: ```{r class.output="output"} log(2) ``` ## Objects (or variables) You can assign values to `R` objects using `<-` symbol: ```{r} height <- 70 ``` `R` is case sensitive, so variables `Height` and `height` are two distinct objects. You can define an `R` object, that stores multiple values. For example, you can create a vector that contains a sequence of integer values (notice that both - start and end values are included): ```{r class.output="output"} id <- 1:10 id ``` Avoid using names c, t, cat, F, T, D as those are built-in functions/constants. The variable name can contain letters, digits, underscores and dots and start with the letter or dot. The variable name cannot contain dollar sign or other special characters. ```{r} str.var <- "oxygen" # character variable num.var <- 15.99 # numerical variable bool.var <- TRUE # Boolean (or logical) variable bool.var.1 <- F # logical values can be abbreviated ``` ## Vectors Vector is an array of values of the same type: Vectors can be numeric, character or logical: ```{r} # Function c() is used to create a vector from a list of values num.vector <- c( 5, 7, 9, 11, 4, -1, 0) # Numeric vectors can be defined in a number of ways: vals1 <- c (2, -7, 5, 3, -1 ) # concatenation vals2 <- 25:75 # range of values vals3 <- seq(from=0, to=3, by=0.5) # sequence definition vals4 <- rep(1, times=7) # repeat value vals5 <- rnorm(5, mean=2, sd=1.5 ) # normally distributed values ``` ## R Functions There are many functions that come with R installation: ```{r class.output="output"} mean(1:99) ``` ## R Help You can access the function's help topic in two ways: ```{r} #| eval: false ?sd help(sd) ``` If you do not know the function's name, you can search for a topic ```{r} #| eval: false ??"standard deviation" help.search("standard deviation") ``` ## Installing `R` packages There are more than 20 thousand packages published officially on the CRAN's website. You can search them [by name](https://cran.r-project.org/web/packages/available_packages_by_name.html) or [by topic](https://cran.r-project.org/web/views/). There are also many packages published on the [Bioconductor site](https://www.bioconductor.org/) To install an R package from the CRAN, use `install.packages("package_name")` command. For example, you may want to install a very popular `tidyverse` library of packages used by many data scientists: ```{r} #| eval: false install.packages("tidyverse") ``` Another very handy package is *table1*: ```{r class.output="output"} #| eval: false # To install package install.packages("table1") ``` Once package is installed it can be loaded and used: ```{r class.output="output"} #| warning: false #Load R package library(table1) ``` ## Vector operations in R ```{r class.output="output"} # Define a vector with values of body temperature in Fahrenheit ftemp <- c(97.8, 99.5, 97.9, 102.1, 98.4, 97.7) # Convert them to a vector with values in Celsius ctemp <- (ftemp - 32) / 9 * 5 print(ctemp) ``` You can also perform operations on two or more vectors: ```{r class.output="output"} # Define values for body weight and height (in kg and meters) weight <- c(65, 80, 73, 57, 84) height <- c( 1.65, 1.80, 1.73, 1.68, 1.79) # Calculate BMI weight/height^2 ``` ## Vector Slicing (subsetting) ```{r class.output="output"} x <- c(36.6, 38.2, 36.4, 37.9, 41.0, 39.9, 36.8, 37.5) x[2] # returns second element x[2:4] # returns 2nd through 4th elements inclusive x[c(1,3,5)] # returns 1st, 3rd and 5th elements # compare each element of the vector with a value x < 37.0 # return only those elements of the vector that satisfy a specific condition x[ x < 37.0 ] ``` There are a number of functions that are useful to locate a value satisfying specific condition(s) ```{r class.output="output"} which.max(x) # find the (first)maximum element and return its index which.min(x) which(x >= 37.0) # find the location of all the elements that satisfy a specific condition ``` ## Useful Functions:
max(x),   min(x),   sum(x)
mean(x),  sd(), median(x),  range(x)
var(x)                            - simple variance
cor(x,y)                          - correlation between x and y
sort(x), rank(x), order(x)        - sorting
duplicated(x), unique(x)          - unique and repeating values
summary()
## Missing Values Missing values in R are indicated with a symbol `NA`: ```{r class.output="output"} x <- c(734, 145, NA, 456, NA) # check if there are any missing values: anyNA(x) ``` To check which values are missing use `is.na()` function: ```{r class.output="output"} is.na(x) # check if the element in the vector is missing which(is.na(x)) # which elements are missing ``` By default statistical functions will not compute if the data contain missing values: ```{r class.output="output"} mean(x) ``` To view the arguments that need to be used to remove missing data, read help topic for the function: ```{r eval=FALSE} ?mean ``` ```{r class.output="output"} #Perform computation removing missing data mean(x, na.rm=TRUE) ``` ## Reading input files There are many R functions to process files with various formats: Some come from base R: * read.table() * read.csv() * read.delim() * read.fwf() * scan() and many others ```{r} # Read a regular csv file: salt <- read.csv("http://rcs.bu.edu/classes/STaRS/intersalt.csv") ``` There are a few R packages which provide additional functionality to read files. ```{r class.output="output"} # Install package first # You can install packages from the RStudio Menu (Tools -> Install packages) # or by executing the following R command: #install.packages("foreign") #Load R package "foreign" library(foreign) # Read data in Stata format swissdata <- read.dta("http://rcs.bu.edu/classes/STaRS/swissfile.dta") # Load R package haven to read SAS-formatted files (make sure it is installed! ) #install.packages("haven") library(haven) # Read data in SAS format fhsdata <- read_sas("http://rcs.bu.edu/classes/STaRS/fhs.sas7bdat") ``` ## Exploring R dataframes There are a few very useful commands to explore R's dataframes: ```{r class.output="output"} head(fhsdata) tail(fhsdata) str(fhsdata) summary(fhsdata) # In this dataset, "999" is used to indicate missing values fhsdata$HEARTRTE[fhsdata$HEARTRTE==999]<- NA fhsdata$CIGPDAY[fhsdata$CIGPDAY==999]<- NA summary(fhsdata) ``` We can use *table1* package to get various summaries stratifying by one or more variable: ```{r class.output="output"} #| warning: false # One level of stratification table1(~ SEX + AGE + TOTCHOL | DIABETES, data=fhsdata) ``` ```{r class.output="output"} #| warning: false # Two levels of stratification (nesting) table1(~ AGE + TOTCHOL + HDLC | DIABETES*SEX, data=fhsdata) ``` To see more examples of *table1* package usage, see. [https://cran.r-project.org/web/packages/table1/vignettes/table1-examples.html](https://cran.r-project.org/web/packages/table1/vignettes/table1-examples.html) ## Correlation To explain this test, we will use the *salt* dataset we imported earlier: ```{r} str(salt) ``` The correlation test is used to test the linear relationship of 2 continuous variables. We can first display two variables using scatter plot: ```{r} plot(x = salt$sodium, y = salt$bp, xlab = "mean sodium excretion (mmol/24h)", ylab = "mean diastolic blood pressure (mm Hg)", main = "Diastolic Blood Presure vs Sodium excretion") ``` As we can see in this plot, there is little correlation between these 2 variables and there are a few "outlier" points that will affect the correlation calculation!!! Let's perform the test: ```{r class.output="output"} cor.test(salt$bp, salt$sodium) ``` The **Null Hypothesis** : the true correlation between bp and sodium (blood pressure and salt intake) is 0 (they are independent); The **Alternative hypothesis**: true correlation between bp and sodium is not equal to 0.
The *p-value* is less that 0.05 and the 95% CI does not contain 0, so at the significance level of 0.05 we reject null hypothesis and state that there is some (positive) correlation (0.359) between these 2 variables.
*  0 < |r| <= .3 - weak correlation
* .3 < |r| <= .7 - moderate correlation
*      |r| >  .7 - strong correlation
**Important**: * The order of the variables in the test is not important * Correlation provide evidence of association, not causation! * Correlation values is always between -1 and 1 and does not change if the units of either or both variables change * Correlation describes linear relationship * Correlation is strongly affected by outliers (extreme observations) In this case the correlation is weak and there are a few points that significantly affect the result. ## Linear Model ```{r class.output="output"} lm.res <- lm( bp ~ sodium, data = salt) summary(lm.res) ``` Here we estimated that relationship between the predictor (*sodium*) and response (*bp*) variables. The summary statistics here reports a number of things. *p-value* tells us if the model is statistically significant. In Linear Regression, the *Null Hypothesis* is that the coefficient associated with the variables are equal to zero. Multiple R-squared value is equal to the square of the correlation value we calculated in the previous test. When the model fits the data; * R-squared - The higher - the better * F-statistics - The higher the better * Std. Error - The closer to 0 - the better ## One Sample t-Test This test is used to test the mean of a sample from a normal distribution ```{r class.output="output"} t.test(salt$bp, mu=70) ``` The *null hypothesis*: The true mean is equal to 70. The *alternative hypothesis*: true mean is not equal to 70. Since *p-value* is small (1.703e-05) - less than 0.05, 95% percent CI does not contain the value 70, we can reject the *null hypothesis*. ## Two Sample t-TEST Let's load some other dataset. It comes with R. This dataset shows the effect of two soporific drugs (increase in hours of sleep compared to control) on 10 patients. There are 3 variables: * extra: (numeric) increase in hours of sleep. * group: (factor) categorical variable indicating which drug is given * ID: (factor) patient ID ```{r class.output="output"} data(sleep) head(sleep) summary(sleep) ``` To compare the means of 2 samples: ```{r class.output="output"} boxplot(extra ~ group, data = sleep) t.test(extra ~ group, data = sleep) ``` Here the *Null Hypothesis* is that the true difference between 2 groups is equal to 0
And the *Alternative Hypothesis* is that it does not equal to 0 In this test the p-value is above significance level of 0.05 and 95% CI contains 0, so we cannot reject the NULL hypothesis. Note, that t.test has a number of options, including *alternative* which can be set to "two.sided", "less", "greater", depending which test you would like to perform. Using option *var.equal* you can also specify if the variances of the values in each group are equal or not. ## One-way ANOVA test The one-way analysis of variance (*ANOVA*) test is an extension of two-samples t.test for comparing means for datasets with more than 2 groups. **Assumptions**: * The observations are obtained independently and randomly from the population defined by the categorical variable * The data of each category is normally distributed * The populations have a common variance. ```{r class.output="output"} data("PlantGrowth") head(PlantGrowth) summary(PlantGrowth) boxplot(weight~group, data=PlantGrowth) ``` As visible from the side-by-side boxplots, there is some difference in the weights of 3 groups, but we cannot determine from the plot if this difference is significant. ```{r class.output="output"} aov.res <- aov(weight~group, data=PlantGrowth) summary( aov.res ) ``` As we can see from the summary output the *p-value* is 0.0159 < 0.05 which indicates that there is a statistically significant difference in weignt between these groups. We can check the confidence intervals for the treatment parameters: ```{r class.output="output"} confint(aov.res) ``` **Important**: In one-way ANOVA test a small p-value indicates that some of the group means are different, but it does not say which ones! For multiple pairwise-comparisions we use **Tukey** test: ```{r class.output="output"} TukeyHSD(aov.res) ``` This result indicates that the significant difference in between treatment 1 and treatment 2 with the adjusted *p-value* of 0.012. ## Chi-Squred test of independence in R The chi-square test is used to analyze the frequency table ( or contengency table) formed by two categorical variables. It evaluates whether there is a significant association between the categories of the two variables. ```{r class.output="output"} treat <- read.csv("http://rcs.bu.edu/classes/STaRS/treatment.csv") head(treat) summary(treat) ``` In the above dataset there are 2 categorical variables: * treated - 0 or 1 * improved - 0 or 1 We would like to test if there is an improvement after the treatment. In other words if these 2 categorical variables are dependent. First let's take a look at the tables: ```{r class.output="output"} # Frequency table: "treated will be rows, "improved" - columns" table(treat$treated, treat$improved) # Proportion table prop.table(table(treat$treated, treat$improved)) ``` We can visualize this contingency table with a mosaic plot: ```{r class.output="output"} tbl <- table(treat$treated, treat$improved) mosaicplot( tbl, color = 2:3, las = 2, main = "Improvement vs. Treatment" ) ``` Compute chi-squred test ```{r class.output="output"} chisq.test (tbl) ``` The *Null Hypothesis* is that *treated* and *improved* variables are independent. In the above test the *p-value* is 0.03083, so at the significance level of 0.05 we can reject the null hypothesis ## Additional resources 1. [R for applied epidemiology and public health](https://epirhandbook.com/en/) 2. [R for data science](https://r4ds.had.co.nz/) 3. [The R Graph Gallery](https://r-graph-gallery.com/) 4. [FAQ in R and other R topics by UCLA](https://stats.oarc.ucla.edu/r/) 5. [Which Statistical Test to use](https://stats.oarc.ucla.edu/other/mult-pkg/whatstat/)