We will be using two packages:
microbenchmark
: for measuring the time it takes to evaluate the running time of a command.profvis
: for code profilinglibrary(microbenchmark)
library(profvis)
Suppose we would like to measure a time it takes to run a particular line or lines within a script. We can use system.time()
function:
system.time( mad( runif(1000 ) ))
#> user system elapsed
#> 0.001 0.000 0.001
The problem with this approach is that some functions run too fast. So we need to repeat them many times to see some non-zero values in the output:
system.time( for(i in 1:100) mad( runif(1000 ) ))
#> user system elapsed
#> 0.029 0.005 0.034
The user time is the CPU time charged for the execution of user instructions of the calling process. It includes the time needed for computation.
The system time is the CPU time charged for execution by the system on behalf of the calling process. It includes the time needed for memory allocation, accessing hardware (network), reading or writing operations, etc. These operations are usually very time consuming and need to be minimized and used with care.
The elapsed time is wall clock time - time from start to finish of the call.
If we are interested in measuring the elapsed time (and possibly comparing it with another implementation of the same task), a better approach is to use the microbenchmark()
function from a package with the same name.
Note: Be careful: by default, the microbenchmark
function runs the expression a hundred times.
microbenchmark( mad( runif(1000 ) ), times=100 )
#> Unit: microseconds
#> expr min lq mean median uq max neval
#> mad(runif(1000)) 150.111 162.4395 167.2235 165.6895 169.53 260.035 100
You may read or hear that R loops are very slow. Indeed, the following loop seems to take forever to perform a very simple operation:
system.time ({
<- c()
x for (i in 1:1e5) x <- c( x, i )
})#> user system elapsed
#> 12.080 0.160 12.241
However, the real reason, why it takes so long to compute x here, is the dynamic expanding of x. Let’s modify our code a little and print x after each iteration:
<- c()
x for (i in 1: 7) {
<- c( x, i )
x
print( paste("Iteration", i, " x: ", paste(x, collapse=", ")))
}#> [1] "Iteration 1 x: 1"
#> [1] "Iteration 2 x: 1, 2"
#> [1] "Iteration 3 x: 1, 2, 3"
#> [1] "Iteration 4 x: 1, 2, 3, 4"
#> [1] "Iteration 5 x: 1, 2, 3, 4, 5"
#> [1] "Iteration 6 x: 1, 2, 3, 4, 5, 6"
#> [1] "Iteration 7 x: 1, 2, 3, 4, 5, 6, 7"
During each iteration, R has to allocate a new larger chunk of memory, copy the previous, value of x, and then de-allocate memory. Operations with memory (especially memory allocation) are very time-consuming and should be minimized.
Let’s allocate the memory first and then fill out the array within the loop:
<- numeric(7)
x for (i in 1: 7) {
<- i
x[i]
print( paste("Iteration", i, " x: ", paste(x, collapse=", ")))
}#> [1] "Iteration 1 x: 1, 0, 0, 0, 0, 0, 0"
#> [1] "Iteration 2 x: 1, 2, 0, 0, 0, 0, 0"
#> [1] "Iteration 3 x: 1, 2, 3, 0, 0, 0, 0"
#> [1] "Iteration 4 x: 1, 2, 3, 4, 0, 0, 0"
#> [1] "Iteration 5 x: 1, 2, 3, 4, 5, 0, 0"
#> [1] "Iteration 6 x: 1, 2, 3, 4, 5, 6, 0"
#> [1] "Iteration 7 x: 1, 2, 3, 4, 5, 6, 7"
Measuring the time to fill out the 1e5
-long x vector:
system.time({
<- numeric(1e5)
x for (i in 1: 1e5) {
<- i
x[i]
}
})#> user system elapsed
#> 0.007 0.000 0.007
The same situation occurs when we use rbind()
within a loop to add additional rows to a matrix or a data frame:
#slow!
<-NULL
matr system.time( for(i in seq(1e4)) matr <-rbind(matr, 1:10) )
#> user system elapsed
#> 0.745 0.036 0.781
#better
<- matrix(NA, nrow=1e4, ncol=10)
matr system.time( for(i in seq(1e4) ) matr[i,] <- 1:10)
#> user system elapsed
#> 0.006 0.000 0.005
As we can see, the main problem was in the code, not in the loop! Saying that, the loop has some overhead and whenever possible we should vectorize our code, but it can be safely used when appropriate.
Below is a simple case of taking an advantage of R vectorization. We compare two approaches of computing a logarithm of each element of a vector and then adding all elements together:
<- rnorm( 1e4, 100,2 )
x
microbenchmark({
# Slow method:
<- 0
total for(i in seq_along(x)) total <- total + log(x[i])
},
{# Faster:
<- sum(log(x))
log_sum
})#> Unit: microseconds
#> expr
#> { total <- 0 for (i in seq_along(x)) total <- total + log(x[i]) }
#> { log_sum <- sum(log(x)) }
#> min lq mean median uq max neval cld
#> 2871.843 2928.1920 3171.2189 2971.4760 3333.7445 8537.560 100 b
#> 314.734 319.7805 335.6021 323.7565 357.1405 366.385 100 a
Let’s list the operations R needs to perform using the loop approach:
log()
1e4 times and pass input parameters.In contrast, the vectorized approach:
log()
function only once.sum()
function once and performs the final assignment operation only once.Similarly, we should avoid using if()
statement within a loop to compute a vector using another vector. The ifelse()
functions performs the same task much more efficiently:
<- rnorm(1e5,0,1)
x <- numeric(1e5)
vec
microbenchmark(
# Slow method: using if() within a loop
"if:" = ( for(i in 1:length(vec)) if(x[i] < 0) vec[i] <- -1 else 1),
# Faster method: use ifelse() function
"ifelse:" = ( vec <-ifelse (x < 0, -1, 1 ) )
)#> Unit: milliseconds
#> expr min lq mean median uq max neval cld
#> if: 8.998399 9.266198 9.427125 9.352911 9.45562 12.21461 100 b
#> ifelse: 2.332400 2.457369 3.378499 2.496164 4.11170 18.29508 100 a
Tip: seq_along()
is generally faster and safer to use than 1:length(x)
:
<- rnorm(1e5)
xmicrobenchmark( 1:length(x), seq_along(x) )
#> Unit: nanoseconds
#> expr min lq mean median uq max neval cld
#> 1:length(x) 224 240.5 336.85 257.5 332.0 3705 100 b
#> seq_along(x) 116 134.5 188.71 154.5 188.5 1856 100 a
In the cases when the input x
vector has zero length (as a result of a bug or an incorrect input) , seq_along() will doo the right thing:
<- NULL
x
# Using 1:length(), the loop will execute two times!
for ( i in 1:length(x) ) print(i)
#> [1] 1
#> [1] 0
# Using seq_along() the loop will execute 0 times
for ( i in seq_along(x) ) print(i)
Random number generation is a relatively slow operation. In this example we create random numbers one at a time:
<-function(){
slow<- as.numeric(1e5)
valsfor(i in 1:1e5) vals[i] <- rnorm(1)
}
And here we generate random numbers using a single rnorm
function call:
<-function(){
fast<-rnorm(1e5)
vals }
Let’s use microbenchmark()
function to compare these two approaches:
microbenchmark(slow(), fast(), times=10 )
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> slow() 204.492434 205.735406 219.102406 207.380764 210.769643 321.73002 10
#> fast() 6.222917 6.232032 6.387098 6.238346 6.253368 7.72048 10
#> cld
#> b
#> a
Similar approach should be used when working with other functions, including sample()
.
Note: There is an R package - dqrng - that might be considered for random number generation:
microbenchmark(slow(), fast(), "dqrng"=dqrng::dqrnorm(1e5), times=10)
#> Unit: microseconds
#> expr min lq mean median uq max
#> slow() 205924.259 207621.974 208790.460 208738.3230 210418.033 211231.077
#> fast() 6188.793 6238.763 6250.755 6246.2280 6263.155 6315.276
#> dqrng 804.661 815.949 9780.719 845.5415 855.673 90288.405
#> neval cld
#> 10 b
#> 10 a
#> 10 a
When performing matrix multiplication, consider using brackets to minimize the number of computations, that computer needs to perform. For example, we know that the order of matrix and vector multiplication can be changed without affecting the result:
# matrix A * matrix B * vector v
<- matrix(rnorm(10000), nrow = 100)
A <- matrix(rnorm(10000), nrow = 100)
B <- rnorm(100)
v microbenchmark(
<- A %*% B %*% v,
result1 <- A %*% (B %*% v) )
result2 #> Unit: microseconds
#> expr min lq mean median uq max
#> result1 <- A %*% B %*% v 258.237 260.592 264.35291 264.0890 266.8995 276.933
#> result2 <- A %*% (B %*% v) 28.264 28.842 29.34891 29.0685 29.2440 43.812
#> neval cld
#> 100 b
#> 100 a
# Check if results are equal
all.equal (result1, result2)
#> [1] TRUE
In the first case, we first compute matrix/matrix multiplication, and get a matrix as a result. This matrix is then multiplied by a vector. In the second case, we first multiply a matrix by a vector, which gives us a vector as a result and then we multiply a matrix by a vector again, which reduces the number of multiplications by a hundred (the number of rows in the matrices.)
When appropriate use crossprod()
and tcrossprod()
for computing crossproduct. They are slightly faster than t(x) %*% y
(crossprod) or x %*% t(y)
(tcrossprod)
#Define a matrix and a data.frame
<- matrix(runif(1e4), nrow=100)
matr
<- as.data.frame(matr) df
R implemented two optimized functions to calculate row means and sums of a matrix or a dataframe:
microbenchmark(
apply(matr, 1, mean),
rowMeans(matr) )
#> Unit: microseconds
#> expr min lq mean median uq max
#> apply(matr, 1, mean) 570.057 580.6060 589.59984 587.9365 595.4905 710.220
#> rowMeans(matr) 24.747 25.5735 26.89985 27.2620 27.7635 36.419
#> neval cld
#> 100 b
#> 100 a
Similarly for a dataframe the built-in rowMeans
is significantly faster:
microbenchmark(
apply(df, 1, mean),
rowMeans(df) )
#> Unit: microseconds
#> expr min lq mean median uq max
#> apply(df, 1, mean) 1153.350 1166.525 2580.0438 1193.2945 1258.560 134939.266
#> rowMeans(df) 465.388 481.476 523.4131 509.6755 521.514 693.704
#> neval cld
#> 100 a
#> 100 a
We see a similar time improvement with the rowSums()
microbenchmark(
apply(df, 1, sum),
rowSums(df) )
#> Unit: microseconds
#> expr min lq mean median uq max neval
#> apply(df, 1, sum) 856.596 881.8645 977.7758 895.1445 927.1920 8385.386 100
#> rowSums(df) 461.338 476.5985 493.4157 487.2330 513.5875 564.414 100
#> cld
#> b
#> a
When we need to work with columns, colMeans()
and colSums()
provide the same time improvement for matrices. However when we work with a dataframe, we should consider using the lapply()
function:
<- matrix(runif(1e4), ncol=100)
matr <- as.data.frame(matr)
df
microbenchmark(
apply(df, 2, mean),
colMeans(df),
lapply(df, mean),
times=10
)#> Unit: microseconds
#> expr min lq mean median uq max
#> apply(df, 2, mean) 1066.836 1077.395 1101.7095 1085.0135 1109.214 1232.229
#> colMeans(df) 458.012 466.079 486.0774 470.7885 477.485 628.296
#> lapply(df, mean) 278.574 283.286 287.4123 285.4715 293.646 298.747
#> neval cld
#> 10 c
#> 10 b
#> 10 a
When working with a dataframe, when possible, use column names to access columns or dataframe values.
microbenchmark(
"[50, 3]" = df[50,3],
"$V3[50]" = df$V3[50]
)#> Unit: nanoseconds
#> expr min lq mean median uq max neval cld
#> [50, 3] 11107 11761.5 12669.46 12035.5 12529.5 52854 100 b
#> $V3[50] 774 1066.0 1352.95 1233.0 1435.5 12708 100 a
A dataframe in R is also a list - a list of columns. Each column is stored separately in memory. The most efficient way extracting a column from a dataframe is by using the column name:
microbenchmark(matr[ ,7], df[ ,7], df$V7)
#> Unit: nanoseconds
#> expr min lq mean median uq max neval cld
#> matr[, 7] 942 1296.0 1494.64 1433.5 1575.5 6616 100 a
#> df[, 7] 6715 7574.5 8540.65 8047.5 8533.0 43265 100 b
#> df$V7 747 938.5 1130.54 1080.0 1230.5 5087 100 a
At the same time accessing rows in a dataframe is a very costly operation as it requires extracting an element from each column and then creating a new dataframe with a single row. Elements of a matrix are stored continuously in memory as a long vector. As a result it is very fast to access elements from a row and create an output vector.
microbenchmark(matr[7, ], df[7, ])
#> Unit: microseconds
#> expr min lq mean median uq max neval cld
#> matr[7, ] 1.253 2.1650 2.64029 2.7290 2.9725 6.076 100 a
#> df[7, ] 560.138 574.3515 590.34775 579.5385 595.5235 664.841 100 b
If you are working with all-numeric data, matrix is a faster R object to use than a data.frame. Creating and accessing rows in a data.frame takes significantly more time than creating a matrix or accessing its elements.
# use a data.frame to return the result of a function
<- function() {
d data.frame(
col1 = sample(1:10, 30, replace = TRUE)
)
}
# use a matrix to return the result of a function
<- function() {
m matrix(sample(1:10, 30, replace = TRUE), nrow=30)
}
# compare 2 approaches
microbenchmark(
df = d(),
mat = m()
)#> Unit: microseconds
#> expr min lq mean median uq max neval cld
#> df 130.034 134.467 155.33058 135.7825 138.6015 1788.198 100 b
#> mat 7.121 8.341 30.29864 9.7170 13.3075 1960.421 100 a
pmax()
and pmin()
vs. max()
and min()
max
and min
return the maximum or minimum of all the values present in their arguments.
pmax
and pmin
take one or more vectors (or matrices) as arguments and return a single vector giving the parallel maxima (or minima) of the vectors. The first element of the result is the maximum (minimum) of the first elements of all the arguments, the second element of the result is the maximum (minimum) of the second elements of all the arguments and so on:
<- c(4, 19, 23, 3, 20)
x <- c(9, 24, 21, 4, NA)
y <- pmax(x, y, 5) # vector as long as larger of x and y
z # where z[i] is max of x[i], y[i], and 5
z#> [1] 9 24 23 5 NA
When we need to compute row maxima or minima for a matrix, depending on the shape, we should use max
and min
or pmax
or pmin
:
If a dataframe is “long” (has relatively few columns and many more rows), pmin
and pmax
are much more efficient:
library(matrixStats)
<- matrix(rnorm(1e5,0,1),ncol=10 )
x <- as.data.frame(x)
df
microbenchmark(
apply(df, 1, max),
do.call( pmax, df) ,
rowMaxs(as.matrix(df)), # from matrixStats package
times = 3)
#> Unit: microseconds
#> expr min lq mean median uq
#> apply(df, 1, max) 16939.440 17061.514 17148.314 17183.587 17252.751
#> do.call(pmax, df) 781.224 785.802 787.852 790.380 791.166
#> rowMaxs(as.matrix(df)) 964.132 972.653 1005.031 981.174 1025.481
#> max neval cld
#> 17321.915 3 b
#> 791.952 3 a
#> 1069.787 3 a
However, if the dataframe has a large number of columns (wide), then pmax() and pmin() may no longer be suitable for computing maxima and minima of the rows:
<- matrix(rnorm(1e7,0,1),nrow=10 )
x <- as.data.frame(x)
df
microbenchmark(
apply(df, 1, max),
do.call( pmax, df) ,
times = 3)
#> Unit: seconds
#> expr min lq mean median uq max neval
#> apply(df, 1, max) 5.929419 6.316609 6.627789 6.703799 6.976974 7.250150 3
#> do.call(pmax, df) 2.648240 3.067606 3.289333 3.486972 3.609880 3.732787 3
#> cld
#> b
#> a
for
loop has a [condition] argumentConsider the following data:
<- data.frame(age = runif(10^3, min=0, max=100),
df body.temp = runif(10^3, min=35.5, max = 42),
bmi = rnorm(10^3, mean=24, sd=2)
)
We want to compute a vector, that for the people with a BMI greater than 27 performs some analysis, while for others, it sets the value to be -1:
#a slow method
<- function(){
method1
# Allocate the memory and set all values to be equal to -1
<- rep( -1, nrow(df))
result
# Loop through all rows of df and compute result
for ( i in 1 : nrow(df) ){
if ( df$bmi[i] > 27 ){
if ( df$age[i] > 18 )
= mean(rnorm(1000, mean=df$body.temp[i])) else median(rnorm(1000,mean=df$body.temp[i]))
result[i]
}
} }
We can first compute a logical vector that is equal to TRUE
for those rows, where the BMI value is greater than 27. Then we will execute the loop only for those iterations where condition is TRUE
:
#a faster method
<- function(){
method2
# Allocate the memory and set all values to be equal to -1
<- rep( -1, nrow(df))
result <- df$bmi > 28
condition
# Loop through all rows of df and compute result
for ( i in (1 : nrow(df)) [condition ]){
if ( df$age[i] > 18 )
<- mean(rnorm(1000, mean=df$body.temp[i])) else median(rnorm(1000,mean=df$body.temp[i]))
result[i]
} }
Let’s compare two approaches:
microbenchmark(method1(), method2(), rep=10)
#> Unit: nanoseconds
#> expr min lq mean median uq max neval cld
#> method1() 7120937 7146574.0 7267934.94 7159550 7169243 17865592 100 c
#> method2() 1952371 1964580.0 2068049.88 1971768 1978381 11443255 100 b
#> rep 17 29.5 70.11 49 85 1543 100 a
Make your code clear and simple. Explore the functions arguments and documentation. Avoid extra calculations or setting unnecessary attributes.
For example, unlist()
function by default returns a named vector. In many cases, we are interested in values only. Setting use.names=FALSE
will result in better performance:
# Create a list
<- split(sample(1000, size = 100000, rep=TRUE), rep(1:10000, times=10))
my.list
# Use unlist function to convert list to a vector
microbenchmark(
<- unlist(my.list),
v <- unlist(my.list, use.names = FALSE) )
v #> Unit: microseconds
#> expr min lq mean
#> v <- unlist(my.list) 29126.201 29151.1735 29382.2819
#> v <- unlist(my.list, use.names = FALSE) 252.199 255.9045 260.1722
#> median uq max neval cld
#> 29167.0950 29190.377 41972.032 100 b
#> 258.4715 262.078 300.894 100 a
<- matrix(rnorm(10000000), nrow=1000)
x sample(1:10000000, 1000, replace=FALSE)] <- NA
x[= function(x) {
slowFun # initialize res
= NULL
res = nrow(x)
n
for (i in 1:n) {
if (!any(is.na(x[i,]))) res = rbind(res, x[i,])
}apply(res,1,mean)
}
# Start profiler
Rprof (interval = 0.01)
# run code
<- slowFun(x)
x.nomiss
# end profiler
Rprof(NULL)
# view results
summaryRprof()
#> $by.self
#> self.time self.pct total.time total.pct
#> "rbind" 10.15 94.33 10.15 94.33
#> "slowFun" 0.25 2.32 10.76 100.00
#> "aperm.default" 0.18 1.67 0.18 1.67
#> "apply" 0.12 1.12 0.30 2.79
#> "is.na" 0.06 0.56 0.06 0.56
#>
#> $by.total
#> total.time total.pct self.time self.pct
#> "slowFun" 10.76 100.00 0.25 2.32
#> "block_exec" 10.76 100.00 0.00 0.00
#> "call_block" 10.76 100.00 0.00 0.00
#> "eng_r" 10.76 100.00 0.00 0.00
#> "eval" 10.76 100.00 0.00 0.00
#> "evaluate_call" 10.76 100.00 0.00 0.00
#> "evaluate::evaluate" 10.76 100.00 0.00 0.00
#> "evaluate" 10.76 100.00 0.00 0.00
#> "handle" 10.76 100.00 0.00 0.00
#> "in_dir" 10.76 100.00 0.00 0.00
#> "knitr::knit" 10.76 100.00 0.00 0.00
#> "process_file" 10.76 100.00 0.00 0.00
#> "process_group.block" 10.76 100.00 0.00 0.00
#> "process_group" 10.76 100.00 0.00 0.00
#> "rmarkdown::render" 10.76 100.00 0.00 0.00
#> "timing_fn" 10.76 100.00 0.00 0.00
#> "withCallingHandlers" 10.76 100.00 0.00 0.00
#> "withVisible" 10.76 100.00 0.00 0.00
#> "rbind" 10.15 94.33 10.15 94.33
#> "apply" 0.30 2.79 0.12 1.12
#> "aperm.default" 0.18 1.67 0.18 1.67
#> "aperm" 0.18 1.67 0.00 0.00
#> "is.na" 0.06 0.56 0.06 0.56
#>
#> $sample.interval
#> [1] 0.01
#>
#> $sampling.time
#> [1] 10.76
Use profvis
packge to find the bottlenecks in your code.
A number of good suggestion to make your code clean and easy to read can be found on CRAN and Bioconductor websites: