R Programming Efficiently - gizotso/R GitHub Wiki
R i works best with vector processing, not loops !
R supports iteration through vectors, but not through nonvector sets. We have to use indirect (but still convenient) means of accomplishing such looping. The lapply() function and the get() functions, for example, make this possible.
apply, lapply
Explicit Loops are generally slow, and it is better to avoid them when it is possible.
High Order Function : applies a given function to each element of a list - map function
- apply() can apply a function to elements of a matrix or an array. This may be the rows of a matrix (1) or the columns (2).
- lapply() applies a function to each column of a dataframe and returns a list.
- sapply() is similar but the output is simplified. It may be a vector or a matrix depending on the function.
- tapply() applies the function for each level of a factor. -> apply function BY
- vapply ?
- see also aggregate()
apply
apply()
: agit sur les lignes et/ou les colonnes d?une matrice
# apply(X, MARGIN, FUN,...), o? X est la matrice, MARGIN indique si l?action doit ?tre appliqu?e sur les lignes (1), les colonnes (2) ou les deux (c(1, 2)), FUN est la fonction (ou l?op?rateur mais dans ce cas il doit ?etre sp?cifi? entre guillemets doubles) doit ?tre sp?cifi? entre guillemets doubles)
x <- rnorm(10, -5, 0.1)
y <- rnorm(10, 5, 2)
X <- cbind(x, y) # les colonnes de la matrice X gardent les noms "x" et "y"
apply(X, 2, mean) # apply sur les colonnes
# x y
# -4.995027 5.455004
mean(x) ##[1] -4.995027
mean(y) ##[1] 5.455004
lapply(X, mean) # will not work properly
lapply(data.frame(X), mean)
# $x
# [1] -5.001045
# $y
# [1] 4.74219
lapply()
va agir sur une liste
# lapply() va agir sur une liste
# variante sapply()
forms <- list(y ~ x, y ~ poly(x, 2))
lapply(forms, lm)
sapply()
# sapply : getting the number of unique values per data frame column
d1 <- data.frame(v1 = c(1,2,3,2), v2 = c("a","a","b","b"))
d1_nb_uniques = sapply(d1, function(x)length(unique(x))) # vector
d1_nb_uniques = rapply(d1, function(x)length(unique(x))) # vector
d1_nb_uniques = lapply(d1, function(x)length(unique(x))) # list
N = 10
x1 = rnorm(N); x2 = rnorm(N) + x1 + 1; male = rbinom(N,1,.48); y = 1 + x1 + x2 + male + rnorm(N)
mydat = data.frame(y,x1,x2,male)
> mydat
y x1 x2 male
1 1.6120728 0.4086131 0.04585462 0
2 5.6836625 0.8569270 3.06609627 1
3 6.0125131 1.6433535 2.00352930 1
4 1.1058325 -0.4295005 -0.68948938 1
5 4.6543794 2.1458691 2.31930332 0
6 2.7312134 0.1874176 0.39057549 1
7 5.7486838 1.1639011 3.85327129 0
8 3.5423505 0.4021064 2.23848358 0
9 0.4873023 -0.7594786 0.52266250 0
10 -1.5319902 -1.0027291 -0.33610895 0
lapply(mydat,mean) # returns a list
# $y
# [1] 3.004602
# $x1
# [1] 0.461648
# $x2
# [1] 1.341418
# $male
# [1] 0.4
apply(mydat,2,mean) # applies the function to each column, returns a vector
y x1 x2 male
3.004602 0.461648 1.341418 0.400000
sapply(mydat,mean)
y x1 x2 male
3.004602 0.461648 1.341418 0.400000
apply(mydat,1,mean) # applies the function to each row ##[1] 1.1654 2.8347 -0.9728 0.6512 -0.0696 3.9206 -0.2492 3.1060 2.0478 0.5116
tapply(mydat$y, mydat$male, mean) # applies the function to each level of the factor > mean(y) BY male
0 1
1.040 5.454
x = matrix(round(rnorm(100)),10,10)
col.sums = apply(x, 2, sum)
row.sums = apply(x, 1, sum)
- http://www.magesblog.com/2012/01/say-it-in-r-with-by-apply-and-friends.html
- https://www.r-bloggers.com/how-to-go-parallel-in-r-basics-tips/
Performance, vetorized loops
- http://www.r-bloggers.com/for-loops-and-how-to-avoid-them/
- http://www.r-bloggers.com/the-performance-cost-of-a-for-loop-and-some-alternatives/
- http://fr.slideshare.net/bytemining/taking-r-to-the-limit-high-performance-computing-in-r-part-1-parallelization-la-r-users-group-727
- http://blog.revolutionanalytics.com/2010/11/loops-in-r.html
- http://stackoverflow.com/questions/28983292/is-the-apply-family-really-not-vectorized
- http://stackoverflow.com/questions/2275896/is-rs-apply-family-more-than-syntactic-sugar
- https://stackoverflow.com/questions/4858511/recoding-numeric-vector-r
Using plyr : (next iteration of plyr is dplyr)
library(microbenchmark)
Iterators
Loops in R are generally slow. iterators may be more efficient than loops. See this entry in the Revolution Computing Blogs http://blog.revolution-computing.com/2009/07/counting-with-iterators.html
moved to http://blog.revolutionanalytics.com/2009/07/counting-with-iterators.html