matrixStats: Benchmark report

anyMissing() benchmarks

This report benchmark the performance of anyMissing() against alternative methods.

Alternative methods

anyNA()
any() + is.na()

as below

> any_is.na <- function(x) {
+     any(is.na(x))
+ }

Data type "integer"

Data

> rvector <- function(n, mode = c("logical", "double", "integer"), range = c(-100, +100), na_prob = 0) {
+     mode <- match.arg(mode)
+     if (mode == "logical") {
+         x <- sample(c(FALSE, TRUE), size = n, replace = TRUE)
+     }     else {
+         x <- runif(n, min = range[1], max = range[2])
+     }
+     storage.mode(x) <- mode
+     if (na_prob > 0) 
+         x[sample(n, size = na_prob * n)] <- NA
+     x
+ }
> rvectors <- function(scale = 10, seed = 1, ...) {
+     set.seed(seed)
+     data <- list()
+     data[[1]] <- rvector(n = scale * 100, ...)
+     data[[2]] <- rvector(n = scale * 1000, ...)
+     data[[3]] <- rvector(n = scale * 10000, ...)
+     data[[4]] <- rvector(n = scale * 1e+05, ...)
+     data[[5]] <- rvector(n = scale * 1e+06, ...)
+     names(data) <- sprintf("n = %d", sapply(data, FUN = length))
+     data
+ }
> data <- rvectors(mode = mode)

Results

n = 1000 vector

> x <- data[["n = 1000"]]
> gc()
           used  (Mb) gc trigger  (Mb) max used  (Mb)
Ncells  3053217 163.1    5709258 305.0  5709258 305.0
Vcells 32116601 245.1   54055058 412.5 56666022 432.4
> stats <- microbenchmark(anyMissing = anyMissing(x), anyNA = anyNA(x), any_is.na = any_is.na(x), unit = "ms")

Table: Benchmarking of anyMissing(), anyNA() and any_is.na() on integer+n = 1000 data. The top panel shows times in milliseconds and the bottom panel shows relative times.

	expr	min	lq	mean	median	uq	max
2	anyNA	0.000359	0.000365	0.0004554	0.0003780	0.0003855	0.008082
1	anyMissing	0.000934	0.000949	0.0010638	0.0009780	0.0010110	0.007846
3	any_is.na	0.002283	0.002359	0.0025444	0.0024015	0.0024980	0.011989

	expr	min	lq	mean	median	uq	max
2	anyNA	1.000000	1.000000	1.000000	1.000000	1.000000	1.0000000
1	anyMissing	2.601671	2.600000	2.336181	2.587302	2.622568	0.9707993
3	any_is.na	6.359331	6.463014	5.587509	6.353175	6.479896	1.4834199

Figure: Benchmarking of anyMissing(), anyNA() and any_is.na() on integer+n = 1000 data. Outliers are displayed as crosses. Times are in milliseconds.

n = 10000 vector

> x <- data[["n = 10000"]]
> gc()
           used  (Mb) gc trigger (Mb) max used  (Mb)
Ncells  3050100 162.9    5709258  305  5709258 305.0
Vcells 10479057  80.0   43244047  330 56666022 432.4
> stats <- microbenchmark(anyMissing = anyMissing(x), anyNA = anyNA(x), any_is.na = any_is.na(x), unit = "ms")

Table: Benchmarking of anyMissing(), anyNA() and any_is.na() on integer+n = 10000 data. The top panel shows times in milliseconds and the bottom panel shows relative times.

	expr	min	lq	mean	median	uq	max
2	anyNA	0.002735	0.0027680	0.0028186	0.0027845	0.0028250	0.004756
1	anyMissing	0.005714	0.0057795	0.0059961	0.0058725	0.0059715	0.015570
3	any_is.na	0.017282	0.0176325	0.0183324	0.0177625	0.0179375	0.039756

	expr	min	lq	mean	median	uq	max
2	anyNA	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000
1	anyMissing	2.089214	2.087970	2.127348	2.108996	2.113805	3.273760
3	any_is.na	6.318830	6.370123	6.504126	6.379063	6.349558	8.359125

Figure: Benchmarking of anyMissing(), anyNA() and any_is.na() on integer+n = 10000 data. Outliers are displayed as crosses. Times are in milliseconds.

n = 100000 vector

> x <- data[["n = 100000"]]
> gc()
           used  (Mb) gc trigger (Mb) max used  (Mb)
Ncells  3050172 162.9    5709258  305  5709258 305.0
Vcells 10479617  80.0   34595238  264 56666022 432.4
> stats <- microbenchmark(anyMissing = anyMissing(x), anyNA = anyNA(x), any_is.na = any_is.na(x), unit = "ms")

Table: Benchmarking of anyMissing(), anyNA() and any_is.na() on integer+n = 100000 data. The top panel shows times in milliseconds and the bottom panel shows relative times.

	expr	min	lq	mean	median	uq	max
2	anyNA	0.026117	0.0261990	0.0263233	0.026240	0.0263320	0.028237
1	anyMissing	0.052362	0.0524585	0.0534622	0.052539	0.0527015	0.088928
3	any_is.na	0.165082	0.1665155	0.2059161	0.167865	0.2737035	0.319562

	expr	min	lq	mean	median	uq	max
2	anyNA	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000
1	anyMissing	2.004901	2.002309	2.030979	2.002248	2.001424	3.149343
3	any_is.na	6.320864	6.355796	7.822567	6.397294	10.394330	11.317137

Figure: Benchmarking of anyMissing(), anyNA() and any_is.na() on integer+n = 100000 data. Outliers are displayed as crosses. Times are in milliseconds.

n = 1000000 vector

> x <- data[["n = 1000000"]]
> gc()
           used (Mb) gc trigger (Mb) max used  (Mb)
Ncells  3050244  163    5709258  305  5709258 305.0
Vcells 10479666   80   34595238  264 56666022 432.4
> stats <- microbenchmark(anyMissing = anyMissing(x), anyNA = anyNA(x), any_is.na = any_is.na(x), unit = "ms")

Table: Benchmarking of anyMissing(), anyNA() and any_is.na() on integer+n = 1000000 data. The top panel shows times in milliseconds and the bottom panel shows relative times.

	expr	min	lq	mean	median	uq	max
2	anyNA	0.259394	0.271498	0.3292668	0.3153795	0.3550220	0.741998
1	anyMissing	0.514958	0.521802	0.5666086	0.5361985	0.5619805	0.933253
3	any_is.na	1.647475	2.685594	2.9601964	2.7435330	2.9737855	12.192548

	expr	min	lq	mean	median	uq	max
2	anyNA	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000
1	anyMissing	1.985235	1.921937	1.720819	1.700169	1.582946	1.257757
3	any_is.na	6.351246	9.891763	8.990265	8.699148	8.376342	16.432050

Figure: Benchmarking of anyMissing(), anyNA() and any_is.na() on integer+n = 1000000 data. Outliers are displayed as crosses. Times are in milliseconds.

n = 10000000 vector

> x <- data[["n = 10000000"]]
> gc()
           used (Mb) gc trigger (Mb) max used  (Mb)
Ncells  3050316  163    5709258  305  5709258 305.0
Vcells 10479714   80   34595238  264 56666022 432.4
> stats <- microbenchmark(anyMissing = anyMissing(x), anyNA = anyNA(x), any_is.na = any_is.na(x), unit = "ms")

Table: Benchmarking of anyMissing(), anyNA() and any_is.na() on integer+n = 10000000 data. The top panel shows times in milliseconds and the bottom panel shows relative times.

	expr	min	lq	mean	median	uq	max
2	anyNA	3.192648	3.210021	3.290707	3.221774	3.278016	3.804310
1	anyMissing	5.256905	5.275116	5.377196	5.310230	5.413356	6.648744
3	any_is.na	26.604766	27.029503	29.775117	27.262952	30.022492	40.988404

	expr	min	lq	mean	median	uq	max
2	anyNA	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000
1	anyMissing	1.646566	1.643328	1.634055	1.648231	1.651412	1.747687
3	any_is.na	8.333135	8.420352	9.048243	8.462092	9.158739	10.774202

Figure: Benchmarking of anyMissing(), anyNA() and any_is.na() on integer+n = 10000000 data. Outliers are displayed as crosses. Times are in milliseconds.

Data type "double"

Data

> rvector <- function(n, mode = c("logical", "double", "integer"), range = c(-100, +100), na_prob = 0) {
+     mode <- match.arg(mode)
+     if (mode == "logical") {
+         x <- sample(c(FALSE, TRUE), size = n, replace = TRUE)
+     }     else {
+         x <- runif(n, min = range[1], max = range[2])
+     }
+     storage.mode(x) <- mode
+     if (na_prob > 0) 
+         x[sample(n, size = na_prob * n)] <- NA
+     x
+ }
> rvectors <- function(scale = 10, seed = 1, ...) {
+     set.seed(seed)
+     data <- list()
+     data[[1]] <- rvector(n = scale * 100, ...)
+     data[[2]] <- rvector(n = scale * 1000, ...)
+     data[[3]] <- rvector(n = scale * 10000, ...)
+     data[[4]] <- rvector(n = scale * 1e+05, ...)
+     data[[5]] <- rvector(n = scale * 1e+06, ...)
+     names(data) <- sprintf("n = %d", sapply(data, FUN = length))
+     data
+ }
> data <- rvectors(mode = mode)

Results

n = 1000 vector

> x <- data[["n = 1000"]]
> gc()
           used  (Mb) gc trigger (Mb) max used  (Mb)
Ncells  3050391 163.0    5709258  305  5709258 305.0
Vcells 16035729 122.4   34595238  264 56666022 432.4
> stats <- microbenchmark(anyMissing = anyMissing(x), anyNA = anyNA(x), any_is.na = any_is.na(x), unit = "ms")

Table: Benchmarking of anyMissing(), anyNA() and any_is.na() on double+n = 1000 data. The top panel shows times in milliseconds and the bottom panel shows relative times.

	expr	min	lq	mean	median	uq	max
2	anyNA	0.000505	0.0005205	0.0006173	0.0005375	0.0005515	0.008371
1	anyMissing	0.000937	0.0009790	0.0011860	0.0010030	0.0010505	0.016060
3	any_is.na	0.002244	0.0023690	0.0026005	0.0024175	0.0025095	0.015431

	expr	min	lq	mean	median	uq	max
2	anyNA	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000
1	anyMissing	1.855446	1.880884	1.921335	1.866046	1.904805	1.918528
3	any_is.na	4.443564	4.551393	4.212717	4.497674	4.550317	1.843388

Figure: Benchmarking of anyMissing(), anyNA() and any_is.na() on double+n = 1000 data. Outliers are displayed as crosses. Times are in milliseconds.

n = 10000 vector

> x <- data[["n = 10000"]]
> gc()
           used  (Mb) gc trigger (Mb) max used  (Mb)
Ncells  3050460 163.0    5709258  305  5709258 305.0
Vcells 16035771 122.4   34595238  264 56666022 432.4
> stats <- microbenchmark(anyMissing = anyMissing(x), anyNA = anyNA(x), any_is.na = any_is.na(x), unit = "ms")

Table: Benchmarking of anyMissing(), anyNA() and any_is.na() on double+n = 10000 data. The top panel shows times in milliseconds and the bottom panel shows relative times.

	expr	min	lq	mean	median	uq	max
2	anyNA	0.004266	0.0042960	0.0043472	0.0043225	0.0043570	0.005866
1	anyMissing	0.005720	0.0058035	0.0059647	0.0058815	0.0059995	0.011991
3	any_is.na	0.017302	0.0175330	0.0179555	0.0176775	0.0178685	0.030253

	expr	min	lq	mean	median	uq	max
2	anyNA	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000
1	anyMissing	1.340834	1.350908	1.372075	1.360671	1.376980	2.044153
3	any_is.na	4.055790	4.081238	4.130334	4.089647	4.101102	5.157347

Figure: Benchmarking of anyMissing(), anyNA() and any_is.na() on double+n = 10000 data. Outliers are displayed as crosses. Times are in milliseconds.

n = 100000 vector

> x <- data[["n = 100000"]]
> gc()
           used  (Mb) gc trigger (Mb) max used  (Mb)
Ncells  3050532 163.0    5709258  305  5709258 305.0
Vcells 16036122 122.4   34595238  264 56666022 432.4
> stats <- microbenchmark(anyMissing = anyMissing(x), anyNA = anyNA(x), any_is.na = any_is.na(x), unit = "ms")

Table: Benchmarking of anyMissing(), anyNA() and any_is.na() on double+n = 100000 data. The top panel shows times in milliseconds and the bottom panel shows relative times.

	expr	min	lq	mean	median	uq	max
2	anyNA	0.041876	0.0420325	0.0422432	0.0421405	0.0422655	0.044177
1	anyMissing	0.052454	0.0525240	0.0531367	0.0526505	0.0529340	0.077419
3	any_is.na	0.164455	0.1668220	0.2142914	0.1705240	0.2727585	0.282729

	expr	min	lq	mean	median	uq	max
2	anyNA	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000
1	anyMissing	1.252603	1.249605	1.257877	1.249404	1.252416	1.752473
3	any_is.na	3.927190	3.968881	5.072808	4.046558	6.453455	6.399914

Figure: Benchmarking of anyMissing(), anyNA() and any_is.na() on double+n = 100000 data. Outliers are displayed as crosses. Times are in milliseconds.

n = 1000000 vector

> x <- data[["n = 1000000"]]
> gc()
           used  (Mb) gc trigger (Mb) max used  (Mb)
Ncells  3050604 163.0    5709258  305  5709258 305.0
Vcells 16036530 122.4   34595238  264 56666022 432.4
> stats <- microbenchmark(anyMissing = anyMissing(x), anyNA = anyNA(x), any_is.na = any_is.na(x), unit = "ms")

Table: Benchmarking of anyMissing(), anyNA() and any_is.na() on double+n = 1000000 data. The top panel shows times in milliseconds and the bottom panel shows relative times.

	expr	min	lq	mean	median	uq	max
2	anyNA	0.456717	0.5120315	0.5419385	0.5531520	0.5685105	0.685328
1	anyMissing	0.545810	0.5804885	0.6082866	0.6047615	0.6236790	1.252541
3	any_is.na	1.709426	2.7450395	2.7931772	2.8021115	2.8304415	9.515550

	expr	min	lq	mean	median	uq	max
2	anyNA	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000
1	anyMissing	1.195073	1.133697	1.122427	1.093301	1.097040	1.827652
3	any_is.na	3.742856	5.361075	5.154048	5.065717	4.978697	13.884665

Figure: Benchmarking of anyMissing(), anyNA() and any_is.na() on double+n = 1000000 data. Outliers are displayed as crosses. Times are in milliseconds.

n = 10000000 vector

> x <- data[["n = 10000000"]]
> gc()
           used  (Mb) gc trigger (Mb) max used  (Mb)
Ncells  3050676 163.0    5709258  305  5709258 305.0
Vcells 16036578 122.4   34595238  264 56666022 432.4
> stats <- microbenchmark(anyMissing = anyMissing(x), anyNA = anyNA(x), any_is.na = any_is.na(x), unit = "ms")

Table: Benchmarking of anyMissing(), anyNA() and any_is.na() on double+n = 10000000 data. The top panel shows times in milliseconds and the bottom panel shows relative times.

	expr	min	lq	mean	median	uq	max
2	anyNA	5.540831	5.668875	5.864548	5.749500	6.116903	7.339590
1	anyMissing	5.925187	6.040297	6.211277	6.157669	6.256903	7.233605
3	any_is.na	27.431948	28.496416	33.181975	28.905825	35.312683	250.630057

	expr	min	lq	mean	median	uq	max
2	anyNA	1.000000	1.000000	1.000000	1.000000	1.000000	1.0000000
1	anyMissing	1.069368	1.065520	1.059123	1.070992	1.022887	0.9855598
3	any_is.na	4.950873	5.026821	5.658062	5.027537	5.772968	34.1476918

Figure: Benchmarking of anyMissing(), anyNA() and any_is.na() on double+n = 10000000 data. Outliers are displayed as crosses. Times are in milliseconds.

Appendix

Session information

R version 3.6.1 Patched (2019-08-27 r77078)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.3 LTS

Matrix products: default
BLAS:   /home/hb/software/R-devel/R-3-6-branch/lib/R/lib/libRblas.so
LAPACK: /home/hb/software/R-devel/R-3-6-branch/lib/R/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] microbenchmark_1.4-6    matrixStats_0.55.0-9000 ggplot2_3.2.1          
[4] knitr_1.24              R.devices_2.16.0        R.utils_2.9.0          
[7] R.oo_1.22.0             R.methodsS3_1.7.1       history_0.0.0-9002     

loaded via a namespace (and not attached):
 [1] Biobase_2.45.0       bit64_0.9-7          splines_3.6.1       
 [4] network_1.15         assertthat_0.2.1     highr_0.8           
 [7] stats4_3.6.1         blob_1.2.0           robustbase_0.93-5   
[10] pillar_1.4.2         RSQLite_2.1.2        backports_1.1.4     
[13] lattice_0.20-38      glue_1.3.1           digest_0.6.20       
[16] colorspace_1.4-1     sandwich_2.5-1       Matrix_1.2-17       
[19] XML_3.98-1.20        lpSolve_5.6.13.3     pkgconfig_2.0.2     
[22] genefilter_1.66.0    purrr_0.3.2          ergm_3.10.4         
[25] xtable_1.8-4         mvtnorm_1.0-11       scales_1.0.0        
[28] tibble_2.1.3         annotate_1.62.0      IRanges_2.18.2      
[31] TH.data_1.0-10       withr_2.1.2          BiocGenerics_0.30.0 
[34] lazyeval_0.2.2       mime_0.7             survival_2.44-1.1   
[37] magrittr_1.5         crayon_1.3.4         statnet.common_4.3.0
[40] memoise_1.1.0        laeken_0.5.0         R.cache_0.13.0      
[43] MASS_7.3-51.4        R.rsp_0.43.1         tools_3.6.1         
[46] multcomp_1.4-10      S4Vectors_0.22.1     trust_0.1-7         
[49] munsell_0.5.0        AnnotationDbi_1.46.1 compiler_3.6.1      
[52] rlang_0.4.0          grid_3.6.1           RCurl_1.95-4.12     
[55] cwhmisc_6.6          rappdirs_0.3.1       labeling_0.3        
[58] bitops_1.0-6         base64enc_0.1-3      boot_1.3-23         
[61] gtable_0.3.0         codetools_0.2-16     DBI_1.0.0           
[64] markdown_1.1         R6_2.4.0             zoo_1.8-6           
[67] dplyr_0.8.3          bit_1.1-14           zeallot_0.1.0       
[70] parallel_3.6.1       Rcpp_1.0.2           vctrs_0.2.0         
[73] DEoptimR_1.0-8       tidyselect_0.2.5     xfun_0.9            
[76] coda_0.19-3

Total processing time was 17.93 secs.

Reproducibility

To reproduce this report, do:

html <- matrixStats:::benchmark('anyMissing')

anyMissing - HenrikBengtsson/matrixStats GitHub Wiki

anyMissing() benchmarks

Alternative methods

Data type "integer"

Data

Results

n = 1000 vector

n = 10000 vector

n = 100000 vector

n = 1000000 vector

n = 10000000 vector

Data type "double"

Data

Results

n = 1000 vector

n = 10000 vector

n = 100000 vector

n = 1000000 vector

n = 10000000 vector

Appendix

Session information

Reproducibility

⚠️ GitHub.com Fallback ⚠️

anyMissing - HenrikBengtsson/matrixStats GitHub Wiki

anyMissing() benchmarks

Alternative methods

Data type "integer"

Data

Results

n = 1000 vector

n = 10000 vector

n = 100000 vector

n = 1000000 vector

n = 10000000 vector

Data type "double"

Data

Results

n = 1000 vector

n = 10000 vector

n = 100000 vector

n = 1000000 vector

n = 10000000 vector

Appendix

Session information

Reproducibility

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️