r - Using dplyr to make sample from data frame -


i have large data frame (150.000.000 rows) format this:

df = data.frame(pnr = rep(500+2*(1:15),each=3), x = runif(3*15)) 

pnr person id , x data. sample 10% of persons. there fast way in dplyr?

the following solution, slow because of merge-statement

prns = as.data.frame(unique(df$prn)) names(prns)[1] = "prn" prns$s = rbinom(nrow(prns),1,0.1)  df = merge(df,prns) df2 = df[df$s==1,] 

i suggest "data.table" package on "dplyr" this. here's example big-ish sample data (not smaller own 15 million rows).

i'll show right , wrong ways things :-)

here's sample data.

library(data.table) library(dplyr) library(microbenchmark) set.seed(1) mydf <- dt <- data.frame(person = sample(10000, 1e7, true),                    value = runif(1e7)) 

we'll create "data.table" , set key "person". creating "data.table" takes no significant time, setting key can.

system.time(setdt(dt)) #    user  system elapsed  #   0.001   0.000   0.001   ## setting key takes time, worth system.time(setkey(dt, person))  #    user  system elapsed  #   0.620   0.025   0.646 

i can't think of more efficient way select "person" values following, i've removed these benchmarks--they common approaches.

## common tests... <- unique(mydf$person) b <- sample(a, ceiling(.1 * length(a)), false) 

for convenience, different tests presented functions...

## base r #1 fun1a <- function() {   mydf[mydf$person %in% b, ] }  ## base r #2--sometimes using `which` makes things quicker fun1b <- function() {   mydf[which(mydf$person %in% b), ] }  ## `filter` "dplyr" fun2 <- function() {   filter(mydf, person %in% b) }  ## "wrong" way "data.table" fun3a <- function() {   dt[which(person %in% b)] }  ## "right" (i think) way "data.table" fun3b <- function() {   dt[j(b)] } 

now, can benchmark:

## benchmarking microbenchmark(fun1a(), fun1b(), fun2(), fun3a(), fun3b(), times = 20) # unit: milliseconds #     expr       min        lq    median        uq       max neval #  fun1a() 382.37534 394.27968 396.76076 406.92431 494.32220    20 #  fun1b() 401.91530 413.04710 416.38470 425.90150 503.83169    20 #   fun2() 381.78909 394.16716 395.49341 399.01202 417.79044    20 #  fun3a() 387.35363 397.02220 399.18113 406.23515 413.56128    20 #  fun3b()  28.77801  28.91648  29.01535  29.37596  42.34043    20 

look @ performance using "data.table" right way! other approaches impressively fast though.


summary shows results same. (the row order "data.table" solution different since has been sorted.)

summary(fun1a()) #      person         value          #  min.   :  16   min.   :0.000002   #  1st qu.:2424   1st qu.:0.250988   #  median :5075   median :0.500259   #  mean   :4958   mean   :0.500349   #  3rd qu.:7434   3rd qu.:0.749601   #  max.   :9973   max.   :1.000000    summary(fun2()) #      person         value          #  min.   :  16   min.   :0.000002   #  1st qu.:2424   1st qu.:0.250988   #  median :5075   median :0.500259   #  mean   :4958   mean   :0.500349   #  3rd qu.:7434   3rd qu.:0.749601   #  max.   :9973   max.   :1.000000    summary(fun3b()) #      person         value          #  min.   :  16   min.   :0.000002   #  1st qu.:2424   1st qu.:0.250988   #  median :5075   median :0.500259   #  mean   :4958   mean   :0.500349   #  3rd qu.:7434   3rd qu.:0.749601   #  max.   :9973   max.   :1.000000  

Comments

Popular posts from this blog

java - Intellij Synchronizing output directories .. -

git - Initial Commit: "fatal: could not create leading directories of ..." -