r - Use rle to group by runs when using dplyr -
in r, want summarize data after grouping based on runs of variable x (aka each group of data corresponds subset of data consecutive x values same). instance, consider following data frame, want compute average y value within each run of x:
(dat <- data.frame(x=c(1, 1, 1, 2, 2, 1, 2), y=1:7)) # x y # 1 1 1 # 2 1 2 # 3 1 3 # 4 2 4 # 5 2 5 # 6 1 6 # 7 2 7 in example, x variable has runs of length 3, 2, 1, , 1, taking values 1, 2, 1, , 2 in 4 runs. corresponding means of y in groups 2, 4.5, 6, , 7.
it easy carry out grouped operation in base r using tapply, passing dat$y data, using rle compute run number dat$x, , passing desired summary function:
tapply(dat$y, with(rle(dat$x), rep(seq_along(lengths), lengths)), mean) # 1 2 3 4 # 2.0 4.5 6.0 7.0 i figured able pretty directly carry on logic dplyr, attempts far have ended in errors:
library(dplyr) # first attempt dat %>% group_by(with(rle(x), rep(seq_along(lengths), lengths))) %>% summarize(mean(y)) # error: cannot coerce type 'closure' vector of type 'integer' # attempt 2 -- maybe "with" problem? dat %>% group_by(rep(seq_along(rle(x)$lengths), rle(x)$lengths)) %>% summarize(mean(y)) # error: invalid subscript type 'closure' for completeness, reimplement rle run id myself using cumsum, head, , tail around this, makes grouping code tougher read , involves bit of reinventing wheel:
dat %>% group_by(run=cumsum(c(1, head(x, -1) != tail(x, -1)))) %>% summarize(mean(y)) # run mean(y) # (dbl) (dbl) # 1 1 2.0 # 2 2 4.5 # 3 3 6.0 # 4 4 7.0 what causing rle-based grouping code fail in dplyr, , there solution enables me keep using rle when grouping run id?
one option seems use of {} in:
dat %>% group_by(yy = {yy = rle(x); rep(seq_along(yy$lengths), yy$lengths)}) %>% summarize(mean(y)) #source: local data frame [4 x 2] # # yy mean(y) # (int) (dbl) #1 1 2.0 #2 2 4.5 #3 3 6.0 #4 4 7.0 it nice if future dplyr versions had equivalent of data.table's rleid function.
i noticed problem occurs when using data.frame or tbl_df input not, when using tbl_dt or data.table input:
dat %>% tbl_df %>% group_by(yy = with(rle(x), rep(seq_along(lengths), lengths))) %>% summarize(mean(y)) error: cannot coerce type 'closure' vector of type 'integer' dat %>% tbl_dt %>% group_by(yy = with(rle(x), rep(seq_along(lengths), lengths))) %>% summarize(mean(y)) source: local data table [4 x 2] yy mean(y) (int) (dbl) 1 1 2.0 2 2 4.5 3 3 6.0 4 4 7.0 i reported issue on dplyr's github page.
Comments
Post a Comment