r - Use rle to group by runs when using dplyr -


in r, want summarize data after grouping based on runs of variable x (aka each group of data corresponds subset of data consecutive x values same). instance, consider following data frame, want compute average y value within each run of x:

(dat <- data.frame(x=c(1, 1, 1, 2, 2, 1, 2), y=1:7)) #   x y # 1 1 1 # 2 1 2 # 3 1 3 # 4 2 4 # 5 2 5 # 6 1 6 # 7 2 7 

in example, x variable has runs of length 3, 2, 1, , 1, taking values 1, 2, 1, , 2 in 4 runs. corresponding means of y in groups 2, 4.5, 6, , 7.

it easy carry out grouped operation in base r using tapply, passing dat$y data, using rle compute run number dat$x, , passing desired summary function:

tapply(dat$y, with(rle(dat$x), rep(seq_along(lengths), lengths)), mean) #   1   2   3   4  # 2.0 4.5 6.0 7.0  

i figured able pretty directly carry on logic dplyr, attempts far have ended in errors:

library(dplyr) # first attempt dat %>%   group_by(with(rle(x), rep(seq_along(lengths), lengths))) %>%   summarize(mean(y)) # error: cannot coerce type 'closure' vector of type 'integer'  # attempt 2 -- maybe "with" problem? dat %>%   group_by(rep(seq_along(rle(x)$lengths), rle(x)$lengths)) %>%   summarize(mean(y)) # error: invalid subscript type 'closure' 

for completeness, reimplement rle run id myself using cumsum, head, , tail around this, makes grouping code tougher read , involves bit of reinventing wheel:

dat %>%   group_by(run=cumsum(c(1, head(x, -1) != tail(x, -1)))) %>%   summarize(mean(y)) #     run mean(y) #   (dbl)   (dbl) # 1     1     2.0 # 2     2     4.5 # 3     3     6.0 # 4     4     7.0 

what causing rle-based grouping code fail in dplyr, , there solution enables me keep using rle when grouping run id?

one option seems use of {} in:

dat %>%     group_by(yy = {yy = rle(x); rep(seq_along(yy$lengths), yy$lengths)}) %>%     summarize(mean(y)) #source: local data frame [4 x 2] # #     yy mean(y) #  (int)   (dbl) #1     1     2.0 #2     2     4.5 #3     3     6.0 #4     4     7.0 

it nice if future dplyr versions had equivalent of data.table's rleid function.


i noticed problem occurs when using data.frame or tbl_df input not, when using tbl_dt or data.table input:

dat %>%      tbl_df %>%      group_by(yy = with(rle(x), rep(seq_along(lengths), lengths))) %>%     summarize(mean(y)) error: cannot coerce type 'closure' vector of type 'integer'  dat %>%      tbl_dt %>%      group_by(yy = with(rle(x), rep(seq_along(lengths), lengths))) %>%     summarize(mean(y)) source: local data table [4 x 2]       yy mean(y)   (int)   (dbl) 1     1     2.0 2     2     4.5 3     3     6.0 4     4     7.0 

i reported issue on dplyr's github page.


Comments

Popular posts from this blog

routing - AngularJS State management ->load multiple states in one page -

python - GRASS parser() error -

Swift game error message -