python - Dask DataFrame Groupby Partitions -
i have large csv files (~10gb) , take advantage of dask analysis. however, depending on number of partitions set dask object read in with, groupby results change. understanding dask took advantage of partitions out-of-core processing benefits, still return appropriate groupby output. doesn't seem case , i'm struggling work out alternate settings necessary. below small example:
df = pd.dataframe({'a': np.arange(100), 'b': np.random.randn(100), 'c': np.random.randn(100), 'grp1': np.repeat([1, 2], 50), 'grp2': [3, 4, 5, 6], 25)}) test_dd1 = dd.from_pandas(df, npartitions=1) test_dd2 = dd.from_pandas(df, npartitions=2) test_dd5 = dd.from_pandas(df, npartitions=5) test_dd10 = dd.from_pandas(df, npartitions=10) test_dd100 = dd.from_pandas(df, npartitions=100) def test_func(x): x['new_col'] = len(x[x['b'] > 0.]) / len(x['b']) return x test_dd1.groupby(['grp1', 'grp2']).apply(test_func).compute().head() b c grp1 grp2 new_col 0 0 -0.561376 -1.422286 1 3 0.48 1 1 -1.107799 1.075471 1 3 0.48 2 2 -0.719420 -0.574381 1 3 0.48 3 3 -1.287547 -0.749218 1 3 0.48 4 4 0.677617 -0.908667 1 3 0.48 test_dd2.groupby(['grp1', 'grp2']).apply(test_func).compute().head() b c grp1 grp2 new_col 0 0 -0.561376 -1.422286 1 3 0.48 1 1 -1.107799 1.075471 1 3 0.48 2 2 -0.719420 -0.574381 1 3 0.48 3 3 -1.287547 -0.749218 1 3 0.48 4 4 0.677617 -0.908667 1 3 0.48 test_dd5.groupby(['grp1', 'grp2']).apply(test_func).compute().head() b c grp1 grp2 new_col 0 0 -0.561376 -1.422286 1 3 0.45 1 1 -1.107799 1.075471 1 3 0.45 2 2 -0.719420 -0.574381 1 3 0.45 3 3 -1.287547 -0.749218 1 3 0.45 4 4 0.677617 -0.908667 1 3 0.45 test_dd10.groupby(['grp1', 'grp2']).apply(test_func).compute().head() b c grp1 grp2 new_col 0 0 -0.561376 -1.422286 1 3 0.5 1 1 -1.107799 1.075471 1 3 0.5 2 2 -0.719420 -0.574381 1 3 0.5 3 3 -1.287547 -0.749218 1 3 0.5 4 4 0.677617 -0.908667 1 3 0.5 test_dd100.groupby(['grp1', 'grp2']).apply(test_func).compute().head() b c grp1 grp2 new_col 0 0 -0.561376 -1.422286 1 3 0 1 1 -1.107799 1.075471 1 3 0 2 2 -0.719420 -0.574381 1 3 0 3 3 -1.287547 -0.749218 1 3 0 4 4 0.677617 -0.908667 1 3 1 df.groupby(['grp1', 'grp2']).apply(test_func).head() b c grp1 grp2 new_col 0 0 -0.561376 -1.422286 1 3 0.48 1 1 -1.107799 1.075471 1 3 0.48 2 2 -0.719420 -0.574381 1 3 0.48 3 3 -1.287547 -0.749218 1 3 0.48 4 4 0.677617 -0.908667 1 3 0.48
does groupby step operate within each partition rather looking on full dataframe? in case it's trivial set npartitions=1 , doesn't seem impact performance since read_csv automatically sets number of partitions how setup call ensure groupby results accurate?
thanks!
i surprised result. groupby.apply should return same results regardless of number of partitions. if can supply reproducible example encourage raise issue , 1 of developers take look.
Comments
Post a Comment