python - Dask DataFrame Groupby Partitions -


i have large csv files (~10gb) , take advantage of dask analysis. however, depending on number of partitions set dask object read in with, groupby results change. understanding dask took advantage of partitions out-of-core processing benefits, still return appropriate groupby output. doesn't seem case , i'm struggling work out alternate settings necessary. below small example:

df = pd.dataframe({'a': np.arange(100), 'b': np.random.randn(100), 'c': np.random.randn(100), 'grp1': np.repeat([1, 2], 50), 'grp2': [3, 4, 5, 6], 25)})  test_dd1 = dd.from_pandas(df, npartitions=1) test_dd2 = dd.from_pandas(df, npartitions=2) test_dd5 = dd.from_pandas(df, npartitions=5) test_dd10 = dd.from_pandas(df, npartitions=10) test_dd100 = dd.from_pandas(df, npartitions=100)  def test_func(x):     x['new_col'] = len(x[x['b'] > 0.]) / len(x['b'])     return x  test_dd1.groupby(['grp1', 'grp2']).apply(test_func).compute().head()                  b               c grp1 grp2 new_col 0  0 -0.561376 -1.422286     1     3     0.48 1  1 -1.107799  1.075471     1     3     0.48 2  2 -0.719420 -0.574381     1     3     0.48 3  3 -1.287547 -0.749218     1     3     0.48 4  4  0.677617 -0.908667     1     3     0.48  test_dd2.groupby(['grp1', 'grp2']).apply(test_func).compute().head()                  b              c  grp1 grp2 new_col 0  0 -0.561376 -1.422286     1     3     0.48 1  1 -1.107799  1.075471     1     3     0.48 2  2 -0.719420 -0.574381     1     3     0.48 3  3 -1.287547 -0.749218     1     3     0.48 4  4  0.677617 -0.908667     1     3     0.48  test_dd5.groupby(['grp1', 'grp2']).apply(test_func).compute().head()                  b              c  grp1 grp2 new_col 0  0 -0.561376 -1.422286     1     3     0.45 1  1 -1.107799  1.075471     1     3     0.45 2  2 -0.719420 -0.574381     1     3     0.45 3  3 -1.287547 -0.749218     1     3     0.45 4  4  0.677617 -0.908667     1     3     0.45  test_dd10.groupby(['grp1', 'grp2']).apply(test_func).compute().head()                  b              c  grp1 grp2 new_col 0  0 -0.561376 -1.422286     1     3      0.5 1  1 -1.107799  1.075471     1     3      0.5 2  2 -0.719420 -0.574381     1     3      0.5 3  3 -1.287547 -0.749218     1     3      0.5 4  4  0.677617 -0.908667     1     3      0.5  test_dd100.groupby(['grp1', 'grp2']).apply(test_func).compute().head()                  b              c  grp1 grp2  new_col 0  0 -0.561376 -1.422286     1     3        0 1  1 -1.107799  1.075471     1     3        0 2  2 -0.719420 -0.574381     1     3        0 3  3 -1.287547 -0.749218     1     3        0 4  4  0.677617 -0.908667     1     3        1  df.groupby(['grp1', 'grp2']).apply(test_func).head()                  b               c grp1 grp2 new_col 0  0 -0.561376 -1.422286     1     3     0.48 1  1 -1.107799  1.075471     1     3     0.48 2  2 -0.719420 -0.574381     1     3     0.48 3  3 -1.287547 -0.749218     1     3     0.48 4  4  0.677617 -0.908667     1     3     0.48 

does groupby step operate within each partition rather looking on full dataframe? in case it's trivial set npartitions=1 , doesn't seem impact performance since read_csv automatically sets number of partitions how setup call ensure groupby results accurate?

thanks!

i surprised result. groupby.apply should return same results regardless of number of partitions. if can supply reproducible example encourage raise issue , 1 of developers take look.


Comments

Popular posts from this blog

sublimetext3 - what keyboard shortcut is to comment/uncomment for this script tag in sublime -

java - No use of nillable="0" in SOAP Webservice -

ubuntu - Laravel 5.2 quickstart guide gives Not Found Error -