numpy - Python pandas groupby agg function on datetime columns -


there bogus behaviour @ .agg() function in pandas 0.17.1 , numpy 1.10.1. problem can found in datatype conversion. here example dataframe:

import datetime dt  df_data = [[1,1,2]     , ['foo', 'bar', 'bar'], [1,2,3]     , [dt.date(2015,1,5), dt.date(2015,1,3), dt.date(2015,1,15)]     , [dt.datetime(2015,1,5,23,16,59), dt.datetime(2015,1,5,21,16,59)     , dt.datetime(2015,1,5,20,16,59)]     , [dt.timedelta(minutes=60), dt.timedelta(minutes=120), dt.timedelta(minutes=12)]]  df_cols = ['gb', 'a_string', 'a_int', 'a_date', 'a_datetime', 'a_timedelta']  df = pd.dataframe({i:j i,j in zip(df_cols, df_data)})  print(df[df_cols]) 

the result is:

   gb a_string  a_int      a_date          a_datetime  a_timedelta 0   1      foo      1  2015-01-05 2015-01-05 23:16:59     01:00:00 1   1      bar      2  2015-01-03 2015-01-05 21:16:59     02:00:00 2   2      bar      3  2015-01-15 2015-01-05 20:16:59     00:12:00 

i want calculate example min() , other statistics. can use several groupby object attribute stats (let's see min()):

df_g = df.groupby('gb').min() print(df_g[df_cols[1:]]) 

the result correct:

   a_string  a_int      a_date          a_datetime  a_timedelta gb                                                              1       bar      1  2015-01-03 2015-01-05 21:16:59     01:00:00 2       bar      3  2015-01-15 2015-01-05 20:16:59     00:12:00 

or can use agg() function, this, datatype conversion error occurs @ timedelta objects:

df_g_2 = df.groupby('gb').agg({i:np.min in df_cols[1:]}) print(df_g_2[df_cols[1:]]) 

this result:

   a_string  a_int      a_date          a_datetime    a_timedelta gb                                                                1       bar      1  2015-01-03 2015-01-05 21:16:59  3600000000000 2       bar      3  2015-01-15 2015-01-05 20:16:59   720000000000 

the same error occurs @ datetime objects if there nat values present in group. let's see new dataframe:

df_data = [[1,1,2]     , ['foo', 'bar', 'bar'], [1,2,3]     , [dt.date(2015,1,5), dt.date(2015,1,3), dt.date(2015,1,15)]     , [pd.nat, pd.nat     , dt.datetime(2015,1,5,20,16,59)]     , [dt.timedelta(minutes=60), dt.timedelta(minutes=120), dt.timedelta(minutes=12)]]  df_cols = ['gb', 'a_string', 'a_int', 'a_date', 'a_datetime', 'a_timedelta']  df = pd.dataframe({i:j i,j in zip(df_cols, df_data)})  print(df[df_cols]) 

the result is:

   gb a_string  a_int      a_date          a_datetime  a_timedelta 0   1      foo      1  2015-01-05                 nat     01:00:00 1   1      bar      2  2015-01-03                 nat     02:00:00 2   2      bar      3  2015-01-15 2015-01-05 20:16:59     00:12:00 

the min() works perfectly:

   a_string  a_int      a_date          a_datetime  a_timedelta gb                                                              1       bar      1  2015-01-03                 nat     01:00:00 2       bar      3  2015-01-15 2015-01-05 20:16:59     00:12:00 

but agg() gives wrong values:

   a_string  a_int      a_date               a_datetime    a_timedelta gb                                                                     1       bar      1  2015-01-03                      nan  3600000000000 2       bar      3  2015-01-15 1420489019000000000.0000   720000000000 

be careful & have nice day!


Comments

Popular posts from this blog

sublimetext3 - what keyboard shortcut is to comment/uncomment for this script tag in sublime -

java - No use of nillable="0" in SOAP Webservice -

ubuntu - Laravel 5.2 quickstart guide gives Not Found Error -