numpy - Python pandas groupby agg function on datetime columns -
there bogus behaviour @ .agg()
function in pandas 0.17.1 , numpy 1.10.1. problem can found in datatype conversion. here example dataframe:
import datetime dt df_data = [[1,1,2] , ['foo', 'bar', 'bar'], [1,2,3] , [dt.date(2015,1,5), dt.date(2015,1,3), dt.date(2015,1,15)] , [dt.datetime(2015,1,5,23,16,59), dt.datetime(2015,1,5,21,16,59) , dt.datetime(2015,1,5,20,16,59)] , [dt.timedelta(minutes=60), dt.timedelta(minutes=120), dt.timedelta(minutes=12)]] df_cols = ['gb', 'a_string', 'a_int', 'a_date', 'a_datetime', 'a_timedelta'] df = pd.dataframe({i:j i,j in zip(df_cols, df_data)}) print(df[df_cols])
the result is:
gb a_string a_int a_date a_datetime a_timedelta 0 1 foo 1 2015-01-05 2015-01-05 23:16:59 01:00:00 1 1 bar 2 2015-01-03 2015-01-05 21:16:59 02:00:00 2 2 bar 3 2015-01-15 2015-01-05 20:16:59 00:12:00
i want calculate example min()
, other statistics. can use several groupby object attribute stats (let's see min()
):
df_g = df.groupby('gb').min() print(df_g[df_cols[1:]])
the result correct:
a_string a_int a_date a_datetime a_timedelta gb 1 bar 1 2015-01-03 2015-01-05 21:16:59 01:00:00 2 bar 3 2015-01-15 2015-01-05 20:16:59 00:12:00
or can use agg()
function, this, datatype conversion error occurs @ timedelta
objects:
df_g_2 = df.groupby('gb').agg({i:np.min in df_cols[1:]}) print(df_g_2[df_cols[1:]])
this result:
a_string a_int a_date a_datetime a_timedelta gb 1 bar 1 2015-01-03 2015-01-05 21:16:59 3600000000000 2 bar 3 2015-01-15 2015-01-05 20:16:59 720000000000
the same error occurs @ datetime
objects if there nat
values present in group. let's see new dataframe:
df_data = [[1,1,2] , ['foo', 'bar', 'bar'], [1,2,3] , [dt.date(2015,1,5), dt.date(2015,1,3), dt.date(2015,1,15)] , [pd.nat, pd.nat , dt.datetime(2015,1,5,20,16,59)] , [dt.timedelta(minutes=60), dt.timedelta(minutes=120), dt.timedelta(minutes=12)]] df_cols = ['gb', 'a_string', 'a_int', 'a_date', 'a_datetime', 'a_timedelta'] df = pd.dataframe({i:j i,j in zip(df_cols, df_data)}) print(df[df_cols])
the result is:
gb a_string a_int a_date a_datetime a_timedelta 0 1 foo 1 2015-01-05 nat 01:00:00 1 1 bar 2 2015-01-03 nat 02:00:00 2 2 bar 3 2015-01-15 2015-01-05 20:16:59 00:12:00
the min()
works perfectly:
a_string a_int a_date a_datetime a_timedelta gb 1 bar 1 2015-01-03 nat 01:00:00 2 bar 3 2015-01-15 2015-01-05 20:16:59 00:12:00
but agg()
gives wrong values:
a_string a_int a_date a_datetime a_timedelta gb 1 bar 1 2015-01-03 nan 3600000000000 2 bar 3 2015-01-15 1420489019000000000.0000 720000000000
be careful & have nice day!
Comments
Post a Comment