python 2.7 - Convert from irregular frequency to monthly within groupby objects in pandas dataframe -
i have dataframe df containing userid, date of observation (usually quarterly frequency, irregular) , characteristic value, example:
from util.dates import dates, to_date import pandas pd df = pd.dataframe(dict( refissuerid=[11590] * 3 + [115948] * 4, availabledate=[to_date(d) d in (20050613, 20050905, 20051214, 20040924, 20041101, 20050202,20050516)], characteristic=[0.06, 0.09, 0.07, 0.13, 0.09, 0.06, 0.04])) userid date characteristic 115950 6/13/2005 0.06 115950 9/5/2005 0.09 115950 12/14/2005 0.07 115948 9/24/2004 0.13 115948 11/1/2004 0.09 115948 2/2/2005 0.06 115948 5/16/2005 0.04
i trying upsample monthly frequency within userid groups. is, looking smth (sorted userid , date)
userid date characteristic month_date 115950 6/13/2005 0.06 6/30/2005 115950 6/13/2005 0.06 7/31/2005 115950 6/13/2005 0.06 8/31/2005 115950 9/5/2005 0.09 9/30/2005 115950 9/5/2005 0.09 10/31/2005 115950 9/5/2005 0.09 11/30/2005 115950 12/14/2005 0.07 12/31/2005 115950 12/14/2005 0.07 1/31/2006 115950 12/14/2005 0.07 2/28/2006 115948 9/24/2004 0.13 9/30/2004 115948 9/24/2004 0.13 10/31/2004 115948 11/1/2004 0.09 11/30/2004 115948 11/1/2004 0.09 12/31/2004 115948 11/1/2004 0.09 1/31/2005 115948 2/2/2005 0.06 2/28/2005 115948 2/2/2005 0.06 3/31/2005 115948 2/2/2005 0.06 4/30/2005 115948 5/16/2005 0.04 5/31/2005 115948 5/16/2005 0.04 6/30/2005 115948 5/16/2005 0.04 7/31/2005
note record 115948 9/24/2004 0.13
gets upsampled twice because next available date 11/1/2004
, generates month_date of 11/30/2004
in upsampled set.
tried applying resample on groupby dataframe:
newdf=df.groupby(['userid']).resample("m",fill_method='ffill')
but not produce desired result. guidance/advice appreciated.
you can use resample
reset_index
:
import pandas pd df_dg = pd.dataframe(dict( userid=[11590] * 3 + [115948] * 4, date=[20050613, 20050905, 20051214, 20040924, 20041101, 20050202,20050516], characteristic=[0.06, 0.09, 0.07, 0.13, 0.09, 0.06, 0.04]), columns=['userid','date','characteristic']) df_dg['date'] = pd.to_datetime(df_dg['date'], format="%y%m%d") print df_dg userid date characteristic 0 11590 2005-06-13 0.06 1 11590 2005-09-05 0.09 2 11590 2005-12-14 0.07 3 115948 2004-09-24 0.13 4 115948 2004-11-01 0.09 5 115948 2005-02-02 0.06 6 115948 2005-05-16 0.04 df_dg['date1'] = df_dg['date'] newdf = df_dg.groupby('userid').apply(lambda x: x.set_index('date').resample('m', how='first',fill_method='ffill')).reset_index(drop=true, level=0).reset_index() newdf = newdf.rename(columns={'date':'month_date', 'date1':'date'}) newdf = newdf[['userid','date','characteristic','month_date']]
print newdf userid date characteristic month_date 0 11590 2005-06-13 0.06 2005-06-30 1 11590 2005-06-13 0.06 2005-07-31 2 11590 2005-06-13 0.06 2005-08-31 3 11590 2005-09-05 0.09 2005-09-30 4 11590 2005-09-05 0.09 2005-10-31 5 11590 2005-09-05 0.09 2005-11-30 6 11590 2005-12-14 0.07 2005-12-31 7 115948 2004-09-24 0.13 2004-09-30 8 115948 2004-09-24 0.13 2004-10-31 9 115948 2004-11-01 0.09 2004-11-30 10 115948 2004-11-01 0.09 2004-12-31 11 115948 2004-11-01 0.09 2005-01-31 12 115948 2005-02-02 0.06 2005-02-28 13 115948 2005-02-02 0.06 2005-03-31 14 115948 2005-02-02 0.06 2005-04-30 15 115948 2005-05-16 0.04 2005-05-31
Comments
Post a Comment