pandas - Timestamp String to Seconds Conversion in Python -
i'm beginner in python data science. i'm working on clickstream data , want find out duration of session. find start time , end time of session. on subtraction, i'm getting wrong answer same. here data
sid tstamp itemid category 0 1 2014-04-07t10:51:09.277z 214536502 0 1 1 2014-04-07t10:54:09.868z 214536500 0 2 1 2014-04-07t10:54:46.998z 214536506 0 3 1 2014-04-07t10:57:00.306z 214577561 0 4 2 2014-04-07t13:56:37.614z 214662742 0 5 2 2014-04-07t13:57:19.373z 214662742 0 6 2 2014-04-07t13:58:37.446z 214825110 0 7 2 2014-04-07t13:59:50.710z 214757390 0 8 2 2014-04-07t14:00:38.247z 214757407 0 9 2 2014-04-07t14:02:36.889z 214551617 0 10 3 2014-04-02t13:17:46.940z 214716935 0 11 3 2014-04-02t13:26:02.515z 214774687 0 12 3 2014-04-02t13:30:12.318z 214832672 0
i referred question code- timestamp conversion
here code-
k.columns=['sid','tstamp','itemid','category'] k=k.loc[:,('sid','tstamp')] #find max timestamp idx=k.groupby(['sid'])['tstamp'].transform(max) == k['tstamp'] ah=k[idx].reset_index() #find min timestamp idy=k.groupby(['sid'])['tstamp'].transform(min) == k['tstamp'] ai=k[idy].reset_index() #grouping sid , applying count retain distinct sid values kgrp=k.groupby('sid').count() i=0 temp1,temp2 in zip(ah['tstamp'],ai['tstamp']): sv1= datetime.datetime.strptime(temp1, "%y-%m-%dt%h:%m:%s.%fz") sv2= datetime.datetime.strptime(temp2, "%y-%m-%dt%h:%m:%s.%fz") d1=time.mktime(sv1.timetuple()) + (sv1.microsecond / 1000000.0) d2=time.mktime(sv2.timetuple()) + (sv2.microsecond / 1000000.0) kgrp.loc[i,'duration']= d1-d2 i=i+1
here output.
kgrp out[5]: tstamp duration sid 1 4 359.275 2 6 745.378 3 3 1034.468
for session id 2, duration should close 6 minutes i'm getting 12 minutes. reckon i'm making silly mistake here.
also, i'm grouping sid , applying count on sid column , store each duration separate column. there easier method through can store sid (not 'tstamp' count column) , duration values?
you assigning duration value wrong label. in test data sid starts 1 i
starts 0:
# sid 1, == 0 kgrp.loc[i,'duration']= d1-d2 i=i+1
update
a more pythonic way handle :)
def calculate_duration(dt1, dt2): # calculation here, return duration in seconds k = k.loc[:, ('sid', 'tstamp')] result = k.groupby(['sid'])['tstamp'].agg({ 'duration': lambda x: calculate_duration(x.max(), x.min()), 'count': lambda x: x.count() })
Comments
Post a Comment