pandas - Memory error when running medium sized merge function ipython notebook jupyter -
i'm trying merge around 100 dataframes loop , getting memory error. i'm using ipython jupyter notebook
here sample of data:
timestamp namecoin_cap 0 2013-04-28 5969081 1 2013-04-29 7006114 2 2013-04-30 7049003 each frame around 1000 lines long
here's error in detail, i've include merge function.
i have searched similar issues seems large arrays >1gb, data relatively small in comparison.
edit: suspicious. wrote beta program before, test 4 dataframes, exported through pickle , 500kb. when try export 100 frames 1 memory error. export file 2gb. suspect somewhere down line code has created kind of loop, creating large file. nb 100 frames stored in dictionary
edit2: have exported scrypt .py
this .xlsx cointains asset names script needs
the script fetches data regarding various assets, cleans , saves each asset data frame in dictionary
i'd appreciative if have , see if there's wrong. other wise please advise on tests can run.
edit3: i'm finding hard understand why happening, code worked fine in beta, have done add more assets.
edit4: ran size check on object (dict of dfs) , 1,066,793 bytes
edit5: problem in merge function coin 37
for coin in coins[:37]: data2['merged'] = pd.merge(left=data2['merged'],right=data2[coin], left_on='timestamp', right_on='timestamp', how='left') this when error occurs. for coin in coins[:36]:' doesn't produce error howeverfor coin in coins[:37]:' produces error, ideas ?
edit6: 36th element 'syscoin', did coins.remove('syscoin') memory problem still occurs. seems problem 36th element in coins no matter coin
edit7: gocards suggestions seemed work next part of code:
merged = data2['merged'] merged['total_mc'] = merged.drop('timestamp',axis=1).sum(axis=1) produces memory error. i'm stumped
in regard storage, recommend using simple csv on pickle. csv more generic format. human readable,and can check data quality easier data grows.
file_template_string='%s.csv' eachkey in dfdict: filename = file_template_string%(eachkey) dfdict[eachkey].to_csv(filename) if need date files can put timestamp in filename.
import time datetime import datetime cur = time.time() cur = datetime.fromtimestamp(cur) file_template_string = "%s_{0}.csv".format(cur.strftime("%m_%d_%y_%h_%m_%s")) there obvious errors in code.
for coin in coins: #line 61,89 coin in data: #should df = data2['namecoin'] #line 87 keys = data2.keys() keys.remove('namecoin') coin in keys: df = pd.merge(left=df,right=data2[coin], left_on='timestamp', right_on='timestamp', how='left')
Comments
Post a Comment