python - what is the optimal chunksize in pandas read_csv to maximize speed? -


i using 20gb (compressed) .csv file , load couple of columns using pandas pd.read_csv() 10 000 chuncksize parameter.

however, parameter arbitrary , wonder whether simple formula give me better chunksize speed-up loading of data.

any ideas?

chunksize tells number of rows per chunk, hence it's meaningless make rule-of-thumb on that. memory size, you'd have convert memory-size per-chunk or per-row, looking @ number of columns, dtypes, , size of each; use df.describe(), or here's idiom:

print 'df memory usage column...' print df.memory_usage(index=false, deep=true) / df.shape[0] 

or else use os (top/task manager/activity monitor) see how memory being used, , make sure you're not using free memory, , margin-of-safety less.

one issue pandas missing/nan, python strs , objects take 32 or 48 bytes, instead of expected 4 bytes np.int32 or 1 byte np.int8 column. 1 nan value in entire column this, , pandas.read_csv() dtypes, converters, na_values arguments not prevent np.nan, , ignore desired dtype. workaround manually post-process each chunk before inserting in dataframe.

(and use standard pandas tricks, specifying dtypes each column, , using converters rather pd.categorical if want reduce 48 bytes 1 or 4)


Comments

Popular posts from this blog

sublimetext3 - what keyboard shortcut is to comment/uncomment for this script tag in sublime -

java - No use of nillable="0" in SOAP Webservice -

ubuntu - Laravel 5.2 quickstart guide gives Not Found Error -