python - what is the optimal chunksize in pandas read_csv to maximize speed? -
i using 20gb (compressed) .csv file , load couple of columns using pandas pd.read_csv() 10 000 chuncksize parameter.
however, parameter arbitrary , wonder whether simple formula give me better chunksize speed-up loading of data.
any ideas?
chunksize
tells number of rows per chunk, hence it's meaningless make rule-of-thumb on that. memory size, you'd have convert memory-size per-chunk or per-row, looking @ number of columns, dtypes, , size of each; use df.describe()
, or here's idiom:
print 'df memory usage column...' print df.memory_usage(index=false, deep=true) / df.shape[0]
or else use os (top
/task manager/activity monitor) see how memory being used, , make sure you're not using free memory, , margin-of-safety less.
one issue pandas missing/nan, python strs , objects take 32 or 48 bytes, instead of expected 4 bytes np.int32 or 1 byte np.int8 column. 1 nan value in entire column this, , pandas.read_csv() dtypes, converters, na_values
arguments not prevent np.nan, , ignore desired dtype. workaround manually post-process each chunk before inserting in dataframe.
(and use standard pandas tricks, specifying dtypes each column, , using converters rather pd.categorical if want reduce 48 bytes 1 or 4)
Comments
Post a Comment