Reading large (~1GB) data file with C++ sometimes throws bad_alloc, even if I have more than 10GB of RAM available -
i'm trying read data contained in .dat file size ~1.1gb. because i'm doing on 16gb ram machine, though have not problem read whole file memory @ once, after process it.
to this, employed slurp
function found in this answer. problem code sometimes, not always, throws bad_alloc exception. looking @ task manager see there @ least 10gb of free memory available, don't see how memory issue.
here code reproduces error
#include <iostream> #include <fstream> #include <sstream> #include <string> using namespace std; int main() { ifstream file; file.open("big_file.dat"); if(!file.is_open()) cerr << "the file not found\n"; stringstream sstr; sstr << file.rdbuf(); string text = sstr.str(); cout << "successfully read file!\n"; return 0; }
what causing problem? , best practices avoid it?
the fact system has 16gb doesn't mean program @ time can allocate given amount of memory. in fact, might work on machine has 512mb of physical ram, if enought swap available, or might fail on hpc node 128gb of ram – it's totally operating system decide how memory available you, here.
i'd argue std::string
never data type of choice if dealing file, possibly binary, large.
the point here there absolutely no knowing how memory stringstream
tries allocate. pretty reasonable algorithm double amount of memory allocated every time allocated internal buffer becomes small contain incoming bytes. also, libc++/libc have own allocators have allocation overhead, here.
note stringstream::str()
returns copy of data contained in stringstream
's internal state, again leaving @ least 2.2 gb of heap used task.
really, if need deal data large binary file can access index operator []
, memory mapping file; way, pointer beginning of file, , might work if plain array in memory, letting os take care of handling underlying memory/buffer management. it's oses for!
if didn't know boost
before, it's kind of "the extended standard library c++" now, , of course, has class abstracting memory mapping file: mapped_file
.
the file i'm reading contains series of data in ascii tabular form, i.e.
float1,float2\nfloat3,float4\n....
i'm browsing through various possible solutions proposed on deal kind of problem, left wondering on (to me) peculiar behaviour. recommend in these kinds of circumstances?
depends; think fastest way of dealing (since file io much, slower in-memory parsing of ascii) parse file incrementally, directly in-memory array of float
variables; possibly taking advantage of os'es pre-fetching smp capabilities in don't of speed advantage if you'd spawn separate threads file reading , float conversion. std::copy
, used read std::ifstream
std::vector<float>
should work fine, here.
i'm still not getting something: file io slower in-memory parsing, , understand (and reason why wanted read whole file @ once). best way parse whole file incrementally in-memory array of float. mean this? doesn't mean read file line-by-line, resulting in large number of file io operations?
yes, , no: first, of course, have more context switches you'd have if ordered whole read @ once. aren't that expensive -- @ least, they're going less expensive when realize oses , libc's know quite how optimize reads, , fetch whole lot of file @ once if don't use extremely randomized read
lengths. also, don't infer penalty of trying allocate block of ram @ least 1.1gb in size -- calls serious page table lookups, aren't fast, either.
now, idea occasional context switch , fact that, if you're staying single-threaded, there times when don't read file because you're still busy converting text float still mean less of performance hit, because of time, read
pretty return, os/runtime has prefetched significant part of file.
generally, me, seem worried wrong kinds of things: performance seems important (is that important, here? you're using brain-dead file format interchanging floats, both bloaty, loses information, , on top of slow parse), you'd rather first read whole file in @ once , start converting numbers. frankly, if performance of criticality application, start multi-thread/-process it, string parsing happen while data still being read. using buffers of few kilo- megabytes read \n
boundaries , exchanged thread creates in-memory table of floats sounds reduce read+parse time down read+non-measurable without sacrificing read performance, , without need gigabytes of ram parse sequential file.
by way, give impression of how bad storing floats in ascii is:
the typical 32bit single-precision ieee753 floating point number has 6-9 significant decimal digits. hence, need @ least 6 characters represent these in ascii, 1 .
, typically 1 exponential divider, e.g. e
, , on average 2.5 digits of decimal exponent, plus on average half sign character (-
or not), if numbers uniformly chosen possible ieee754 32bit floats:
-1.23456e-10
that's average of 11 characters.
add 1 ,
or \n
after every number.
now, character 1b, meaning blow 4b of actual data factor of 3, still losing precision.
now, people come around telling me plaintext more usable, because if in doubt, user can read it… i've yet see 1 user can skim through 1.1gb (according calculations above, that's around 90 million floating point numbers, or 45 million floating point pairs) , not go insane.
Comments
Post a Comment