Note that there are some explanatory texts on larger screens.

plurals
  1. POhow do I read a huge .gz file (more than 5 gig uncompressed) in c
    text
    copied!<p>I have some .gz compressed files which is around 5-7gig uncompressed. These are flatfiles.</p> <p>I've written a program that takes a uncompressed file, and reads it line per line, which works perfectly.</p> <p>Now I want to be able to open the compressed files inmemory and run my little program.</p> <p>I've looked into zlib but I can't find a good solution.</p> <p>Loading the entire file is impossible using gzread(gzFile,void *,unsigned), because of the 32bit unsigned int limitation.</p> <p>I've tried gzgets, but this almost doubles the execution time, vs reading in using gzread.(I tested on a 2gig sample.)</p> <p>I've also looked into "buffering", such as splitting the gzread process into multiple 2gig chunks, find the last newline using strcchr, and then setting the gzseek. But gzseek will emulate a total file uncompression. which is very slow.</p> <p>I fail to see any sane solution to this problem. I could always do some checking, whether or not a current line actually has a newline (should only occure in the last partially read line), and then read more data from the point in the program where this occurs. But this could get very ugly.</p> <p>Does anyhow have any suggestions?</p> <p>thanks</p> <p>edit: I dont need to have the entire file at once,just need one line a time, but I got a fairly huge machine, so if that was the easiest I would have no problems.</p> <p>For all those that suggest piping the stdin, I've experienced extreme slowdowns compared to opening the file. Here is a small code snippet I made some months ago, that illustrates it.</p> <pre><code>time ./a.out 59846/59846.txt # 59846/59846.txt 18255221 real 0m4.321s user 0m2.884s sys 0m1.424s time ./a.out &lt;59846/59846.txt 18255221 real 1m56.544s user 1m55.043s sys 0m1.512s </code></pre> <p>And the source code</p> <pre><code>#include &lt;iostream&gt; #include &lt;fstream&gt; #define LENS 10000 int main(int argc, char **argv){ std::istream *pFile; if(argc==2)//ifargument supplied pFile = new std::ifstream(argv[1],std::ios::in); else //if we want to use stdin pFile = &amp;std::cin; char line[LENS]; if(argc==2) //if we are using a filename, print it. printf("#\t%s\n",argv[1]); if(!pFile){ printf("Do you have permission to open file?\n"); return 0; } int numRow=0; while(!pFile-&gt;eof()) { numRow++; pFile-&gt;getline(line,LENS); } if(argc==2) delete pFile; printf("%d\n",numRow); return 0; } </code></pre> <p>thanks for your replies, I'm still waiting the golden apple</p> <p>edit2: using the cstyle FILE pointers instead of c++ streams is much much faster. So I think this is the way to go.</p> <p>Thank for all your input</p>
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload