Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    primarykey
    data
    text
    <p>When faced with this kind of problem, you can adopt one of two methods:</p> <ol> <li><em>the method you suggested</em>: for each record you read, <strong>write out the record number</strong> (or the position returned by <code>ftell</code> on the input file) to a separate <em>bookmark</em> file. To ensure that you resume exactly where you left off, as to not introduce duplicate records, you must <code>fflush</code> after every write (to both <code>bookmark</code> and output/reject files.) This, and unbuffered write operations in general, slow down the typical (no-failure) scenario significantly. For completeness' sake, note that you have three ways of writing to your bookmark file: <ul> <li><code>fopen(..., 'w') / fwrite / fclose</code> - extremely slow</li> <li><code>rewind / truncate / fwrite / fflush</code> - marginally faster</li> <li><code>rewind / fwrite / fflush</code> - <em>somewhat faster</em>; you may skip <code>truncate</code> since the record number (or <code>ftell</code> position) will always be as long or longer than the previous record number (or <code>ftell</code> position), and will completely overwrite it, provided you truncate the file once at startup <strong>(this answers your original question)</strong></li> </ul></li> <li>assume everything will go well <em>in most cases</em>; when resuming after failure, simply <strong>count the number of records already output</strong> (normal output plus rejects), and skip an equivalent number of records from the input file. <ul> <li>This keeps the typical (no-failure) scenarios very fast, without significantly compromising performance in case of resume-after-failure scenarios.</li> <li>You do not need to <code>fflush</code> files, or at least not so often. You still need to <code>fflush</code> the main output file before switching to writing to the rejects file, and <code>fflush</code> the rejects file before switching back to writing to the main output file (probably a few hundred or thousand times for a 500k-record input.) Simply remove the last unterminated line from the output/reject files, everything up to that line will be consistent.</li> </ul></li> </ol> <p><strong>I strongly recommend method #2</strong>. The writing entailed by method #1 (whichever of the three possibilities) is extremely expensive compared to any additional (buffered) reads required by method #2 (<code>fflush</code> can take several milliseconds; multiply that by 500k and you get minutes - whereas counting the number of lines in a 500k-record file takes mere seconds and, what's more, the filesystem cache is working <em>with, not against</em> you on that.)</p> <hr> <p><strong>EDIT</strong> Just wanted to clarify the exact steps you need to implement method 2:</p> <ul> <li><p>when writing to the output and rejects files respectively you only need to flush when switching from writing to one file to writing to another. Consider the following scenario as illustration of the ncessity of doing these flushes-on-file-switch:</p> <ul> <li>suppose you write 1000 records to the main output file, then</li> <li>you have to write 1 line to the rejects file, without manually flushing the main output file first, then</li> <li>you write 200 more lines to the main output file, without manually flushing the rejects file first, then</li> <li>the runtime <em>automatically</em> flushes the main output file for you because you have accumulated a large volume of data in the buffers for the main output file, i.e. 1200 records <ul> <li><em>but</em> the runtime has not yet automatically flushed the rejects file to disk for you, as the file buffer only contains one record, which is not sufficient volume to automatically flush</li> </ul></li> <li>your program crashes at this point</li> <li>you resume and count 1200 records in the main output file (the runtime flushed those out for you), but 0 (!) records in the rejects file (not flushed).</li> <li>you resume processing the input file at record #1201, assuming you only had 1200 records successfully processed to the main output file; the rejected record would be lost, and the 1200'th valid record will be repeated</li> <li><em>you do not want this!</em></li> </ul></li> <li>now consider manually flushing after switching output/reject files: <ul> <li>suppose you write 1000 records to the main output file, then</li> <li>you encounter one invalid record which belongs to the rejects file; the last record was valid; this means you're switching to writing to the rejects file: flush the main output file before writing to the rejects file</li> <li>you now write 1 line to the rejects file, then</li> <li>you encounter one valid record which belongs to the main output file; the last record was invalid; this means you're switching to writing to the main output file: flush the rejects file before writing to the main output file</li> <li>you write 200 more lines to the main output file, without manually flushing the rejects file first, then</li> <li>assume that the runtime did not automatically flush anything for you, because 200 records buffered since the last manual flush on the main output file are not enough to trigger an automatic flush</li> <li>your program crashes at this point</li> <li>you resume and count 1000 valid records in the main output file (you manually flushed those before switching to the rejects file), and 1 record in the rejects file (you manually flushed before switching back to the main output file).</li> <li>you correctly resume processing the input file at record #1001, which is the first valid record immediately after the invalid record.</li> <li>you reprocess the next 200 valid records because they were not flushed, but you get no missing records and no duplicates either</li> </ul></li> <li><p>if you are not happy with the interval between the runtime's automatic flushes, you may also do manual flushes every 100 or every 1000 records. This depends on whether processing a record is more expensive than flushing or not (if procesing is more expensive, flush often, maybe after each record, otherwise only flush when switching between output/rejects.)</p></li> <li><p>resuming from failure</p> <ul> <li>open the output file and the rejects file <em>for both reading and writing</em>, and begin by reading and counting each record (say in <code>records_resume_counter</code>) until you reach the end of file</li> <li><em>unless you were flushing after <strong>each</strong> record you are outputting</em>, you will also need to perform a bit of special treatment for the last record in both the output and rejects file: <ul> <li>before reading a record from the interrupted output/rejects file, remember the position you are at in the said output/rejects file (use <code>ftell</code>), let's call it <code>last_valid_record_ends_here</code></li> <li>read the record. validate that the record is not a partial record (i.e. the runtime has not flushed the file up to the <em>middle</em> of a record). </li> <li><em>if you have one record per line, this is easily verified by checking that the last character in the record is a carriage return or line feed (<code>\n</code> or `r`)</em> <ul> <li>if the record is complete, increment the records counter and proceed with the next record (or end of file, whichever comes first.)</li> <li>if the record is partial, <code>fseek</code> back to <code>last_valid_record_ends_here</code>, and stop reading from this output/reject files; do not increment the counter; proceed to the next output or rejects file unless you've gone through all of them</li> </ul></li> </ul></li> <li>open the input file for reading and skip <code>records_resume_counter</code> records from it <ul> <li>continue processing and outputting to the output/rejects file; this will automatically append to the output/rejects file where you left off reading/counting already processed records</li> <li>if you had to perform special processing for partial record flushes, the next record you output will overwrite its partial information from the previous run (at <code>last_valid_record_ends_here</code>) - you will have no duplicate, garbage or missing records.</li> </ul></li> </ul></li> </ul>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. VO
      singulars
      1. This table or related slice is empty.
    2. VO
      singulars
      1. This table or related slice is empty.
    3. VO
      singulars
      1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload