Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    text
    copied!<p>Suppose you can fit <code>1/k</code>'th of the file into memory and still have room for working data structures. The whole file can be processed in <code>k</code> or fewer passes, as below, and this has a chance of being much faster than sorting the whole file depending on line lengths and sort-algorithm constants. Sorting averages <code>O(n ln n)</code> and the process below is <code>O(k n)</code> worst case. For example, if lines average 10 characters and there are n = 5G lines, <code>ln(n) ~ 22.3</code>. In addition, if your output file <code>B</code> is much smaller than the input file <code>A</code>, the process probably will take only one or two passes.</p> <p>Process: </p> <ol> <li>Allocate a few megabytes for input buffer I, a few gigabytes for a result buffer R, and a gigabyte or so for a hash table H. Open input file F and output file O.</li> <li>Repeat: Fill I from F and process it into R, via step 3.</li> <li>For each line L in I, check if L is already in H and R. If so, go on to next L, else add L to R and its hash to H.</li> <li>When R is full, say with M entries, write it to O. Then repeatedly fill I from F, dedup as in step 3, and write to O. At EOF(F) go to 5.</li> <li>Repeat (using old O as input F and a new O for output): Read M lines from F and copy to O. Then load R and H as in steps 2 and 3, and copy to EOF(F) with dedup as before. Set M to the new number of non-dupped lines at the beginning of each O file.</li> </ol> <p>Note that after each pass, the first M lines of O contain no duplicates, and none of those M lines are duplicated in the rest of O. Thus, at least <code>1/k</code>'th of the original file is processed per pass, so processing takes at most <code>k</code> passes.</p> <p><strong>Update 1</strong> Instead of repeatedly writing out and reading back in the already-processed leading lines, a separate output file P should be used, to which process buffer R is appended at the end of each pass. This cuts the amount of reading and writing by a factor of <code>k/2</code> when result file B is nearly as large as A, or by somewhat less when B is much smaller than A; but in no case does it increase the amount of I/O.</p>
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload