Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    primarykey
    data
    text
    <h3>Data generation</h3> <p>I created 10 files each with 300,000 lines, each line containing 11 characters (mainly digits) and a newline, using this Perl script:</p> <pre><code>#!/usr/bin/env perl use strict; use warnings; # Generate sorted, merge-worthy data die "Usage: num-files lines-per-file" unless scalar(@ARGV) == 2; my $num_files = $ARGV[0]; my $num_lines = $ARGV[1]; die "Enter a number of files between 2 and 999" unless $num_files &gt;= 2 &amp;&amp; $num_files &lt;= 999; die "Enter a number of lines between 2 and 9,999,999" unless $num_files &gt;= 2 &amp;&amp; $num_files &lt;= 9_999_999; my $base = "datafile"; my $digits = length($num_files . ''); my $format = "$base.%.${digits}d"; foreach my $i (1..$num_files) { my $file = sprintf $format, $i; open my $fh, '&gt;', $file or die "Failed to create $file"; foreach my $n (1..$num_lines) { printf $fh "%.7d-%.3d\n", $n + 60 * $i, $i; } close $fh; } </code></pre> <p>It took Perl 5.18.0 about 1.5 seconds to generate the 10 files (on a MacBook Pro, 2.3 GHz Intel Core i7, 16 GiB 1333 MHz RAM, good old-fashioned spinning magnetic disk; useful, but not incredibly powerful). The data is designed so that there is overlap between the files, but also each line is unique and identifies the file which it comes from (that's the purpose of the <code>$offset</code> and putting the sequential number within the file before the file number). In testing 3 files, say, there is some data from file 1 alone before there is data from file 1 and 2 interleaved, and then from all 3 files, then file 1 runs out, and file 2. It gives decent coverage of the different parts of the merge algorithm (but you could create more scenarios).</p> <p>I then created a <code>merge</code> program as shown in the 'Code' section below. There's nothing very fancy about it. It lifts some heap management algorithms from Jon Bentley's 'More Programming Pearls'; the originals are quoted as comments. The only tricky bit is the sense of the comparison routine: that caused me some wrong answers to start with, and the levels of indirection.</p> <h3>Timing Results</h3> <p>When run on the 10 files representing ~35 MiB of data:</p> <pre><code>$ ls -l datafile.?? -rw-r--r-- 1 jleffler staff 3600000 Sep 15 14:09 datafile.01 -rw-r--r-- 1 jleffler staff 3600000 Sep 15 14:09 datafile.02 -rw-r--r-- 1 jleffler staff 3600000 Sep 15 14:09 datafile.03 -rw-r--r-- 1 jleffler staff 3600000 Sep 15 14:09 datafile.04 -rw-r--r-- 1 jleffler staff 3600000 Sep 15 14:09 datafile.05 -rw-r--r-- 1 jleffler staff 3600000 Sep 15 14:09 datafile.06 -rw-r--r-- 1 jleffler staff 3600000 Sep 15 14:09 datafile.07 -rw-r--r-- 1 jleffler staff 3600000 Sep 15 14:09 datafile.08 -rw-r--r-- 1 jleffler staff 3600000 Sep 15 14:09 datafile.09 -rw-r--r-- 1 jleffler staff 3600000 Sep 15 20:37 datafile.10 $ $ for file in datafile.??; do echo $file; sort -c $file; done datafile.01 datafile.02 datafile.03 datafile.04 datafile.05 datafile.06 datafile.07 datafile.08 datafile.09 datafile.10 $ time ./merge datafile.?? &gt; datafile.out real 0m0.510s user 0m0.314s sys 0m0.074s $ time ./merge datafile.?? &gt; datafile.out real 0m0.513s user 0m0.317s sys 0m0.080s $ time sort -m datafile.?? &gt; datafile.merge real 0m10.061s user 0m9.960s sys 0m0.083s $ time sort -m datafile.?? &gt; datafile.merge real 0m10.039s user 0m9.921s sys 0m0.082s $ ls -l datafile.* -rw-r--r-- 1 jleffler staff 3600000 Sep 15 14:09 datafile.01 -rw-r--r-- 1 jleffler staff 3600000 Sep 15 14:09 datafile.02 -rw-r--r-- 1 jleffler staff 3600000 Sep 15 14:09 datafile.03 -rw-r--r-- 1 jleffler staff 3600000 Sep 15 14:09 datafile.04 -rw-r--r-- 1 jleffler staff 3600000 Sep 15 14:09 datafile.05 -rw-r--r-- 1 jleffler staff 3600000 Sep 15 14:09 datafile.06 -rw-r--r-- 1 jleffler staff 3600000 Sep 15 14:09 datafile.07 -rw-r--r-- 1 jleffler staff 3600000 Sep 15 14:09 datafile.08 -rw-r--r-- 1 jleffler staff 3600000 Sep 15 14:09 datafile.09 -rw-r--r-- 1 jleffler staff 3600000 Sep 15 20:37 datafile.10 -rw-r--r-- 1 jleffler staff 36000000 Sep 15 20:42 datafile.merge -rw-r--r-- 1 jleffler staff 36000000 Sep 15 20:41 datafile.out $ cmp datafile.out datafile.merge $ sort -c datafile.out $ </code></pre> <p>Interpreting these results:</p> <ol> <li>The first <code>ls</code> listing shows 10 data files.</li> <li>The <code>for</code> loop checks that the input files are individually in sorted order.</li> <li>The <code>merge</code> program I wrote runs in about 0.5s.</li> <li>The system <code>sort</code> in merge mode (<code>-m</code>) runs in about 10s (and I'm astonished that it takes so long when my non-optimized code handles it so much quicker).</li> <li>The second <code>ls</code> listing shows that the output files are the same size.</li> <li>The <code>cmp</code> command shows that the output files are identical.</li> <li>The <code>sort -c</code> command shows that the output files are in sorted order.</li> </ol> <p>I separately and subsequently created 101 files, each with 1,000,000 records of 12 bytes each, in 52 seconds. I merged the files in 20 seconds (generating approximately 1179 MiB of output file — 1,236,000,000 bytes). The system <code>sort</code> merged them in 467 (7m47s) seconds (aka 'forever'). It took about 90 seconds for <code>sort -c</code> to check the output file was in sorted order. It took <code>cmp</code> less than 2 seconds to compare the two large files.</p> <p>I'm intrigued that the system <code>sort</code> is so slow on merge.</p> <h3>Interpretation of the results</h3> <ul> <li>On my Mac, it is possible to merge ~35 MiB of data from 10 input files in about half a second.</li> <li>The system <code>sort</code> can do the job in about ten seconds.</li> <li>By inference (given that my machine is no longer a rocket, if it ever was), it should be possible to merge ~35 MiB on a Windows machine in under twenty seconds (allowing you a factor of 40 disparity in performance, which I don't think you should need).</li> <li>So, your code taking thirty seconds is taking far too long — and the question is 'why'. You should look hard at your algorithms and data structures.</li> </ul> <h3>Code</h3> <pre><code>#include &lt;errno.h&gt; #include &lt;stdarg.h&gt; #include &lt;stdio.h&gt; #include &lt;stdlib.h&gt; #include &lt;string.h&gt; typedef struct File { const char *file; FILE *fp; char *line; size_t reserve; /* Space allocated for line */ size_t length; /* Space used by current line */ } File; extern void err_exit(const char *fmt, ...); extern void read_line(File *current); extern void write_line(File *current); extern void heapify(size_t *heap, size_t heap_size, File *inputs); extern void siftup(size_t *heap, size_t lo, size_t hi, File *inputs); extern void siftdown(size_t *heap, size_t lo, size_t hi, File *inputs); extern int compare(File *inputs, size_t i1, size_t i2); const char *arg0; int main(int argc, char **argv) { File inputs[argc]; arg0 = argv[0]; /* Open files for reading - load first lines */ for (int i = 1; i &lt; argc; i++) { File *current = &amp;inputs[i]; current-&gt;file = argv[i]; if ((current-&gt;fp = fopen(current-&gt;file, "r")) == 0) err_exit("failed to open file %s for reading", current-&gt;file); current-&gt;reserve = 128; if ((current-&gt;line = malloc(current-&gt;reserve)) == 0) err_exit("failed to allocate %zu bytes memory", current-&gt;reserve); current-&gt;length = 0; read_line(current); } #if defined(CHECK_FUNDAMENTALS) for (int i = 1; i &lt; argc; i++) printf("%d: %zu - %s\n", i, inputs[i].length, inputs[i].file); #endif size_t heap_size = argc - 1; size_t heap[argc]; // heap[0] unused for (int i = 1; i &lt; argc; i++) heap[i] = i; #if defined(CHECK_FUNDAMENTALS) printf("Heap before:\n"); for (int i = 1; i &lt; argc; i++) printf("%d: %zu - %s", i, heap[i], inputs[heap[i]].line); #endif heapify(heap, heap_size, inputs); #if defined(CHECK_FUNDAMENTALS) printf("Heap after:\n"); for (int i = 1; i &lt; argc; i++) printf("%d: %zu - %s", i, heap[i], inputs[heap[i]].line); #endif #if defined(CHECK_FUNDAMENTALS) printf("Compare two lines:\n"); printf("1: %s\n", inputs[1].line); printf("2: %s\n", inputs[2].line); int r12 = compare(inputs, 1, 2); int r21 = compare(inputs, 2, 1); int r11 = compare(inputs, 1, 1); printf("1 vs 2: %d\n", r12); printf("2 vs 1: %d\n", r21); printf("1 vs 1: %d\n", r11); #endif while (heap_size &gt; 0) { File *current = &amp;inputs[heap[1]]; write_line(current); read_line(current); if (current-&gt;line == 0) heap[1] = heap[heap_size--]; if (heap_size &gt; 0) { siftdown(heap, 1, heap_size, inputs); #if defined(CHECK_FUNDAMENTALS) printf("Heap check:\n"); for (int i = 1; i &lt; argc; i++) printf("%d: %zu - %s", i, heap[i], inputs[heap[i]].line); #endif } } return 0; } int compare(File *inputs, size_t i1, size_t i2) { return strcmp(inputs[i1].line, inputs[i2].line); } void heapify(size_t *heap, size_t heap_size, File *inputs) { for (size_t i = 1; i &lt;= heap_size; i++) siftup(heap, 1, i, inputs); } /* J Bentley: More Programming Pearls ** ** Heap in array x, indexed from 1, not 0 as is normal in C. ** Implementation will allocate but not use array[0]. ** ** function siftup(l, u, i, p) { ** # pre maxheap(l, u-1) ** # post maxheap(l, u) ** i = u ** while (1) { ** # maxheap(l, u) except between i and its parent ** if (i &lt;= l) break ** p = int(i/2) ** if (x[p] &gt;= x[i]) break ** swap(p, i) ** i = p ** } ** } ** ** function siftdown(l, u, i, c) { ** # pre maxheap(l+1, u) ** # post maxheap(l,u) ** i = l ** while (1) { ** # maxheap(l, u) except between i and its children ** c = 2*i ** if (c &gt; u) break ** if (c + 1 &lt;= u &amp;&amp; x[c+1] &gt; x[c]) c++ ** if (x[i] &gt;= x[c]) break ** swap(c, i) ** i = c ** } ** } */ void siftup(size_t *heap, size_t lo, size_t hi, File *inputs) { size_t i = hi; while (1) { if (i &lt;= lo) break; size_t p = i / 2; if (compare(inputs, heap[p], heap[i]) &lt;= 0) break; size_t t = heap[p]; heap[p] = heap[i]; heap[i] = t; i = p; } } void siftdown(size_t *heap, size_t lo, size_t hi, File *inputs) { size_t i = lo; while (1) { size_t c = 2 * i; if (c &gt; hi) break; if (c + 1 &lt;= hi &amp;&amp; compare(inputs, heap[c+1], heap[c]) &lt; 0) c++; if (compare(inputs, heap[i], heap[c]) &lt;= 0) break; size_t t = heap[c]; heap[c] = heap[i]; heap[i] = t; i = c; } } void read_line(File *current) { char buffer[4096]; if (fgets(buffer, sizeof(buffer), current-&gt;fp) != 0) { size_t length = strlen(buffer) + 1; if (length &gt; current-&gt;reserve) { void *space = realloc(current-&gt;line, length); if (space == 0) err_exit("failed to reallocate %zu bytes memory", length); current-&gt;line = space; current-&gt;reserve = length; } strcpy(current-&gt;line, buffer); current-&gt;length = length - 1; } else { fclose(current-&gt;fp); current-&gt;fp = 0; free(current-&gt;line); current-&gt;line = 0; current-&gt;length = 0; current-&gt;reserve = 0; } } void write_line(File *current) { if (fwrite(current-&gt;line, sizeof(char), current-&gt;length, stdout) != current-&gt;length) err_exit("short write of line from %s of length %zu", current-&gt;file, current-&gt;length); } void err_exit(const char *fmt, ...) { int errnum = errno; va_list args; fprintf(stderr, "%s: ", arg0); va_start(args, fmt); vfprintf(stderr, fmt, args); va_end(args); if (errnum != 0) fprintf(stderr, " (%d: %s)", errnum, strerror(errnum)); putc('\n', stderr); exit(EXIT_FAILURE); } </code></pre> <p>The code is designed to handle lines up to 4 KiB long; it would not be hard to revise it to use POSIX <a href="http://pubs.opengroup.org/onlinepubs/9699919799/functions/getline.html" rel="nofollow"><code>getline()</code></a> to handle even longer lines. It assumes that all the files can be open at once (that means an upper limit of around 250 input files on most machines without tweaking limits). It will stop if it can't open all the files rather than do multiple merges to intermediate files.</p>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. VO
      singulars
      1. This table or related slice is empty.
    2. VO
      singulars
      1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload