Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    primarykey
    data
    text
    <p>Of <em>all</em> the programs in this thread that I've tested so far, <strong>the OCaml version is the fastest</strong> and also among the <strong>shortest</strong>. (Line-of-code-based measurements are a little fuzzy, but it's not <em>clearly longer</em> than the Python version or the C or C++ versions, and it <em>is</em> clearly faster.)</p> <blockquote> <p>Note: I figured out why my earlier runtimes were so nondeterministic! My CPU heatsink was clogged with dust and my CPU was overheating as a result. Now I am getting nice deterministic benchmark times. I think I've now redone all the timing measurements in this thread now that I have a reliable way to time things.</p> </blockquote> <p>Here are the timings for the different versions so far, running on a 27-million-row 630-megabyte input data file. I'm on Ubuntu Intrepid Ibex on a dual-core 1.6GHz Celeron, running a 32-bit version of the OS (the Ethernet driver was broken in the 64-bit version). I ran each program five times and report the range of times those five tries took. I'm using Python 2.5.2, OpenJDK 1.6.0.0, OCaml 3.10.2, GCC 4.3.2, SBCL 1.0.8.debian, and Octave 3.0.1.</p> <ul> <li>SquareCog's Pig version: not yet tested (because I can't just <code>apt-get install pig</code>), <em>7</em> lines of code.</li> <li>mjv's pure SQL version: not yet tested, but I predict a runtime of several days; <em>7</em> lines of code.</li> <li>ygrek's OCaml version: <strong>68.7 seconds</strong> ±0.9 in <em>15</em> lines of code.</li> <li>My Python version: <strong>169 seconds</strong> ±4 or <strong>86 seconds</strong> ±2 with Psyco, in <em>16</em> lines of code.</li> <li>abbot's heap-based Python version: <strong>177 seconds</strong> ±5 in <em>18</em> lines of code, or <strong>83 seconds</strong> ±5 with Psyco.</li> <li>My C version below, composed with GNU <code>sort -n</code>: <strong>90 + 5.5 seconds</strong> (±3, ±0.1), but gives the wrong answer because of a deficiency in GNU <code>sort</code>, in <em>22</em> lines of code (including one line of shell.)</li> <li>hrnt's C++ version: <strong>217 seconds</strong> ±3 in <em>25</em> lines of code.</li> <li>mjv's alternative SQL-based procedural approach: not yet tested, <em>26</em> lines of code.</li> <li>mjv's first SQL-based procedural approach: not yet tested, <em>29</em> lines of code.</li> <li>peufeu's <a href="http://gist.github.com/194877" rel="nofollow noreferrer" title="My modified version as a gist">Python version with Psyco</a>: <strong>181 seconds</strong> ±4, somewhere around <em>30</em> lines of code.</li> <li>Rainer Joswig's Common Lisp version: <strong>478 seconds</strong> (only run once) in <em>42</em> lines of code.</li> <li>abbot's <code>noop.py</code>, which intentionally gives incorrect results to establish a lower bound: not yet tested, <em>15</em> lines of code.</li> <li>Will Hartung's Java version: <strong>96 seconds</strong> ±10 in, according to David A. Wheeler’s SLOCCount, <em>74</em> lines of code.</li> <li>Greg's Matlab version: doesn't work.</li> <li>Schuyler Erle's suggestion of using Pyrex on one of the Python versions: not yet tried.</li> </ul> <p>I supect abbot's version comes out relatively worse for me than for them because the real dataset has a highly nonuniform distribution: as I said, some <code>aa</code> values (“players”) have thousands of lines, while others only have one.</p> <p>About Psyco: I applied Psyco to my original code (and abbot's version) by putting it in a <code>main</code> function, which by itself cut the time down to about 140 seconds, and calling <code>psyco.full()</code> before calling <code>main()</code>. This added about four lines of code.</p> <p>I can <strong>almost</strong> solve the problem using GNU <code>sort</code>, as follows:</p> <pre><code>kragen@inexorable:~/devel$ time LANG=C sort -nr infile -o sorted real 1m27.476s user 0m59.472s sys 0m8.549s kragen@inexorable:~/devel$ time ./top5_sorted_c &lt; sorted &gt; outfile real 0m5.515s user 0m4.868s sys 0m0.452s </code></pre> <p>Here <code>top5_sorted_c</code> is this short C program:</p> <pre><code>#include &lt;ctype.h&gt; #include &lt;stdio.h&gt; #include &lt;string.h&gt; #include &lt;stdlib.h&gt; enum { linesize = 1024 }; char buf[linesize]; char key[linesize]; /* last key seen */ int main() { int n = 0; char *p; while (fgets(buf, linesize, stdin)) { for (p = buf; *p &amp;&amp; !isspace(*p); p++) /* find end of key on this line */ ; if (p - buf != strlen(key) || 0 != memcmp(buf, key, p - buf)) n = 0; /* this is a new key */ n++; if (n &lt;= 5) /* copy up to five lines for each key */ if (fputs(buf, stdout) == EOF) abort(); if (n == 1) { /* save new key in `key` */ memcpy(key, buf, p - buf); key[p-buf] = '\0'; } } return 0; } </code></pre> <p>I first tried writing that program in C++ as follows, and I got runtimes which were substantially slower, at 33.6±2.3 seconds instead of 5.5±0.1 seconds:</p> <pre><code>#include &lt;map&gt; #include &lt;iostream&gt; #include &lt;string&gt; int main() { using namespace std; int n = 0; string prev, aa, bb, cc; while (cin &gt;&gt; aa &gt;&gt; bb &gt;&gt; cc) { if (aa != prev) n = 0; ++n; if (n &lt;= 5) cout &lt;&lt; aa &lt;&lt; " " &lt;&lt; bb &lt;&lt; " " &lt;&lt; cc &lt;&lt; endl; prev = aa; } return 0; } </code></pre> <p>I did say <strong>almost</strong>. The problem is that <code>sort -n</code> does okay for most of the data, but it fails when it's trying to compare <code>0.33</code> with <code>3.78168e-05</code>. So to get this kind of performance and actually solve the problem, I need a better sort.</p> <p>Anyway, I kind of feel like I'm whining, but the sort-and-filter approach is about 5× faster than the Python program, while the elegant STL program from hrnt is actually a little slower — there seems to be some kind of gross inefficiency in <code>&lt;iostream&gt;</code>. I don't know where the other 83% of the runtime is going in that little C++ version of the filter, but it isn't going anywhere useful, which makes me suspect I don't know where it's going in hrnt's <code>std::map</code> version either. Could that version be sped up 5× too? Because that would be pretty cool. Its working set <em>might</em> be bigger than my L2 cache, but as it happens it probably isn't.</p> <p>Some investigation with callgrind says my filter program in C++ is executing 97% of its instructions inside of <code>operator &gt;&gt;</code>. I can identify at least 10 function calls per input byte, and <code>cin.sync_with_stdio(false);</code> doesn’t help. This probably means I could get hrnt’s C program to run substantially faster by parsing input lines more efficiently.</p> <p>Edit: kcachegrind claims that hrnt’s program executes 62% of its instructions (on a small 157000 line input file) extracting <code>double</code>s from an <code>istream</code>. A substantial part of this is because the istreams library apparently executes about 13 function calls per input byte when trying to parse a <code>double</code>. Insane. Could I be misunderstanding kcachegrind's output?</p> <p>Anyway, any other suggestions?</p>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. VO
      singulars
      1. This table or related slice is empty.
    2. VO
      singulars
      1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload