StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

PO
primarykey
Id
1469525
data
AcceptedAnswerId
0
AnswerCount
0
ClosedDate
CommentCount
4
CommunityOwnedDate
2009-09-24T02:39:38.133
CreationDate
2009-09-24T02:39:38.133
FavoriteCount
0
LastActivityDate
2009-09-27T21:57:12.707
LastEditDate
2009-09-27T21:57:12.707
LastEditorUserId
176009
OwnerUserId
176009
ParentId
1467898
PostTypeId
2
Score
2
ViewCount
0
LastEditorDisplayName
text
Body
Of all the programs in this thread that I've tested so far, the OCaml version is the fastest and also among the shortest. (Line-of-code-based measurements are a little fuzzy, but it's not clearly longer than the Python version or the C or C++ versions, and it is clearly faster.) <blockquote> Note: I figured out why my earlier runtimes were so nondeterministic! My CPU heatsink was clogged with dust and my CPU was overheating as a result. Now I am getting nice deterministic benchmark times. I think I've now redone all the timing measurements in this thread now that I have a reliable way to time things. </blockquote> Here are the timings for the different versions so far, running on a 27-million-row 630-megabyte input data file. I'm on Ubuntu Intrepid Ibex on a dual-core 1.6GHz Celeron, running a 32-bit version of the OS (the Ethernet driver was broken in the 64-bit version). I ran each program five times and report the range of times those five tries took. I'm using Python 2.5.2, OpenJDK 1.6.0.0, OCaml 3.10.2, GCC 4.3.2, SBCL 1.0.8.debian, and Octave 3.0.1. <ul> <li>SquareCog's Pig version: not yet tested (because I can't just <code>apt-get install pig</code>), 7 lines of code.</li> <li>mjv's pure SQL version: not yet tested, but I predict a runtime of several days; 7 lines of code.</li> <li>ygrek's OCaml version: 68.7 seconds ±0.9 in 15 lines of code.</li> <li>My Python version: 169 seconds ±4 or 86 seconds ±2 with Psyco, in 16 lines of code.</li> <li>abbot's heap-based Python version: 177 seconds ±5 in 18 lines of code, or 83 seconds ±5 with Psyco.</li> <li>My C version below, composed with GNU <code>sort -n</code>: 90 + 5.5 seconds (±3, ±0.1), but gives the wrong answer because of a deficiency in GNU <code>sort</code>, in 22 lines of code (including one line of shell.)</li> <li>hrnt's C++ version: 217 seconds ±3 in 25 lines of code.</li> <li>mjv's alternative SQL-based procedural approach: not yet tested, 26 lines of code.</li> <li>mjv's first SQL-based procedural approach: not yet tested, 29 lines of code.</li> <li>peufeu's <a href="http://gist.github.com/194877" rel="nofollow noreferrer" title="My modified version as a gist">Python version with Psyco</a>: 181 seconds ±4, somewhere around 30 lines of code.</li> <li>Rainer Joswig's Common Lisp version: 478 seconds (only run once) in 42 lines of code.</li> <li>abbot's <code>noop.py</code>, which intentionally gives incorrect results to establish a lower bound: not yet tested, 15 lines of code.</li> <li>Will Hartung's Java version: 96 seconds ±10 in, according to David A. Wheeler’s SLOCCount, 74 lines of code.</li> <li>Greg's Matlab version: doesn't work.</li> <li>Schuyler Erle's suggestion of using Pyrex on one of the Python versions: not yet tried.</li> </ul> I supect abbot's version comes out relatively worse for me than for them because the real dataset has a highly nonuniform distribution: as I said, some <code>aa</code> values (“players”) have thousands of lines, while others only have one. About Psyco: I applied Psyco to my original code (and abbot's version) by putting it in a <code>main</code> function, which by itself cut the time down to about 140 seconds, and calling <code>psyco.full()</code> before calling <code>main()</code>. This added about four lines of code. I can almost solve the problem using GNU <code>sort</code>, as follows: <pre><code>kragen@inexorable:~/devel$ time LANG=C sort -nr infile -o sorted real 1m27.476s user 0m59.472s sys 0m8.549s kragen@inexorable:~/devel$ time ./top5_sorted_c < sorted > outfile real 0m5.515s user 0m4.868s sys 0m0.452s </code></pre> Here <code>top5_sorted_c</code> is this short C program: <pre><code>#include <ctype.h> #include <stdio.h> #include <string.h> #include <stdlib.h> enum { linesize = 1024 }; char buf[linesize]; char key[linesize]; /* last key seen */ int main() { int n = 0; char *p; while (fgets(buf, linesize, stdin)) { for (p = buf; *p && !isspace(*p); p++) /* find end of key on this line */ ; if (p - buf != strlen(key) || 0 != memcmp(buf, key, p - buf)) n = 0; /* this is a new key */ n++; if (n <= 5) /* copy up to five lines for each key */ if (fputs(buf, stdout) == EOF) abort(); if (n == 1) { /* save new key in `key` */ memcpy(key, buf, p - buf); key[p-buf] = '\0'; } } return 0; } </code></pre> I first tried writing that program in C++ as follows, and I got runtimes which were substantially slower, at 33.6±2.3 seconds instead of 5.5±0.1 seconds: <pre><code>#include <map> #include <iostream> #include <string> int main() { using namespace std; int n = 0; string prev, aa, bb, cc; while (cin >> aa >> bb >> cc) { if (aa != prev) n = 0; ++n; if (n <= 5) cout << aa << " " << bb << " " << cc << endl; prev = aa; } return 0; } </code></pre> I did say almost. The problem is that <code>sort -n</code> does okay for most of the data, but it fails when it's trying to compare <code>0.33</code> with <code>3.78168e-05</code>. So to get this kind of performance and actually solve the problem, I need a better sort. Anyway, I kind of feel like I'm whining, but the sort-and-filter approach is about 5× faster than the Python program, while the elegant STL program from hrnt is actually a little slower — there seems to be some kind of gross inefficiency in <code><iostream></code>. I don't know where the other 83% of the runtime is going in that little C++ version of the filter, but it isn't going anywhere useful, which makes me suspect I don't know where it's going in hrnt's <code>std::map</code> version either. Could that version be sped up 5× too? Because that would be pretty cool. Its working set might be bigger than my L2 cache, but as it happens it probably isn't. Some investigation with callgrind says my filter program in C++ is executing 97% of its instructions inside of <code>operator >></code>. I can identify at least 10 function calls per input byte, and <code>cin.sync_with_stdio(false);</code> doesn’t help. This probably means I could get hrnt’s C program to run substantially faster by parsing input lines more efficiently. Edit: kcachegrind claims that hrnt’s program executes 62% of its instructions (on a small 157000 line input file) extracting <code>double</code>s from an <code>istream</code>. A substantial part of this is because the istreams library apparently executes about 13 function calls per input byte when trying to parse a <code>double</code>. Insane. Could I be misunderstanding kcachegrind's output? Anyway, any other suggestions?
Tags
Title
singulars
PostAcceptedAnswerId
1. This table or related slice is empty.
PostParentId
1. POWhat language could I use for fast execution of this database summarization task?
 singulars
 PostTypePostTypeId
 PTQuestion
PostTypePostTypeId
1. PTAnswer
UserLastEditorUserId
1. USKragen Javier Sitaker
UserOwnerUserId
1. USKragen Javier Sitaker
plurals
PostLinksPostIdRelatedPostId
1. This table or related slice is empty.
PostLinksRelatedPostIdPostId
1. This table or related slice is empty.
PostsAcceptedAnswerId
1. This table or related slice is empty.
PostsParentIdCreationDate
1. This table or related slice is empty.
VotesPostIdCreationDate
1. VO
 singulars
 PostPostId
 PO
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
2. VO
 singulars
 PostPostId
 PO
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
CommentsPostId

Querying!

Guidance

A row detail

Detail views are divided into sections. All the information in the data section comes from columns in the selected row. The other sections display data from other, related rows.

Related data can be related in a to-one or a to-many fashion. Captions of data related in a to-many fashion link to a list view showing a filtered view of the table.

Try moving around until you find a non-empty to-many entry and click on the label to get to one. You can move back to the root by clicking on the database name in the header.