StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

PO
text
Body
copied!<p>It looks like Wes may have discovered a known issue in <code>data.table</code> when the number of unique strings (<em>levels</em>) is large: 10,000.</p> <p>Does <code>Rprof()</code> reveal most of the time spent in the call <code>sortedmatch(levels(i[[lc]]), levels(x[[rc]])</code>? This isn't really the join itself (the algorithm), but a preliminary step.</p> <p>Recent efforts have gone into allowing character columns in keys, which should resolve that issue by integrating more closely with R's own global string hash table. Some benchmark results are already reported by <code>test.data.table()</code> but that code isn't hooked up yet to replace the levels to levels match.</p> <p>Are pandas merges faster than <code>data.table</code> for regular integer columns? That should be a way to isolate the algorithm itself vs factor issues.</p> <p>Also, <code>data.table</code> has <em>time series merge</em> in mind. Two aspects to that: i) multi column <em>ordered</em> keys such as (id,datetime) ii) fast prevailing join (<code>roll=TRUE</code>) a.k.a. last observation carried forward.</p> <p>I'll need some time to confirm as it's the first I've seen of the comparison to <code>data.table</code> as presented.</p> <hr> <p><strong>UPDATE from data.table v1.8.0 released July 2012</strong></p> <ul> <li>Internal function sortedmatch() removed and replaced with chmatch() when matching i levels to x levels for columns of type 'factor'. This preliminary step was causing a (known) significant slowdown when the number of levels of a factor column was large (e.g. >10,000). Exacerbated in tests of joining four such columns, as demonstrated by Wes McKinney (author of Python package Pandas). Matching 1 million strings of which of which 600,000 are unique is now reduced from 16s to 0.5s, for example.</li> </ul> <p>also in that release was :</p> <ul> <li><p>character columns are now allowed in keys and are preferred to factor. data.table() and setkey() no longer coerce character to factor. Factors are still supported. Implements FR#1493, FR#1224 and (partially) FR#951.</p></li> <li><p>New functions chmatch() and %chin%, faster versions of match() and %in% for character vectors. R's internal string cache is utilised (no hash table is built). They are about 4 times faster than match() on the example in ?chmatch.</p></li> </ul> <p>As of Sep 2013 data.table is v1.8.10 on CRAN and we're working on v1.9.0. <strong><a href="https://r-forge.r-project.org/scm/viewvc.php/pkg/NEWS?view=markup&root=datatable" rel="noreferrer">NEWS</a></strong> is updated live.</p> <hr> <p>But as I wrote originally, above :</p> <blockquote> <p><code>data.table</code> has <em>time series merge</em> in mind. Two aspects to that: i) multi column <em>ordered</em> keys such as (id,datetime) ii) fast prevailing join (<code>roll=TRUE</code>) a.k.a. last observation carried forward.</p> </blockquote> <p>So the Pandas equi join of two character columns is probably still faster than data.table. Since it sounds like it hashes the combined two columns. data.table doesn't hash the key because it has prevailing ordered joins in mind. A "key" in data.table is literally just the sort order (similar to a clustered index in SQL; i.e., that's how the data is ordered in RAM). On the list is to add secondary keys, for example.</p> <p>In summary, the glaring speed difference highlighted by this particular two-character-column test with over 10,000 unique strings shouldn't be as bad now, since the known problem has been fixed.</p>

Querying!

Guidance

An individual column

Larger individual text columns get their own page to allow for proper reading.

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload