Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    text
    copied!<p>[UPDATE] 2 years after question was asked ...</p> <p>On running the code in the question, <code>data.table</code> is now more helpful and returns this (using 1.8.2) :</p> <pre><code>Error in `[.data.table`(my.dt, , sum(dependent.variable), by = grouping.vars[i]) : 'by' appears to evaluate to column names but isn't c() or key(). Use by=list(...) if you can. Otherwise, by=eval(grouping.vars[i]) should work. This is for efficiency so data.table can detect which columns are needed. </code></pre> <p>and following the advice in the second sentence of error :</p> <pre><code>my.dt[,sum(dependent.variable), by=eval(grouping.vars[i])] sex V1 1: M 2650 2: F 2600 </code></pre> <p><br></p> <hr> <p>Old answer from Jul 2010 (<code>by</code> can now be <code>double</code> and <code>character</code>, though) :</p> <p>Strictly speaking the <code>by</code> needs to evaluate to a list of vectors each with storage mode integer, though. So the numeric vector <code>age</code> could also be coerced to integer using <code>as.integer()</code>. This is because data.table uses radix sorting (very fast) but the radix algorithm is specifically for <em>integers only</em> (see wikipedia's entry for 'radix sort'). Integer storage for key columns and ad hoc <code>by</code> is one of the reasons data.table is fast. A factor is of course an integer lookup to unique strings.</p> <p>The idea behind <code>by</code> being a <code>list()</code> of expressions is that you are not restricted to column names. It is usual to write <em>expressions</em> of column names directly in the <code>by</code>. A common one is to aggregate by month; for example :</p> <pre><code>DT[,sum(col1), by=list(region,month(datecol))] </code></pre> <p>or a very fast way to group by yearmonth is by using a non epoch based date, such as yyyymmddL as seen in some of the examples in the package, like this :</p> <pre><code>DT[,sum(col1), by=list(region,month=datecol%/%100L)] </code></pre> <p>Notice how you can name the columns inside the list() like that.</p> <p>To define and reuse complex grouping expressions :</p> <pre><code>e = quote(list(region,month(datecol))) DT[,sum(col1),by=eval(e)] DT[,sum(col2*col3/col4),by=eval(e)] </code></pre> <p>Or if you don't want to re-evaluate the <code>by</code> expressions each time, you can save the result once and reuse the result for efficiency; if the <code>by</code> expressions themselves take a long time to calculate/allocate, or you need to reuse it many times :</p> <pre><code>byval = DT[,list(region,month(datecol))] DT[,sum(col1),by=byval] DT[,sum(col2*col3/col4),by=byval] </code></pre> <p>Please see <a href="http://datatable.r-forge.r-project.org/" rel="nofollow noreferrer">http://datatable.r-forge.r-project.org/</a> for latest info and status. A new presentation will be up there soon and hoping to release v1.5 to CRAN soon too. This contains several bug fixes and new features detailed in the NEWS file. The datatable-help list has about 30-40 posts a month which may be of interest too.</p>
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload