Note that there are some explanatory texts on larger screens.

plurals
  1. POWritings functions (procedures) for data.table objects
    primarykey
    data
    text
    <p>In the book <em>Software for Data Analysis: Programming with R</em>, John Chambers emphasizes that functions should generally not be written for their side effect; rather, that a function should return a value without modifying any variables in its calling environment. Conversely, writing good script using data.table objects should specifically avoid the use of object assignment with <code>&lt;-</code>, typically used to store the result of a function.</p> <p>First, is a technical question. Imagine an R function called <code>proc1</code> that accepts a <code>data.table</code> object <code>x</code> as its argument (in addition to, maybe, other parameters). <code>proc1</code> returns NULL but modifies <code>x</code> using <code>:=</code>. From what I understand, <code>proc1</code> calling <code>proc1(x=x1)</code> makes a copy of <code>x1</code> just because of the way that promises work. However, as demonstrated below, the original object <code>x1</code> is still modified by <code>proc1</code>. Why/how is this? </p> <pre><code>&gt; require(data.table) &gt; x1 &lt;- CJ(1:2, 2:3) &gt; x1 V1 V2 1: 1 2 2: 1 3 3: 2 2 4: 2 3 &gt; proc1 &lt;- function(x){ + x[,y:= V1*V2] + NULL + } &gt; proc1(x1) NULL &gt; x1 V1 V2 y 1: 1 2 2 2: 1 3 3 3: 2 2 4 4: 2 3 6 &gt; </code></pre> <p>Furthermore, it seems that using <code>proc1(x=x1)</code> isn't any slower than doing the procedure directly on x, indicating that my vague understanding of promises are wrong and that they work in a pass-by-reference sort of way:</p> <pre><code>&gt; x1 &lt;- CJ(1:2000, 1:500) &gt; x1[, paste0("V",3:300) := rnorm(1:nrow(x1))] &gt; proc1 &lt;- function(x){ + x[,y:= V1*V2] + NULL + } &gt; system.time(proc1(x1)) user system elapsed 0.00 0.02 0.02 &gt; x1 &lt;- CJ(1:2000, 1:500) &gt; system.time(x1[,y:= V1*V2]) user system elapsed 0.03 0.00 0.03 </code></pre> <p>So, given that passing a data.table argument to a function doesn't add time, that makes it possible to write procedures for data.table objects, incorporating both the speed of data.table and the generalizability of a function. However, given what John Chambers said, that functions should not have side-effects, is it really "ok" to write this type of procedural programming in R? Why was he arguing that side effects are "bad"? If I'm going to ignore his advice, what sort of pitfalls should I be aware of? What can I do to write "good" data.table procedures?</p>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload