Note that there are some explanatory texts on larger screens.

plurals
  1. POR: Find and add missing (/non existing) rows in time related data frame
    text
    copied!<p>I'm struggling with the following.</p> <p>If have a (big) data frame with the following:</p> <ul> <li>several columns for which the combination of columns is a 'unique' combination, say ID</li> <li>a time related column</li> <li>a measure related column</li> </ul> <p>I want to make sure that for each unique ID for each time interval a measure is available in the data frame. And if it is not, I want to add a 0 (or NA) measure for that time/ID.</p> <p>To illustrate the problem, create the following <code>test</code> data frame:</p> <pre><code>test &lt;- data.frame( YearWeek =rep(c("2012-01","2012-02"),each=4), ProductID =rep(c(1,2), times=4), CustomerID =rep(c("a","b"), each=2, times=2), Quantity =5:12 )[1:7,] YearWeek ProductID CustomerID Quantity 1 2012-01 1 a 5 2 2012-01 2 a 6 3 2012-01 1 b 7 4 2012-01 2 b 8 5 2012-02 1 a 9 6 2012-02 2 a 10 7 2012-02 1 b 11 </code></pre> <p>The 8th row is left out, on purpose. This way I simulate a 'missing value' (missing <code>Quantity</code>) for ID '2-b' (<code>ProductID-CustomerID</code>) for the time value "2012-02".</p> <p>What I want to do is adjust the data.frame in such a way that for all time values (these are known, in this example just "2012-01" and "2012-02"), for all ID-combinations (these are not known upfront, but this is 'all unique ID combinations in the data frame', thus the unique set on the ID columns), a Quantity is available in the data frame.</p> <p>This should result for this example (if we choose <code>NA</code> for the missing value, typically I want to have control on that):</p> <pre><code> YearWeek ProductID CustomerID Quantity 1 2012-01 1 a 5 2 2012-01 2 a 6 3 2012-01 1 b 7 4 2012-01 2 b 8 5 2012-02 1 a 9 6 2012-02 2 a 10 7 2012-02 1 b 11 8 2012-02 2 b NA </code></pre> <p>The ultimate goal is to create time series for these ID combinations and I therefore want to have Quantities for all time values. I need to do different aggregations (on time) and using different levels of ID's from a big dataset</p> <p>I tried several things, for instance with <code>melt</code> and <code>cast</code> from the <code>reshape</code> package. But so far I didn't manage to do it. The next step is creating a function, with for-loops etc. but that is not really useful from a performance perspective.</p> <p>Maybe there is an easier way to create time series instantly, giving a data.frame like <code>test</code>. Does anybody have an idea on this one??</p> <p>Thanks in advance!</p> <p>Note that in the actual problem there are more than two 'ID columns'.</p> <hr> <p>EDIT:</p> <p>I should describe the problem further. There is a difference between the 'time' column and the 'ID' columns. The first (and great!) answer on the question by <strong>joran</strong>, maybe didn't get a clear understanding from what I want (and the example I gave didn't made the difference clear). I said above: </p> <blockquote> <p>for all ID-combinations (these are not known upfront, but this is 'all unique ID combinations in the data frame', thus the unique set on the ID columns)</p> </blockquote> <p>So I do not want 'all possible ID combinations' but 'all ID combinations within the data'. For each of those combinations I want a value for every unique time-value.</p> <p>Let me make it clear by expanding <code>test</code> to <code>test2</code>, as follows</p> <pre><code>&gt; test2 &lt;- rbind(test, c("2012-02", 3, "a", 13)) &gt; test2 YearWeek ProductID CustomerID Quantity 1 2012-01 1 a 5 2 2012-01 2 a 6 3 2012-01 1 b 7 4 2012-01 2 b 8 5 2012-02 1 a 9 6 2012-02 2 a 10 7 2012-02 1 b 11 8 2012-02 3 a 13 </code></pre> <p>Which means I want in the resulting data frame no '3-b' ID combination, because this combination is not within <code>test2</code>. If I use the method of the first answer I will get the following:</p> <pre><code>&gt; vals2 &lt;- expand.grid(YearWeek = unique(test2$YearWeek), ProductID = unique(test2$ProductID), CustomerID = unique(test2$CustomerID)) &gt; merge(vals2,test2,all = TRUE) YearWeek ProductID CustomerID Quantity 1 2012-01 1 a 5 2 2012-01 1 b 7 3 2012-01 2 a 6 4 2012-01 2 b 8 5 2012-01 3 a &lt;NA&gt; 6 2012-01 3 b &lt;NA&gt; 7 2012-02 1 a 9 8 2012-02 1 b 11 9 2012-02 2 a 10 10 2012-02 2 b &lt;NA&gt; 11 2012-02 3 a 13 12 2012-02 3 b &lt;NA&gt; </code></pre> <p>So I don't want the rows <code>6</code> and <code>12</code> to be here.</p> <p>To overcome this problem I found a solution in the one below. In here I split the 'unique time column' and the 'unique ID combination'. The difference with above is thus the word 'combination' and not unique for every ID column.</p> <pre><code>&gt; temp_merge &lt;- merge(unique(test2["YearWeek"]), unique(test2[c("ProductID", "CustomerID")])) &gt; merge(temp_merge,test2,all = TRUE) YearWeek ProductID CustomerID Quantity 1 2012-01 1 a 5 2 2012-01 1 b 7 3 2012-01 2 a 6 4 2012-01 2 b 8 5 2012-01 3 a &lt;NA&gt; 6 2012-02 1 a 9 7 2012-02 1 b 11 8 2012-02 2 a 10 9 2012-02 2 b &lt;NA&gt; 10 2012-02 3 a 13 </code></pre> <p>What are the comments on this one?</p> <p>Is this an elegant way, or are there better ways? </p>
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload