Note that there are some explanatory texts on larger screens.

plurals
  1. POUse zoo read and split a data frame over a column
    text
    copied!<p>I have a table containing observations on scores of <code>restaurants</code>(identified by ID). The variable <code>mean</code> is the mean rating of reviews received in a week-long window centered on each day (i.e. from 3 days before till 3 days later), and the variable <code>count</code> is the number of reviews received in the same window (see the code below for a <code>dput</code> of a randomly-generated sample of my data frame).</p> <p>I am interested in looking at those restaurants that contain big spikes in either variable (like all of a sudden their mean rating goes up by a lot, or drops suddenly). For those restaurants, I would like to investigate what's going on by plotting the distribution (I have lots of restaurants so I can't do it manually and I have to restrict my domain for semi-manual inspection).</p> <p>Also, since my data is day-by-day, I would like it to be less granular. In particolar, I want to average all the ratings or counts for a given month in a single value.</p> <p>I think zoo should help me do it nicely: given the data frame in the example, I think I can convert it to a zoo time series which is aggregate the way I want and split the way I want by using:</p> <pre><code>z &lt;- read.zoo(df, split = "restaurantID", format = "%m/%d/%Y", index.column = 2, FUN = as.yearmon, aggregate = mean) </code></pre> <p>however, splitting on restaurantID does not yield the expected result. What I get instead is lots of NAs:</p> <pre><code> mean.1006054 count.1006054 mean.1006639 count.1006639 mean.1006704 count.1006704 mean.1007177 count.1007177 Lug 2004 NA NA NA NA NA NA NA NA Ago 2004 NA NA NA NA NA NA NA NA Nov 2004 NA NA NA NA NA NA NA NA Gen 2005 NA NA NA NA NA NA NA NA Feb 2005 NA NA NA NA NA NA NA NA Mar 2005 NA NA NA NA NA NA NA NA mean.1007296 count.1007296 mean.1007606 count.1007606 mean.1007850 count.1007850 mean.1008272 count.1008272 Lug 2004 NA NA NA NA NA NA NA NA Ago 2004 NA NA NA NA NA NA NA NA Nov 2004 NA NA NA NA NA NA NA NA Gen 2005 NA NA NA NA NA NA NA NA Feb 2005 NA NA NA NA NA NA NA NA Mar 2005 NA NA NA NA NA NA NA NA </code></pre> <p>Note that it works if I don't split it on the restaurantID column.</p> <pre><code>df$website &lt;- NULL &gt; z &lt;- read.zoo(df, format = "%m/%d/%Y", index.column = 2, FUN = as.yearmon, aggregate = mean) &gt; head(z) restaurantID mean count Lug 2004 1418680 3.500000 1 Ago 2004 1370457 5.000000 1 Nov 2004 1324645 4.333333 1 Gen 2005 1425933 1.920000 1 Feb 2005 1315289 3.000000 1 Mar 2005 1400577 2.687500 1 </code></pre> <p>Also, <code>plot.zoo(z)</code> works but of course the produced graph has no meaning for me. </p> <p>My questions are:</p> <p>1) How can I filter the restaurants that have the higher "month-month" spikes in either column?</p> <p>2) How can I split on restaurantID and plot the time series of only such restaurants?</p> <p><a href="http://pastebin.com/raw.php?i=4N9C0fVq" rel="nofollow">DATA HERE (wouldn't fit SO's word limit)</a></p>
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload