Note that there are some explanatory texts on larger screens.

plurals
  1. POHow to Convert (timestamp, value) array to timeseries
    primarykey
    data
    text
    <p>I have a rather straightforward problem I'd like to solve with more efficiency than I'm currently getting.</p> <p>I have a bunch of data coming in as a set of monitoring metrics. Input data is structured as an array of tuples. Each tuple is (timestamp, value). Timestamps are integer epoch seconds, and values are normal floating point numbers. Example: </p> <pre><code>inArr = [ (1388435242, 12.3), (1388435262, 11.1), (1388435281, 12.8), ... ] </code></pre> <p>The timestamps are not always the same number of seconds apart, but it's usually close. Sometimes we get duplicate numbers submitted, sometimes we miss datapoints, etc. </p> <p>My current solution takes the timestamps and: </p> <ul> <li>finds the num seconds between each successive pair of timestamps;</li> <li>finds the median of these delays;</li> <li>creates an array of the correct size;</li> <li>presumes the first time period starts at half the median value before the first timestamp (putting the measurement in the middle of the time period);</li> <li>averages values that happen to go into the same time bucket;</li> <li>adds data to this array according to the correct (timestamp - starttime)/median element.</li> <li>if there's no value for a time range, I obviously output a None value.</li> </ul> <p>Output data has to be in the format:</p> <pre><code>outArr = [ (startTime, timeStep, numVals), [ val1, val2, val3, val4, ... ] ] </code></pre> <p>I suspect this is a solved problem with Python Pandas <a href="http://pandas.pydata.org/" rel="nofollow">http://pandas.pydata.org/</a> (or Numpy / SciPy).</p> <p>Yes, my solution works, but when I'm operating on 60K datapoints it can take a tenth of a second (or more) to run. This is troublesome when I'm trying to work on <strong>large</strong> numbers of sets of data. </p> <p>So, I'm looking for a solution that might run faster than my pure-Python version. I guess I'm presuming (based on a couple of previous conversations with an Argonne National Labs guy) that SciPy and Numpy are (clearing-throat) "somewhat faster" at array operations. I've looked briefly (an hour or so) at the Pandas code but it looks cumbersome to do this set of operations. Am I incorrect?</p> <p>-- <strong>Edit to show expected output</strong> --</p> <p>The median time between datapoints is 20 seconds, half that is 10 seconds. To make sure we put the lines well between the timestamps, we make the start time 10 seconds before the first datapoint. If we just make the start time the first timestamp, it's a lot more likely that we'll get 2 timestamps in one interval. </p> <p>So, 1388435242 - 10 = 1388435232. The timestep is the median, 20 seconds. The numvals here is 3.</p> <pre><code>outArr = [ (1388435232, 20, 3), [ 12.3, 11.1, 12.8 ] ) </code></pre> <p>This is the format that Graphite expects when we're graphing the output; it's not my invention. It seems common, though, to have timeseries data be in this format - a starttime, interval, and then an array of values.</p>
    singulars
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload