StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

POHow to Convert (timestamp, value) array to timeseries
primarykey
Id
20847532
data
AcceptedAnswerId
0
AnswerCount
2
ClosedDate
2013-12-31T01:27:20.520
CommentCount
4
CommunityOwnedDate
CreationDate
2013-12-30T21:03:25.220
FavoriteCount
1
LastActivityDate
2013-12-30T23:19:15.387
LastEditDate
2013-12-30T23:19:15.387
LastEditorUserId
977162
OwnerUserId
977162
ParentId
0
PostTypeId
1
Score
1
ViewCount
2335
LastEditorDisplayName
text
Body
I have a rather straightforward problem I'd like to solve with more efficiency than I'm currently getting. I have a bunch of data coming in as a set of monitoring metrics. Input data is structured as an array of tuples. Each tuple is (timestamp, value). Timestamps are integer epoch seconds, and values are normal floating point numbers. Example: <pre><code>inArr = [ (1388435242, 12.3), (1388435262, 11.1), (1388435281, 12.8), ... ] </code></pre> The timestamps are not always the same number of seconds apart, but it's usually close. Sometimes we get duplicate numbers submitted, sometimes we miss datapoints, etc. My current solution takes the timestamps and: <ul> <li>finds the num seconds between each successive pair of timestamps;</li> <li>finds the median of these delays;</li> <li>creates an array of the correct size;</li> <li>presumes the first time period starts at half the median value before the first timestamp (putting the measurement in the middle of the time period);</li> <li>averages values that happen to go into the same time bucket;</li> <li>adds data to this array according to the correct (timestamp - starttime)/median element.</li> <li>if there's no value for a time range, I obviously output a None value.</li> </ul> Output data has to be in the format: <pre><code>outArr = [ (startTime, timeStep, numVals), [ val1, val2, val3, val4, ... ] ] </code></pre> I suspect this is a solved problem with Python Pandas <a href="http://pandas.pydata.org/" rel="nofollow">http://pandas.pydata.org/</a> (or Numpy / SciPy). Yes, my solution works, but when I'm operating on 60K datapoints it can take a tenth of a second (or more) to run. This is troublesome when I'm trying to work on large numbers of sets of data. So, I'm looking for a solution that might run faster than my pure-Python version. I guess I'm presuming (based on a couple of previous conversations with an Argonne National Labs guy) that SciPy and Numpy are (clearing-throat) "somewhat faster" at array operations. I've looked briefly (an hour or so) at the Pandas code but it looks cumbersome to do this set of operations. Am I incorrect? -- Edit to show expected output -- The median time between datapoints is 20 seconds, half that is 10 seconds. To make sure we put the lines well between the timestamps, we make the start time 10 seconds before the first datapoint. If we just make the start time the first timestamp, it's a lot more likely that we'll get 2 timestamps in one interval. So, 1388435242 - 10 = 1388435232. The timestep is the median, 20 seconds. The numvals here is 3. <pre><code>outArr = [ (1388435232, 20, 3), [ 12.3, 11.1, 12.8 ] ) </code></pre> This is the format that Graphite expects when we're graphing the output; it's not my invention. It seems common, though, to have timeseries data be in this format - a starttime, interval, and then an array of values.
Tags
<python><numpy><pandas><scipy><time-series>
Title
How to Convert (timestamp, value) array to timeseries
singulars
PostAcceptedAnswerId
1. This table or related slice is empty.
PostParentId
1. This table or related slice is empty.
PostTypePostTypeId
1. PTQuestion
UserLastEditorUserId
1. USKevin J. Rice
UserOwnerUserId
1. USKevin J. Rice
plurals
PostLinksPostIdRelatedPostId
1. This table or related slice is empty.
PostLinksRelatedPostIdPostId
1. This table or related slice is empty.
PostsAcceptedAnswerId
1. This table or related slice is empty.
PostsParentIdCreationDate
1. PO
 singulars
 PostTypePostTypeId
 PTAnswer
2. PO
 singulars
 PostTypePostTypeId
 PTAnswer
VotesPostIdCreationDate
1. This table or related slice is empty.
CommentsPostId

Querying!

Guidance

A row detail

Detail views are divided into sections. All the information in the data section comes from columns in the selected row. The other sections display data from other, related rows.

Related data can be related in a to-one or a to-many fashion. Captions of data related in a to-many fashion link to a list view showing a filtered view of the table.

Try moving around until you find a non-empty to-many entry and click on the label to get to one. You can move back to the root by clicking on the database name in the header.