Note that there are some explanatory texts on larger screens.

plurals
  1. POTime based data analysis with Python
    primarykey
    data
    text
    <p>I've got a project where physical sensors send data to the server. Data is send irregularly - after something activated a sensor, but not less often than every 20 minutes. On the server data is stored in a Posgresql database. </p> <p>Data structure looks like:</p> <pre><code>Sensor Table sensor name - string sensor serial no - string sensor type - foreign key to types table Sensor Data Table sensor - foreign key timestamp value 1 - boolean value 2 - boolean value 3 - integer value 4 - float ... </code></pre> <p>It's expected to be no more than total 100 request/second. Data records in database should be persisted for 90 days and even more in some cases (not only 2 weeks as I thought earlier). So the total amount of records would be no more than 120 960 000/14 days. This is "safe" estimation. In reality it might be 10 times less (10 req/second, 12 960 000 of records).</p> <p>I need to do some analysis on the data, like:</p> <ol> <li>Do something when a new record comes and it's "value 2" is true</li> <li>Do something when sensor X's "value 2" is true for longer than some declared time (50 minutes, 1 hour or more other times)</li> <li>Do something when sensor X's total true time for "value 2" in 24 hours is more than some declared time</li> <li>Do something when sensor X's "value 3" is true for longer than some declared time and no other sensor of type XYZ was active in this period ...</li> </ol> <p>The "declared time" above is greater than or equal to 1 second.</p> <p>The whole server part is developed in Django (and django-rest-framework to gather data).</p> <p>The questions is how to do such data analysis efficiently, assuming that there should be real time or close to real time (1 second) monitoring of data and of time periods to trigger desired actions.</p> <p>My thoughts:</p> <ol> <li><p>Run a process that would query database every second for records that meet criteria and call specific actions (it probably would take more than 1 second)</p></li> <li><p>Run some separate processes (eventlet?) one for each analysis type and then query the database every 1 second and fire specific actions.</p></li> <li><p>Run one process per each sensor that continuously reports to it's subscribers: I'm true on "value 2" for longer than x seconds etc. Process is reset after new data for that sensor arrives. Some publish-subscribe solution like zeromq might be used here?</p></li> <li><p>Use some other/faster solution</p> <ul> <li>Mongodb - the problem might be that mongodb's files are not compacted after data is removed (2 weeks). </li> <li>Hadoop - isn't it too big and too complex for this class of problems?</li> <li>Pandas and some HDF5 storage - the problem might be whether it's capable of doing the analysis I've described above and probably also with writes into files. But.. might work with mongo too.</li> </ul></li> </ol> <p>Hints?</p> <p>Update.</p> <p>Currently the solution that seems to be simple and effective to me is:</p> <ol> <li>after data arrives on sensor A run all tests and</li> <li>store test results in some "tests" table (or redis) in a way that says: <ul> <li>today at 1:15 pm run action "sensor open longer than"</li> <li>today at 1:30 pm run action "sensor open longer than in 24h period" ...</li> </ul></li> <li>continuously scan the above "tests" table and when it's today 1:15 pm then run desired action, etc.</li> <li>when a new signal arrives for sensor A then run all tests again, and also reset data in "tests" table.</li> </ol> <p>This would require me to fire tests each time the request arrives for a specific sensor, but on the other side I'll have to scan only "tests" table, every 1 second.</p> <p>Update 2</p> <p>I've discovered PyTables (<a href="http://www.pytables.org/moin/PyTables" rel="nofollow">http://www.pytables.org/moin/PyTables</a>), looks it's quite well suited for my use case as a data storage.</p>
    singulars
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload