Note that there are some explanatory texts on larger screens.

plurals
  1. POrealtime querying/aggregating millions of records - hadoop? hbase? cassandra?
    primarykey
    data
    text
    <p>I have a solution that can be parallelized, but I don't (yet) have experience with hadoop/nosql, and I'm not sure which solution is best for my needs. In theory, if I had unlimited CPUs, my results should return back instantaneously. So, any help would be appreciated. Thanks!</p> <p>Here's what I have:</p> <ul> <li>1000s of datasets</li> <li>dataset keys: <ul> <li>all datasets have the same keys</li> <li>1 million keys (this may later be 10 or 20 million)</li> </ul></li> <li>dataset columns:<br> <ul> <li>each dataset has the same columns </li> <li>10 to 20 columns </li> <li>most columns are numerical values for which we need to aggregate on (avg, stddev, and use R to calculate statistics) </li> <li>a few columns are "type_id" columns, since in a particular query we may want to only include certain type_ids</li> </ul></li> <li>web application<br> <ul> <li>user can choose which datasets they are interested in (anywhere from 15 to 1000)</li> <li>application needs to present: key, and aggregated results (avg, stddev) of each column</li> </ul></li> <li>updates of data: <ul> <li>an entire dataset can be added, dropped, or replaced/updated</li> <li>would be cool to be able to add columns. But, if required, can just replace the entire dataset.</li> <li>never add rows/keys to a dataset - so don't need a system with lots of fast writes</li> </ul></li> <li>infrastructure:<br> <ul> <li>currently two machines with 24 cores each</li> <li>eventually, want ability to also run this on amazon</li> </ul></li> </ul> <p>I can't precompute my aggregated values, but since each key is independent, this should be easily scalable. Currently, I have this data in a postgres database, where each dataset is in its own partition.</p> <ul> <li>partitions are nice, since can easily add/drop/replace partitions</li> <li>database is nice for filtering based on type_id</li> <li>databases aren't easy for writing parallel queries</li> <li>databases are good for structured data, and my data is not structured </li> </ul> <p>As a proof of concept I tried out hadoop:</p> <ul> <li>created a tab separated file per dataset for a particular type_id</li> <li>uploaded to hdfs</li> <li>map: retrieved a value/column for each key</li> <li>reduce: computed average and standard deviation</li> </ul> <p>From my crude proof-of-concept, I can see this will scale nicely, but I can see hadoop/hdfs has latency I've read that that it's generally not used for real time querying (even though I'm ok with returning results back to users in 5 seconds). </p> <p>Any suggestion on how I should approach this? I was thinking of trying HBase next to get a feel for that. Should I instead look at Hive? Cassandra? Voldemort?</p> <p>thanks!</p>
    singulars
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload