Note that there are some explanatory texts on larger screens.

plurals
  1. PO
    primarykey
    data
    text
    <p>I cannot stress this enough: <em>Get something that plays nicely with off-the-shelf reporting tools.</em></p> <p>20 Billion rows per month puts you in VLDB territory, so you need partitioning. The low cardinality dimensions would also suggest that bitmap indexes would be a performance win.</p> <ul> <li><p>Forget the cloud systems (<a href="http://hadoop.apache.org/hive/" rel="nofollow noreferrer">Hive</a>, <a href="http://hadoop.apache.org/hbase/" rel="nofollow noreferrer">Hbase</a>) until they have mature SQL support. For a data warehouse application you want something that works with conventional reporting tools. Otherwise, you will find yourself perpetually bogged down writing and maintaining ad-hoc report programs.</p></li> <li><p>The data volumes are manageable with a more conventional DBMS like Oracle - I know of a <a href="http://www.telecomitalia.com/" rel="nofollow noreferrer">major European telco</a> that loads 600GB per day into an <a href="http://www.oracle.com" rel="nofollow noreferrer">Oracle</a> database. All other things being equal, that's two orders of magnitude bigger than your data volumes, so <a href="http://www.databasecolumn.com/2007/10/database-parallelism-choices.html" rel="nofollow noreferrer">shared disk architectures</a> still have headroom for you. A <a href="http://en.wikipedia.org/wiki/Shared_nothing_architecture" rel="nofollow noreferrer">shared-nothing</a> architecture like <a href="http://www.netezza.com" rel="nofollow noreferrer">Netezza</a> or <a href="http://www.teradata.com" rel="nofollow noreferrer">Teradata</a> will probably be faster still but these volumes are not at a level that is beyond a conventional shared-disk system. Bear in mind, though, that these systems are all quite expensive.</p></li> <li><p>Also bear in mind that MapReduce is <a href="http://highscalability.com/database-people-hating-mapreduce" rel="nofollow noreferrer">not an efficient query selection algorithm</a>. It is fundamentally a mechanism for distributing brute-force computations. Greenplum does have a MapReduce back-end, but a purpose-built shared nothing engine will be a lot more efficient and get more work done for less hardware.</p></li> </ul> <p>My take on this is that Teradata or Netezza would probably be the ideal tool for the job but definitely the most expensive. <a href="http://www.oracle.com" rel="nofollow noreferrer">Oracle</a>, <a href="http://www.sybase.com/products/datawarehousing/sybaseiq" rel="nofollow noreferrer">Sybase IQ</a> or even <a href="http://msdn.microsoft.com/en-us/sqlserver/default.aspx" rel="nofollow noreferrer">SQL Server</a> would also handle the data volumes involved but will be slower - they are shared disk architectures but can still manage this sort of data volume. See <a href="https://stackoverflow.com/questions/126188/ms-sql-server-and-oracle-which-one-is-better-in-terms-of-scalability#126267">This posting</a> for a rundown on VLDB related features in Oracle and SQL Server, and bear in mind that Oracle has just introduced the <a href="http://www.oracle.com/solutions/business_intelligence/exadata.html" rel="nofollow noreferrer">Exadata storage platform</a> also.</p> <p>My back-of-a-fag-packet capacity plan suggests maybe 3-5 TB or so per month including indexes for Oracle or SQL Server. Probably less on Oracle with bitmap indexes, although an index leaf has a 16-byte ROWID on oracle vs. a 6 byte page reference on SQL Server.</p> <p>Sybase IQ makes extensive use of bitmap indexes and is optimized for data warehouse queries. Although a shared-disk architecture, it is very efficient for this type of query (IIRC it was the original column-oriented architecture). This would probably be better than Oracle or SQL Server as it is specialized for this type of work.</p> <p>Greenplum might be a cheaper option but I've never actually used it so I can't comment on how well it works in practice. </p> <p>If you have 10 dimensions with just a few hundred rows consider merging them into a single <a href="http://www.dmreview.com/news/1027074-1.html" rel="nofollow noreferrer">junk dimension</a> which will slim down your fact table by merging the ten keys into just one. You can still implement hierarchies on a junk dimension and this would knock 1/2 or more off the size of your fact table and eliminate a lot of disk usage by indexes.</p> <p>I strongly recommend that you go with something that plays nicely with a reasonable cross-section of reporting tools. <em>This means a SQL front end.</em> Commercial systems like <a href="http://www.businessobjects.com/product/catalog/crystalreports/" rel="nofollow noreferrer">Crystal Reports</a> allow reporting and analytics to be done by people with a more readily obtainable set of SQL skills. The open-source world has also generated <a href="http://www.eclipse.org/birt/phoenix/" rel="nofollow noreferrer">BIRT</a>, <a href="http://www.jaspersoft.com/JasperSoft_JasperReports.html?utm_source=google&amp;utm_medium=cpc&amp;utm_content=JasperReports&amp;utm_campaign=JasperReportsSearchjasperreports" rel="nofollow noreferrer">Jasper Reports</a> and <a href="http://www.pentaho.com" rel="nofollow noreferrer">Pentaho.</a>. Hive or HBase put you in the business of building a custom front-end, which you <em>really</em> don't want unless you're happy to spend the next 5 years writing custom report formatters in Python. </p> <p>Finally, host it somewhere you can easily get a fast data feed from your production systems. This probably means your own hardware in your own data centre. This system will be I/O bound; it's doing simple processing on large volumes of data. This means you will need machines with fast disk subsystems. Cloud providers tend not to support this type of hardware as it's an order of magnitude more expensive than the type of disposable 1U box traditionally used by these outfits. Fast Disk I/O is not a strength of cloud architectures. </p>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. VO
      singulars
      1. This table or related slice is empty.
    2. VO
      singulars
      1. This table or related slice is empty.
    3. VO
      singulars
      1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload