Note that there are some explanatory texts on larger screens.

plurals
  1. POAdvice on how to scale and improve execution times of a "pivot-based query" on a billion rows table, increasing one million a day
    primarykey
    data
    text
    <p>Our company is developing an internal project to parse text files. Those text files are composed of metadata which is extracted using regular expresions. Ten computers are 24/7 parsing the text files and feeding a high-end Intel Xeon SQL Server 2005 database with the extracted metadata.</p> <p>The simplified database schema looks like this:</p> <pre> <b>Items</b> | Id | Name | |----|--------| | 1 | Sample | </pre> <pre> <b>Items_Attributes</b> | ItemId | AttributeId | |--------|-------------| | 1 | 1 | | 1 | 2 | </pre> <pre> <b>Attributes</b> | Id | AttributeTypeId | Value | |----|-----------------|-------| | 1 | 1 | 500mB | | 2 | 2 | 1.0.0 | </pre> <pre> <b>AttributeTypes</b> | Id | Name | |----|---------| | 1 | Size | | 2 | Version | </pre> <p>There are many distinct text files types with distinct metadata inside. For every text file we have an <code>Item</code> and for every extracted metadata value we have an <code>Attribute<code>.</p> <p><code>Items_Attributes</code> allow us to avoid duplicate <code>Attribute</code> values which avoids database size to increase x^10.</p> <p>This particular schema allows us to dynamically add new regular expressions and to obtain new metadata from new processed files no matter which internal structure they have. </p> <p>Additionally this allow us to filter the data and to obtain dynamic reports based on the user criteria. We are filtering by <code>Attribute</code> and then pivoting the resultset (<a href="http://msdn.microsoft.com/en-us/library/ms177410.aspx" rel="nofollow noreferrer">http://msdn.microsoft.com/en-us/library/ms177410.aspx</a>). So this example pseudo-sql query</p> <pre>SELECT FROM Items WHERE Size = @A AND Version = @B </code></pre> <p>would return a pivoted table like this</p> <pre> | ItemName | Size | Version | |----------|-------|---------| | Sample | 500mB | 1.0.0 | </pre> <p>The application has been running for months and performance decreased terribly at the point is no longer usable. Reports should take no more than 2 seconds and Items_Attributes</code> table increases an average of 10,000,000 rows per week. Everything is properly indexed and we spent severe time analyzing and optimizing query execution plans.</p> <p>So my question is, how would you scale this in order to decrease report execution times?</p> <p>We came with this possible solutions:</p> <ul> <li>Buy more hardware and setup an SQL Server cluster. (we need advice on the proper "clustering" strategy)</li> <li>Use a key/value database like HBase (we don't really know if would solve our problem)</li> <li>Use a ODBMS rather than a RDBMS (we have been considering db4o)</li> <li>Move our software to the cloud (we have zero experience)</li> <li>Statically generate reports at runtime. (we don't really want to)</li> <li>Static indexed views for common reports (performance is almost the same)</li> <li>De-normalize schema (some of our reports involves up to 50 tables in a single query)</li> </ul>
    singulars
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload