StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

POAdvice on how to scale and improve execution times of a "pivot-based query" on a billion rows table, increasing one million a day
primarykey
Id
1002086
data
AcceptedAnswerId
1002166
AnswerCount
7
ClosedDate
CommentCount
5
CommunityOwnedDate
CreationDate
2009-06-16T15:05:26.680
FavoriteCount
2
LastActivityDate
2009-06-17T07:39:29.087
LastEditDate
2009-06-16T16:07:25.757
LastEditorUserId
62368
OwnerUserId
62368
ParentId
0
PostTypeId
1
Score
2
ViewCount
348
LastEditorDisplayName
text
Body
Our company is developing an internal project to parse text files. Those text files are composed of metadata which is extracted using regular expresions. Ten computers are 24/7 parsing the text files and feeding a high-end Intel Xeon SQL Server 2005 database with the extracted metadata. The simplified database schema looks like this: <pre> Items | Id | Name | |----|--------| | 1 | Sample | </pre> <pre> Items_Attributes | ItemId | AttributeId | |--------|-------------| | 1 | 1 | | 1 | 2 | </pre> <pre> Attributes | Id | AttributeTypeId | Value | |----|-----------------|-------| | 1 | 1 | 500mB | | 2 | 2 | 1.0.0 | </pre> <pre> AttributeTypes | Id | Name | |----|---------| | 1 | Size | | 2 | Version | </pre> There are many distinct text files types with distinct metadata inside. For every text file we have an <code>Item</code> and for every extracted metadata value we have an <code>Attribute<code>. <code>Items_Attributes</code> allow us to avoid duplicate <code>Attribute</code> values which avoids database size to increase x^10. This particular schema allows us to dynamically add new regular expressions and to obtain new metadata from new processed files no matter which internal structure they have. Additionally this allow us to filter the data and to obtain dynamic reports based on the user criteria. We are filtering by <code>Attribute</code> and then pivoting the resultset (<a href="http://msdn.microsoft.com/en-us/library/ms177410.aspx" rel="nofollow noreferrer">http://msdn.microsoft.com/en-us/library/ms177410.aspx</a>). So this example pseudo-sql query <pre>SELECT FROM Items WHERE Size = @A AND Version = @B </code></pre> would return a pivoted table like this <pre> | ItemName | Size | Version | |----------|-------|---------| | Sample | 500mB | 1.0.0 | </pre> The application has been running for months and performance decreased terribly at the point is no longer usable. Reports should take no more than 2 seconds and Items_Attributes</code> table increases an average of 10,000,000 rows per week. Everything is properly indexed and we spent severe time analyzing and optimizing query execution plans. So my question is, how would you scale this in order to decrease report execution times? We came with this possible solutions: <ul> <li>Buy more hardware and setup an SQL Server cluster. (we need advice on the proper "clustering" strategy)</li> <li>Use a key/value database like HBase (we don't really know if would solve our problem)</li> <li>Use a ODBMS rather than a RDBMS (we have been considering db4o)</li> <li>Move our software to the cloud (we have zero experience)</li> <li>Statically generate reports at runtime. (we don't really want to)</li> <li>Static indexed views for common reports (performance is almost the same)</li> <li>De-normalize schema (some of our reports involves up to 50 tables in a single query)</li> </ul>
Tags
<sql-server><database><cloud><scaling>
Title
Advice on how to scale and improve execution times of a "pivot-based query" on a billion rows table, increasing one million a day
singulars
PostAcceptedAnswerId
1. PO
 singulars
 PostTypePostTypeId
 PTAnswer
PostParentId
1. This table or related slice is empty.
PostTypePostTypeId
1. PTQuestion
UserLastEditorUserId
1. USknoopx
UserOwnerUserId
1. USknoopx
plurals
PostLinksPostIdRelatedPostId
1. This table or related slice is empty.
PostLinksRelatedPostIdPostId
1. This table or related slice is empty.
PostsAcceptedAnswerId
1. This table or related slice is empty.
PostsParentIdCreationDate
1. PO
 singulars
 PostTypePostTypeId
 PTAnswer
2. PO
 singulars
 PostTypePostTypeId
 PTAnswer
3. PO
 singulars
 PostTypePostTypeId
 PTAnswer
VotesPostIdCreationDate
1. VO
 singulars
 PostPostId
 POAdvice on how to scale and improve execution times of a "pivot-based query" on a billion rows table, increasing one million a day
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
2. VO
 singulars
 PostPostId
 POAdvice on how to scale and improve execution times of a "pivot-based query" on a billion rows table, increasing one million a day
 UserUserId
 USDavid Espart
 VoteTypeVoteTypeId
 VTFavorite
3. VO
 singulars
 PostPostId
 POAdvice on how to scale and improve execution times of a "pivot-based query" on a billion rows table, increasing one million a day
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
CommentsPostId

Querying!

Guidance

A row detail

Detail views are divided into sections. All the information in the data section comes from columns in the selected row. The other sections display data from other, related rows.

Related data can be related in a to-one or a to-many fashion. Captions of data related in a to-many fashion link to a list view showing a filtered view of the table.

Try moving around until you find a non-empty to-many entry and click on the label to get to one. You can move back to the root by clicking on the database name in the header.