StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

PO
text
Body
copied!<p>Scalability is all about pre-computing, spreading out, or paring down the repeated work to the bare essentials. To scale well, you don't do anything you don't need to in volume, and the things you actually do you make sure as done as efficiently as possible.</p> <p>In that context, of course joining two separate data sources is relatively slow, at least compared to not joining them, because it's work you need to do live as the point where the user requests it. </p> <p>But remember the alternative is no longer having two separate pieces of data at all; you have to put the two disparate data points in the same record. You can't combine two different pieces of data without there being a consequence somewhere, so make sure you understand the trade-off. </p> <p>The good news is modern relational databases are <em>good</em> at joins. You shouldn't really think of joins as slow with a good database used well. The database provides a number of scalability-friendly ways to take raw joins and make them <em>much</em> faster:</p> <ul> <li>Join on a surrogate key (autonumer/identity column) rather than a natural key. This means smaller (and therefore faster) comparisons during the join operation</li> <li>Indexes</li> <li>Materialized/indexed views (think of this as a pre-computed join or <em>managed</em> de-normalization)</li> <li>Computed columns. You can use this to hash or otherwise pre-compute the key columns of a join, such that what would be a complicated comparison for a join is now much smaller and potentially pre-indexed.</li> <li>Table partitions (helps with large data sets by spreading the load out to multiple disks, or limiting what might have been a table scan down to a partition scan)</li> <li>OLAP (pre-computes results of certain kinds of queries/joins. It's not quite true, but you can think of this as <em>generic</em> denormalization)</li> <li>Replication, Availability Groups, Log shipping, or other mechanisms to let multiple servers answer read queries for the same database, and thus scale your workload out among several servers.</li> </ul> <p>I would go as far as saying <strong>the main reason relational databases exist at all is to allow you do joins efficiently</strong><sup>*</sup>. It's certainly not just to store structured data (you could do that with flat file constructs like csv or xml). A few of the options I listed will even let you completely build your join in advance, so the results are already done before you issue the query — just as if you had denormalized the data (admittedly at the cost of slower write operations). </p> <p><strong>If you have a slow join, you're probably not using your database correctly.</strong> </p> <p>De-normalization should be done only after these other techniques have failed. And the only way you can truly judge "failure" is to set meaningful performance goals and measure against those goals. If you haven't measured, it's too soon to even think about de-normalization.</p> <p><sub>* That is, exist as entities distinct from mere collections of tables. An additional reason for a real rdbms is safe concurrent access.</sub></p>

Querying!

Guidance

An individual column

Larger individual text columns get their own page to allow for proper reading.

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload