Note that there are some explanatory texts on larger screens.

plurals
  1. POMySQL: simple schema, joining in a view and sorting on unrelated attribute causes unbearable performance hit
    text
    copied!<p>I'm creating a database model for use by a diverse amount of applications and different kinds of database servers (though I'm mostly testing on MySQL and SQLite now). It's a really simple model that basically consists of one central <strong>matches</strong> table and many <strong>attribute</strong> tables that have the match_id as their primary key and one other field (the attribute value itself). Said in other words, every match has exactly one of every type of attribute and every attribute is stored in a seperate table. After experiencing some rather bad performance whilst sorting and filtering on these attributes (<code>FROM matches LEFT JOIN attributes_i_want on primary index</code>) I decided to try to improve it. To this end I added an index on every attribute value column. Sorting and filtering performance increased a lot for easy queries.</p> <p>This simple schema is basically a requirement for the application, so it is able to auto-discover and use attributes. Thus, to create more complex attributes that are actually based on other results, I decided to use VIEWs that turn one or more other tables that don't necessarily match up to the attribute-like schema into an attribute-schema. I call these meta-attributes (they aren't directly editable either). However, to the application this is all transparant, and so it happily joins in the VIEW as well when it wants to. The problem: <strong>it kills performance</strong>. When the VIEW is joined in without sorting on any attribute, performance is still acceptable, but combining a retrieval of the VIEW with sorting is unacceptably slow (on the order of 1s). Even after reading quite a bit of tutorials on indexing and some questions here on stack overflow, I can't seem to help it.</p> <p>_Prerequisites for a solution: in one way or another, num_duplicates must exist as a table or view with the columns match_id and num_duplicates to look like an attribute. I can't change the way attributes are discovered and used. So if I want to see num_duplicates appear in the application it'll have to be as some kind of view or materialized table that makes a num_duplicates table._</p> <h3>Relevant parts of the schema</h3> <p>Main table:</p> <pre><code>CREATE TABLE `matches` ( `match_id` int(11) NOT NULL, `source_name` text, `target_name` text, `transformation` text, PRIMARY KEY (`match_id`) ) ENGINE=InnoDB; </code></pre> <p>Example of a normal attribute (indexed):</p> <pre><code>CREATE TABLE `error` ( `match_id` int(11) NOT NULL, `error` double DEFAULT NULL, PRIMARY KEY (`match_id`), KEY `error_index` (`error`) ) ENGINE=InnoDB; </code></pre> <p>(all normal attributes, like <code>error</code>, are basically the same)</p> <p>Meta-attribute / VIEW:</p> <pre><code>CREATE VIEW num_duplicates AS SELECT duplicate AS match_id, COUNT(duplicate) AS num_duplicates FROM duplicate GROUP BY duplicate </code></pre> <p>(this is the only meta-attribute I'm using right now)</p> <h3>Simple query with indexing on the attribute value columns (the part improved by indexes)</h3> <pre><code>SELECT matches.match_id, source_name, target_name, transformation FROM matches INNER JOIN error ON matches.match_id = error.match_id ORDER BY error.error </code></pre> <p>(the performance on this query increased a lot because of the index on error) (the runtime of this query is on the order of 0.0001 sec)</p> <h3>Slightly more complex queries and their runtimes including the meta-attribute (the still bad part)</h3> <pre><code>SELECT matches.match_id, source_name, target_name, transformation, STATUS , volume, error, COMMENT , num_duplicates FROM matches INNER JOIN STATUS ON matches.match_id = status.match_id INNER JOIN error ON matches.match_id = error.match_id LEFT JOIN num_duplicates ON matches.match_id = num_duplicates.match_id INNER JOIN volume ON matches.match_id = volume.match_id INNER JOIN COMMENT ON matches.match_id = comment.match_id </code></pre> <p>(runtime: 0.0263sec) &lt;--- still acceptable</p> <pre><code>SELECT matches.match_id, source_name, target_name, transformation, STATUS , volume, error, COMMENT , num_duplicates FROM matches INNER JOIN STATUS ON matches.match_id = status.match_id INNER JOIN error ON matches.match_id = error.match_id LEFT JOIN num_duplicates ON matches.match_id = num_duplicates.match_id INNER JOIN volume ON matches.match_id = volume.match_id INNER JOIN COMMENT ON matches.match_id = comment.match_id ORDER BY error.error LIMIT 20, 20 </code></pre> <p>(runtime: 0.8866 sec) &lt;--- not acceptable (the query speed is exactly the same with the LIMIT as without the LIMIT, <strong>note</strong>: if I could get the version with the LIMIT to be fast that would already be a big win. I presume it has to scan the entire table and so the limit doesn't matter too much)</p> <h3>EXPLAIN of the last query</h3> <p>Of course I tried to solve it myself before coming here, but I must admit I'm not that good at these things and haven't found a way to remove the offending performance killer yet. I know it's most likely the using filesort but I don't know how to get rid of it.</p> <pre><code>id select_type table type possible_keys key key_len ref rows Extra 1 PRIMARY error index PRIMARY,match_id error_index 9 NULL 53909 Using index; Using temporary; Using filesort 1 PRIMARY COMMENT eq_ref PRIMARY PRIMARY 4 tangbig4.error.match_id 1 1 PRIMARY STATUS eq_ref PRIMARY PRIMARY 4 tangbig4.COMMENT.match_id 1 Using where 1 PRIMARY matches eq_ref PRIMARY PRIMARY 4 tangbig4.COMMENT.match_id 1 Using where 1 PRIMARY &lt;derived2&gt; ALL NULL NULL NULL NULL 2 1 PRIMARY volume eq_ref PRIMARY PRIMARY 4 tangbig4.matches.match_id 1 Using where 2 DERIVED duplicate index NULL duplicate_index 5 NULL 49222 Using index </code></pre> <p>By the way, the query without the sort, which still runs acceptably, is EXPLAIN'ed like this:</p> <pre><code>id select_type table type possible_keys key key_len ref rows Extra 1 PRIMARY COMMENT ALL PRIMARY NULL NULL NULL 49610 1 PRIMARY error eq_ref PRIMARY,match_id PRIMARY 4 tangbig4.COMMENT.match_id 1 1 PRIMARY matches eq_ref PRIMARY PRIMARY 4 tangbig4.COMMENT.match_id 1 1 PRIMARY &lt;derived2&gt; ALL NULL NULL NULL NULL 2 1 PRIMARY STATUS eq_ref PRIMARY PRIMARY 4 tangbig4.COMMENT.match_id 1 1 PRIMARY volume eq_ref PRIMARY PRIMARY 4 tangbig4.matches.match_id 1 Using where 2 DERIVED duplicate index NULL duplicate_index 5 NULL 49222 Using index </code></pre> <h3>Question</h3> <p>So, my question is if someone who know more about databases/MySQL is able to find me a way that I can use/research to increase the performance of my last query.</p> <p>I've been thinking quite a lot about materialized views but they are not natively supported in MySQL and since I'm going for as wide a range of SQL servers as possible this might not be idea. I'm hoping maybe a change to the queries or views might help or possible an extra index.</p> <p><strong>EDIT</strong>: Some random thoughts I've been having about the query:</p> <ul> <li>VERY FAST: joining all tables, excluding the VIEW, <strong>sorting</strong></li> <li>ACCEPTABLE: joining all tables, including the VIEW, <strong>no sorting</strong></li> <li>DOG SLOW: joining all tables, including the VIEW, <strong>sorting</strong></li> </ul> <p>But: <strong>the VIEW has no influence at all on the sorting, none of it's attributes or even the attributes in its constituent tables are used to sort. Why does includingg the sort impact performance that much then? Is there any way I can convince the database to sort first and then just join up the VIEW? Or can I convince it that the VIEW is not important for sorting?</strong></p> <p><strong>EDIT2</strong>: Following the suggestion by @ace for creating a VIEW and then joining at first didn't seem to help:</p> <pre><code>DROP VIEW IF EXISTS `matches_joined`; CREATE VIEW `matches_joined` AS ( SELECT matches.match_id, source_name, target_name, transformation, STATUS , volume, error, COMMENT FROM matches INNER JOIN STATUS ON matches.match_id = status.match_id INNER JOIN error ON matches.match_id = error.match_id INNER JOIN volume ON matches.match_id = volume.match_id INNER JOIN COMMENT ON matches.match_id = comment.match_id ORDER BY error.error ); </code></pre> <p>followed by:</p> <pre><code>SELECT matches_joined.*, num_duplicates FROM matches_joined LEFT JOIN num_duplicates ON matches_joined.match_id = num_duplicates.match_id </code></pre> <p>However, using LIMIT on the view did make a difference:</p> <pre><code>DROP VIEW IF EXISTS `matches_joined`; CREATE VIEW `matches_joined` AS ( SELECT matches.match_id, source_name, target_name, transformation, STATUS , volume, error, COMMENT FROM matches INNER JOIN STATUS ON matches.match_id = status.match_id INNER JOIN error ON matches.match_id = error.match_id INNER JOIN volume ON matches.match_id = volume.match_id INNER JOIN COMMENT ON matches.match_id = comment.match_id ORDER BY error.error LIMIT 0, 20 ); </code></pre> <p>Afterwards, the query ran at an acceptable speed. This is already a nice result. However, I feel that I'm jumping through hoops to force the database to do what I want and the reduction in time is probably only caused by the fact that it now only has to sort 20 rows. What if I have more rows? Is there any other way to force the database to see that joining in the <code>num_duplicates</code> VIEW doesn't influence the sorting in the least? Could I perhaps change the query that makes the VIEW a bit?</p>
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload