Note that there are some explanatory texts on larger screens.

plurals
  1. POSQL magic - query shouldn't take 15 hours, but it does
    primarykey
    data
    text
    <p>Ok, so i have one really monstrous MySQL table (900k records, 180 MB total), and i want to extract from subgroups records with higher <code>date_updated</code> and calculate weighted average in each group. The calculation runs for ~15 hours, and i have a strong feeling i'm <strong>doing it wrong</strong>.</p> <p>First, monstrous table layout:</p> <ul> <li><code>category</code></li> <li><code>element_id</code></li> <li><code>date_updated</code></li> <li><code>value</code></li> <li><code>weight</code></li> <li><code>source_prefix</code></li> <li><code>source_name</code></li> </ul> <p>Only key here is on <code>element_id</code> (BTREE, ~8k unique elements).</p> <p>And calculation process:</p> <p><em>Make hash for each group and subgroup.</em></p> <pre><code>CREATE TEMPORARY TABLE `temp1` (INDEX ( `ds_hash` )) SELECT `category`, `element_id`, `source_prefix`, `source_name`, `date_updated`, `value`, `weight`, MD5(CONCAT(`category`, `element_id`, `source_prefix`, `source_name`)) AS `subcat_hash`, MD5(CONCAT(`category`, `element_id`, `date_updated`)) AS `cat_hash` FROM `bigbigtable` WHERE `date_updated` &lt;= '2009-04-28' </code></pre> <p>I really don't understand this fuss with hashes, but it worked faster this way. Dark magic, i presume.</p> <p><em>Find maximum date for each subgroup</em></p> <pre><code>CREATE TEMPORARY TABLE `temp2` (INDEX ( `subcat_hash` )) SELECT MAX(`date_updated`) AS `maxdate` , `subcat_hash` FROM `temp1` GROUP BY `subcat_hash`; </code></pre> <p><em>Join temp1 with temp2 to find weighted average values for categories</em></p> <pre><code>CREATE TEMPORARY TABLE `valuebycats` (INDEX ( `category` )) SELECT `temp1`.`element_id`, `temp1`.`category`, `temp1`.`source_prefix`, `temp1`.`source_name`, `temp1`.`date_updated`, AVG(`temp1`.`value`) AS `avg_value`, SUM(`temp1`.`value` * `temp1`.`weight`) / SUM(`weight`) AS `rating` FROM `temp1` LEFT JOIN `temp2` ON `temp1`.`subcat_hash` = `temp2`.`subcat_hash` WHERE `temp2`.`subcat_hash` = `temp1`.`subcat_hash` AND `temp1`.`date_updated` = `temp2`.`maxdate` GROUP BY `temp1`.`cat_hash`; </code></pre> <p>(now that i looked through it and wrote it all down, it seems to me that i should use INNER JOIN in that last query (to avoid 900k*900k temp table)).</p> <p>Still, is there a <strong>normal way</strong> to do so?</p> <p><strong>UPD</strong>: some picture for reference:</p> <p><em>removed dead ImageShack link</em></p> <p><strong>UPD</strong>: EXPLAIN for proposed solution:</p> <pre><code>+----+-------------+-------+------+---------------+------------+---------+--------------------------------------------------------------------------------------+--------+----------+----------------------------------------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra | +----+-------------+-------+------+---------------+------------+---------+--------------------------------------------------------------------------------------+--------+----------+----------------------------------------------+ | 1 | SIMPLE | cur | ALL | NULL | NULL | NULL | NULL | 893085 | 100.00 | Using where; Using temporary; Using filesort | | 1 | SIMPLE | next | ref | prefix | prefix | 1074 | bigbigtable.cur.source_prefix,bigbigtable.cur.source_name,bigbigtable.cur.element_id | 1 | 100.00 | Using where | +----+-------------+-------+------+---------------+------------+---------+--------------------------------------------------------------------------------------+--------+----------+----------------------------------------------+ </code></pre>
    singulars
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    plurals
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
    1. This table or related slice is empty.
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload