Note that there are some explanatory texts on larger screens.

plurals
  1. POHow do I increase the speed of this data analysis?
    text
    copied!<p>I need to optimize how I analyze a rather large data set, and I'm unsure of what the next steps are. I have done a fair bit of MySQL configuration tuning.</p> <p>I have this InnoDB table:</p> <pre><code>+----------------+--------------+------+-----+---------+----------------+ | Field | Type | Null | Key | Default | Extra | +----------------+--------------+------+-----+---------+----------------+ | id | int(250) | NO | PRI | NULL | auto_increment | | memory | int(15) | YES | MUL | NULL | | | q | varchar(250) | YES | MUL | NULL | | | created | datetime | YES | | NULL | | | modified | datetime | YES | | NULL | | | dt | datetime | YES | MUL | NULL | | | site_id | int(250) | NO | MUL | NULL | | | execution_time | int(11) | YES | MUL | NULL | | +----------------+--------------+------+-----+---------+----------------+ </code></pre> <p>Here is a sample of 10 rows:</p> <pre><code>+-----------+----------+-----------------+---------------------+---------------------+---------------------+---------+----------------+ | id | memory | q | created | modified | dt | site_id | execution_time | +-----------+----------+-----------------+---------------------+---------------------+---------------------+---------+----------------+ | 266864867 | 38011080 | node/16432/edit | 2011-12-05 23:22:23 | 2011-12-05 23:22:23 | 2011-12-06 00:04:44 | 890 | 1534 | | 266864868 | 46090184 | node/16432 | 2011-12-05 23:22:23 | 2011-12-05 23:22:23 | 2011-12-06 00:04:46 | 890 | 840 | | 266864869 | 50329248 | node/16432/edit | 2011-12-05 23:22:23 | 2011-12-05 23:22:23 | 2011-12-06 00:05:16 | 890 | 2500 | | 266864870 | 38011272 | node/16432/edit | 2011-12-05 23:22:23 | 2011-12-05 23:22:23 | 2011-12-06 00:07:01 | 890 | 1494 | | 266864871 | 46087732 | node/16432 | 2011-12-05 23:22:23 | 2011-12-05 23:22:23 | 2011-12-06 00:07:03 | 890 | 850 | | 266864872 | 30304428 | node/303 | 2011-12-05 23:22:23 | 2011-12-05 23:22:23 | 2011-12-06 00:07:12 | 890 | 113 | | 266864873 | 50329412 | node/16432/edit | 2011-12-05 23:22:23 | 2011-12-05 23:22:23 | 2011-12-06 00:07:25 | 890 | 2465 | | 266864874 | 28253112 | front_page | 2011-12-05 23:22:23 | 2011-12-05 23:22:23 | 2011-12-06 00:07:25 | 890 | 86 | | 266864875 | 28256044 | front_page | 2011-12-05 23:22:23 | 2011-12-05 23:22:23 | 2011-12-06 00:08:32 | 890 | 81 | | 266864876 | 38021072 | node/16432/edit | 2011-12-05 23:22:23 | 2011-12-05 23:22:23 | 2011-12-06 00:08:55 | 890 | 1458 | +-----------+----------+-----------------+---------------------+---------------------+---------------------+---------+----------------+ </code></pre> <p>Here are the table indexes:</p> <pre><code>+----------+------------+----------------------+--------------+----------------+-----------+-------------+----------+--------+------+------------+---------+ | Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | +----------+------------+----------------------+--------------+----------------+-----------+-------------+----------+--------+------+------------+---------+ | memories | 0 | PRIMARY | 1 | id | A | 8473766 | NULL | NULL | | BTREE | | | memories | 1 | index_dt | 1 | dt | A | 1210538 | NULL | NULL | YES | BTREE | | | memories | 1 | index_execution_time | 1 | execution_time | A | 2344 | NULL | NULL | YES | BTREE | | | memories | 1 | index_memory | 1 | memory | A | 8473766 | NULL | NULL | YES | BTREE | | | memories | 1 | index_site_id | 1 | site_id | A | 16 | NULL | NULL | | BTREE | | | memories | 1 | index_q | 1 | q | A | 338950 | NULL | NULL | YES | BTREE | | +----------+------------+----------------------+--------------+----------------+-----------+-------------+----------+--------+------+------------+---------+ </code></pre> <p>It stores over a million records for many different sites (site_id). For a given site, there might be 20,000 rows. The information stored is performance metrics for individual page requests. If it matters, non-obvious fields: The memory field is how much memory the script used, q is the path, site_id is a reference to a table Sites.</p> <p>I have two slow queries that I run on this data. The first gets the 25 most most memory-hog pages:</p> <pre><code>Select Memory.q, count(*) as count, AVG(Memory.memory) as average_memory, MAX(Memory.memory) as peak_memory, AVG(Memory.execution_time) as average_execution_time, MAX(Memory.execution_time) as peak_execution_time FROM Memory WHERE site_id = $some_site_id ORDER BY average_memory DESC GROUP BY Memory.q LIMIT 25 </code></pre> <p>The second query gets the the slowest average 25 pages for a given site:</p> <pre><code>Select Memory.q, count(*) as count, AVG(Memory.memory) as average_memory, MAX(Memory.memory) as peak_memory, AVG(Memory.execution_time) as average_execution_time, MAX(Memory.execution_time) as peak_execution_time FROM Memory WHERE site_id = $some_site_id ORDER BY average_execution_time DESC GROUP BY Memory.q LIMIT 25 </code></pre> <p>I recently converted the table from MyISAM to InnoDB, so that these reads would not lock the table. This was causing operations that update this table to queue up and lag.</p> <p>Beyond throwing more ram at the problem (to increase the InnoDB cache size), I want to see if there are other options. I've never worked with a NoSQL database, but from what I understand they won't be of much help here because I use aggregate functions and queries.</p> <p>The app is written in PHP, if it matters.</p> <p>Any ideas for a better way to approach the storage and analysis of this data?</p> <p><em>Update:</em></p> <p>Profiling the query shows the slowness is all in the copying to temp table. I will research how to make this step faster.</p> <pre><code>+--------------------------------+----------+ | Status | Duration | +--------------------------------+----------+ | starting | 0.000030 | | checking query cache for query | 0.000065 | | Opening tables | 0.000013 | | System lock | 0.000004 | | Table lock | 0.000014 | | init | 0.000032 | | optimizing | 0.000010 | | statistics | 0.008119 | | preparing | 0.000042 | | Creating tmp table | 0.000317 | | executing | 0.000005 | | Copying to tmp table | 5.349280 | | Sorting result | 0.006511 | | Sending data | 0.000092 | | end | 0.000005 | | removing tmp table | 0.001510 | | end | 0.000007 | | query end | 0.000004 | | freeing items | 0.001163 | | logging slow query | 0.000006 | | cleaning up | 0.000006 | +--------------------------------+----------+ 21 rows in set (0.01 sec) mysql&gt; show profile cpu for query 4; +--------------------------------+----------+----------+------------+ | Status | Duration | CPU_user | CPU_system | +--------------------------------+----------+----------+------------+ | starting | 0.000030 | 0.000000 | 0.000000 | | checking query cache for query | 0.000065 | 0.000000 | 0.000000 | | Opening tables | 0.000013 | 0.000000 | 0.000000 | | System lock | 0.000004 | 0.000000 | 0.000000 | | Table lock | 0.000014 | 0.000000 | 0.000000 | | init | 0.000032 | 0.000000 | 0.000000 | | optimizing | 0.000010 | 0.000000 | 0.000000 | | statistics | 0.008119 | 0.001000 | 0.000000 | | preparing | 0.000042 | 0.000000 | 0.000000 | | Creating tmp table | 0.000317 | 0.000000 | 0.000000 | | executing | 0.000005 | 0.000000 | 0.000000 | | Copying to tmp table | 5.349280 | 0.687896 | 0.412937 | | Sorting result | 0.006511 | 0.004999 | 0.001999 | | Sending data | 0.000092 | 0.000000 | 0.000000 | | end | 0.000005 | 0.000000 | 0.000000 | | removing tmp table | 0.001510 | 0.000000 | 0.001000 | | end | 0.000007 | 0.000000 | 0.000000 | | query end | 0.000004 | 0.000000 | 0.000000 | | freeing items | 0.001163 | 0.000000 | 0.001000 | | logging slow query | 0.000006 | 0.000000 | 0.000000 | | cleaning up | 0.000006 | 0.000000 | 0.000000 | +--------------------------------+----------+----------+------------+ </code></pre>
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload