StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

POSpeed up PostgreSQL query with timestamp OVERLAPS and "PARTITION BY"
text
Body
copied!<p>I have a fairly large table (500K - 1M rows) in PostgreSQL 9.0 that contains generic "time slice" information, that is, it determines when a row in another table (a "feature") is valid. The definition looks like this (slightly simplified):</p> <pre><code>CREATE TABLE feature_timeslice ( timeslice_id int NOT NULL, feature_id int NOT NULL, valid_time_begin timestamp NOT NULL, valid_time_end timestamp, sequence_number smallint, -- Some other columns CONSTRAINT pk_feature_timeslice PRIMARY KEY (timeslice_id) -- Some other constraints ) CREATE INDEX ix_feature_timeslice_feature_id ON feature_timeslice USING btree (feature_id); </code></pre> <p>Many other tables for specific features are then joined to it on <code>timeslice_id</code>:</p> <pre><code>CREATE TABLE specific_feature_timeslice ( timeslice_id int NOT NULL, -- Other columns CONSTRAINT pk_specific_feature_timeslice PRIMARY KEY (timeslice_id), CONSTRAINT fk_specific_feature_timeslice_feature_timeslice FOREIGN KEY (timeslice_id) REFERENCES feature_timeslice (timeslice_id) ) </code></pre> <p>There may be multiple time slices with overlapping valid periods (begin/end time), but the one with the highest <code>sequence_number</code> takes priority (again, a slight simplification, but close enough). I'd like to efficiently find the currently valid row for each feature_id, so I have a view defined, like this:</p> <pre><code>CREATE VIEW feature_timeslice_id_now AS SELECT timeslice_id FROM ( SELECT timeslice_id, rank() OVER ( PARTITION BY feature_id ORDER BY sequence_number DESC, timeslice_id DESC ) FROM feature_timeslice WHERE (current_timestamp AT TIME ZONE 'UTC', '0'::interval) OVERLAPS (valid_time_begin, COALESCE(valid_time_end, 'infinity'::timestamp)) ) subq WHERE subq.rank = 1 </code></pre> <p>It's typically queried like this:</p> <pre><code>SELECT * FROM specific_feature_timeslice sf JOIN feature_timeslice_id_now n USING (timeslice_id) WHERE sf.name = 'SOMETHING' </code></pre> <p>This works, but it's still a bit too slow - takes 1-2 seconds, even though there may only be 1-5 rows returned, because the <code>specific_feature_timeslice</code> criteria generally narrows it down a lot. (More complex queries that join multiple feature views get very slow very quickly.) I can't figure out how to get PostgreSQL to do this more efficiently. The query plan looks like this:</p> <pre><code> Join Filter: ((r.timeslice_id)::integer = (subq.timeslice_id)::integer) -> Subquery Scan on subq (cost=32034.36..37876.98 rows=835 width=4) (actual time=2086.125..5243.467 rows=250918 loops=1) Filter: (subq.rank = 1) -> WindowAgg (cost=32034.36..35790.33 rows=166932 width=10) (actual time=2086.110..4066.351 rows=250918 loops=1) -> Sort (cost=32034.36..32451.69 rows=166932 width=10) (actual time=2086.065..2654.971 rows=250918 loops=1) Sort Key: feature_timeslice.feature_id, feature_timeslice.sequence_number, feature_timeslice.timeslice_id Sort Method: quicksort Memory: 13898kB -> Seq Scan on feature_timeslice (cost=0.00..17553.93 rows=166932 width=10) (actual time=287.270..1225.595 rows=250918 loops=1) Filter: overlaps(timezone('UTC'::text, now()), (timezone('UTC'::text, now()) + '00:00:00'::interval), (valid_time_begin)::timestamp without time zone, COALESCE((valid_time_end)::timestamp without time zone, 'infinity'::timestamp without time zone)) -> Materialize (cost=0.00..1093.85 rows=2 width=139) (actual time=0.002..0.007 rows=2 loops=250918) -> Seq Scan on specific_feature_timeslice sf (cost=0.00..1093.84 rows=2 width=139) (actual time=1.958..7.674 rows=2 loops=1) Filter: ((name)::text = 'SOMETHING'::text) Total runtime: 10319.875 ms </code></pre> <p>In reality, I'd like to do this query for any given time, not just the current time. I have a function defined for that, which takes the time as an argument, but querying for "now" is the most common scenario, so even if I could only speed that up it would be a great improvement.</p> <p><strong>== Edit ==</strong></p> <p>OK, I've tried normalising the table as suggested by both answers - that is, I moved valid_time_begin and valid_time_end into a separate table, <code>time_period</code>. I also replaced the window function with <code>WHERE NOT EXISTS ([better candidate time slice])</code>. In the process I also upgraded to PostgreSQL 9.1. With all that some queries are now twice as fast. The query plan looks the same as in <strong>wildplasser</strong>'s answer. This is good, but not quite as good as I'd hoped - it still takes over a second to select from a single feature table.</p> <p>Ideally, I'd like to take advantage of the selectivity of the feature WHERE condition, as <strong>Erwin Brandstetter</strong> says. If I hand-craft a query to do this the time I get is 15-30 ms. Now that's more like it! The hand-crafted query looks something like this:</p> <pre><code>WITH filtered_feature AS ( SELECT * FROM specific_feature_timeslice sf JOIN feature_timeslice ft USING (timeslice_id) WHERE sf.name = 'SOMETHING' ) SELECT * FROM filtered_feature ff JOIN ( SELECT timeslice_id FROM filtered_feature candidate JOIN time_period candidate_time ON candidate.valid_time_period_id = candidate_time.id WHERE ('2011-09-26', '0'::interval) OVERLAPS (candidate_time.valid_time_begin, COALESCE(candidate_time.valid_time_end, 'infinity'::timestamp)) AND NOT EXISTS ( SELECT * FROM filtered_feature better JOIN time_period better_time ON better.valid_time_period_id = better_time.id WHERE ('2011-09-26', '0'::interval) OVERLAPS (better_time.valid_time_begin, COALESCE(better_time.valid_time_end, 'infinity'::timestamp)) AND better.feature_id = candidate.feature_id AND better.timeslice_id != candidate.timeslice_id AND better.sequence_number > candidate.sequence_number ) ) AS ft ON ff.timeslice_id = ft.timeslice_id </code></pre> <p>Unfortunately, this is way too big and complex to use in normal queries, which might join many other tables. I need some way to encapsulate this logic in a function (for arbitrary time) or at least a view (for current time), but I cannot figure out how to do this while still getting the query planner to filter on the specific feature first. If only I could pass a rowset into a function - but as far as I know PostgreSQL doesn't allow this. Any ideas?</p> <p><strong>== Conclusion ==</strong></p> <p>I ended up using PostgreSQL inheritance to solve this (see my answer), but I would not have come up with this idea if it wasn't for <strong>Erwin Brandstetter</strong> answer, so the bounty goes to him. <strong>wildplasser</strong>'s answer was also very helpful, because it allowed me to eliminate the unnecessary window function, which sped it up further. Many thanks to both of you!</p>

Querying!

Guidance

An individual column

Larger individual text columns get their own page to allow for proper reading.

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload