StackOverflow2013

Note that there are some explanatory texts on larger screens.

plurals

POUnpredictable query performance in Postgresql
primarykey
Id
19691575
data
AcceptedAnswerId
19703821
AnswerCount
2
ClosedDate
CommentCount
7
CommunityOwnedDate
CreationDate
2013-10-30T19:00:11.567
FavoriteCount
1
LastActivityDate
2013-10-31T11:16:50.213
LastEditDate
2013-10-31T09:35:57.047
LastEditorUserId
276010
OwnerUserId
276010
ParentId
0
PostTypeId
1
Score
3
ViewCount
481
LastEditorDisplayName
text
Body
I have tables like that in a Postgres 9.3 database: <pre><code>A <1---n B n---1> C </code></pre> Table A contains ~10^7 rows, table B is rather big with ~10^9 rows and C contains ~100 rows. I use the following query to find all As (distinct) that match some criteria in B and C (the real query is more complex, joins more tables and checks more attributes within the subquery): Query 1: <pre><code>explain analyze select A.SNr from A where exists (select 1 from B, C where B.AId = A.Id and B.CId = C.Id and B.Timestamp >= '2013-01-01' and B.Timestamp <= '2013-01-12' and C.Name = '00000015') limit 200; </code></pre> That query takes about 500ms (Note that C.Name = '00000015' exists in the table): <pre><code>Limit (cost=119656.37..120234.06 rows=200 width=9) (actual time=427.799..465.485 rows=200 loops=1) -> Hash Semi Join (cost=119656.37..483518.78 rows=125971 width=9) (actual time=427.797..465.460 rows=200 loops=1) Hash Cond: (a.id = b.aid) -> Seq Scan on a (cost=0.00..196761.34 rows=12020034 width=13) (actual time=0.010..15.058 rows=133470 loops=1) -> Hash (cost=117588.73..117588.73 rows=125971 width=4) (actual time=427.233..427.233 rows=190920 loops=1) Buckets: 4096 Batches: 8 Memory Usage: 838kB -> Nested Loop (cost=0.57..117588.73 rows=125971 width=4) (actual time=0.176..400.326 rows=190920 loops=1) -> Seq Scan on c (cost=0.00..2.88 rows=1 width=4) (actual time=0.015..0.030 rows=1 loops=1) Filter: (name = '00000015'::text) Rows Removed by Filter: 149 -> Index Only Scan using cid_aid on b (cost=0.57..116291.64 rows=129422 width=8) (actual time=0.157..382.896 rows=190920 loops=1) Index Cond: ((cid = c.id) AND ("timestamp" >= '2013-01-01 00:00:00'::timestamp without time zone) AND ("timestamp" <= '2013-01-12 00:00:00'::timestamp without time zone)) Heap Fetches: 0 Total runtime: 476.173 ms </code></pre> Query 2: Changing C.Name to something that doesn't exist (C.Name = 'foo') takes 0.1ms: <pre><code>explain analyze select A.SNr from A where exists (select 1 from B, C where B.AId = A.Id and B.CId = C.Id and B.Timestamp >= '2013-01-01' and B.Timestamp <= '2013-01-12' and C.Name = 'foo') limit 200; Limit (cost=119656.37..120234.06 rows=200 width=9) (actual time=0.063..0.063 rows=0 loops=1) -> Hash Semi Join (cost=119656.37..483518.78 rows=125971 width=9) (actual time=0.062..0.062 rows=0 loops=1) Hash Cond: (a.id = b.aid) -> Seq Scan on a (cost=0.00..196761.34 rows=12020034 width=13) (actual time=0.010..0.010 rows=1 loops=1) -> Hash (cost=117588.73..117588.73 rows=125971 width=4) (actual time=0.038..0.038 rows=0 loops=1) Buckets: 4096 Batches: 8 Memory Usage: 0kB -> Nested Loop (cost=0.57..117588.73 rows=125971 width=4) (actual time=0.038..0.038 rows=0 loops=1) -> Seq Scan on c (cost=0.00..2.88 rows=1 width=4) (actual time=0.037..0.037 rows=0 loops=1) Filter: (name = 'foo'::text) Rows Removed by Filter: 150 -> Index Only Scan using cid_aid on b (cost=0.57..116291.64 rows=129422 width=8) (never executed) Index Cond: ((cid = c.id) AND ("timestamp" >= '2013-01-01 00:00:00'::timestamp without time zone) AND ("timestamp" <= '2013-01-12 00:00:00'::timestamp without time zone)) Heap Fetches: 0 Total runtime: 0.120 ms </code></pre> Query 3: Resetting the C.Name to something that exists (like in the first query) and increasing the timestamp by 3 days uses another query plan than before, but is still fast (200ms): <pre><code>explain analyze select A.SNr from A where exists (select 1 from B, C where B.AId = A.Id and B.CId = C.Id and B.Timestamp >= '2013-01-01' and B.Timestamp <= '2013-01-15' and C.Name = '00000015') limit 200; Limit (cost=0.57..112656.93 rows=200 width=9) (actual time=4.404..227.569 rows=200 loops=1) -> Nested Loop Semi Join (cost=0.57..90347016.34 rows=160394 width=9) (actual time=4.403..227.544 rows=200 loops=1) -> Seq Scan on a (cost=0.00..196761.34 rows=12020034 width=13) (actual time=0.008..1.046 rows=12250 loops=1) -> Nested Loop (cost=0.57..7.49 rows=1 width=4) (actual time=0.017..0.017 rows=0 loops=12250) -> Seq Scan on c (cost=0.00..2.88 rows=1 width=4) (actual time=0.005..0.015 rows=1 loops=12250) Filter: (name = '00000015'::text) Rows Removed by Filter: 147 -> Index Only Scan using cid_aid on b (cost=0.57..4.60 rows=1 width=8) (actual time=0.002..0.002 rows=0 loops=12250) Index Cond: ((cid = c.id) AND (aid = a.id) AND ("timestamp" >= '2013-01-01 00:00:00'::timestamp without time zone) AND ("timestamp" <= '2013-01-15 00:00:00'::timestamp without time zone)) Heap Fetches: 0 Total runtime: 227.632 ms </code></pre> Query 4: But that new query plan utterly fails when searching for a C.Name that doesn't exist:: <pre><code>explain analyze select A.SNr from A where exists (select 1 from B, C where B.AId = A.Id and B.CId = C.Id and B.Timestamp >= '2013-01-01' and B.Timestamp <= '2013-01-15' and C.Name = 'foo') limit 200; </code></pre> Now it takes 170 seconds (vs. 0.1ms before!) to return the same 0 rows: <pre><code>Limit (cost=0.57..112656.93 rows=200 width=9) (actual time=170184.979..170184.979 rows=0 loops=1) -> Nested Loop Semi Join (cost=0.57..90347016.34 rows=160394 width=9) (actual time=170184.977..170184.977 rows=0 loops=1) -> Seq Scan on a (cost=0.00..196761.34 rows=12020034 width=13) (actual time=0.008..794.626 rows=12020034 loops=1) -> Nested Loop (cost=0.57..7.49 rows=1 width=4) (actual time=0.013..0.013 rows=0 loops=12020034) -> Seq Scan on c (cost=0.00..2.88 rows=1 width=4) (actual time=0.013..0.013 rows=0 loops=12020034) Filter: (name = 'foo'::text) Rows Removed by Filter: 150 -> Index Only Scan using cid_aid on b (cost=0.57..4.60 rows=1 width=8) (never executed) Index Cond: ((cid = c.id) AND (aid = a.id) AND ("timestamp" >= '2013-01-01 00:00:00'::timestamp without time zone) AND ("timestamp" <= '2013-01-15 00:00:00'::timestamp without time zone)) Heap Fetches: 0 Total runtime: 170185.033 ms </code></pre> All queries were run after "alter table set statistics" with a value of 10000 on all columns and after running analyze on the whole db. Right now it looks like the slightest change of a parameter (not even of the SQL) can make Postgres choose a bad plan (0.1ms vs. 170s in this case!). I always try to check query plans when changing things, but it's hard to ever be sure that something will work when such small changes on parameters can make such huge differences. I have similar problems with other queries too. What can I do to get more predictable results? (I have tried modifying certain query planning parameters (set enable_... = on/off) and some different SQL statements - joining+distinct/group by instead of "exists" - but nothing seems to make postgres choose "stable" query plans while still providing acceptable performance). Edit #1: Table + index definitions <pre><code>test=# \d a Tabelle äpublic.aô Spalte | Typ | Attribute --------+---------+---------------------------------------------------- id | integer | not null Vorgabewert nextval('a_id_seq'::regclass) anr | integer | snr | text | Indexe: "a_pkey" PRIMARY KEY, btree (id) "anr_snr_index" UNIQUE, btree (anr, snr) "anr_index" btree (anr) Fremdschlnssel-Constraints: "anr_fkey" FOREIGN KEY (anr) REFERENCES pt(id) Fremdschlnsselverweise von: TABLE "b" CONSTRAINT "aid_fkey" FOREIGN KEY (aid) REFERENCES a(id) test=# \d b Tabelle äpublic.bô Spalte | Typ | Attribute -----------+-----------------------------+----------- id | uuid | not null timestamp | timestamp without time zone | cid | integer | aid | integer | prop1 | text | propn | integer | Indexe: "b_pkey" PRIMARY KEY, btree (id) "aid_cid" btree (aid, cid) "cid_aid" btree (cid, aid, "timestamp") "timestamp_index" btree ("timestamp") Fremdschlnssel-Constraints: "aid_fkey" FOREIGN KEY (aid) REFERENCES a(id) "cid_fkey" FOREIGN KEY (cid) REFERENCES c(id) test=# \d c Tabelle äpublic.cô Spalte | Typ | Attribute --------+---------+---------------------------------------------------- id | integer | not null Vorgabewert nextval('c_id_seq'::regclass) name | text | Indexe: "c_pkey" PRIMARY KEY, btree (id) "c_name_index" UNIQUE, btree (name) Fremdschlnsselverweise von: TABLE "b" CONSTRAINT "cid_fkey" FOREIGN KEY (cid) REFERENCES c(id) </code></pre>
Tags
<sql><database><postgresql><query-performance>
Title
Unpredictable query performance in Postgresql
singulars
PostAcceptedAnswerId
1. PO
 singulars
 PostTypePostTypeId
 PTAnswer
PostParentId
1. This table or related slice is empty.
PostTypePostTypeId
1. PTQuestion
UserLastEditorUserId
1. USstmax
UserOwnerUserId
1. USstmax
plurals
PostLinksPostIdRelatedPostId
1. This table or related slice is empty.
PostLinksRelatedPostIdPostId
1. This table or related slice is empty.
PostsAcceptedAnswerId
1. This table or related slice is empty.
PostsParentIdCreationDate
1. PO
 singulars
 PostTypePostTypeId
 PTAnswer
2. PO
 singulars
 PostTypePostTypeId
 PTAnswer
VotesPostIdCreationDate
1. VO
 singulars
 PostPostId
 POUnpredictable query performance in Postgresql
 UserUserId
 USFabian
 VoteTypeVoteTypeId
 VTFavorite
2. VO
 singulars
 PostPostId
 POUnpredictable query performance in Postgresql
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
3. VO
 singulars
 PostPostId
 POUnpredictable query performance in Postgresql
 UserUserId
 This table or related slice is empty.
 VoteTypeVoteTypeId
 VTUpMod
CommentsPostId

Querying!

Guidance

A row detail

Detail views are divided into sections. All the information in the data section comes from columns in the selected row. The other sections display data from other, related rows.

Related data can be related in a to-one or a to-many fashion. Captions of data related in a to-many fashion link to a list view showing a filtered view of the table.

Try moving around until you find a non-empty to-many entry and click on the label to get to one. You can move back to the root by clicking on the database name in the header.