Note that there are some explanatory texts on larger screens.

plurals
  1. POPostgreSQL: NOT IN versus EXCEPT performance difference (edited #2)
    text
    copied!<p>I have two queries that are functionally identical. One of them performs very well, the other one performs very poorly. I do not see from where the performance difference arises.</p> <p>Query #1:</p> <pre><code>SELECT id FROM subsource_position WHERE id NOT IN (SELECT position_id FROM subsource) </code></pre> <p>This comes back with the following plan:</p> <pre><code> QUERY PLAN ------------------------------------------------------------------------------- Seq Scan on subsource_position (cost=0.00..362486535.10 rows=128524 width=4) Filter: (NOT (SubPlan 1)) SubPlan 1 -&gt; Materialize (cost=0.00..2566.50 rows=101500 width=4) -&gt; Seq Scan on subsource (cost=0.00..1662.00 rows=101500 width=4) </code></pre> <p>Query #2:</p> <pre><code>SELECT id FROM subsource_position EXCEPT SELECT position_id FROM subsource; </code></pre> <p>Plan:</p> <pre><code> QUERY PLAN ------------------------------------------------------------------------------------------------- SetOp Except (cost=24760.35..25668.66 rows=95997 width=4) -&gt; Sort (cost=24760.35..25214.50 rows=181663 width=4) Sort Key: "*SELECT* 1".id -&gt; Append (cost=0.00..6406.26 rows=181663 width=4) -&gt; Subquery Scan on "*SELECT* 1" (cost=0.00..4146.94 rows=95997 width=4) -&gt; Seq Scan on subsource_position (cost=0.00..3186.97 rows=95997 width=4) -&gt; Subquery Scan on "*SELECT* 2" (cost=0.00..2259.32 rows=85666 width=4) -&gt; Seq Scan on subsource (cost=0.00..1402.66 rows=85666 width=4) (8 rows) </code></pre> <p>I have a feeling I'm missing either something obviously bad about one of my queries, or I have misconfigured the PostgreSQL server. I would have expected this <code>NOT IN</code> to optimize well; is <code>NOT IN</code> always a performance problem or is there a reason it does not optimize here?</p> <p>Additional data:</p> <pre><code>=&gt; select count(*) from subsource; count ------- 85158 (1 row) =&gt; select count(*) from subsource_position; count ------- 93261 (1 row) </code></pre> <p><strong>Edit</strong>: I have now fixed the A-B != B-A problem mentioned below. But my problem as stated still exists: query #1 is still massively worse than query #2. This, I believe, follows from the fact that both tables have similar numbers of rows.</p> <p><strong>Edit 2</strong>: I'm using PostgresQL 9.0.4. I cannot use EXPLAIN ANALYZE because query #1 takes too long. All of these columns are NOT NULL, so there should be no difference as a result of that.</p> <p><strong>Edit 3</strong>: I have an index on both these columns. I haven't yet gotten query #1 to complete (gave up after ~10 minutes). Query #2 returns immediately.</p>
 

Querying!

 
Guidance

SQuiL has stopped working due to an internal error.

If you are curious you may find further information in the browser console, which is accessible through the devtools (F12).

Reload