PostgreSQL: NOT IN versus EXCEPT performance difference (edited #2)

Query #1 is not the elegant way for doing this... (NOT) IN SELECT is fine for a few entries, but it can't use indexes (Seq Scan).

Not having EXCEPT, the alternative is to use a JOIN (HASH JOIN):

    SELECT sp.id
    FROM subsource_position AS sp
        LEFT JOIN subsource AS s ON (s.position_id = sp.id)
    WHERE
        s.position_id IS NULL

EXCEPT appeared in Postgres long time ago... But using MySQL I believe this is still the only way, using indexes, to achieve this.


Your queries are not functionally equivalent so any comparison of their query plans is meaningless.

Your first query is, in set theory terms, this:

{subsource.position_id} - {subsource_position.id}
          ^        ^                ^        ^

but your second is this:

{subsource_position.id} - {subsource.position_id}
          ^        ^                ^        ^

And A - B is not the same as B - A for arbitrary sets A and B.

Fix your queries to be semantically equivalent and try again.


Since you are running with the default configuration, try bumping up work_mem. Most likely, the subquery ends up getting spooled to disk because you only allow for 1Mb of work memory. Try 10 or 20mb.


If id and position_id are both indexed (either on their own or first column in a multi-column index), then two index scans are all that are necessary - it's a trivial sorted-merge based set algorithm.

Personally I think PostgreSQL simply doesn't have the optimization intelligence to understand this.

(I came to this question after diagnosing a query running for over 24 hours that I could perform with sort x y y | uniq -u on the command line in seconds. Database less than 50MB when exported with pg_dump.)

PS: more interesting comment here:

more work has been put into optimizing EXCEPT and NOT EXISTS than NOT IN, because the latter is substantially less useful due to its unintuitive but spec-mandated handling of NULLs. We're not going to apologize for that, and we're not going to regard it as a bug.

What it comes down to is that except is different to not in with respect to null handling. I haven't looked up the details, but it means PostgreSQL (aggressively) doesn't optimize it.

Tags:

Sql

Postgresql