Improve performance of COUNT/GROUP-BY in large PostgresSQL table?

Main problem is the missing index. But there is more.

SELECT user_id, count(*) AS ct
FROM   treenode
WHERE  project_id = 1
GROUP  BY user_id;
  • You have many bigint columns. Probably overkill. Typically, integer is more than enough for columns like project_id and user_id. This would also help the next item.
    While optimizing the table definition, consider this related answer, with an emphasis on data alignment and padding. But most of the rest applies, too:

    • Configuring PostgreSQL for read performance
  • The elephant in the room: there is no index on project_id. Create one. This is more important than the rest of this answer.
    While being at it, make that a multicolumn index:

    CREATE INDEX treenode_project_id_user_id_index ON treenode (project_id, user_id);
    

    If you followed my advice, integer would be perfect here:

    • Is a composite index also good for queries on the first field?
  • user_id is defined NOT NULL, so count(user_id) is equivalent to count(*), but the latter is a bit shorter and faster. (In this specific query, this would even apply without user_id being defined NOT NULL.)

  • id is already the primary key, the additional UNIQUE constraint is useless ballast. Drop it:

    "treenode_pkey" PRIMARY KEY, btree (id)
    "treenode_id_key" UNIQUE CONSTRAINT, btree (id)

    Aside: I'd not use id as column name. Use something descriptive like treenode_id.

Added information

Q: How many different project_id and user_id?
A: not more than five different project_id.

That means Postgres has to read about 20% of the whole table to satisfy your query. Unless it can use an index-only scan, a sequential scan on the table will be faster than involving any indexes. No more performance to gain here - except by optimizing the table and server settings.

As for the index-only scan: To see how effective that can be, run VACUUM ANALYZE if you can afford that (locks the table exclusively). Then try your query again. It should now be moderately faster using only the index. Read this related answer first:

  • Optimize simple query using ORDER BY date and text

As well as the manual page added with Postgres 9.6 and the Postgres Wiki on index-only scans.


I'd first add an index on (project_id, user_id) and then in 9.3 version, try this query:

SELECT u.user_id, c.number_of_nodes 
FROM users AS u
   , LATERAL
     ( SELECT COUNT(*) AS number_of_nodes 
       FROM treenode AS t
       WHERE t.project_id = 1 
         AND t.user_id = u.user_id
     ) c 
-- WHERE c.number_of_nodes > 0 ;   -- you probably want this as well
                                   -- to show only relevant users

In 9.2, try this one:

SELECT u.user_id, 
       ( SELECT COUNT(*) 
         FROM treenode AS t
         WHERE t.project_id = 1 
           AND t.user_id = u.user_id
       ) AS number_of_nodes  
FROM users AS u ;

I assume you have a users table. If not, replace users with:
(SELECT DISTINCT user_id FROM treenode)