Using GROUP BY WITH ROLLUP on column that contains NULL

you have a "solution", so by saying

I am wondering if there is a better solution.

I think you want to comment about it. Let me focus on several aspects:

  1. SELECT ID [...] GROUP BY word, independently of having or not a WITH ROLLUP is wrong. You are selecting a field whose value is undetermined. In particular, with MySQL you get exposed of returning a random value; or if you use an strict mode (recommended, and by default in the latest versions of MySQL), of the query to fail: 'db_9_b0ff7.test.ID' isn't in GROUP BY

  2. Be it formally or not, WITH ROLLUP Note- (I refer to MySQL implementation) is something that is nice to have in some cases because it avoids a double rescan of the table, but I have not seen a lot in production code. Nothing wrong with it, but they were an exclusive MSSQL feature until it was implemented on SQL1999, and either it was not available widely back in the day or, like in the case of MySQL it was a limited implementation. Reference: https://www.percona.com/blog/2007/09/17/using-group-by-with-rollup-for-reporting-performance-optimization/

  3. I do not see in principle a problem with your final query, if we run explain on both queries (without ID), we see:

    MariaDB [test]> EXPLAIN SELECT   if(count(word) = 0, "empty", ifnull(word, "total")) as word,   count(*) as occurrences FROM test group by word with ROLLUP\G
    *************************** 1. row ***************************
               id: 1
      select_type: SIMPLE
            table: test
             type: ALL
    possible_keys: NULL
              key: NULL
          key_len: NULL
              ref: NULL
             rows: 4
            Extra: Using filesort
    1 row in set, 1 warning (0.00 sec)
    
    Warning (Code 1052): Column 'word' in group statement is ambiguous
    MariaDB [test]> EXPLAIN SELECT   ifnull(word, "total") as word,   count(*) as occurrences FROM ( SELECT   ifnull(word, "empty") as word FROM test ) tmp GROUP BY word WITH ROLLUP\G
    *************************** 1. row ***************************
               id: 1
      select_type: SIMPLE
            table: test
             type: ALL
    possible_keys: NULL
              key: NULL
          key_len: NULL
              ref: NULL
             rows: 4
            Extra: Using filesort
    1 row in set, 1 warning (0.01 sec)
    

    Normally the hate towards MySQL is because its limitations on early versions of MySQL. I am not going to say that it is now perfect, there will be still issues, as MySQL development, as it has always been, focused on a fast, easy-in, easy-out RDBMS, and not a full-blown relational engine, but a subquery in the right place is ok (after all, some queries will require a subquery, no matter what. Your usage is ok- it will only have an overhead of going over the summarized rows again and I suppose on the real deal, those will not be significant compared to the full set.

    Note- on my instance up here, the subquery is not showed, but you can see it with materialized subqueries in 5.6 here: http://sqlfiddle.com/#!9/b0ff7/7 (It will execute the inner query and then the outer part, with no issue). An index could definitely help here (due to the filesort), but that depends on the final number of records. The subquery may hurt performance if an index (which does not exist now) cannot be used because it. You have to test your specific mysql version and create such an index.

  4. I think your problem is with the data model- in the SQL world (outside of Oracle) NULL is an unknown/invalid value. Seem to be correlating that to 'empty' which is not right- if a word is known to not exist, it should be '' (the empty string, not NULL). I do not know if you can change that, but I do not like your current model. I understand that you may not have something to say about it, so I will not focus on that. Alos you count NULL words? More reasons to use '' instead of NULL.
  5. Now the final question is, is it possible to do it without a subquery on MySQL, and will it be better/less ugly? I do not really see a way- we could do a join, but it will still be to a subquery. I can think of an alternative, which is:

    SELECT * FROM (
        SELECT   word,   
        count(*) as occurrences 
        FROM test 
        GROUP BY word with rollup) 
    tmp 
    ORDER BY occurrences;
    

    That makes sure that the last row contains the summary (and if there is other null, it is the actual null value). The order would happen even on the case of only null values. On the contrary, it may even slower as it requires ordering, and again you may need an index, and that index may or may not be used.


You can avoid subqueries and optimize for characters typed using this query:

SELECT
  ifnull(word, "empty") as word,
  count(*) as occurrences
FROM test
GROUP BY ifnull(word, "empty") WITH ROLLUP
;

This basically replaces null values with your "empty" marker before grouping is performed:

  • null counts can be found in row containing "empty" value in grouping column
  • total counts can be found in row containing null value in grouping column

Similar approach can be used for PostgreSQL, this should not be necessary for SQL Server as it provides a special GROUPING function.