SQL select distinct but "keep first"?

This all has to do with the "logical ordering" of SQL statements. Although a DBMS might actually retrieve the data according to all sorts of clever strategies, it has to behave according to some predictable logic. As such, the different parts of an SQL query can be considered to be processed "before" or "after" one another in terms of how that logic behaves.

As it happens, the ORDER BY clause is the very last step in that logical sequence, so it can't change the behaviour of "earlier" steps.

If you use a GROUP BY, the rows have been bundled up into their groups by the time the SELECT clause is run, let alone the ORDER BY, so you can only look at columns which have been grouped by, or "aggregate" values calculated across all the values in a group. (MySQL implements a controversial extension to GROUP BY where you can mention a column in the SELECT that can't logically be there, and it will pick one from an arbitrary row in that group).

If you use a DISTINCT, it is logically processed after the SELECT, but the ORDER BY still comes afterwards. So only once the DISTINCT has thrown away the duplicates will the remaining results be put into a particular order - but the rows that have been thrown away can't be used to determine that order.


As for how to get the result you need, the key is to find a value to sort by which is valid after the GROUP BY/DISTINCT has (logically) been run. Remember that if you use a GROUP BY, any aggregated values are still valid - an aggregate function can look at all the values in a group. This includes MIN() and MAX(), which are ideal for ordering by, because "the lowest number" (MIN) is the same thing as "the first number if I sort them in ascending order", and vice versa for MAX.

So to order a set of distinct foo_number values based on the lowest applicable bar_number for each, you could use this:

SELECT foo_number
FROM some_table
GROUP BY foo_number
ORDER BY MIN(bar_number) ASC

Here's a live demo with some arbitrary data.


EDIT: In the comments, it was discussed why, if an ordering is applied before the grouping / de-duplication takes place, that order is not applied to the groups. If that were the case, you would still need a strategy for which row was kept in each group: the first, or the last.

As an analogy, picture the original set of rows as a set of playing cards picked from a deck, and then sorted by their face value, low to high. Now go through the sorted deck and deal them into a separate pile for each suit. Which card should "represent" each pile?

If you deal the cards face up, the cards showing at the end will be the ones with the highest face value (a "keep last" strategy); if you deal them face down and then flip each pile, you will reveal the lowest face value (a "keep first" strategy). Both are obeying the original order of the cards, and the instruction to "deal the cards based on suit" doesn't automatically tell the dealer (who represents the DBMS) which strategy was intended.

If the final piles of cards are the groups from a GROUP BY, then MIN() and MAX() represent picking up each pile and looking for the lowest or highest value, regardless of the order they are in. But because you can look inside the groups, you can do other things too, like adding up the total value of each pile (SUM) or how many cards there are (COUNT) etc, making GROUP BY much more powerful than an "ordered DISTINCT" could be.


I would go for something like

select col1
from (
select col1,
       rank () over(order by col2) pos
from table
)
group by col1
order by min(pos)

In the subquery I calculate the position, then in the main query I do a group by on col1, using the smallest position to order.

Here the demo in SQLFiddle (this was Oracle, the MySql info was added later.

Edit for MySql:

select col1
from (
select col1 col1,
       @curRank := @curRank + 1 AS pos
from table1, (select @curRank := 0) p
) sub
group by col1
order by min(pos)

And here the demo for MySql.

Tags:

Mysql

Sql