What is the best way to get a random ordering?

ORDER BY NEWID() will sort the records randomly. An example here

SELECT *
FROM Northwind..Orders 
ORDER BY NEWID()

This is an old question, but one aspect of the discussion is missing, in my opinion -- PERFORMANCE. ORDER BY NewId() is the general answer. When someone get's fancy they add that you should really wrap NewID() in CheckSum(), you know, for performance!

The problem with this method, is that you're still guaranteed a full index scan and then a complete sort of the data. If you've worked with any serious data volume this can rapidly become expensive. Look at this typical execution plan, and note how the sort takes 96% of your time ...

enter image description here

To give you a sense of how this scales, I'll give you two examples from a database I work with.

  • TableA - has 50,000 rows across 2500 data pages. The random query generates 145 reads in 42ms.
  • Table B - has 1.2 million rows across 114,000 data pages. Running Order By newid() on this table generates 53,700 reads and takes 16 seconds.

The moral of the story is that if you have large tables (think billions of rows) or need to run this query frequently the newid() method breaks down. So what's a boy to do?

Meet TABLESAMPLE()

In SQL 2005 a new capability called TABLESAMPLE was created. I've only seen one article discussing it's use...there should be more. MSDN Docs here. First an example:

SELECT Top (20) *
FROM Northwind..Orders TABLESAMPLE(20 PERCENT)
ORDER BY NEWID()

The idea behind table sample is to give you approximately the subset size you ask for. SQL numbers each data page and selects X percent of those pages. The actual number of rows you get back can vary based on what exists in the selected pages.

So how do I use it? Select a subset size that more than covers the number of rows you need, then add a Top(). The idea is you can make your ginormous table appear smaller prior to the expensive sort.

Personally I've been using it to in effect limit the size of my table. So on that million row table doing top(20)...TABLESAMPLE(20 PERCENT) the query drops to 5600 reads in 1600ms. There is also a REPEATABLE() option where you can pass a "Seed" for page selection. This should result in a stable sample selection.

Anyway, just thought this should be added to the discussion. Hope it helps someone.


Pradeep Adiga's first suggestion, ORDER BY NEWID(), is fine and something I've used in the past for this reason.

Be careful with using RAND() - in many contexts it is only executed once per statement so ORDER BY RAND() will have no effect (as you are getting the same result out of RAND() for each row).

For instance:

SELECT display_name, RAND() FROM tr_person

returns each name from our person table and a "random" number, which is the same for each row. The number does vary each time you run the query, but is the same for each row each time.

To show that the same is the case with RAND() used in an ORDER BY clause, I try:

SELECT display_name FROM tr_person ORDER BY RAND(), display_name

The results are still ordered by the name indicating that the earlier sort field (the one expected to be random) has no effect so presumably always has the same value.

Ordering by NEWID() does work though, because if NEWID() was not always reassessed the purpose of UUIDs would be broken when inserting many new rows in one statemnt with unique identifiers as they key, so:

SELECT display_name FROM tr_person ORDER BY NEWID()

does order the names "randomly".

Other DBMS

The above is true for MSSQL (2005 and 2008 at least, and if I remember rightly 2000 as well). A function returning a new UUID should be evaluated every time in all DBMSs NEWID() is under MSSQL but it is worth verifying this in the documentation and/or by your own tests. The behaviour of other arbitrary-result functions, like RAND(), is more likely to vary between DBMSs, so again check the documentation.

Also I've seen ordering by UUID values being ignored in some contexts as the DB assumes that the type has no meaningful ordering. If you find this to be that case explicitly cast the UUID to a string type in the ordering clause, or wrap some other function around it like CHECKSUM() in SQL Server (there may be a small performance difference from this too as the ordering will be done on a 32-bit values not a 128-bit one, though whether the benefit of that outweighs the cost of running CHECKSUM() per value first I'll leave you to test).

Side Note

If you want an arbitrary but somewhat repeatable ordering, order by some relatively uncontrolled subset of the data in the rows themselves. For instance either or these will return the names in an arbitrary but repeatable order:

SELECT display_name FROM tr_person ORDER BY CHECKSUM(display_name), display_name -- order by the checksum of some of the row's data
SELECT display_name FROM tr_person ORDER BY SUBSTRING(display_name, LEN(display_name)/2, 128) -- order by part of the name field, but not in any an obviously recognisable order)

Arbitrary but repeatable orderings are not often useful in applications, though can be useful in testing if you want to test some code on results in a variety of orders but want to be able to repeat each run the same way several times (for getting average timing results over several runs, or testing that a fix you have made to the code does remove a problem or inefficiency previously highlighted by a particular input resultset, or just for testing that your code is "stable" in that is returns the same result each time if sent the same data in a given order).

This trick can also be used to get more arbitrary results from functions, which do not allow non-deterministic calls like NEWID() within their body. Again, this is not something that is likely to be often useful in the real world but could come in handy if you want a function to return something random and "random-ish" is good enough (but be careful to remember the rules that determine when user defined functions evaluted, i.e. usually only once per row, or your results may not be what you expect/require).

Performance

As EBarr points out, there can be performance issues with any of the above. For more than a few rows you are almost garanteed to see the output spooled out to tempdb before the requested number of rows being read back in the right order, which means that even if you are looking for the top 10 you might find a full index scan (or worse, table scan) happens along with a huge block of writing to tempdb. Therefor it can be vitally important, as with most things, to benchmark with realistic data before using this in production.

Tags:

Sql Server