SQL Server splits A <> B into A < B OR A > B, yielding strange results if B is non-deterministic

Is SQL Server allowed to evaluate A <> B as A < B OR A > B, even if one of the expressions is non-deterministic?

This is a somewhat controversial point, and the answer is a qualified "yes".

The best discussion I am aware of was given in answer to Itzik Ben-Gan's Connect bug report Bug with NEWID and Table Expressions, which was closed as won't fix. Connect has since been retired, so the link there is to a web archive. Sadly, a lot of useful material was lost (or made harder to find) by the demise of Connect. Anyway, the most useful quotes from Jim Hogg of Microsoft there are:

This hits to the very heart of the issue - is optimization allowed to change a program's semantics? Ie: if a program yields certain answers, but runs slowly, is it legitimate for a Query Optimizer make that program run faster, yet also change the results given?

Before shouting "NO!" (my own personal inclination too :-), consider: the good news is that, in 99% of cases, the answers ARE the same. So Query Optimization is a clear win. The bad news is that, if the query contains side-effecting code, then different plans CAN indeed yield different results. And NEWID() is one such side-effecting (non-deterministic) 'function' that exposes the difference. [Actually, if you experiment, you can devise others - for example, short-circuit evaluation of AND clauses: make the second clause throw an arithmetic divide-by-zero - different optimizations may execute that second clause BEFORE the first clause] This reflects Craig's explanation, elsewhere in this thread, that SqlServer does not guarantee when scalar operators are executed.

So, we have a choice: if we want to guarantee a certain behavior in the presence of non-deterministic (side-effecting) code - so that results of JOINs, for example, follow the semantics of a nested-loop execution - then we can use appropriate OPTIONs to force that behavior - as UC points out. But the resulting code will run slow - that's the cost of, in effect, hobbling the Query Optimizer.

All that said, we are moving the Query Optimizer in the direction of "as expected" behavior for NEWID() - trading off performance for "results as expected".

One example of the changing of behaviour in this regard over time is NULLIF works incorrectly with non-deterministic functions such as RAND(). There are also other similar cases using e.g. COALESCE with a subquery that can produce unexpected results, and which are also being addressed gradually.

Jim continues:

Closing the loop . . . I've discussed this question with the Dev team. And eventually we have decided not to change current behavior, for the following reasons:

1) The optimizer does not guarantee timing or number of executions of scalar functions. This is a long-estabilished tenet. It's the fundamental 'leeway' tha allows the optimizer enough freedom to gain significant improvements in query-plan execution.

2) This "once-per-row behavior" is not a new issue, although it's not widely discussed. We started to tweak its behavior back in the Yukon release. But it's quite hard to pin down precisely, in all cases, exactly what it means! For example, does it a apply to interim rows calculated 'on the way' to the final result? - in which case it clearly depends on the plan chosen. Or does it apply only to the rows that will eventually appear in the completed result? - there's a nasty recursion going on here, as I'm sure you'll agree!

3) As I mentioned earlier, we default to "optimize performance" - which is good for 99% of cases. The 1% of cases where it might change results are fairly easy to spot - side-effecting 'functions' such as NEWID - and easy to 'fix' (trading perf, as a consequence). This default to "optimize performance" again, is long-established, and accepted. (Yes, it's not the stance chosen by compilers for conventional programming languages, but so be it).

So, our recommendations are:

a) Avoid reliance on non-guaranteed timing and number-of-executions semantics. b) Avoid using NEWID() deep in table expressions. c) Use OPTION to force a particular behavior (trading perf)

Hope this explanation helps clarify our reasons for closing this bug as "won't fix".


Interestingly, AND NOT (s_guid = NEWID()) yields the same execution plan

This is a consequence of normalization, which happens very early during query compilation. Both expressions compile to exactly the same normalized form, so the same execution plan is produced.


This is documented (sort of) here:

The number of times that a function specified in a query is actually executed can vary between execution plans built by the optimizer. An example is a function invoked by a subquery in a WHERE clause. The number of times the subquery and its function is executed can vary with different access paths chosen by the optimizer.

User-Defined Functions

This is not the only query form where the query plan will execute NEWID() multiple times and change the result. This is confusing, but is actually critical for NEWID() to be useful for key generation and random sorting.

What's most confusing is that not all non-deterministic functions actually behave like this. For instance RAND() and GETDATE() will execute only once per query.


For what it's worth, if you look at this old SQL 92 standard document, the requirements around inequality are described in section "8.2 <comparison predicate>" as follows:

1) Let X and Y be any two corresponding <row value constructor element>s. Let XV and YV be the values represented by X and Y, respectively.

[...]

ii) "X <> Y" is true if and only if XV and YV are not equal.

[...]

7) Let Rx and Ry be the two <row value constructor>s of the <comparison predicate> and let RXi and RYi be the i-th <row value constructor element>s of Rx and Ry, respectively. "Rx <comp op> Ry" is true, false, or unknown as follows:

[...]

b) "x <> Ry" is true if and only if RXi <> RYi for some i.

[...]

h) "x <> Ry" is false if and only if "Rx = Ry" is true.

Note: I included 7b and 7h for completeness since they talk about <> comparison - I don't think comparison of row value constructors with multiple values is implemented in T-SQL, unless I'm just massively misunderstanding what this says - which is quite possible

This is a bunch of confusing garbage. But if you want to keep dumpster diving...

I think that 1.ii is the item that applies in this scenario, since we're comparing the values of "row value constructor elements."

ii) "X <> Y" is true if and only if XV and YV are not equal.

Basically it's saying X <> Y is true if the values represented by X and Y are not equal. Since X < Y OR X > Y is a logically equivalent rewrite of that predicate, it's totally cool for the optimizer to use that.

The standard does not put any constraints on this definition related to the deterministic-ness (or whatever, you get it) of the row value constructor elements on either side of the <> comparison operator. It's the responsibility of user code to deal with the fact that a value expression on one side might be non-deterministic.