Why did SQL Server suddenly decide to use such a terrible execution plan?

This is one my most hated issues with SQL - I've had more than one failure due to this issue - once a query that had been working for months went from ~250ms to beyond the timeout threshold causing a manufacturing system to crash at 3am of course. Took awhile to isolate the query and stick it into SSMS and then start breaking it into pieces - but everything I did just "worked". In the end I just added the phrase " AND 1=1" to the query which got things working again for a few weeks - the final patch was to "blind" the optimizer - basically copying all passed parameters into local parameters. If the query works off the bat, it seems like it will continue to work.

To me a reasonably simple fix from MS would be: if this query has been profiled already and ran just fine the last time, and the relevant statistics haven't changed significantly (e.g. come up with some factor of various changes in tables or new indexes, etc), and the "optimizer" decides to spice things up with a new execution plan, how about if that new and improved plan takes more than X-multiple of the old plan, I abort and switch back again. I can understand if a table goes from 100 to 100,000,000 rows or if a key index is deleted, but for a stable production environment to have a query jump in duration to between 100x and 1000x slower, it couldn't be that hard to detect this, flag the plan, and go back to the previous one.


The reason is simple: the optimizer changes its mind on what the best plan is. This can be due to subtle changes in the distribution of the data (or other reasons, such as a type incompatibility in a join key). I wish there were a tool that not only gave the execution plan for a query but also showed thresholds for how close you are to another execution plan. Or a tool that would let you stash an execution plan and give an alert if the same query starts using a different plan.

I've asked myself this exact same question on more than one occasion. You have a system that's running nightly, for months on end. It processes lots of data using really complicated queries. Then, one day, you come in in the morning and the job that normally finishes by 11:00 p.m. is still running. Arrrggg!

The solution that we came up with was to use explicit join hints for the failed joins. (option (merge join, hash join)). We also started saving the execution plans for all our complex queries, so we could compare changes from one night to the next. In the end, this was of more academic interest than practical interest -- when the plans changed, we were already suffering from a bad execution plan.