SQL Server has encountered occurences of I/O requests taking longer than 15 seconds

This is far less often a disk issue, and far more often a networking issue. You know, the N in SAN?

If you go to your SAN team and start talking about the disks being slow, they're gonna show you a fancy graph with 0 millisecond latency on it and then point a stapler at you.

Instead, ask them about the network path to the SAN. Get speeds, if it's multipathed, etc. Get numbers from them about the speeds you should be seeing. Ask if they have benchmarks from when the servers were set up.

Then you can use Crystal Disk Mark or diskpd to validate those speeds. If they don't line up, again, it's most likely the networking.

You should also search your error log for messages that contain "FlushCache" and "saturation", because those can also be signs of network contention.

One thing you can do to avoid those things as a DBA is make sure that your maintenance and any other data-heavy tasks (like ETL) aren't going on at the same time. That can definitely put a lot of pressure on storage networking.

You may also want to check the answers here for more suggestions: Slow checkpoint and 15 second I/O warnings on flash storage

I blogged about a similar topic here: From The Server To The SAN

We have a similar setup and recently encountered these messages in the logs. We are using a DELL Compellent SAN. Here are some things to check when receiving these messages that helped us find a solution

  • Review your windows performance counters for your disks that the warning messages are pointing to, specifically:
    • Disk avg. read time
    • Disk avg. write time
    • Disk read bytes/sec
    • Disk write bytes/sec
    • Disk Transfers/sec
    • Avg. disk queue length
  • The above are averages. If you have many database files on one drive these averages can skew the result and mask a bottle neck on specific database files. Check out this query from Paul S. Randal which returns average latency for each file from the dmv sys.dm_io_virtual_file_stats. In our case the average latency reported was acceptable, but underneath the covers we had many files with > 200 ms average latency.
  • Check the timings. Is there any pattern? Does it happen more frequently at certain a time in the night? If so check if any maintenance jobs are running at that time or any scheduled activity which may increase disk activity and expose a bottle neck in your IO subsystem.
  • Check the windows event viewer for errors. If your switch or SAN is being overloaded or not setup properly for your application you may find some messages in this log, and it is good to take this information to your SAN admin. In our case we were receiving iSCSI connection errors often throughout the day, hinting at the problem.
  • Review your SQL Server code. When you receive these messages you shouldn't immediately think it is an IO subsystem issue and pass it to your SAN admin. You need to do your part and review the database. Do you have really bad queries being run often churning through tons of data? Bad indexing? Excessive transaction log writes? You can use some open source queries to get a health check on your database, an example for checking how your query plan looks is sp_blitzCache
  • Don't ignore these. Today you may be receiving them a few times a day... then several months later when your workload increases and you forgot to monitor them they start to increase. Receiving lots of these messages can prevent SQL Server from accessing a certain file, and if it is tempdb, that is not good. In our case it got so bad that SQL Server shut itself down.

Our solution was upgrading our switch to a SAN switch. Yes, these are all points to cover within SQL Server. What led us to finding out it was the switch was that we were receiving about 1500 iSCSI pdu disconnect errors in the Windows application event viewer on the SQL Server every day. That prompted the investigation by our SAN admins into the switch.

Immediately after upgrading, the iSCSI errors were gone and average latency came down to around 50 ms for all the files, and that correlated to better performance in the application. With these points in mind hopefully you can find your solution.

Why storing the data on a SAN? What's the point? All database performance is tied to Disk I/O and you are using 3 servers with only one device for the I/O behind them. That makes no sense... and unfortunately so common.

A SQL server software is a very advanced piece of software, it is designed to take advantage of any bits of hardware, CPU cores, CPU cache, TLB, RAM, disk controllers, hard drive cache... They almost include all filesystem logic. They are developed on regular computer and benchmarked on high end systems. Therfore a SQL server must have its own disks. Installing them on a SAN is like "emulating" a computer, you lose all performance optimisations. SANs are for storing backups, immutable files, and files you just append data to (logs).

Datacenter administrators tend to put all they can on SANs because this way they have only one pool of storage to manage, it's more easy than caring for storage on each server. It's a "I don't want to do my job" choice, and a very bad one, because then they have to deal with performance problems and all the company suffer from this. Just install software on the hardware it is designed for. Keep it simple. Care for I/O bandwidth, cache and context switch overhead, ressource jitter (happens when ressource is shared). You'll end up maintaining 1/10th of the devices for the same raw output power, save your ops team lot of headaches, gain performance that makes your end users happy and more productive, make your company a better place to work in, and save lot of energy (the planet will thank you).

You said in comments, you are considering to put SSD in your server. You will not recognize your setup with dedicated SSDs, compared to a SAN you'll get something like 500x improvement even with data and transaction log files on same drive. A state of the art SQL Server would have fast separate SSD for data and transaction log on different hardware controllers channels (most server motherboard have several). But compared to your current setup we are talking of sci-fi there. Just give SSD a try.

I spend my life encountering poorly designed hardware platforms where people just try to design a large scale computer. All CPU power here, all disks there... hopefully there is not such a thing as remote RAM. And the saddest is they compensate the lack of efficiency of this design with huge servers that cost ten time more than they should. I saw $400k infra slower than a $1k laptop.