Windows Server 2008 R2 was released in 2009, but we still have many customers on Windows 2003 (and probably some on Windows 2000 I’m sure). I’ve run into an issue with two customers when they upgraded their SQL Server and Windows version at the same time, and thought I would share the details in case anyone else is on the same path.
The original customer issue occurred a few months ago. The database installation was moved to an entirely new physical server. The SQL Server version was upgraded from SQL Server 2000 to SQL Server 2008 SP1, and the server had Windows 2008 R2 installed (previously they were on Windows 2003). The database storage stayed the same, and the application was not upgraded. After the upgrade, we were called in because performance was slow.
Where to begin?
I first asked what was slow, and was told that activities that used to take 10 seconds were now taking 30 seconds…consistently…and for all users.
I asked if they had any baseline data from Windows PerfMon counters, such as Avg Disk Sec/Read and Avg Disk Sec/Write, % Processor Time, Available Mbytes and the infamous Page Life Expectancy. They did not.
I opened up my .sql file that contained a set of DMV queries. When I was on the inaugural SQLCruise Tim Ford ( blog | @sqlagentman ) presented a session on DMVs. From his session, I developed a script that I could step through when troubleshooting customer issues. It’s a hodge-podge of queries from Tim, Glenn Berry ( blog | @glennalanberry ), Paul Randal ( blog | @paulrandal ) and the Troubleshooting Performance Problem Guides ( 2005, 2008) from Microsoft. Nothing struck me as suspicious until I got to the wait stats query (against sys.dm_os_wait_stats). The top wait was CXPACKET. I knew the CXPACKET wait was commonly related to parallelism issues. Our application typically does not benefit from parallelized queries, and we recommend setting Max Degree of Parallelism to 1. I checked the customer setting. It was 0. I figured I had found the problem.
I requested that the customer change Max Degree of Parallelism to 1, and after they did performance improved. Or so we thought…
They called again the next day; performance was slow again with the same symptoms. I opened my .sql file of DMV queries. This time I also downloaded Adam Machanic’s WhoIsActive script, hoping I might be able to catch troublesome queries in the act.
I started rolling through the DMV queries, running WhoIsActive now and again. Signal waits were fine and cross checking to % Processor Time showed that there was no CPU pressure. There was over 24 GB of memory available to SQL Server, and while it was all in use, available physical memory and Page Life Expectancy were high. Then I started to notice a trend…
When querying sys.dm_os_wait_stats the highest wait was now PAGEIOLATCH_XX. When running WhoIsActive, the queries that occasionally showed up had a resource wait of PAGEIOLATCH_XX. And when querying sys.dm_io_virtual_file_stats joined to sys.dm_io_pending_io_requests repeatedly, I frequently saw output, indicating pending I/O. I was on to something; it had to be related to the storage. I kept digging, and everything pointed back to the I/O subsystem.
The storage team was called in. They told me everything looked fine on their end. In fact, the number of I/Os was less than it was the prior week, before they upgraded. I explained that I wanted to investigate the entire path, from the database server all the way back to the disks, including the connections between. I was assured that everything was fine, and the network team was called in. Again, there was barely any data flowing across the pipe. Everyone said it had to be SQL Server.
Note: What I didn’t fully realized until after the fact is that less network traffic and/or fewer reads against the SAN doesn’t mean that life is good. In this case, because it was taking so much longer to retrieve data, less overall work was being done.
My data told me that the issue was not with SQL Server, or due to a lack of memory or CPU resources. Finally, because performance was affecting business, and no smoking gun had been found, Microsoft was engaged. The Microsoft engineers collected and analyzed data for several hours. Over time, one of the highest waits became LCK_M_X. The application was suddenly under the microscope. Why were there so many waits on locks? What code was inefficient?
I was running WhoIsActive at the same time. I could see queries being blocked, but the resource wait for blocking SPID was still on PAGEIOLATCH_XX. I pointed this out to the engineer from Microsoft. I asked, if an IO takes a long time to complete, and there are other tasks that cannot get a lock until that IO completes, then wouldn’t it make sense that lock waits were high?
Microsoft wanted to the customer to update statistics (which they started to do, but that was taking a while as well) and finally they recommended to the customer that they apply a hotfix, detailed in KB 976700. Once the hotfix was applied, performance took off and my investigation was done.
I admit that in the end, it was not my ability to find the exact the problem that resolved the issue. But sometimes, being able to demonstrate what is not the problem is just as important.
Good post Erin. I agree that sometimes we dont actually may end up resolving the issue but the legwork done before does help a lot.
Thanks for the KB link.
This is a common scenario – which I’ve exeprienced many times. It is frustrating the lack of metrics by network\storage teams in troubleshooting.