Wednesday, 16 May 2012

How WinDbg Helped Me Find My Stackoverflow

I was faced with the problem recently of trying to work out why the worker process in IIS that was running my website kept crashing. The event logs told me that it was due to a StackOverflow error, but I found this odd. Firstly the fact that something in our code had changed so much that it introduced this error in the first place, and secondly that our default error handler did not catch it.

Basically we use the HttpApplication.Error event to email all uncaught exceptions to us with full details - including the stack trace - but this did not happen with this error. At the very least I would have thought we would get the .Net StackOverflowException emailed to us, but even if it did that does not seem to produce a stack trace. The first instance that we heard about this was that our users were getting kicked out of the web application at random times, and the event logs kept saying this error:
Application: w3wp.exe
Framework Version: v4.0.30319
Description: The process was terminated due to stack overflow.
It turns out that this is one of several critical errors that the operating system simply cannot recover from and the process terminates immediately. But how can we trace where this is happening?

If the .Net framework is not able to help us, then we need to go deeper into the operating system itself. It is possible to get a full (unmanaged) memory dump of the process at the time the error happens, but can this help us delve into managed code?

Enter WinDbg

WinDbg is a powerful tool though with a steep learning curve. It is effectively a debugger that you can use after the event has happened by means of reading memory dump files. These dump files can contain a wealth of information, from stack traces to all the threads running at the time. As a result it can be a bit overwhelming when first introduced to it - I heavily used Google searching for answers to get me through my first debug session with it. But in the end it provided me with exactly what I needed.

I've got my error, I need to know where it is happening, so lets talk about what I did next.

Download Debugging Tools for Windows

WinDbg is one of many applications that is part of the Debugging Tools for Windows kit. Microsoft provides a small installer to which will download further components depending on what you wish to install. There are a lot of things the full toolkit provides, such as driver development tools, but it is perfectly fine to just download and install the debugging tools as standalone components.

You can also install this on a number of Windows operating systems including server operating systems - I will explain why this is useful later on.

Get a Dump File

Once you've installed the debugging tools you'll need to attach a debugger to something. This can be done with the ADPlus, a tool which can troubleshoot applications that hang or crash.

In my case I needed to attach it to the IIS worker process for my website and give me dumps when this process crashed, so I opened up a command prompt, changed directory to the debugger tools folder (in my case this was C:\Program Files\Windows Kits\8.0\Debuggers\x86) and entered this command:

adplus -crash -pn w3wp.exe -o D:\Dumps


The last argument states that all dump files will be saved to the directory given.

Wait...

I then had to wait for my web application to crash, which is ironic since it is not often that a developer wills their code to fail. Since I didn't know when it was going to happen I was prepared for a long wait...

Bang!

Aha! The IIS process crashed again and now I have something to show for it:


Now it is time to analyse them.

Analysing Crash Dumps

Fire up WinDbg and select File > Open Crash Dump, then load the dump file ADPlus produced.

Note: I did this on the same server that the crash occurred on. I originally tried to copy the dump file to my development machine to analyse it there, but WinDbg found that the .Net framework versions on my development machine and the server were slightly different (possibly due to Windows Update) meaning it could not load the CLR data recorded in the dump. Opening it on the same server meant that all framework components were accounted for.

You'll see something like this:


OK, this isn't exactly Visual Studio with pretty colours and icons. Everything you have to do is text based. But we can get through this...

Note: I'm not sure if this was important in my scenario, but I also set the Symbol Search Path in WinDbg to download debugging symbols from Microsoft's servers. This could be used to translate core DLL function calls (such as from nt.dll). The setting I used was srv*c:\symbols*http://msdl.microsoft.com/downloads/symbols which you can add by clicking File > Symbol File Path.

SOS - Save Our Souls?

Step 1 is to load the SOS module. This is a debugging extension module which can delve into the memory dump file (which holds mostly unmanaged code references and symbols) to extract information given by the CLR. This is how .Net developers can use WinDbg and fortunately this module is shipped with every version of the .Net framework in the standard installation folder.

So our first WinDbg command will be to load this extension by typing:

.loadby sos clr

Note: this is the command to load the SOS extension for .Net 4 onwards. For previous versions of the framework you need to run .loadby sos mscorwks instead.


Find the Stack Trace

Now we can finally read the managed stack, so enter the command !clrstack. Viola! A full managed stack trace.

In the case of my dump file I could finally see exactly why a StackOverflow was occurring - far too many iterations of the same set of functions. Thankfully this was easy enough to fix and we've never encountered the problem again.

Conclusion

Sometimes it is best to dig a little deeper into a system (in this case Windows) and this is one of those cases. I could have trawled through all the code changes made since our last update and tried to figure this out for myself. But delving a little into the unmanaged world of processes and core memory dumps really helped. Not that it's not without it's challenges - what I've described above is just the tip of the iceberg with what you can do with WinDbg - but  it does make you appreciate how far you can go to solve a problem.