My client had this issue where their web application (deployed across multiple servers) was randomly making the servers unresponsive with 100% cpu usage.
The first action we took was to configure the IIS to automatically recycle the Application Pools when they are using high CPU for more than a few minutes. In the example below we kill AppPools after 3 minutes of using more than 80% cpu.
dir IIS:\AppPools | ForEach-Object {
Write-Host "Updating $($_.Name) ..."
$appPoolName = $_.Name
$appPool = Get-Item "IIS:\AppPools\$appPoolName"
$appPool.cpu.limit = 80000
$appPool.cpu.action = "KillW3wp"
$appPool.cpu.resetInterval = "00:03:00"
$appPool | Set-Item
}
That solved the problems, servers stopped getting unresponsive, but we had to investigate what was eating all CPU.
See below how I proceeded with the troubleshooting:
1. Create a Memory Dump
Task Manager - Right Button in the IIS Worker Process, and create a Dump File
2. Install Debug Diagnostic Tool
Download and install Debug Diagnostic Tool
3. Run Crash & Hang Analysis for ASP.NET / IIS
Add your dump (DMP) file, select “CrashHangAnalysis”, and click “Start Analysis”.
4. Review Analysis for Problems
The first page immediately suggests that there’s a Generic Dictionary
A few pages later we can find the threads which are consuming the most of the CPU:
If we check those top threads we can see that both are blocked in the same call which is invoking GetVersion() on an API client-wrapper. One thread is trying to Insert on the dictionary (cache the API version), while the other is trying to Find (FindEntry) on the dictionary.
5. What was the issue?
Long Explanation:
Dictionary<T>
is a HashMap implementation, and like most HashMap implementations it internally uses LinkedLists (to store multiple elements in case different keys result into the same bucket position after being hashed and after taking the hash modulo). The problem is that since Dictionary<T>
is not thread-safe, multiple threads trying to change the dictionary may put it into an invalid state (race condition).
Probably there were different threads trying to add the same element to the dictionary at the same time (invoking Insert
method which internally invokes the Resize
method which modifies the LinkedList), which was putting the LinkedList (and therefore the whole HashMap) into an inconsistent state. If the LinkedList goes into an inconsistent state it can put the threads into an infinite loop, since both Insert()
and FindEntry()
iterate through the LinkedList and could go into an infinite loop if the LinkedList was inconsistent.
Short Explanation:
Since Dictionary<T>
is not thread-safe, multiple threads trying to change the dictionary may put it into an invalid state (race condition). So if you want to share a dictionary across multiple threads you should use a ConcurrentDictionary<T>
which as the name implies is a thread-safe class.
It’s a known-issue that concurrent access to Dictionary
6. Advanced Troubleshooting using WinDbg
If the Debug Diagnostic Tool didn’t gave us any obvious clue about the root cause, we could use WinDbg to inspect a memory dump (it also supports .NET/CLR). See example here.