Software Diagnostics Library » Crash Dump Patterns

Crash Dump Analysis Patterns (Part 73)

Tuesday, July 29th, 2008

Opposite to Overaged System sometimes we can see Young System pattern. This means that the system didn’t have time to initialize and subsequently mature or reach the state when the problem could surface. Usual signs are less than a minute system uptime (or larger, depends on a problem context) and the low number of processes and services running (also, sometimes the problem description mentions a terminal services session but there is only one console session in the dump, or two as in Vista and Windows Server 2008):

System Uptime: 0 days 0:00:18.562

3: kd> !vm [...] 0248 lsass.exe 1503 ( 6012 Kb) 020c winlogon.exe 1468 ( 5872 Kb) 03b8 svchost.exe 655 ( 2620 Kb) 023c services.exe 416 ( 1664 Kb) 01f0 csrss.exe 356 ( 1424 Kb) 0338 svchost.exe 298 ( 1192 Kb) 02dc svchost.exe 259 ( 1036 Kb) 0374 svchost.exe 240 ( 960 Kb) 039c svchost.exe 224 ( 896 Kb) 01bc smss.exe 37 ( 148 Kb) 0004 System 8 ( 32 Kb)

3: kd> !session Sessions on machine: 1 Valid Sessions: 0

In the case of the fully initialized system the manual dump might have been taken after reboot when the bugcheck already happened or any other reason stemming from the usual confusion between crashes and hangs.

Similar considerations apply to a young process as well, where Process Uptime value from user dumps or ElapsedTime value from kernel or complete memory dumps is too small unless we have obvious crash or hang signs inside, for example, exceptions, deadlock, wait chain or blocked thread waiting for another coupled process:

Process Uptime: 0 days 0:00:10.000

3: kd> !process 8a389d88 PROCESS 8a389d88 SessionId: 0 Cid: 020c Peb: 7ffdf000 ParentCid: 01bc DirBase: 7fbe6080 ObjectTable: e1721008 HandleCount: 455. Image: winlogon.exe VadRoot 8a65d070 Vads 194 Clone 0 Private 1166. Modified 45. Locked 0. DeviceMap e10030f8 Token e139bde0 ElapsedTime 00:00:01.062 UserTime 00:00:00.046 KernelTime 00:00:00.015 QuotaPoolUsage[PagedPool] 71228 QuotaPoolUsage[NonPagedPool] 72232 Working Set Sizes (now,min,max) (2265, 50, 345) (9060KB, 200KB, 1380KB) PeakWorkingSetSize 2267 VirtualSize 41 Mb PeakVirtualSize 42 Mb PageFaultCount 2605 MemoryPriority BACKGROUND BasePriority 13 CommitCharge 1468

- Dmitry Vostokov @ DumpAnalysis.org -

Posted in Crash Dump Analysis, Crash Dump Patterns | 1 Comment »

Crash Dump Analysis Patterns (Part 72)

Saturday, July 26th, 2008

It is important to save crash dumps at the right time, for example, when an error message box is shown as discussed in one of my previous posts:

Crash Dumps for Dummies (Part 7)

However, sometimes crash dumps are saved after a visual indicator of the problem disappeared and the opportunity to see the stack trace was gone. This pattern is called Lost Opportunity. Here is one example of a service memory dump saved manually after it became unresponsive. In the dump file there was only one thread left excluding the thread created by a debugger:

0:001> ~*kv

0 Id: a3c.700 Suspend: 1 Teb: 7ff59000 Unfrozen ChildEBP RetAddr Args to Child 1178fd60 7c822124 77e6bad8 00000574 00000000 ntdll!KiFastSystemCallRet 1178fd64 77e6bad8 00000574 00000000 00000000 ntdll!NtWaitForSingleObject+0xc 1178fdd4 77e6ba42 00000574 ffffffff 00000000 kernel32!WaitForSingleObjectEx+0xac 1178fde8 67e223dd 00000574 ffffffff 1178fe10 kernel32!WaitForSingleObject+0x12 1178fdfc 7c82257a 67e20000 00000000 00000001 componentA!DllInitialize+0xed 1178fe1c 7c8118b0 67e222f0 67e20000 00000000 ntdll!LdrpCallInitRoutine+0x14 1178feb8 77e53002 00000000 00000000 00000000 ntdll!LdrShutdownProcess+0x130 1178ffa4 77e53065 c0000005 77e8f3b0 ffffffff kernel32!_ExitProcess+0×43 1178ffb8 77e84277 c0000005 00000000 00000000 kernel32!ExitProcess+0×14 1178ffec 00000000 77c5de6d 0a078138 00000000 kernel32!BaseThreadStart+0×5f

# 1 Id: a3c.18bc Suspend: 1 Teb: 7ffde000 Unfrozen ChildEBP RetAddr Args to Child 0a6cffc8 7c845ea0 00000005 00000004 00000001 ntdll!DbgBreakPoint 0a6cfff4 00000000 00000000 00905a4d 00000003 ntdll!DbgUiRemoteBreakin+0x36

We also see exception code 0xc0000005 as ExitProcess parameter. The raw stack reveals the call to NtRaiseHardError function that definitely resulted in some error message box:

0:001> ~0s [...]

0:000> !teb TEB at 7ff59000 ExceptionList: 1178fdc4 StackBase: 11790000 StackLimit: 11789000 SubSystemTib: 00000000 FiberData: 00001e00 ArbitraryUserPointer: 00000000 Self: 7ff59000 EnvironmentPointer: 00000000 ClientId: 00000a3c . 00000700 RpcHandle: 00000000 Tls Storage: 00000000 PEB Address: 7ffdf000 LastErrorValue: 0 LastStatusValue: c0000008 Count Owned Locks: 0 HardErrorMode: 0

0:000> dds 11789000 11790000 11789000 00000000 11789004 00000000 11789008 00000000 1178900c 00000000 11789010 00000000 [...] 1178f058 0a4016f4 1178f05c 1178efe0 1178f060 695040c4 <Unloaded_faultrep.dll>+0x40c4 1178f064 7ffdf000 1178f068 00000000 1178f06c 0a4016f4 1178f070 0a4016f4 1178f074 00000000 1178f078 1178f06c 1178f07c 0a4016b8 1178f080 0a4016b8 1178f084 1178efa0 1178f088 7c821b74 ntdll!NtRaiseHardError+0xc 1178f08c 77e99af9 kernel32!UnhandledExceptionFilter+0×54b 1178f090 d0000144 1178f094 00000004 1178f098 00000000 1178f09c 1178f164 1178f0a0 00000001 1178f0a4 77e996a7 kernel32!UnhandledExceptionFilter+0×873 1178f0a8 00000000 1178f0ac 00000000 1178f0b0 00000000 1178f0b4 02f049f0 1178f0b8 1178f13c 1178f0bc 00000000 […]

It was that time when the dump should have been saved. See also Process crash - getting the dump manually post for another example and full explanation.

- Dmitry Vostokov @ DumpAnalysis.org -

Posted in Crash Dump Analysis, Crash Dump Patterns, Debugging | 1 Comment »

Heap and spike: pattern cooperation

Saturday, July 26th, 2008

This is another case study showing how different patterns interact. It was reported that the service was spiking CPU (see Spiking Thread pattern) and other processes couldn’t communicate with it (see Coupled Processes pattern). The dump was saved using userdump.exe and !runaway command showed that the thread #18 was consuming CPU mostly in user mode:

0:018> !runaway 0y11 User Mode Time Thread Time 18:6080 0 days 20:46:57.156 3:2838 0 days 0:00:00.203 5:53a0 0 days 0:00:00.093 16:44a8 0 days 0:00:00.078 17:2ad8 0 days 0:00:00.046 19:5834 0 days 0:00:00.015 […] 0:57d4 0 days 0:00:00.000 Kernel Mode Time Thread Time 5:53a0 0 days 0:00:03.328 3:2838 0 days 0:00:01.046 16:44a8 0 days 0:00:00.937 18:6080 0 days 0:00:00.765 17:2ad8 0 days 0:00:00.531 19:5834 0 days 0:00:00.171 20:3174 0 days 0:00:00.140 […]

Its thread stack revealed runtime heap manipulation function on top of it:

0:018> ~18kL ChildEBP RetAddr 0207f748 77e673ca ntdll!RtlFreeHeap+0×4a5 0207f7b0 77e67690 kernel32!BasepComputeProcessPath+0×395 0207f7b0 77e67690 kernel32!BaseComputeProcessDllPath+0xe3 0207f7f0 77e41b95 kernel32!BaseComputeProcessDllPath+0xe3 0207f850 710f43d9 kernel32!LoadLibraryExW+0×14e […]

The parameters to RtlFreeHeap call are described in the following MSDN article:

http://msdn.microsoft.com/en-us/library/ms796710.aspx

We see that the code was freeing the heap block with the address 0×02f88278 and it was located in the first heap 0×00090000:

0:018> ~18kv ChildEBP RetAddr Args to Child 0207f748 77e673ca 00090000 00000000 02f88278 ntdll!RtlFreeHeap+0×4a5 […]

0:018> !heap Index Address Name Debugging options enabled 1: 00090000 2: 00190000 3: 003b0000 4: 00340000 5: 00380000 6: 007d0000 7: 008e0000 8: 00a40000 9: 00ae0000 10: 00c60000 11: 00cf0000 12: 00d30000 13: 00d40000 14: 00ed0000 15: 01010000 16: 01090000 17: 00a20000 18: 00c90000 19: 00fd0000 20: 01690000 21: 01b30000 22: 01b50000 23: 01c20000 24: 01e50000 25: 01f20000 26: 01f30000 27: 01f40000

Could it be the case that the heap was corrupt? We can validate it:

0:018> !heap -s 00090000 Walking the heap 00090000 .HEAP 00090000 (Seg 00090640) At 00184998 Error: invalid block size

.List corrupted: (Blink->Flink = 00000000) != (Block = 02f88278) HEAP 00090000 (Seg 02f80000) At 02f88270 Error: block list entry corrupted

[...]

The output shows our block 0×02f88278 too but this could also be the effect of False Positive Corruption, the fact that the heap control structures had not been updated yet: the thread is inside heap manipulation function.

When looking at critical sections and other threads we also see several wait chains (see Wait Chain pattern):

0:000> !locks

CritSec ntdll!LdrpLoaderLock+0 at 7c889d94 WaiterWoken No LockCount 18 RecursionCount 1 OwningThread 2300 EntryCount 0 ContentionCount 141 *** Locked

CritSec ntdll!FastPebLock+0 at 7c889d40 WaiterWoken No LockCount 1 RecursionCount 1 OwningThread 2904 EntryCount 0 ContentionCount 2 *** Locked

CritSec +90608 at 00090608 WaiterWoken No LockCount 7 RecursionCount 1 OwningThread 6080 EntryCount 0 ContentionCount 7 *** Locked

CritSec serviceA!ServiceSection+0 at 7617c340 WaiterWoken No LockCount 5 RecursionCount 1 OwningThread 2838 EntryCount 0 ContentionCount 1e3 *** Locked

0:000> ~*kv

[...]

3 Id: da4.2838 Suspend: 1 Teb: 7ffdc000 Unfrozen ChildEBP RetAddr Args to Child 00c4f164 7c822124 7c83970f 00000688 00000000 ntdll!KiFastSystemCallRet 00c4f168 7c83970f 00000688 00000000 00000000 ntdll!NtWaitForSingleObject+0xc 00c4f1a4 7c839620 00000000 00000004 00090000 ntdll!RtlpWaitOnCriticalSection+0x19c 00c4f1c4 7c82fe72 00090608 000995b8 00000008 ntdll!RtlEnterCriticalSection+0xa8 00c4f3ec 7c81f4d5 00090000 00800000 000000f8 ntdll!RtlAllocateHeap+0×313 00c4f434 7c81f073 000995b8 00000000 00090000 ntdll!RtlpAllocateUserBlock+0×91 00c4f4f8 7c81ff83 00000bbe 00000000 00000001 ntdll!RtlpLowFragHeapAlloc+0×862 00c4f71c 7c820b23 00090000 00000000 0000000c ntdll!RtlAllocateHeap+0×80 00c4f734 77f6599e 00c4f7b8 00000001 00000002 ntdll!RtlAllocateAndInitializeSid+0×35 00c4f768 74062bfd 00c4f7b8 00000001 00000002 advapi32!AllocateAndInitializeSid+0×2c […]

[...]

18 Id: da4.6080 Suspend: 1 Teb: 7ffa8000 Unfrozen [This is our spiking thread] [...]

[...]

32 Id: da4.2904 Suspend: 1 Teb: 7ff9a000 Unfrozen ChildEBP RetAddr Args to Child 0356fbfc 7c822124 7c83970f 00000688 00000000 ntdll!KiFastSystemCallRet 0356fc00 7c83970f 00000688 00000000 00000000 ntdll!NtWaitForSingleObject+0xc 0356fc3c 7c839620 00000000 00000004 00090000 ntdll!RtlpWaitOnCriticalSection+0x19c 0356fc5c 7c82fe72 00090608 7c8329a4 0000008e ntdll!RtlEnterCriticalSection+0xa8 0356fe84 77e67319 00090000 00000000 0000056a ntdll!RtlAllocateHeap+0×313 0356fee8 77e67690 77e676bc 00020a50 00000000 kernel32!BasepComputeProcessPath+0xc2 0356ff28 77e41b95 00000000 00000000 7c82f337 kernel32!BaseComputeProcessDllPath+0xe3 0356ff88 77e67b77 74066908 00000000 00000000 kernel32!LoadLibraryExW+0×14e 0356ff9c 74071b74 74066908 74075320 74071c1c kernel32!LoadLibraryW+0×11 […]

33 Id: da4.2300 Suspend: 1 Teb: 7ff99000 Unfrozen ChildEBP RetAddr Args to Child 035afb40 7c822124 7c83970f 00000164 00000000 ntdll!KiFastSystemCallRet 035afb44 7c83970f 00000164 00000000 00000000 ntdll!NtWaitForSingleObject+0xc 035afb80 7c839620 00000000 00000004 00000000 ntdll!RtlpWaitOnCriticalSection+0x19c 035afba0 7c83288c 7c889d40 77e630b8 00000000 ntdll!RtlEnterCriticalSection+0xa8 035afba8 77e630b8 00000000 007e5860 00000001 ntdll!RtlAcquirePebLock+0×11 […]

[...]

18 threads -> TID#2300 -> TID#2904 -> TID#6080

5 threads -> TID#2838 -> TID#6080

So it looks like our spiking thread was the main problem and preventing other threads from running.

- Dmitry Vostokov @ DumpAnalysis.org -

Posted in Crash Dump Analysis, Crash Dump Patterns, Debugging | No Comments »

Exception and deadlock: pattern cooperation

Tuesday, July 22nd, 2008

When a process experienced an unhandled exception what were the possible reasons for a postmortem debugger not saving a crash dump? One of them will be illustrated in this post. The process AppA was hanging and causing another process AppB to hang too (see Coupled Processes pattern). If we look at AppA locked critical sections we would see a loader deadlock:

0:000> !locks

CritSec ntdll!LdrpLoaderLock+0 at 7c889d94 WaiterWoken No LockCount 8 RecursionCount 2 OwningThread 4534 EntryCount 0 ContentionCount a3e *** Locked

CritSec ntdll!FastPebLock+0 at 7c889d40 WaiterWoken No LockCount 2 RecursionCount 1 OwningThread 3eb4 EntryCount 0 ContentionCount 15 *** Locked

3 Id: 30b0.3eb4 Suspend: 1 Teb: 7ffdb000 Unfrozen ChildEBP RetAddr Args to Child 00b3e11c 7c822124 7c83970f 0000004c 00000000 ntdll!KiFastSystemCallRet 00b3e120 7c83970f 0000004c 00000000 00000000 ntdll!NtWaitForSingleObject+0xc 00b3e15c 7c839620 00000000 00000004 00000001 ntdll!RtlpWaitOnCriticalSection+0×19c 00b3e17c 7c832ad0 7c889d94 09150000 7c889e0c ntdll!RtlEnterCriticalSection+0xa8 00b3e1b0 77e68de4 00000001 00000000 00b3e1ec ntdll!LdrLockLoaderLock+0xe4 00b3e210 77e98fae 09150000 00b3e5b0 00000104 kernel32!GetModuleFileNameW+0×77 00b3e24c 77e76d5f 091518b8 00b3e5a4 00000000 kernel32!FillUEFInfo+0×51 00b3e8d4 77e84269 00b3e8fc 77e6b831 00b3e904 kernel32!UnhandledExceptionFilter+0xec 00b3e8dc 77e6b831 00b3e904 00000000 00b3e904 kernel32!BaseThreadStart+0×4a 00b3e904 7c82eeb2 00b3e9e8 00b3ffdc 00b3ea04 kernel32!_except_handler3+0×61 00b3e928 7c82ee84 00b3e9e8 00b3ffdc 00b3ea04 ntdll!ExecuteHandler2+0×26 00b3e9d0 7c82ecc6 00b38000 00b3ea04 00b3e9e8 ntdll!ExecuteHandler+0×24 00b3e9d0 7c832335 00b38000 00b3ea04 00b3e9e8 ntdll!KiUserExceptionDispatcher+0xe (CONTEXT @ 00b3ea04) 00b3eeec 77e67319 00090000 00000000 0000056a ntdll!RtlAllocateHeap+0×9e3 00b3ef50 77e67690 77e676bc 00020a50 00000000 kernel32!BasepComputeProcessPath+0xc2 00b3ef90 77e41b95 00000000 00000000 00b3f05c kernel32!BaseComputeProcessDllPath+0xe3 00b3eff0 77e67b77 74065a30 00000000 00000000 kernel32!LoadLibraryExW+0×14e 00b3f004 7406615c 74065a30 007e1ab0 00000000 kernel32!LoadLibraryW+0×11 00b3f020 74066263 007e1ab0 00000000 534c4354 AppA!ProcessItem+0×2d […]

21 Id: 30b0.4534 Suspend: 1 Teb: 7ffa3000 Unfrozen ChildEBP RetAddr Args to Child 031eeef8 7c822124 7c83970f 00000350 00000000 ntdll!KiFastSystemCallRet 031eeefc 7c83970f 00000350 00000000 00000000 ntdll!NtWaitForSingleObject+0xc 031eef38 7c839620 00000000 00000004 031ef278 ntdll!RtlpWaitOnCriticalSection+0×19c 031eef58 7c830c25 7c889d40 031ef328 00000000 ntdll!RtlEnterCriticalSection+0xa8 031ef200 7c831016 00000001 031ef2ac 031ef288 ntdll!RtlpDosPathNameToRelativeNtPathName_Ustr+0×97 031ef21c 7c830f85 031ef2ac 031ef288 00000000 ntdll!RtlDosPathNameToRelativeNtPathName_Ustr+0×18 031ef290 7c833ce9 031ef2ac 00000001 031ef414 ntdll!RtlDoesFileExists_UstrEx+0×1c 031ef2c0 7c83350e 031ef328 031ef434 031ef33c ntdll!LdrpSearchPath+0×76 031ef3c4 7c833637 0325fc78 031ef414 00000000 ntdll!LdrpCheckForLoadedDll+0xdc 031ef668 7c833ee5 00000000 0325fc78 031ef968 ntdll!LdrpLoadDll+0×1b3 031ef8e4 77e41bcc 0325fc78 031ef968 031ef948 ntdll!LdrLoadDll+0×198 031ef984 5e00b8d7 00115b00 00000000 00000008 kernel32!LoadLibraryExW+0×1b2 031ef9b0 5dff64aa 000cee40 7406275b 0000059c AppA!LoadPluginModule+0×42 […]

Looking at TID#3eb4 we see that the cause for deadlock was an exception while loading a DLL. Applying exception context WinDbg command .cxr reveals heap corruption pattern:

0:003> .cxr 00b3ea04 eax=0325f1f0 ebx=00000051 ecx=00090000 edx=00090400 esi=0008019d edi=0325f1e8 eip=7c832335 esp=00b3ecd0 ebp=00b3eeec iopl=0 nv up ei ng nz na po cy cs=001b ss=0023 ds=0023 es=0023 fs=003b gs=0000 efl=00010283 ntdll!RtlAllocateHeap+0×9e3: 7c832335 8906 mov dword ptr [esi],eax ds:0023:0008019d={(AppA+0×2) (01000002)}

0:003> kL *** Stack trace for last set context - .thread/.cxr resets it ChildEBP RetAddr 00b3eeec 77e67319 ntdll!RtlAllocateHeap+0×9e3 00b3ef50 77e67690 kernel32!BasepComputeProcessPath+0xc2 00b3ef90 77e41b95 kernel32!BaseComputeProcessDllPath+0xe3 00b3eff0 77e67b77 kernel32!LoadLibraryExW+0×14e 00b3f004 7406615c kernel32!LoadLibraryW+0×11 00b3f020 74066263 AppA!ProcessItem+0×2d […]

Seems the address pointed by ESI is valid: [0008019d] = 01000002. However this is write MOV instruction but the accessed memory page is read-only:

0:003> !address 01000002 01000000 : 01000000 - 00001000 Type 01000000 MEM_IMAGE Protect 00000002 PAGE_READONLY State 00001000 MEM_COMMIT Usage RegionUsageImage FullPath […]AppA.exe

Without heap corruption the normal course of action can be depicted on the following diagram (Wait Chain pattern applied to critical sections):

However the exception changes the picture. The course of execution is deflected to the loader again and the loop is closed. We have a classical deadlock:

- Dmitry Vostokov @ DumpAnalysis.org -

Posted in Crash Dump Analysis, Crash Dump Patterns, Debugging | No Comments »

CDA Pattern Classification (Memory dump type)

Monday, July 21st, 2008

The following table provides correspondence between memory dump types (user, kernel and complete) and pattern names. For ALL, where appropriate, I also list the most common dump types where a pattern can be encountered.

Table: CDA Pattern Classification (Memory dump type)

- Dmitry Vostokov @ DumpAnalysis.org -

Posted in CDA Pattern Classification, Crash Dump Analysis, Crash Dump Patterns, Debugging | No Comments »

CDA Pattern Classification (Space/Mode)

Monday, July 21st, 2008

The following table provides correspondence between kernel / user mode and space and pattern names. ALL means that a pattern is observable in both modes / spaces.

Table: CDA Pattern Classification (Space/Mode)

- Dmitry Vostokov @ DumpAnalysis.org -

Posted in CDA Pattern Classification, Crash Dump Analysis, Crash Dump Patterns, Debugging | No Comments »

CDA Pattern Classification (Overview)

Sunday, July 20th, 2008

Thinking for a while I realized that it is not possible to create one classification scheme because one pattern can belong to several categories. Therefore I came up with the following draft list of them:

- Space / Mode
- Memory dump type
- Dynamical (Activity) / Structural (State)
- Architecture / Design / Implementation
- Entity-dependence
- System / Subsystem / Component / Object
- Cosmogony (How?)
- Management (Why?)
- Error class
- Observable behaviour

Next two posts will present classification based on Space/Mode and Memory dump type.

- Dmitry Vostokov @ DumpAnalysis.org -

Posted in CDA Pattern Classification, Crash Dump Analysis, Crash Dump Patterns, Debugging | No Comments »

Multiple patterns in a dump: case study

Thursday, July 17th, 2008

Let’s now apply the knowledge of crash dump analysis patterns to the real memory dump coming from the hanging system. !locks WinDbg command is the most frequently used start command for such symptoms according to my observation which might be biased. It reveals deep Wait Chain pattern for executive resources:

1: kd> !locks [...] Resource @ 0x8b10c168 Exclusively owned Contention Count = 1950624 NumberOfSharedWaiters = 2 NumberOfExclusiveWaiters = 154 Threads: 88256430-01<*> 882aedb0-01 8a348580-01 Threads Waiting On Exclusive Access: 891fd368 894fd240 88382280 8921b9c0 8aa18db0 89790328 882b8818 88a70bf8 884a2780 88999818 8a13b020 8846a7a0 8a0b3020 8812e568 897b6db0 88a16440 8922c5f8 88bfe3b8 88264ac0 89ff2b40 8a9da020 881cf020 8807adb0 89d64598 887811d0 8822a850 88264820 88194738 8801f7a8 88284020 88628db0 8a071db0 884a84b0 88be46c0 89755b18 89700020 89ca4580 881cddb0 882f7020 88bcf9a0 8921b020 8826fdb0 88a73db0 88211020 8868a1c8 89121280 89e01020 895cedb0 88d03790 883941f8 8910f820 891ebc80 89862db0 88154af8 8821e7d0 881cedb0 8822b020 88094818 8a00b020 89e69020 880bbdb0 8945f690 8954c1c0 88d2cb90 881cd020 8921c448 89550540 8a5a5870 8a159228 893976c0 882847b8 89306578 880eb9a8 8978e020 882f72c0 8966c380 8a12f4b8 8815adb0 881a5020 897c7db0 8873ebf8 88674530 8831b468 88e999a0 88287020 8966e600 88541db0 8826d7a0 88119b10 8a226338 882f7810 888ba348 884b89a8 88d03db0 8826db00 8910adb0 881d8368 89288238 8a00adb0 89125db0 88eb50a0 88dbbdb0 880ed020 895cd5b0 881d4b00 88565db0 886e7780 884b86d8 8a603598 89383020 8826f370 886d2248 88cd1360 881d1888 88bef670 88117db0 890d63f0 894d0368 8826f850 89123020 88209020 8826fac0 88f9bdb0 89027478 894b8d18 882a7338 899b9020 897c3db0 8a13fc50 88b33d50 88b54b68 88652360 8a199020 8910dc98 8833a020 8a194a70 8a5af640 89b717a0 89464db0 8a152878 884773d8 88afe020 88debaf8 88bef2a0 88bd6948 89abddb0 8a133db0 88e0ebe8 88287398 889622e0 8836aab8 88daec80 88c5c450 88225718

[...]

Resource @ 0x8a316c98 Exclusively owned Contention Count = 40315 NumberOfExclusiveWaiters = 2 Threads: 893bd498-01<*> Threads Waiting On Exclusive Access: 8846f9a8 88256430

[...]

Resource @ 0x8a2b3800 Exclusively owned Contention Count = 17735 NumberOfExclusiveWaiters = 1 Threads: 8a30ec80-01<*> Threads Waiting On Exclusive Access: 893bd498

[...]

14606 total locks, 14 locks currently held

We have this chain: 154 threads -> 88256430 -> 893bd498 -> 8a30ec80

The same conclusion comes from !analyze -v -hang command:

1: kd> !analyze -v -hang

[...]

Scanning for threads blocked on locks ...

CURRENT_IRQL: 2

BLOCKING_THREAD: 8a30ec80

LOCK_ADDRESS: 8a2b3800 -- (!locks 8a2b3800)

Resource @ 0x8a2b3800 Exclusively owned Contention Count = 17735 NumberOfExclusiveWaiters = 1 Threads: 8a30ec80-01<*> Threads Waiting On Exclusive Access: 893bd498

BUGCHECK_STR: LOCK_HELD

[...]

Let’s examine the thread 8a30ec80 that holds so many others:

1: kd> !thread 8a30ec80 1f THREAD 8a30ec80 Cid 3ca0.20f0 Teb: 00000000 Win32Thread: 00000000 RUNNING on processor 2 Not impersonating DeviceMap e1000930 Owning Process 8a254128 Image: processA.exe Wait Start TickCount 2024978 Ticks: 2291 (0:00:00:35.796) Context Switch Count 339739 UserTime 00:00:00.000 KernelTime 01:08:29.484 Start Address driverA!WorkerRoutine (0xbfa4b850) Stack Init b7409000 Current b7407e74 Base b7409000 Limit b7406000 Call 0 Priority 13 BasePriority 13 PriorityDecrement 0 ChildEBP RetAddr f77353a0 bfa4b880 driverA!DoProcessing+0×256 f77353a4 ff180010 driverA!WorkerRoutine+0×30 […]

We see that this thread is running on processor 2 and the time spent in kernel is more than one hour. Seems we have the case of Spiking Thread pattern here. Also the thread is not waiting and seems to be moving some data:

1: kd> .thread 8a30ec80 Implicit thread is now 8a30ec80

1: kd> r Last set context: eax=baec2950 ebx=00000000 ecx=00001b4a edx=00002275 esi=bae8c010 edi=ba01a018 eip=bfa3b68c esp=f77353a4 ebp=b7407f44 iopl=0 nv up ei pl nz na po nc cs=0008 ss=0010 ds=0023 es=0023 fs=0030 gs=0000 efl=00000202 driverA!DoProcessing+0×256: bfa3b68c f3a5 rep movs dword ptr es:[edi],dword ptr [esi]

This driver was also known to have pool allocation problems resulting in various strange system behaviour so let’s inspect the output of !vm command to see whether we have an instance of Insufficient Memory pattern:

1: kd> !vm

*** Virtual Memory Usage *** Physical Memory: 1048242 ( 4192968 Kb) Page File: \??\C:\pagefile.sys Current: 4456448 Kb Free Space: 3483860 Kb Minimum: 4456448 Kb Maximum: 4456448 Kb Available Pages: 409879 ( 1639516 Kb) ResAvail Pages: 942398 ( 3769592 Kb) Locked IO Pages: 124 ( 496 Kb) Free System PTEs: 182782 ( 731128 Kb) Free NP PTEs: 32766 ( 131064 Kb) Free Special NP: 0 ( 0 Kb) Modified Pages: 29 ( 116 Kb) Modified PF Pages: 29 ( 116 Kb) NonPagedPool Usage: 13567 ( 54268 Kb) NonPagedPool Max: 65279 ( 261116 Kb) PagedPool 0 Usage: 12350 ( 49400 Kb) PagedPool 1 Usage: 2442 ( 9768 Kb) PagedPool 2 Usage: 2457 ( 9828 Kb) PagedPool 3 Usage: 2395 ( 9580 Kb) PagedPool 4 Usage: 2465 ( 9860 Kb) PagedPool Usage: 22109 ( 88436 Kb) PagedPool Maximum: 64512 ( 258048 Kb)

********** 3 pool allocations have failed **********

Shared Commit: 133470 ( 533880 Kb) Special Pool: 0 ( 0 Kb) Shared Process: 24993 ( 99972 Kb) PagedPool Commit: 22173 ( 88692 Kb) Driver Commit: 2056 ( 8224 Kb) Committed pages: 938909 ( 3755636 Kb) Commit limit: 2119027 ( 8476108 Kb)

[...]

We have signs here but the current size of paged pool and nonpaged pool seems to be very far from their maximum. Perhaps there were failures in session pool allocations? Let’s look at session pool:

1: kd> !vm 4 [...] Session ID 5 @ f79f3000: Paged Pool Usage: 35640K

*** 6 Pool Allocation Failures ***

Commit Usage: 36900K

We see the problem in session 5 and let’s see in which session our processA.exe was running:

1: kd> !process 8a254128 0 PROCESS 8a254128 SessionId: 15 Cid: 3ca0 Peb: 7ffd7000 ParentCid: 01ac DirBase: bff47800 ObjectTable: e779f5c8 HandleCount: 161. Image: processA.exe

It was session 15 and therefore we might conclude that previous problems with driverA are not connected to this new one. The identified problem is CPU spike. Perhaps the code contains a bug that causes this driver to loop indefinitely.

- Dmitry Vostokov @ DumpAnalysis.org -

Posted in Crash Dump Analysis, Crash Dump Patterns, Debugging | No Comments »

In Search of Lost CID

Friday, July 11th, 2008

Paraphrasing the title of Marcel Proust’s “In Search of Lost Time” 6-volume classic we can say there is timeless empirical knowledge. One is target CID (Client ID, PID:TID) for RPC calls. We just need to search for even 16-bit numbers and compare them with the list of available PIDs. The example can be found on ntdebugging blog:

Tracking Down a Multi-Process Deadlock

Actually the second dword after PID can contain even 16-bit TID number as can be seen from another example:

1: kd> kv ChildEBP RetAddr Args to Child [...] 00faf828 7778c38b 00faf8f0 00faf9f0 06413b54 ole32!CRpcChannelBuffer::SwitchAptAndDispatchCall+0x112 00faf908 776c0565 06413b54 00fafa00 00faf9f0 ole32!CRpcChannelBuffer::SendReceive2+0xd3 00faf974 776c04fa 06413b54 00fafa00 00faf9f0 ole32!CAptRpcChnl::SendReceive+0xab 00faf9c8 77ce247f 06413b54 00fafa00 00faf9f0 ole32!CCtxComChnl::SendReceive+0×1a9 00faf9e4 77ce252f 03213bdc 00fafa2c 0600016e RPCRT4!NdrProxySendReceive+0×43 00fafdcc 77ce25a6 763050e8 76306bba 00fafe04 RPCRT4!NdrClientCall2+0×206 00fafdec 77c64f87 0000000c 00000005 00fafe3c RPCRT4!ObjectStublessClient+0×8b 00fafdfc 7784ba75 03213bdc 03235858 03235850 RPCRT4!ObjectStubless+0xf […]

1: kd> dpp 06413b54 l8 06413b54 77672418 7778bcdf ole32!CRpcChannelBuffer::QueryInterface 06413b58 776723e8 777267f5 ole32!CRpcChannelBuffer::QueryInterface 06413b5c 00000003 06413b60 00000002 06413b64 0743fde8 0316f080 06413b68 07676528 031d5ad0 06413b6c 064c9c80 064c9d00 06413b70 078843c0 00000000

1: kd> dd 064c9c80 l4 064c9c80 064c9d00 064c9c00 00003108 00001dac ; PID TID

I have been using this technique for a long time and only now see it documented.

- Dmitry Vostokov @ DumpAnalysis.org -

Posted in Crash Dump Analysis, Crash Dump Patterns, Debugging, WinDbg Tips and Tricks | 3 Comments »

Crash Dump Analysis Patterns (Part 71)

Friday, July 11th, 2008

Dynamic memory corruption patterns in user and kernel spaces are specializations of one big parent pattern called Corrupt Structure because crashes there happen due to corrupt or overwritten heap or pool control structures (for the latter see Double Free pattern). Another frequently seen specialization is called Critical Section Corruption which is the subject of this post. Critical sections are linked together through statically pre-allocated or heap-allocated helper structure (shown in magenta) although themselves they can be stored anywhere from static and stack area to heap:

0:001> dt -r1 ntdll!_RTL_CRITICAL_SECTION 77795240 +0×000 DebugInfo : 0×00175d28 _RTL_CRITICAL_SECTION_DEBUG +0×000 Type : 0 +0×002 CreatorBackTraceIndex : 0 +0×004 CriticalSection : 0×77795240 _RTL_CRITICAL_SECTION +0×008 ProcessLocksList : _LIST_ENTRY [ 0×173a08 - 0×173298 ] +0×010 EntryCount : 0 +0×014 ContentionCount : 0 +0×018 Spare : [2] 0 +0×004 LockCount : -1 +0×008 RecursionCount : 0 +0×00c OwningThread : (null) +0×010 LockSemaphore : (null) +0×014 SpinCount : 0

0:001> !address 77795240 77670000 : 77792000 - 00005000 Type 01000000 MEM_IMAGE Protect 00000004 PAGE_READWRITE State 00001000 MEM_COMMIT Usage RegionUsageImage FullPath C:\WINDOWS\system32\ole32.dll

0:001> !address 0×00175d28 00140000 : 00173000 - 0000d000 Type 00020000 MEM_PRIVATE Protect 00000004 PAGE_READWRITE State 00001000 MEM_COMMIT Usage RegionUsageHeap Handle 00140000

0:000> !locks

CritSec ntdll!LdrpLoaderLock+0 at 7c8877a0 WaiterWoken No LockCount 0 RecursionCount 1 OwningThread 1184 EntryCount 0 ContentionCount b04707 *** Locked

0:000> dt -r1 _RTL_CRITICAL_SECTION 7c8877a0 +0×000 DebugInfo : 0×7c8877c0 _RTL_CRITICAL_SECTION_DEBUG +0×000 Type : 0 +0×002 CreatorBackTraceIndex : 0 +0×004 CriticalSection : 0×7c8877a0 _RTL_CRITICAL_SECTION +0×008 ProcessLocksList : _LIST_ENTRY [ 0×7c887be8 - 0×7c887bc8 ] +0×010 EntryCount : 0 +0×014 ContentionCount : 0xb04707 +0×018 Spare : [2] 0 +0×004 LockCount : -2 +0×008 RecursionCount : 1 +0×00c OwningThread : 0×00001184 +0×010 LockSemaphore : 0×0000013c +0×014 SpinCount : 0

0:000> !address 7c8877a0 7c800000 : 7c887000 - 00003000 Type 01000000 MEM_IMAGE Protect 00000004 PAGE_READWRITE State 00001000 MEM_COMMIT Usage RegionUsageImage FullPath C:\WINDOWS\system32\ntdll.dll

0:000> !address 0×7c8877c0 7c800000 : 7c887000 - 00003000 Type 01000000 MEM_IMAGE Protect 00000004 PAGE_READWRITE State 00001000 MEM_COMMIT Usage RegionUsageImage FullPath C:\WINDOWS\system32\ntdll.dll

Consider the case when CRITICAL_SECTION is defined on a stack and there was Local Buffer Overflow overwriting DebugInfo pointer. Then we have an example of Wild Pointer pattern and traversing the list of critical sections from this point will diverge into completely unrelated memory area or stop there. Consider another example of heap corruption or race condition overwriting ProcessLocksList or CriticalSection pointer. Then we have an instance of Wild Pointer pattern illustrated below:

0:000> !locks

CritSec ntdll!LdrpLoaderLock+0 at 7c8877a0 WaiterWoken No LockCount 0 RecursionCount 1 OwningThread 1184 EntryCount 0 ContentionCount b04707 *** Locked

CritSec +1018de08 at 1018de08 WaiterWoken Yes LockCount -49153 RecursionCount 5046347 OwningThread 460050 EntryCount 0 ContentionCount 0 *** Locked

CritSec +1018ddd8 at 1018ddd8 WaiterWoken Yes LockCount -1 RecursionCount 0 OwningThread 0 *** Locked

CritSec +1018de28 at 1018de28 WaiterWoken Yes LockCount -1 RecursionCount 0 OwningThread 0 *** Locked

CritSec +1018de08 at 1018de08 WaiterWoken Yes LockCount -49153 RecursionCount 5046347 OwningThread 460050 EntryCount 0 ContentionCount 0 *** Locked

CritSec +1018de28 at 1018de28 WaiterWoken Yes LockCount -1 RecursionCount 0 OwningThread 0 *** Locked

CritSec +1018ddd8 at 1018ddd8 WaiterWoken Yes LockCount -1 RecursionCount 0 OwningThread 0 *** Locked

Scanned 841 critical sections

We see the signs of corruption at 1018de08 address which is interpreted as pointing to a locked critical section. To see where the corruption started we need to look at the list of all critical sections either locked or not locked:

0:000> !locks -v

CritSec ntdll!RtlCriticalSectionLock+0 at 7c887780 LockCount NOT LOCKED RecursionCount 0 OwningThread 0 EntryCount 0 ContentionCount 28

CritSec ntdll!LdrpLoaderLock+0 at 7c8877a0 WaiterWoken No LockCount 0 RecursionCount 1 OwningThread 1184 EntryCount 0 ContentionCount b04707 *** Locked

CritSec ntdll!FastPebLock+0 at 7c887740 LockCount NOT LOCKED RecursionCount 0 OwningThread 0 EntryCount 0 ContentionCount 42c9

CritSec ntdll!RtlpCalloutEntryLock+0 at 7c888ea0 LockCount NOT LOCKED RecursionCount 0 OwningThread 0 EntryCount 0 ContentionCount 0

CritSec ntdll!PMCritSect+0 at 7c8883c0 LockCount NOT LOCKED RecursionCount 0 OwningThread 0 EntryCount 0 ContentionCount 0

CritSec ntdll!UMLogCritSect+0 at 7c888400 LockCount NOT LOCKED RecursionCount 0 OwningThread 0 EntryCount 0 ContentionCount 0

CritSec ntdll!RtlpProcessHeapsListLock+0 at 7c887960 LockCount NOT LOCKED RecursionCount 0 OwningThread 0 EntryCount 0 ContentionCount 0

CritSec +80608 at 00080608 LockCount NOT LOCKED RecursionCount 0 OwningThread 0 EntryCount 0 ContentionCount 22

[...]

CritSec cabinet!_adbgmsg+13c at 74fb4658 LockCount NOT LOCKED RecursionCount 0 OwningThread 0 EntryCount 0 ContentionCount 0

CritSec +c6c17c at 00c6c17c LockCount NOT LOCKED RecursionCount 0 OwningThread 0 EntryCount 0 ContentionCount 0

CritSec +c6c0e4 at 00c6c0e4 LockCount NOT LOCKED RecursionCount 0 OwningThread 0 EntryCount 0 ContentionCount 0

CritSec at 1018de08 does not point back to the debug info at 00136a40 Perhaps the memory that held the critical section has been reused without calling DeleteCriticalSection() ?

CritSec +1018de08 at 1018de08 WaiterWoken Yes LockCount -49153 RecursionCount 5046347 OwningThread 460050 EntryCount 0 ContentionCount 0 *** Locked

CritSec at 1018ddd8 does not point back to the debug info at 00136a68 Perhaps the memory that held the critical section has been reused without calling DeleteCriticalSection() ?

CritSec +1018ddd8 at 1018ddd8 WaiterWoken Yes LockCount -1 RecursionCount 0 OwningThread 0 *** Locked

[...]

We see that the problem appears when the heap-allocated critical section at 00c6c0e4 address is linked to an inconsistent critical section at 0×1018de08 address where memory contains UNICODE string fragment:

0:000> !address 00c6c0e4 00c60000 : 00c60000 - 00010000 Type 00020000 MEM_PRIVATE Protect 00000004 PAGE_READWRITE State 00001000 MEM_COMMIT Usage RegionUsageHeap Handle 00c60000

0:000> dt -r1 _RTL_CRITICAL_SECTION 00c6c0e4 +0x000 DebugInfo : 0x00161140 _RTL_CRITICAL_SECTION_DEBUG +0x000 Type : 0 +0x002 CreatorBackTraceIndex : 0 +0x004 CriticalSection : 0x00c6c0e4 _RTL_CRITICAL_SECTION +0×008 ProcessLocksList : _LIST_ENTRY [ 0×136a48 - 0×119f58 ] +0×010 EntryCount : 0 +0×014 ContentionCount : 0 +0×018 Spare : [2] 0 +0×004 LockCount : -1 +0×008 RecursionCount : 0 +0×00c OwningThread : (null) +0×010 LockSemaphore : (null) +0×014 SpinCount : 0

0:000> dt -r _RTL_CRITICAL_SECTION_DEBUG 0×00136a48-0×008 +0×000 Type : 0 +0×002 CreatorBackTraceIndex : 0 +0×004 CriticalSection : 0×1018de08 _RTL_CRITICAL_SECTION +0×000 DebugInfo : 0×000d001b _RTL_CRITICAL_SECTION_DEBUG +0×000 Type : 0 +0×002 CreatorBackTraceIndex : 0 +0×004 CriticalSection : (null) +0×008 ProcessLocksList : _LIST_ENTRY [ 0×0 - 0×0 ] +0×010 EntryCount : 0 +0×014 ContentionCount : 0×37e3c700 +0×018 Spare : [2] 0×8000025 +0×004 LockCount : 196609 +0×008 RecursionCount : 5046347 +0×00c OwningThread : 0×00460050 +0×010 LockSemaphore : 0×00310033 +0×014 SpinCount : 0×520044 +0×008 ProcessLocksList : _LIST_ENTRY [ 0×136a70 - 0×161148 ] +0×000 Flink : 0×00136a70 _LIST_ENTRY [ 0×136a98 - 0×136a48 ] +0×000 Flink : 0×00136a98 _LIST_ENTRY [ 0×136ae8 - 0×136a70 ] +0×004 Blink : 0×00136a48 _LIST_ENTRY [ 0×136a70 - 0×161148 ] +0×004 Blink : 0×00161148 _LIST_ENTRY [ 0×136a48 - 0×119f58 ] +0×000 Flink : 0×00136a48 _LIST_ENTRY [ 0×136a70 - 0×161148 ] +0×004 Blink : 0×00119f58 _LIST_ENTRY [ 0×161148 - 0×16cc3c0 ] +0×010 EntryCount : 0 +0×014 ContentionCount : 0 +0×018 Spare : [2] 0×2e760000

0:000> !address 0x1018de08 10120000 : 10120000 - 00100000 Type 00020000 MEM_PRIVATE Protect 00000004 PAGE_READWRITE State 00001000 MEM_COMMIT Usage RegionUsageIsVAD

The address points miraculously to some DLL:

0:000> du 1018de08 1018de08 "....componentA.dll"

We might suggest that componentA.dll played some role there.

There are other messages from verbose version of !locks WinDbg command pointing to critical section problems:

CritSec componentB!Section+0 at 74004008 LockCount NOT LOCKED RecursionCount 0 OwningThread 0 EntryCount 0 ContentionCount 0

The CritSec componentC!Info+c at 72455074 has been RE-INITIALIZED. The critical section points to DebugInfo at 00107cc8 instead of 000f4788

CritSec componentC!Info+c at 72455074 LockCount NOT LOCKED RecursionCount 0 OwningThread 0 EntryCount 0 ContentionCount 0

CritSec componentD!foo+8ec0 at 0101add0 LockCount NOT LOCKED RecursionCount 0 OwningThread 0 EntryCount 0 ContentionCount 0

- Dmitry Vostokov @ DumpAnalysis.org -

Posted in Crash Dump Analysis, Crash Dump Patterns, Debugging | 3 Comments »

Crash Dump Analysis Patterns (Part 70)

Thursday, July 10th, 2008

Sometimes compilers optimize code by replacing function calls with their bodies. This procedure is called function inlining and functions themselves are called inline. On one platform we can see the real function call on the stack trace but on another platform or product version we only see the same problem instruction. Fortunately the rest of stack trace should be the same. Therefore when comparing Stack Traces we shouldn’t pay attention only to the top function call. This pattern is called Inline Function Optimization.

It is frequently seen when threads crash while copying or moving memory. Consider this stack trace:

0: kd> kL ChildEBP RetAddr f22efaf4 f279ec3d driver!QueueValue+0x26b f22efb30 8081dcdf driver!BufferAppendData+0x35f f22efc7c 808f47b7 nt!IofCallDriver+0x45 f22efc90 808f24ee nt!IopSynchronousServiceTail+0x10b f22efd38 80888c7c nt!NtWriteFile+0x65a f22efd38 7c82ed54 nt!KiFastCallEntry+0xfc

When looking at rep movs instruction we might suspect that QueueValue was copying memory:

0: kd> r eax=00000640 ebx=89b23000 ecx=00000190 edx=89b3c828 esi=02124220 edi=e2108f58 eip=f279c797 esp=f22efadc ebp=f22efaf4 iopl=0 nv up ei pl nz na pe nc cs=0008 ss=0010 ds=0023 es=0023 fs=0030 gs=0000 efl=00010206 driver!QueueValue+0x26b: f279c797 f3a5 rep movs dword ptr es:[edi],dword ptr [esi] es:0023:e2108f58=dfefbecf ds:0023:02124220=????????

On x64 bit platform the same driver had the similar stack trace but with memcpy at its top:

fffffadf`8955f4a8 fffffadf`8d1bef46 driver!memcpy+0x1c0 fffffadf`8955f4b0 fffffadf`8d1c15c9 driver!QueueValue+0x2fe fffffadf`8955f550 fffff800`01273ed9 driver!BufferAppendData+0x481 [...]

We also see how QueueValue+0×2fe and QueueValue+0×26b are close. In fact the source code for the driver calls RtlCopyMemory function only once and it is defined as memcpy in wdm.h. The latter function is also exported from nt:

0: kd> x nt! [...] 80881780 nt!memcpy = <no type information> [...]

but usually can be found in any driver that links it from C runtime library, for example, on my x64 Windows:

1: kd> x nt!memcpy fffff800`01c464e0 nt!memcpy = <no type information>

1: kd> x srv!memcpy fffff980`0eafdf20 srv!memcpy = <no type information>

1: kd> x win32k!memcpy fffff960`000c1b40 win32k!memcpy = <no type information>

Therefore we see that when compiling for x86 platform Visual C++ compiler decided to inline memcpy code but AMD compiler on x64 platform didn’t inline it. The overall stack trace without offsets is very similar and we can suppose that the problem was identical.

- Dmitry Vostokov @ DumpAnalysis.org -

Posted in Crash Dump Analysis, Crash Dump Patterns, Debugging | 2 Comments »

Crash Dump Analysis Patterns (Part 19b)

Thursday, July 10th, 2008

Previously I wrote about how to calculate thread waiting time in kernel and complete memory dumps (Waiting Thread Time pattern). Now I show how to do it for user dumps. Unfortunately this information is not available in user space and therefore this can be done only approximately by calculating average or maximum waiting time.

The key is to take advantage of !runaway WinDbg command. Modern versions of postmortem debuggers and process dumpers usually save thread time information additionally. For example, /ma switch for .dump command does it. The later versions of userdump.exe also do it. When !runaway is invoked with 0y111 (7) flag it shows each thread user and kernel times and also the time elapsed since each thread was created. Therefore for threads that didn’t do much CPU intensive computation their average waiting time will almost correspond to their elapsed time. Unfortunately we cannot say when that computation took place. It could be the case that the thread in question was waiting 98.9% of the time, then did some computation for 0.01% and was waiting again 1% of the time or it could be the case that it was doing computation 0.01% of the time initially and then waiting the rest of the time (99.9%) until the dump was saved.

Consider this example of a sleeping thread:

0:000> ~32kL 100 ChildEBP RetAddr 0a18f03c 7c826f4b ntdll!KiFastSystemCallRet 0a18f040 77e41ed1 ntdll!NtDelayExecution+0xc 0a18f0a8 77e424ed kernel32!SleepEx+0x68 0a18f0b8 1b0ac343 kernel32!Sleep+0xf 0a18f100 1b00ba49 msjet40!System::AllocatePages+0x15e 0a18f11c 1b00ba08 msjet40!PageDesc::WireCurrentPage+0x2f 0a18f13c 1b00b8dd msjet40!PageDesc::ReadPage+0x119 0a18f164 1b01b567 msjet40!Database::ReadPage+0x7a 0a18f190 1b00e309 msjet40!TableMover::GetNextRow+0xc9 0a18f1a8 1b015de9 msjet40!TableMover::Move+0xc4 0a18f1d8 1b015d9c msjet40!ErrIsamMove+0x6c 0a18f1f0 1b038aa4 msjet40!ErrDispMove+0x43 0a18f43c 1b015d9c msjet40!ErrJPMoveRange+0x350 0a18f454 1b0200b3 msjet40!ErrDispMove+0x43 0a18f6a0 1b021e4d msjet40!ErrMaterializeRows+0x1fd 0a18f6f0 1b021c0d msjet40!ErrJPSetColumnSort+0x191 0a18f718 1b0210a4 msjet40!ErrJPOpenSort+0x105 0a18f750 1b020de5 msjet40!ErrJPOpenRvt+0x171 0a18f9f0 1b039b82 msjet40!ErrExecuteQuery+0x548 0a18fa3c 1b05548b msjet40!ErrExecuteTempQuery+0x13d 0a18fa6c 4c23b01e msjet40!JetExecuteTempQuery+0xc9 0a18fa94 4c23a8d1 odbcjt32!DoJetCloseTable+0x64 0a18fd10 4c23a6b6 odbcjt32!SQLInternalExecute+0x217 0a18fd20 4c23a694 odbcjt32!SQLExecuteCover+0x1f 0a18fd28 488b3fc1 odbcjt32!SQLExecute+0x9 0a18fd44 0c4898b5 odbc32!SQLExecute+0xd3 [...]

The process uptime can be seen from vertarget WinDbg command:

0:000> vertarget [...] Process Uptime: 0 days 1:17:22.000 [...]

Then we dump thread times (only information for our 32nd thread is shown here):

0:000> !runaway 0y111 User Mode Time Thread Time 32:cfc 0 days 0:00:00.109 [...] Kernel Mode Time Thread Time 32:cfc 0 days 0:00:00.062 [...] Elapsed Time Thread Time [...] 32:cfc 0 days 1:17:20.703 [...]

We see that 1 hour, 17 minutes and almost 21 seconds passed since the thread was created. By subtracting this time from process uptime we see that it was created in the first 2 seconds and it was consuming less than one second of CPU time. Therefore, most of the time this thread was waiting. Unfortunately we cannot say when it was waiting most of the time, in the beginning before it started sleeping or after. Fortunately we might continue guessing by looking at Sleep argument:

0:032> kv ChildEBP RetAddr Args to Child 0a18f03c 7c826f4b 77e41ed1 00000000 0a18f080 ntdll!KiFastSystemCallRet 0a18f040 77e41ed1 00000000 0a18f080 00000000 ntdll!NtDelayExecution+0xc 0a18f0a8 77e424ed 00000032 00000000 04390000 kernel32!SleepEx+0x68 0a18f0b8 1b0ac343 00000032 042a34b4 0a18f1c8 kernel32!Sleep+0xf […]

We see that it had been sleeping for at most 32 milliseconds and perhaps this is a retry / sleep loop. We might guess that the thread was recently spinning in that loop and therefore waiting all the time before that. Alternatively we might guess that the retry portion is very small and fast and 32 milliseconds are spread over all elapsed time and the thread was in this loop the significant proportion of time. What we can surely say that the last waiting time was no more than 32 milliseconds. in the case of waiting for an event, for example, I don’t see any reliable way of calculating this time. If anyone knows please post a comment.

- Dmitry Vostokov @ DumpAnalysis.org -

Posted in Crash Dump Analysis, Crash Dump Patterns, WinDbg Tips and Tricks | 3 Comments »

Crash Dump Analysis AntiPatterns (Part 10)

Wednesday, July 9th, 2008

Often engineers spend 10 minutes pursuing a certain investigation path and then prematurely closing it and switching to another. This is what I call Myopic Troubleshooting and Debugging.

This anti-pattern name was inspired by Daniel Dennett’s discussion of insufficiently patient scientists doing computer simulations:

“mistaking a failure of imagination for an insight into necessity” (Darwin’s Dangerous Idea, page 175).

Paraphrasing we can say that engineers think of impossibility where their imagination fails.

- Dmitry Vostokov @ DumpAnalysis.org -

Posted in AntiPatterns, Crash Dump Analysis, Crash Dump Patterns, Debugging, Software Technical Support, Troubleshooting Methodology | No Comments »

Crash Dump Analysis Patterns (Part 69)

Wednesday, July 9th, 2008

Sometimes patterns like Message Box and / or Stack Trace semantics reveal another pattern that I call Self-Diagnosis which may or may not result in Self-Dump. The diagnostic message may reveal the problem internally detected by runtime environment.

Consider the following stack trace:

0:000> kv ChildEBP RetAddr Args to Child 0012e8c0 77f4bf53 77f4610a 00000000 00000000 ntdll!KiFastSystemCallRet 0012e8f8 77f3965e 000101a2 00000000 00000001 user32!NtUserWaitMessage+0xc 0012e920 77f4f762 77f30000 00151768 00000000 user32!InternalDialogBox+0xd0 0012ebe0 77f4f047 0012ed3c 00000000 ffffffff user32!SoftModalMessageBox+0x94b 0012ed30 77f4eec9 0012ed3c 00000028 00000000 user32!MessageBoxWorker+0x2ba 0012ed88 77f87d0d 00000000 001511a8 0014ef50 user32!MessageBoxTimeoutW+0x7a 0012edbc 77f742c8 00000000 0012ee70 1001d7d4 user32!MessageBoxTimeoutA+0x9c 0012eddc 77f742a4 00000000 0012ee70 1001d7d4 user32!MessageBoxExA+0x1b 0012edf8 10014c9a 00000000 0012ee70 1001d7d4 user32!MessageBoxA+0×45 WARNING: Stack unwind information not available. Following frames may be wrong. 0012ee2c 10010221 0012ee70 1001d7d4 00012010 component!Error+0×7e4a […]

Dumping the message box message and its title shows that Visual C++ runtime detected a buffer overflow condition:

0:000> da 0012ee70 0012ee70 "Buffer overrun detected!..Progra” 0012ee90 “m: E:\W\program.exe..A buffer ov” 0012eeb0 “errun has been detected which ha” 0012eed0 “s corrupted the program’s.intern” 0012eef0 “al state. The program cannot sa” 0012ef10 “fely continue execution and must” 0012ef30 “.now be terminated..”

0:000> da 1001d7d4 1001d7d4 "Microsoft Visual C++ Runtime Lib" 1001d7f4 "rary"

- Dmitry Vostokov @ DumpAnalysis.org -

Posted in Crash Dump Analysis, Crash Dump Patterns, Debugging | 9 Comments »

Crash Dump Analysis Patterns (Part 68)

Friday, June 27th, 2008

Setting a thread affinity mask to a specific processor or core makes that thread running in a single processor environment from that thread point of view. It is always scheduled to run on that processor. This potentially creates a problem found on real single processor environments if the processor runs another higher priority thread (Thread Starvation pattern) or loops at dispatch level IRQL (Dispatch Level Spin pattern). Therefore I call this pattern Affine Thread.

Here is one example. Dual core laptop was hanging and kernel memory dump showed the following Wait Chain pattern:

Resource @ nt!PopPolicyLock (0x80563080) Exclusively owned Contention Count = 32 NumberOfExclusiveWaiters = 9 Threads: 8b3b08b8-01<*> Threads Waiting On Exclusive Access: 872935f0 8744cb30 87535da8 8755a6b0 8588dba8 8a446c10 85891c50 87250020 8a6e7da8

The thread 8b3b08b8 blocked other 9 threads and had the following stack trace:

0: kd> !thread 8b3b08b8 1f THREAD 8b3b08b8 Cid 0004.002c Teb: 00000000 Win32Thread: 00000000 READY Not impersonating DeviceMap e1009248 Owning Process 8b3b2830 Image: System Wait Start TickCount 44419 Ticks: 8744 (0:00:02:16.625) Context Switch Count 4579 UserTime 00:00:00.000 KernelTime 00:00:01.109 Start Address nt!ExpWorkerThread (0x8053867e) Stack Init bad00000 Current bacffcb0 Base bad00000 Limit bacfd000 Call 0 Priority 15 BasePriority 12 PriorityDecrement 3 DecrementCount 16 ChildEBP RetAddr bacffcc8 804fd2c9 nt!KiUnlockDispatcherDatabase+0x9e bacffcdc 8052a16f nt!KeSetSystemAffinityThread+0×5b bacffd04 805caf03 nt!PopCompositeBatteryUpdateThrottleLimit+0×2d bacffd24 805ca767 nt!PopCompositeBatteryDeviceHandler+0×1c5 bacffd3c 80529d3b nt!PopPolicyWorkerMain+0×25 bacffd7c 8053876d nt!PopPolicyWorkerThread+0xbf bacffdac 805cff64 nt!ExpWorkerThread+0xef bacffddc 805460de nt!PspSystemThreadStartup+0×34 00000000 00000000 nt!KiThreadStartup+0×16

Note this function and its first parameter:

0: kd> !thread 8b3b08b8 [...] bacffcdc 8052a16f 00000002 8a5b8cd8 00000030 nt!KeSetSystemAffinityThread+0×5b […]

The first parameter is KAFFINITY mask and 0×2 is 0y10 (binary) which is the second core. This thread had been already set to run on that core only:

0: kd> dt _KTHREAD 8b3b08b8 nt!_KTHREAD +0x000 Header : _DISPATCHER_HEADER [...] +0×124 Affinity : 2 […]

Let’s look at our second core:

0: kd> ~1s

1: kd> kL 100 ChildEBP RetAddr a8f00618 acd21947 hal!KeAcquireInStackQueuedSpinLock+0x43 a8f00618 acd21947 tcpip!IndicateData+0x98 a8f00684 acd173e5 tcpip!IndicateData+0x98 a8f0070c acd14ef5 tcpip!TCPRcv+0xbb0 a8f0076c acd14b19 tcpip!DeliverToUser+0x18e a8f007e8 acd14836 tcpip!DeliverToUserEx+0x95e a8f008a0 acd13928 tcpip!IPRcvPacket+0x6cb a8f008e0 acd13853 tcpip!ARPRcvIndicationNew+0x149 a8f0091c ba56be45 tcpip!ARPRcvPacket+0x68 a8f00970 b635801d NDIS!ethFilterDprIndicateReceivePacket+0x307 a8f00984 b63581b4 psched!PsFlushReceiveQueue+0x15 a8f009a8 b63585f9 psched!PsEnqueueReceivePacket+0xda a8f009c0 ba56c8ed psched!ClReceiveComplete+0x13 a8f009d8 b7defdb5 NDIS!EthFilterDprIndicateReceiveComplete+0x7c a8f00a08 b7df0f78 driverA+0x17db5 a8f00a64 ba55ec2c driverA+0x18f78 a8f00a88 b6b0962c NDIS!ndisMSendCompleteX+0x8d a8f00a9c b6b0a36d driverB+0x62c a8f00ab8 ba55e88f driverB+0x136d a8f00ae0 b7de003c NDIS!NdisReturnPackets+0xe9 a8f00af0 ba55e88f driverA+0x803c a8f00b18 b6358061 NDIS!NdisReturnPackets+0xe9 a8f00b30 ba55e88f psched!MpReturnPacket+0x3b a8f00b58 acc877cc NDIS!NdisReturnPackets+0xe9 87749da0 00000000 afd!AfdReturnBuffer+0xe1

1: kd> r eax=a8f005f8 ebx=a8f00624 ecx=8a9862ed edx=a8f00b94 esi=874e2ed0 edi=8a9862d0 eip=806e6a33 esp=a8f005ec ebp=a8f00618 iopl=0 nv up ei pl nz na po nc cs=0008 ss=0010 ds=0023 es=0023 fs=0030 gs=0000 efl=00000202 hal!KeAcquireInStackQueuedSpinLock+0x43: 806e6a33 74ee je hal!KeAcquireInStackQueuedSpinLock+0x33 (806e6a23) [br=0]

1: kd> !running

System Processors 3 (affinity mask) Idle Processors 1

Prcb Current Next 1 bab38120 8a0c8ae8 8b3a7318 …………….

We see the thread 8a0c8ae8 had been spinning on the second core for more than 2 minutes:

1: kd> !thread 8a0c8ae8 1f THREAD 8a0c8ae8 Cid 0660.0124 Teb: 7ffd7000 Win32Thread: e338c498 RUNNING on processor 1 IRP List: 8a960008: (0006,01b4) Flags: 00000900 Mdl: 87535908 Not impersonating DeviceMap e2f155b8 Owning Process 87373020 Image: APPLICATION.EXE Wait Start TickCount 43918 Ticks: 9245 (0:00:02:24.453) Context Switch Count 690 LargeStack UserTime 00:00:00.000 KernelTime 00:02:24.453 […]

Its kernel time looks consistent with the starved thread waiting time:

0: kd> !thread 8b3b08b8 1f THREAD 8b3b08b8 Cid 0004.002c Teb: 00000000 Win32Thread: 00000000 READY Not impersonating DeviceMap e1009248 Owning Process 8b3b2830 Image: System Wait Start TickCount 44419 Ticks: 8744 (0:00:02:16.625) […]

For comparison, the spinning thread has affinity mask 0y11 (0×3) which means that it can be scheduled to run on both cores:

0: kd> dt _KTHREAD 8a0c8ae8 nt!_KTHREAD +0x000 Header : _DISPATCHER_HEADER [...] +0×124 Affinity : 3 […]

- Dmitry Vostokov @ DumpAnalysis.org -

Posted in Crash Dump Analysis, Crash Dump Patterns, Debugging | 1 Comment »

Crash Dump Analysis Patterns (Part 67b)

Wednesday, June 25th, 2008

Nested exception analysis is much simpler in managed code than in unmanaged. Exception object references the inner exception if any and so on:

Exception.InnerException

WinDbg does a good job traversing all nested exceptions when executing !analyze -v command. In the following example of a Windows forms application crash ObjectDisposedException (shown in red) was re-thrown as Exception object with “Critical error” message (shown in magenta) which was re-thrown several times as Exception object with “Critical program error” message (shown in blue) that finally resulted in the process termination request in the top level exception handler:

MANAGED_STACK: (TransitionMU) 001374B0 0B313757 System!System.Diagnostics.Process.Kill()+0x37 001374E4 0B3129C7 Component!Foo.HandleUnhandledException(System.Exception)+0x137 001374F4 07C0A7D3 Component!Foo+FooBarProcessMenuCommand(System.String)+0x33 (TransitionUM) (TransitionMU) [...]

EXCEPTION_OBJECT: !pe 3398614 Exception object: 03398614 Exception type: System.Exception Message: Critical program error InnerException: System.Exception, use !PrintException 03398560 to see more StackTrace (generated): SP IP Function 001371A8 07C0CDD8 Foo.BarUserInteraction.Process(System.String) 00137258 07C0CCA6 Foo.BarUserInteraction.ProcessUserInteraction(Sub, BarStepType) 00137268 07C0A9BA Foo.BarMenu.Process(CMD) 00137544 07C0A8D8 Foo.BarMenu.ProcessCMD(CMD) 0013756C 07C0A7BE Foo+FooBar.ProcessBarMenuCommand(System.String)