Archive for October, 2009

Darwinian Debugging

Monday, October 12th, 2009

To commemorate 200 years since Darwin’s birth and 150 years since the publication of The Origin of Species, OpenTask plans to dedicate December, 2009 issue of Debugged! MZ/PE magazine to taxonomy and evolution of software defects and March, 2010 issue to multithreading (as natural selection).

Debugged! MZ/PE: Systematics and Evolution of Software Defects, December, 2009 (Paperback, ISBN: 978-1906717827)

Debugged! MZ/PE: Multithreading, March, 2010 (Paperback, ISBN: 978-1906717834)

Here are draft front covers designed during last weekend:

- Dmitry Vostokov @ DumpAnalysis.org -

From Image to Imagination

Monday, October 12th, 2009

The best of artistic work commissioned by OpenTask to be published with annotations in the following book scheduled to open 2010, The Year of The Foundation of Debugging (Crash Dump Analysis):

Spikes, Hangs, Crashes, Leaks and Dumps of Imagination: The Art of the Debugging Art (ISBN: 978-1906717841)

Note: This is not a book about natural computer memory visualization.

- Dmitry Vostokov @ DumpAnalysis.org -

Statement current, coupled processes, wait chain, spiking thread, hidden exception, message box and not my version: memory dump and trace analysis pattern cooperation

Monday, October 12th, 2009

It was reported that one important system functionality is not available from time to time but is usually restored to normal operation when one service (ServiceA) is restarted. That service was coupled with ServiceB and their memory dumps were saved and delivered for analysis. Unfortunately, nothing raising a suspicion was found inside. To tackle the problem it was advised to get an ETW trace from the system including modules from ServiceA together with process memory dumps when the problem happens again. The trace revealed the following message with exceptionally high statement current of 72,118 msg/s (and also superdense - no other types of trace statements were found inside):

#      PID    TID    Message
[...]
823296 11300  2484   ServiceB notification failed, error code = 6
[…]

Where the error 6 is invalid handle error:

0:000> !error 6
Error code: (Win32) 0x6 (6) - The handle is invalid.

The thread 2484 (9B4) corresponds to thread #22 in ServiceA and it is blocked waiting for an LPC reply:

  22  Id: 2c24.9b4 Suspend: 1 Teb: 7ffa4000 Unfrozen
ChildEBP RetAddr 
020cfa18 7c827899 ntdll!KiFastSystemCallRet
020cfa1c 77c80a6e ntdll!ZwRequestWaitReplyPort+0xc
020cfa68 77c7fcf0 rpcrt4!LRPC_CCALL::SendReceive+0×230

020cfa74 77c80673 rpcrt4!I_RpcSendReceive+0×24
020cfa88 77ce315a rpcrt4!NdrSendReceive+0×2b
020cfe70 73077ca5 rpcrt4!NdrClientCall2+0×22e
020cfe88 73077c2a ServiceA!RpcNextNotification+0×1c
020cffb8 77e6482f ServiceA!EventWatcherThread+0×107
020cffec 00000000 kernel32!BaseThreadStart+0×34

Suspicious of a loop we confirm that the thread was spiking:

0:000> !runaway f
 User Mode Time
  Thread       Time
  22:9b4       0 days 0:41:27.453
  19:4768      0 days 0:00:00.109
[…] 
Kernel Mode Time
  Thread       Time
  22:9b4       0 days 0:24:27.984
  23:407c      0 days 0:00:00.437
[…]
 Elapsed Time
  Thread       Time
[…]
  22:9b4       0 days 5:26:21.499
[…]

Looking at the raw stack data (using !teb and dds WinDbg commands) we see a hidden processed exception:

020cf6c4  020cf4c0
020cf6c8  020cf6d8
020cf6cc  020cf718
020cf6d0  7c828290 ntdll!_except_handler3
020cf6d4  7c82a120 ntdll!CheckHeapFillPattern+0x54
020cf6d8  020cf6e8
020cf6dc  00140000
020cf6e0  7c82a144 ntdll!RtlpAllocateFromHeapLookaside+0x13
020cf6e4  00140868
020cf6e8  020cf910
020cf6ec  7c82a0d8 ntdll!RtlAllocateHeap+0x1dd
020cf6f0  7c82a11c ntdll!RtlAllocateHeap+0xee7
020cf6f4  73074548
020cf6f8  00000000
020cf6fc  00000000
020cf700  00000000
020cf704  00000000
020cf708  00218ef0
020cf70c  020cf728
020cf710  7c82a791 ntdll!RtlpCoalesceFreeBlocks+0x383
020cf714  020d0000
020cf718  00218ef0
020cf71c  020cf9fc
020cf720  7c82865c ntdll!RtlRaiseException+0×3d
020cf724  020ce000
020cf728  020cf72c
020cf72c  00010007
020cf730  020cf810
020cf734  7c829f5d ntdll!RtlFreeHeap+0×20e
020cf738  001407d8
020cf73c  7c829f79 ntdll!RtlFreeHeap+0×70f
020cf740  00000000

After some time another pair of coupled processes was collected where ServiceA(2) was hanging on an LPC request again but this time ServiceB(2) had one thread blocked by a GUI property sheet processing code (a variant of Message Box pattern):

0:015> kL 100
ChildEBP RetAddr 
017fb9f0 7c827d29 ntdll!KiFastSystemCallRet
017fb9f4 77e61d1e ntdll!ZwWaitForSingleObject+0xc
017fba64 77e61c8d kernel32!WaitForSingleObjectEx+0xac
017fba78 6dfcdac3 kernel32!WaitForSingleObject+0x12
[...]
017fbdac 730801c5 compstui!CommonPropertySheetUIW+0×17
017fbdf4 73080f5d ServiceB!CommonPropertySheetUI+0×43

WARNING: Stack unwind information not available. Following frames may be wrong.
017fc27c 5c3ae4e6 ComponentA!DllGetClassObject+0xbf4e
[…]
017ff8f8 77ce33e1 rpcrt4!Invoke+0×30
017ffcf8 77ce35c4 rpcrt4!NdrStubCall2+0×299
017ffd14 77c7ff7a rpcrt4!NdrServerCall2+0×19
017ffd48 77c8042d rpcrt4!DispatchToStubInCNoAvrf+0×38
017ffd9c 77c80353 rpcrt4!RPC_INTERFACE::DispatchToStubWorker+0×11f
017ffdc0 77c811dc rpcrt4!RPC_INTERFACE::DispatchToStub+0xa3
017ffdfc 77c812f0 rpcrt4!LRPC_SCALL::DealWithRequestMessage+0×42c
017ffe20 77c88678 rpcrt4!LRPC_ADDRESS::DealWithLRPCRequest+0×127
017fff84 77c88792 rpcrt4!LRPC_ADDRESS::ReceiveLotsaCalls+0×430
017fff8c 77c8872d rpcrt4!RecvLotsaCallsWrapper+0xd
017fffac 77c7b110 rpcrt4!BaseCachedThreadRoutine+0×9d
017fffb8 77e6482f rpcrt4!ThreadStartRoutine+0×1b
017fffec 00000000 kernel32!BaseThreadStart+0×34

ComponentA was also found loaded in ServiceB(1) user dump and in the ServiceB memory dump from the initial coupled pair where nothing was found before. The timestamp of that component was old enough (lmv command) to warrant more attention to it and contact its ISV.

- Dmitry Vostokov @ DumpAnalysis.org -

Critical section high contention and wait chains, blocked threads, and periodic error: memory dump and trace analysis pattern cooperation

Friday, October 9th, 2009

This is the first case study here that shows an interplay of memory dump analysis (DA) and software trace analysis (TA) patterns, what I call DATA analysis patterns (or DA+TA).  

It was reported that one process was blocking vital server functionality. After the process restart the problem was gone away. A complete memory dump was saved on the next occurrence and it revealed critical section wait chains in that process but no critical section deadlocks:

0: kd> .process /r /p 87f76020
Implicit process is now 87f76020
Loading User Symbols
[...]

0: kd> !cs -l -o -s
-----------------------------------------
DebugInfo          = 0x0016c6d8
Critical section   = 0×0032be30 (+0×32BE30)
LOCKED
LockCount          = 0×34
WaiterWoken        = No
OwningThread       = 0×00001c64
RecursionCount     = 0×1
LockSemaphore      = 0×624
SpinCount          = 0×00000000
OwningThread       = .thread 86396db0
ntdll!RtlpStackTraceDataBase is NULL. Probably the stack traces are not enabled.
[…]

The thread 86396db0 (TID 1c64) that blocked more than 50 threads (0×34) was blocked itself sleeping for more than 6 seconds:

0: kd> .thread 86396db0
Implicit thread is now 86396db0

0: kd> kL 100
  *** Stack trace for last set context - .thread/.cxr resets it
ChildEBP RetAddr 
ae7f8c98 8083d5b1 nt!KiSwapContext+0x26
ae7f8cc4 8083cf69 nt!KiSwapThread+0x2e5
ae7f8d0c 8092b03f nt!KeDelayExecutionThread+0x2ab
ae7f8d54 80833bef nt!NtDelayExecution+0x84
ae7f8d54 7c82860c nt!KiFastCallEntry+0xfc
1020e8ac 7c826f69 ntdll!KiFastSystemCallRet
1020e8b0 77e41ed5 ntdll!NtDelayExecution+0xc
1020e918 77e424fd kernel32!SleepEx+0x68
1020e928 67739357 kernel32!Sleep+0xf
1020e944 6773c3a2 ComponentA!DB_Driver_Command+0xa7
[…]
1020ec64 67485393 ComponentB!DatabaseSearch+0×34
[…]
1020ffb8 77e6482f msvcrt!_endthreadex+0xa3
1020ffec 00000000 kernel32!BaseThreadStart+0×34

0: kd> kv
  *** Stack trace for last set context - .thread/.cxr resets it
ChildEBP RetAddr  Args to Child             
[...]
1020e918 77e424fd 00001b00 00000000 1020e944 kernel32!SleepEx+0x68 (FPO: [SEH])
1020e928 67739357 00001b00 00000000 0032ac6c kernel32!Sleep+0xf (FPO: [1,0,0])
[…]

0: kd> ? 1b00 / 0n1000
Evaluate expression: 6 = 00000006

Critical section it owns shows high contention count too:

0: kd> dt -r1 _RTL_CRITICAL_SECTION   0x0032be30
ProcessA!_RTL_CRITICAL_SECTION
   +0x000 DebugInfo        : 0x0016c6d8 _RTL_CRITICAL_SECTION_DEBUG
      +0x000 Type             : 0
      +0x002 CreatorBackTraceIndex : 0
      +0x004 CriticalSection  : 0x0032be30 _RTL_CRITICAL_SECTION
      +0x008 ProcessLocksList : _LIST_ENTRY [ 0x16c708 - 0x16c6b8 ]
      +0x010 EntryCount       : 0
      +0×014 ContentionCount  : 0xac352
      +0×018 Spare            : [2] 0×43005c
   +0×004 LockCount        : -210
   +0×008 RecursionCount   : 1
   +0×00c OwningThread     : 0×00001c64
   +0×010 LockSemaphore    : 0×00000624
   +0×014 SpinCount        : 0

Fortunately, that process had ETW tracing capability and its software trace recorded before the complete memory dump was saved the following recurrent periodic errorfrom different threads that confirms our observation about the possible problem with a database and explains thread delays we see (> 6 seconds for Sleep):

#    PID  TID  Time         Message
[...]
1972 2780 5992 10:05:11.005 Error: [DB Driver] Not enough space on temp disk
1973 2780 5992 10:05:11.005 Execute DB command sleeps on error (retry 26)
[...]
4513 2780 3292 10:06:02.942 Error: [DB Driver] Not enough space on temp disk
4514 2780 3292 10:06:02.942 Execute DB command sleeps on error (retry 11)
4515 2780 3292 10:06:09.598 Error: [DB Driver] Not enough space on temp disk
4516 2780 3292 10:06:09.598 Execute DB command sleeps on error (retry 12)
[…]

- Dmitry Vostokov @ DumpAnalysis.org -

Windows Installer logs

Wednesday, October 7th, 2009

Software installation may fail: this is a fact (like Evolution). Therefore this is the domain of troubleshooting and debugging proper. Usually such problems are typically analyzed by reading Windows Installer MSI logs that are examples of software traces. The following book is on my desk now:

The Definitive Guide to Windows Installer

Buy from Amazon

- Dmitry Vostokov @ DumpAnalysis.org -

Dictionary of Debugging: Breakpoint

Tuesday, October 6th, 2009

Breakpoint

A code or a processor state modification to plan for a synchronous diversion to another execution path when some condition is met. Usually implemented by a special processor instruction inserted at the specified address or a special processor register that holds the specified condition to be met. If that condition is met or the special instruction is executed the processor interrupts a computational process (a debuggee) and transfers the execution to another computational process (a debugger) that can inspect the debuggee state.

No breakpoints (normal execution path, the yellow line represents a function call):

2 breakpoints (BP#1 is inside the function and BP#2 is at the entry of another function):

Synonyms:

Antonyms:

Also: virtual memory, software breakpoint, hardware breakpoint, processor breakpoint, data breakpoint, code breakpoint, exception, debugger event.

- Dmitry Vostokov @ DumpAnalysis.org -

Dictionary of Debugging: Virtual Memory

Monday, October 5th, 2009

Virtual Memory

A computational process view of its memory. Memory content is combined from the process host memory (may not be physical) and from a storage memory. Virtual memory usually has linear ordered addresses in the range [0, N] where some regions may be inaccessible:

Synonyms: virtual space

Antonyms:

Also: memory space, memory dump, memory region, user dump, kernel dump, complete dump, kernel spacephysical memory, user space, generalized kernel space.

- Dmitry Vostokov @ DumpAnalysis.org -

Sample Chapter from A.NET.Debugging Book

Friday, October 2nd, 2009

While doing Google search today found the site for the forthcoming Mario Hewardt’s new book Advanced .NET Debugging:

www.advanceddotnetdebugging.com

with 74-page sample chapter. Looking forward to reading this book.

- Dmitry Vostokov @ DumpAnalysis.org -