Archive for the ‘Troubleshooting Methodology’ Category

Forthcoming Memory Dump Analysis Anthology, Volume 4

Thursday, February 11th, 2010

This is a revised, edited, cross-referenced and thematically organized volume of selected DumpAnalysis.org blog posts about crash dump analysis and debugging written in July 2009 - January 2010 for software engineers developing and maintaining products on Windows platforms, quality assurance engineers testing software on Windows platforms and technical support and escalation engineers dealing with complex software issues. The fourth volume features:

- 13 new crash dump analysis patterns
- 13 new pattern interaction case studies
- 10 new trace analysis patterns
- 6 new Debugware patterns and case study
- Workaround patterns
- Updated checklist
- Fully cross-referenced with Volume 1, Volume 2 and Volume 3
- New appendixes

Product information:

  • Title: Memory Dump Analysis Anthology, Volume 4
  • Author: Dmitry Vostokov
  • Language: English
  • Product Dimensions: 22.86 x 15.24
  • Paperback: 410 pages
  • Publisher: Opentask (30 March 2010)
  • ISBN-13: 978-1-906717-86-5
  • Hardcover: 410 pages
  • Publisher: Opentask (30 April 2010)
  • ISBN-13: 978-1-906717-87-2

Back cover features memory space art image: Internal Process Combustion.

- Dmitry Vostokov @ DumpAnalysis.org + TraceAnalysis.org -

Workaround Patterns (Part 3)

Tuesday, January 26th, 2010

What happens when Hidden Output and Frozen Process patterns don’t help with annoying popup windows? The former can’t prevent windows from reappearing afresh and the latter could block other coupled processes that might exchange window messages with our suspended process or simply use any IPC mechanism. Here Axed Code pattern can help as demonstrated below. One process was frequently and briefly showing network disconnection message box or dialog. The problem is that it was also bringing its main window into foreground disrupting work in other windows because they were loosing focus. Next time the dialog appeared we found its process ID in Task Manager and attached WinDbg to it. We wasn’t sure what dialog function to intercept so we put a general breakpoint on all “Dialog” functions for all threads:

0:000:x86> bm *Dialog*
[...]
  6: 73a8ba81 @!"MFC80!CDialog::~CDialog"
  7: 73ac25e2 @!"MFC80!CPageSetupDialog::~CPageSetupDialog"
  8: 73a94b6b @!"MFC80!CDHtmlDialog::_AfxSimpleScanf"
  9: 73a8fbe9 @!"MFC80!CFileDialog::OnTypeChange"
 10: 73a90b17 @!"MFC80!CColorDialog::GetRuntimeClass"
 11: 73a8bb4a @!"MFC80!CDialog::CreateIndirect"
[...]
360: 73a93750 @!"MFC80!CDHtmlDialog::OnNavigateComplete"
361: 73a8f1f3 @!"MFC80!CCommonDialog::OnOK"
362: 73a95d9f @!"MFC80!CDHtmlDialog::GetDropTarget"
363: 73a90266 @!"MFC80!CPrintDialog::GetDevMode"
364: 73ac1514 @!"MFC80!COleInsertDialog::COleInsertDialog"
365: 73ac27c7 @!"MFC80!COlePropertiesDialog::COlePropertiesDialog"
366: 73a75282 @!"MFC80!CWnd::UpdateDialogControls"
367: 73a7fd86 @!"MFC80!CDialogBar::SetOccDialogInfo"

0:000:x86> g
Breakpoint 314 hit
MFC80!_AfxPostInitDialog:
73a7134e 55              push    ebp

0:000:x86> kL 100
ChildEBP RetAddr  Args to Child             
0027ed2c 73a7180a MFC80!_AfxPostInitDialog
0027ed90 75628817 MFC80!_AfxActivationWndProc+0x90
0027edbc 7562898e USER32!InternalCallWinProc+0x23
0027ee34 7562c306 USER32!UserCallWinProcCheckWow+0x109
0027ee78 756375a2 USER32!SendMessageWorker+0x55b
0027ef4c 7563787a USER32!InternalCreateDialog+0xb64
0027ef70 75649b65 USER32!CreateDialogIndirectParamAorW+0x33
0027ef9c 75225192 USER32!CreateDialogParamA+0x4a
WARNING: Stack unwind information not available. Following frames may be wrong.
0027efc8 010c3bf1 DllA!WarningPopup+0×152
0027effc 73a71812 ProcessA+0×9fa1
00000000 00000000 MFC80!_AfxActivationWndProc+0×98

Now we cleared all breakpoints and put the new breakpoint on WarningPopup function:

0:000:x86> bc *

0:000:x86> bp DllA!WarningPopup

0:000:x86> g
Breakpoint 0 hit
DllA!WarningPopup:
75225040 51              push    ecx

Then we assumed that the calling convention was the default one used by C or C++ code like _cdecl and took the bold step to replace push ecx with ret instruction:

0:000:x86> a 75225040
75225040 ret
ret
75225041

0:000:x86> g
Breakpoint 0 hit
DllA!WarningPopup:
75225040 c3 ret

0:000:x86> bc *

0:000:x86> g

Result: no warning popups anymore.

I originally intended to name the pattern Patched Code but then realized that code axing can also be done at the source code level as a quick temporal fix.

- Dmitry Vostokov @ DumpAnalysis.org + TraceAnalysis.org -

Workaround Patterns (Part 2)

Monday, January 25th, 2010

Another workaround pattern for some problems is to freeze a process responsible for an annoying or excessive activity like in the case study: Debugger as a Shut Up Application. We can also use other tools for this purpose like Mark Russinovich’s PsSuspend. The suitable name for this pattern is Frozen Process.

- Dmitry Vostokov @ DumpAnalysis.org + TraceAnalysis.org -

Workaround Patterns (Part 1)

Sunday, January 24th, 2010

After fighting HTML comments in Safari and Chrome (see the case study below) I came to an idea to name and catalog workaround patterns in troubleshooting and debugging. The first one is called Hidden Output. Sometimes we can just remove message boxes reporting minor problems and generating unnecessary support calls by hiding their windows, for example, by using CtxHideEx32. A different example is what I did today when troubleshooting Amazon aStore widget HTML code. It worked well in IE8:

However, in Apple Safari and Google Chrome the widget code was visible at the top of the page:

 

After a few unsuccessful attempts to debug the problem and faced with other pressing tasks I got a flash in my mind to hide the visible code by changing its color to be the same as its background:

<font color=”D3E7F4″><script type=”text/javascript”><!–
amazon_ad_tag=”crasdumpanala-20″;
amazon_ad_width=”728″;
amazon_ad_height=”90″;
amazon_color_background=”D3E7F4″;
amazon_color_border=”0000FF”;
amazon_color_logo=”FFFFFF”;
amazon_color_link=”0000FF”;
amazon_ad_logo=”hide”;
amazon_ad_link_target=”new”;
amazon_ad_border=”hide”;
amazon_ad_title=”OpenTask Books, Magazines and Notebooks”; //–></script>
<script type=”text/javascript” src=”http://www.assoc-amazon.com/s/asw.js”></script></font>

 
After that the picture became nicer:

- Dmitry Vostokov @ DumpAnalysis.org + TraceAnalysis.org -

Memory Dump Analysis Anthology, Volume 3

Sunday, December 20th, 2009

“Memory dumps are facts.”

I’m very excited to announce that Volume 3 is available in paperback, hardcover and digital editions:

Memory Dump Analysis Anthology, Volume 3

Table of Contents

In two weeks paperback edition should also appear on Amazon and other bookstores. Amazon hardcover edition is planned to be available in January 2010.

The amount of information was so voluminous that I had to split the originally planned volume into two. Volume 4 should appear by the middle of February together with Color Supplement for Volumes 1-4. 

- Dmitry Vostokov @ DumpAnalysis.org -

Debugged! MZ/PE September issue is out

Wednesday, December 16th, 2009

Finally, after the long delay, the issue is available in print on Amazon and through other sellers:

Debugged! MZ/PE: Software Tracing

Buy from Amazon

- Dmitry Vostokov @ DumpAnalysis.org -

The Law of Simple Tools

Wednesday, December 9th, 2009

In its simplest form the first law of troubleshooting and debugging states that:

The more frequent a problem is, the simpler tool is needed to resolve and fix it.

- Dmitry Vostokov @ DumpAnalysis.org -

First Fault Software Problem Solving Book

Wednesday, December 9th, 2009

I’m very pleased to announce that Dan Skwire’s unique book has been published by OpenTask:

First Fault Software Problem Solving: A Guide for Engineers, Managers and Users

 

- Dmitry Vostokov @ DumpAnalysis.org -

Crash Dump Analysis Patterns (Part 92)

Tuesday, November 24th, 2009

Sometimes the functionality of a system depends upon a specific application or service process. For example, in a database server environment it might be a database process, in printing environment it is a print spooler process or in a terminal services environment it is a terminal services process (termsvc, hosted by svchost.exe). In system failure scenarios we should check these processes for their presence (and also the presence of any coupled processes), hence the name of this pattern: Missing Process. However, if the vital process is present we should check if it is exited but references to it exist or there are any missing threads or components inside it, any suspended threads and special processes like a postmortem debugger. We shouldn’t also forget about service dependencies and their relevant process startup order. For example, we know that our service is hosted by svchost.exe and we see one such process exited but its object still referenced somewhere:

0: kd> !vm

*** Virtual Memory Usage ***
[...]
         0ed8 svchost.exe          0 (         0 Kb)
[…]

However, another command shows that it could be a different service hosted by the same image, svchost.exe, if we know that ServiceA depends on our service:

0: kd> !process 0 0
**** NT ACTIVE PROCESS DUMP ****
PROCESS 8b581818  SessionId: none  Cid: 0004    Peb: 00000000  ParentCid: 0000
    DirBase: bff4d020  ObjectTable: e1001e18  HandleCount: 1601.
    Image: System

PROCESS 8b06d778  SessionId: none  Cid: 01a8    Peb: 7ffde000  ParentCid: 0004
    DirBase: bff4d040  ObjectTable: e13eae40  HandleCount:  22.
    Image: smss.exe

[...]

PROCESS 8aabed88  SessionId: 0  Cid: 0854    Peb: 7ffd6000  ParentCid: 0220
    DirBase: bff4d4a0  ObjectTable: e1c867a8  HandleCount: 778.
    Image: ServiceA.exe

[...]

PROCESS 8aaa6510  SessionId: 0  Cid: 0ed8    Peb: 7ffd4000  ParentCid: 0220
    DirBase: bff4d580  ObjectTable: 00000000  HandleCount:   0.
    Image: svchost.exe

[...]

Another alternative is that our service was restarted but then exited. If our process is not visible it could be possible that it was either stopped or simply crashed before.

- Dmitry Vostokov @ DumpAnalysis.org -

There Ought to be a Planet at that Location!

Thursday, October 22nd, 2009

One ETW trace pointed to a set of intermittent symptoms (messages were simplified for this post):

#        PID        TID        Message 
[...]
31278    2300       7060       RequestXMLData entry
31281    2300       7060       RequestXMLData: XML error     
[...]

Searching for issues having this error only pointed to a case with a mixed software product environment where some servers had the product version X and other servers the product version X+1. However, in the new case the customer claimed that he had only the product version X+1 on all production servers. We insisted and, after the closer inspection, servers with the product X were found… 

- Dmitry Vostokov @ TraceAnalysis.org -

Can Software Tweet?

Monday, September 28th, 2009

Every PID has its twitter account. Processes emit short trace messages (STM) and others subscribe to them. This is the technical support of the future, the concept of SoftWeet (*):

www.SoftWeet.com

(*) to weet

to know; to wit (archaic)

- Dmitry Vostokov @ DumpAnalysis.org -

Forthcoming Memory Dump Analysis Anthology, Volume 3

Saturday, September 26th, 2009

This is a revised, edited, cross-referenced and thematically organized volume of selected DumpAnalysis.org blog posts about crash dump analysis and debugging written in October 2008 - June 2009 for software engineers developing and maintaining products on Windows platforms, quality assurance engineers testing software on Windows platforms and technical support and escalation engineers dealing with complex software issues. The third volume features:

- 15 new crash dump analysis patterns
- 29 new pattern interaction case studies
- Trace analysis patterns
- Updated checklist
- Fully cross-referenced with Volume 1 and Volume 2
- New appendixes

Product information:

  • Title: Memory Dump Analysis Anthology, Volume 3
  • Author: Dmitry Vostokov
  • Language: English
  • Product Dimensions: 22.86 x 15.24
  • Paperback: 404 pages
  • Publisher: Opentask (20 December 2009)
  • ISBN-13: 978-1-906717-43-8
  • Hardcover: 404 pages
  • Publisher: Opentask (30 January 2010)
  • ISBN-13: 978-1-906717-44-5

Back cover features 3D computer memory visualization image.

- Dmitry Vostokov @ DumpAnalysis.org -

DebugWare Patterns (Part 9)

Thursday, September 24th, 2009

Real troubleshooting is usually done by combining several units of work chosen from a manual. Checklist pattern summarizes this recurrent practice. Checklist Coordinator orchestrates troubleshooting units of work (TUWs) components from TUW Repository according to checklists from Checklist Repository (in the simple case it can be just one checklist). This is illustrated on the following UML component diagram:

- Dmitry Vostokov @ DumpAnalysis.org -

DebugWare Patterns (Part 8)

Monday, September 21st, 2009

Troubleshooting Unit of Work is another pattern frequently used in manual troubleshooting and debugging. This is usually some independent and self-sufficient list of steps to perform to check something from a troubleshooting checklist or a manual and can be implemented as a separate loadable module, a class to reuse or a function to call. Output from such units of work can be stored in a blackboard system or processed by tools implementing Checklist DebugWare pattern. Typical example is an implementation of the following document:

Required Permissions and Rights for the Ctx_CpsvcUser Account

as a tool:

CTX_CpsvcUser Re-creation Tool for 32-Bit and 64-Bit Versions of Presentation Server 4.5

- Dmitry Vostokov @ DumpAnalysis.org -

DebugWare Patterns (Part 7)

Thursday, September 10th, 2009

Trace Expert pattern came to my mind when I was writing about software trace patterns. It is a very lightweight expert system relying on trace collector and trace formatter (patterns to be written about soon). It is a module that takes a preformatted software trace message file or a buffer and a set of built in rules and uses simple search (peharps involving regular expressions) to dig out diagnostic information and provide troubleshooting and debugging directions.

This module is schematically depicted on the following UML component diagram:

- Dmitry Vostokov @ DumpAnalysis.org -

Debugging Expert Magazine Online (DEMO)

Wednesday, September 9th, 2009

I’m very pleased to announce the free online version of Debugged! MZ/PE magazine under the code name DEMO launched last night:

Debugging Expert Magazine Online (www.DebuggingExpert.com)

- Dmitry Vostokov @ DumpAnalysis.org -

Metaphorical Bijectionism: A Method of Inquiry

Monday, September 7th, 2009

Consider this example mapping (taken metaphorically from the mathematical notion of an injection) of one domain of knowledge to another:

This mapping between concepts and ideas was once called “bijectivism” but was trivially described either as one to one mapping between two domains (like physical vs. mathematical) or fusing different concepts together to get another emerging concept. I myself proposed the similar mapping and called it a metaphorical bijection.  

Now consider another mapping metaphorically equivalent to a mathematical notion of a surjection where all constituents of the second domain are covered metaphorically by the first domain:

What we strive for is to establish the complete bijective mapping and reorganize our knowledge of both domains to achieve that:

In diagrams above small boxes can represent sets of ideas, methods, etc. or individual ideas, methods, etc. The established metaphorical bijection can divide sets or combine them if needed. There can be several such bijections, of course, and we can use other methods of inquiry (for example, the scientific method) to choose between competing metaphorical bijections.

Useful mnemonic:

BEIS (B=I+S or to BE IS …)

Bijectionism Equals Injection + Surjection

Another mnemonic:

BET (B=T or to BE Transformation…)

Bijectionism Equals Transformation 

Note also the second letter of Alef-Beis or Alef-Bet, the letter of Light that has interpretation of Creation in Biblical Hebrew.   

More on this later as I need to come back to DebugWare patterns.

- Dmitry Vostokov @ DumpAnalysis.org -

Epistemic Troubleshooting and Debugging (Part 1)

Sunday, July 26th, 2009

Paraphrasing “Knowing about knowing about knowing” (Side-box 0.1, Consciousness, David Rose) as “Knowing about knowing about problem solving”, I would suggest the following references to raise the level of awareness from meta-troubleshooting and meta-debugging, the subject of various general purpose debugging books to the next epistemic level. I’m currently reading the following books and let you know about my progress along the journey:

Toward a Unified Theory of Problem Solving: Views From the Content Domains

Buy from Amazon

The Psychology of Problem Solving

Buy from Amazon

The Cambridge Handbook of Expertise and Expert Performance

Buy from Amazon

- Dmitry Vostokov @ DumpAnalysis.org -

Debugged! MZ/PE June issue is out

Thursday, July 23rd, 2009

Finally the issue is available on Amazon and through other sellers:

Debugged! MZ/PE: Modeling Software Defects

Buy from Amazon

I’m now planning the September issue and post details later. 

- Dmitry Vostokov @ DumpAnalysis.org -

Trace Analysis Patterns (Part 5)

Wednesday, July 22nd, 2009

Sometimes we have several disjoint Periodic Errors and possible false positives. We wonder where should we start or assign relative priorities for troubleshooting suggestions. Here Statement Density and Current pattern can help. The statement or message density is simply the ratio of the number of occurrences of the specific trace statement (message) in the trace to the total number of all different recorded messages.

Consider this software trace with two frequent messages:

N     PID  TID
21    5928 8092 LookupAccountSid failed. Result = -2146238462
[...]
1013  5928 1340 SQL execution needs a retry. Result = 0

We have approx. 7,500 statements for the former and approx. 1,250 statements for the latter. The total number of trace statements is 185,700, so we have the corresponding approx. trace densities: 0.04 and 0.0067. Their relative ratio 7,500 / 1,250 is 6.

Another trace for the same problem was collected at a different time with the same errors. It has 71,100 statements and only 160 and 27 statements counted for messages above. We have a ratio 160 / 27 approx. the same, 5.93, that suggests that messages are correlated. However statement density is much lower, 0,002 and 0.00038 approx. and this suggests the closer look at the second trace to see whether these problems started at some time later after the start of the recording.

We can also check the statement current as the number of messages per unit of time. The first trace was recorded over the period of 195 seconds and the second over the period of 650 seconds. Therefore, we have 952 msg/s and 109 msg/s respectively. This suggests that the problem might have started at some time during the second trace or there were more modules selected for the first trace. To make sure, we adjust the total number of messages for these two traces. We find the first occurrence of the error and subtract its message number from the total number of messages. For our first trace we see that messages start from the very beginning, and in our second trace they also almost start from the beginning. So such adjustment shouldn’t give much better results here. Also these statements continue to be recorded till the very end of these traces.

To avoid being lost in this discusssion I repeat main results:

           Density             Relative Density   Current,
                                                  all msg/s
Trace 1    0.04 / 0.0067       6                  952
Trace 2    0.002 / 0.00038     5.93               109

The possibility that much more was traced that resulted in lower density for the second trace should be discarded because we have much lower current. Perhaps environment was not quite the same for the second tracing. However the same relative density for two different errors suggest that they are correlated and the higher density of the first error suggests that we should start our investigation from it.

The reason why I came up with this statistical trace analysis pattern is because 2 different engineers analyzed the same trace and both were suggesting different troubleshooting paths based on selected error messages from software traces. So I did a statistical analysis to prioritize their suggestions.

- Dmitry Vostokov @ TraceAnalysis.org -