Archive for the ‘Software Trace Analysis’ Category

Trace Analysis Patterns (Part 6)

Wednesday, July 29th, 2009

Often analysis of software traces starts with searching for short textual patterns, like a failure or an exception code or simply the word “exception”. And indeed, some software components are able to record their own exceptions or exceptions that were propagated to them including full stack traces. This is all common in .NET and Java environments. Here is a synthetic and beautified example based on real software traces:

N      PID  TID  Message
[...]
265799 8984 4216 ComponentA.Store.GetData threw exception: ‘System.Reflection.TargetInvocationException: DCOM connection to server failed with error: ‘Exception from HRESULT: 0×842D0001′ —> System.Runtime.InteropServices.COMException (0×842D0001): Exception from HRESULT: 0×842D0001
   at ComponentA.GetData(Byte[] pKey)
   at System.RuntimeMethodHandle._InvokeMethodFast(Object target, Object[] arguments, SignatureStruct& sig, MethodAttributes methodAttributes, RuntimeTypeHandle typeOwner)
   at System.RuntimeMethodHandle.InvokeMethodFast(Object target, Object[] arguments, Signature sig, MethodAttributes methodAttributes, RuntimeTypeHandle typeOwner)
   at System.Reflection.RuntimeMethodInfo.Invoke(Object obj, BindingFlags invokeAttr, Binder binder, Object[] parameters, CultureInfo culture, Boolean skipVisibilityChecks)
   at System.Reflection.RuntimeMethodInfo.Invoke(Object obj, BindingFlags invokeAttr, Binder binder, Object[] parameters, CultureInfo culture)
   at ComponentB.Connections.ComInterfaceProxy.Invoke(IMessage message)’ 
265800 8984 4216 === Begin Exception Dump === 
265801 8984 4216 ComponentB.Exceptions.ConnectionException: DCOM connection to server failed with error: ‘Exception from HRESULT: 0×842D0001′ —> System.Runtime.InteropServices.COMException (0×842D0001): Exception from HRESULT: 0×842D0001
265802 8984 4216 at ComponentA.Store.GetData(Byte[] pKey) 
[…]
265808 8984 4216 Exception rethrown at [0]:
265809 8984 4216 at System.Runtime.Remoting.Proxies.RealProxy.HandleReturnMessage(IMessage reqMsg, IMessage retMsg) 
265810 8984 4216 at System.Runtime.Remoting.Proxies.RealProxy.PrivateInvoke(MessageData& msgData, Int32 type) 
265811 8984 4216 at ComponentA.Store.GetData(Byte[] pKey) 
265812 8984 4216 at ComponentA.App.EnumBusinessObjects()

[…]
265816 8984 4216 ===> InnerException: 
265817 8984 4216 ** COM Exception Error Code: 0×842d0001 
265818 8984 4216 System.Runtime.InteropServices.COMException (0×842D0001): Exception from HRESULT: 0×842D0001 
265819 8984 4216 at ComponentA.Store.GetData(Byte[] pKey) 
265820 8984 4216 === End Exception Dump === 
[…]

In the embedded stack trace we see that App object was trying to enumerate business objects and asked Store object to get some data and the latter object was probably trying to communicate to the real data store via DCOM. The communication attempt failed with HRESULT.

- Dmitry Vostokov @ TraceAnalysis.org -

Trace Analysis Patterns (Part 5)

Wednesday, July 22nd, 2009

Sometimes we have several disjoint Periodic Errors and possible false positives. We wonder where should we start or assign relative priorities for troubleshooting suggestions. Here Statement Density and Current pattern can help. The statement or message density is simply the ratio of the number of occurrences of the specific trace statement (message) in the trace to the total number of all different recorded messages.

Consider this software trace with two frequent messages:

N     PID  TID
21    5928 8092 LookupAccountSid failed. Result = -2146238462
[...]
1013  5928 1340 SQL execution needs a retry. Result = 0

We have approx. 7,500 statements for the former and approx. 1,250 statements for the latter. The total number of trace statements is 185,700, so we have the corresponding approx. trace densities: 0.04 and 0.0067. Their relative ratio 7,500 / 1,250 is 6.

Another trace for the same problem was collected at a different time with the same errors. It has 71,100 statements and only 160 and 27 statements counted for messages above. We have a ratio 160 / 27 approx. the same, 5.93, that suggests that messages are correlated. However statement density is much lower, 0,002 and 0.00038 approx. and this suggests the closer look at the second trace to see whether these problems started at some time later after the start of the recording.

We can also check the statement current as the number of messages per unit of time. The first trace was recorded over the period of 195 seconds and the second over the period of 650 seconds. Therefore, we have 952 msg/s and 109 msg/s respectively. This suggests that the problem might have started at some time during the second trace or there were more modules selected for the first trace. To make sure, we adjust the total number of messages for these two traces. We find the first occurrence of the error and subtract its message number from the total number of messages. For our first trace we see that messages start from the very beginning, and in our second trace they also almost start from the beginning. So such adjustment shouldn’t give much better results here. Also these statements continue to be recorded till the very end of these traces.

To avoid being lost in this discusssion I repeat main results:

           Density             Relative Density   Current,
                                                  all msg/s
Trace 1    0.04 / 0.0067       6                  952
Trace 2    0.002 / 0.00038     5.93               109

The possibility that much more was traced that resulted in lower density for the second trace should be discarded because we have much lower current. Perhaps environment was not quite the same for the second tracing. However the same relative density for two different errors suggest that they are correlated and the higher density of the first error suggests that we should start our investigation from it.

The reason why I came up with this statistical trace analysis pattern is because 2 different engineers analyzed the same trace and both were suggesting different troubleshooting paths based on selected error messages from software traces. So I did a statistical analysis to prioritize their suggestions.

- Dmitry Vostokov @ TraceAnalysis.org -

Dump and Trace Analysis on Facebook

Friday, June 19th, 2009

The following group has been created on Facebook:

DATA (Dump Analysis + Trace Analysis)

Please don’t hesitate to join and spread news about it :-)

- Dmitry Vostokov @ DumpAnalysis.org -

Trace Analysis Patterns (Part 4)

Tuesday, June 16th, 2009

Sometimes we see a functional activity in a trace and / or see basic facts. Then we might want to find a correlation between that activity or facts in another part of the trace. If that intra-correlation fits into our problem description we may claim a possible explanation or, if we are lucky, we have just found, an inference to the best explanation, as philosophers of science like to say. Here is an example, but this time using Citrix WindowHistory tracing tool. A third-party application was frequently loosing the focus and the suspicion was on a terminal services client process. It was found that the following WindowHistory trace fragment corresponds to that application:

Handle: 00050586 Class: "Application A Class" Title: ""
     Title changed at 15:52:4:3 to "Application A"
     Title changed at 15:52:10:212 to "Application A - File1"
[...]
   Process ID: 89c
   Thread ID: d6c
[...]
   Visible: true
   Window placement command: SW_SHOWNORMAL
        Placement changed at 15:54:57:506 to SW_SHOWMINIMIZED
        Placement changed at 15:55:2:139 to SW_SHOWNORMAL
   Foreground: false
        Foreground changed at 15:52:4:3 to true
        Foreground changed at 15:53:4:625 to false
        Foreground changed at 15:53:42:564 to true
        Foreground changed at 15:53:44:498 to false
        Foreground changed at 15:53:44:498 to true
        Foreground changed at 15:53:44:592 to false
        Foreground changed at 15:53:45:887 to true
        Foreground changed at 15:53:47:244 to false
        Foreground changed at 15:53:47:244 to true
        Foreground changed at 15:53:47:353 to false
        Foreground changed at 15:54:26:416 to true
        Foreground changed at 15:54:27:55 to false
        Foreground changed at 15:54:27:55 to true
        Foreground changed at 15:54:27:180 to false
        Foreground changed at 15:54:28:428 to true
        Foreground changed at 15:54:28:771 to false
        Foreground changed at 15:54:28:865 to true
        Foreground changed at 15:54:29:99 to false
        Foreground changed at 15:54:30:877 to true
        Foreground changed at 15:54:57:521 to false
        Foreground changed at 15:55:2:76 to true
        Foreground changed at 15:57:3:378 to false
        Foreground changed at 15:57:11:396 to true
        Foreground changed at 15:57:29:601 to false
        Foreground changed at 15:57:39:803 to true
        Foreground changed at 15:58:54:41 to false
        Foreground changed at 15:59:8:96 to true
        Foreground changed at 16:1:19:478 to false
        Foreground changed at 16:1:27:527 to true
        Foreground changed at 16:1:39:914 to false
        Foreground changed at 16:2:0:515 to true
        Foreground changed at 16:7:14:628 to false
        Foreground changed at 16:7:24:246 to true
        Foreground changed at 16:9:53:523 to false
        Foreground changed at 16:10:15:919 to true
        Foreground changed at 16:10:31:426 to false
        Foreground changed at 16:11:12:818 to true
        Foreground changed at 16:11:59:538 to false
        Foreground changed at 16:12:39:456 to true
        Foreground changed at 16:13:6:364 to false

Corresponding terminal services client window trace fragment doesn’t have any foreground changes but another application main window has lots of them:

Handle: 000D0540 Class: "Application B Class" Title: "Application B"
[...]
   Process ID: 3ac
   Thread ID: bd4
[...]
   Foreground: false
        Foreground changed at 15:50:36:972 to true
        Foreground changed at 15:50:53:732 to false
        Foreground changed at 15:50:53:732 to true
        Foreground changed at 15:50:53:826 to false
        Foreground changed at 15:51:51:352 to true
        Foreground changed at 15:51:53:941 to false
        Foreground changed at 15:53:8:135 to true
        Foreground changed at 15:53:8:182 to false
        Foreground changed at 15:53:10:178 to true
        Foreground changed at 15:53:13:938 to false
        Foreground changed at 15:53:30:443 to true
        Foreground changed at 15:53:31:20 to false
        Foreground changed at 15:53:31:20 to true
        Foreground changed at 15:53:31:129 to false
        Foreground changed at 15:53:34:78 to true
        Foreground changed at 15:53:34:795 to false
        Foreground changed at 15:53:34:795 to true
        Foreground changed at 15:53:34:873 to false
        Foreground changed at 15:53:36:901 to true
        Foreground changed at 15:53:42:502 to false
        Foreground changed at 15:53:42:502 to true
        Foreground changed at 15:53:42:564 to false
        Foreground changed at 15:57:3:425 to true
        Foreground changed at 15:57:4:595 to false
        Foreground changed at 15:57:10:507 to true
        Foreground changed at 15:57:11:318 to false
        Foreground changed at 15:57:29:632 to true
        Foreground changed at 15:57:31:67 to false
        Foreground changed at 15:57:32:721 to true
        Foreground changed at 15:57:33:844 to false
        Foreground changed at 15:58:54:88 to true
        Foreground changed at 15:58:56:178 to false
        Foreground changed at 15:59:6:505 to true
        Foreground changed at 15:59:7:987 to false
        Foreground changed at 16:1:19:525 to true
        Foreground changed at 16:1:19:961 to false
        Foreground changed at 16:1:26:607 to true
        Foreground changed at 16:1:27:434 to false
        Foreground changed at 16:1:39:914 to true
        Foreground changed at 16:1:39:992 to false
        Foreground changed at 16:1:49:798 to true
        Foreground changed at 16:2:0:437 to false
        Foreground changed at 16:7:14:628 to true
        Foreground changed at 16:7:14:847 to false
        Foreground changed at 16:7:18:76 to true
        Foreground changed at 16:7:24:106 to false
        Foreground changed at 16:9:58:790 to true
        Foreground changed at 16:10:4:16 to false
        Foreground changed at 16:10:4:874 to true
        Foreground changed at 16:10:4:890 to false
        Foreground changed at 16:10:8:634 to true
        Foreground changed at 16:10:15:779 to false
        Foreground changed at 16:10:56:766 to true
        Foreground changed at 16:10:59:402 to false
        Foreground changed at 16:10:59:652 to true
        Foreground changed at 16:10:59:667 to false
        Foreground changed at 16:12:9:397 to true
        Foreground changed at 16:12:39:347 to false
        Foreground changed at 16:13:18:375 to true
        Foreground changed at 16:14:33:656 to false

We can see that most of the time when Application A window looses focus Application B window gets it.

- Dmitry Vostokov @ TraceAnalysis.org -

Software Narratology: A Definition

Friday, June 12th, 2009

Let’s define a software or computer narratology as an application of the theory and methods of literary narratology to the domain of software execution where software traces and logs are considered as a narrative, the story of a computation (*). As an example, we have the following correspondence between 4-tiers of literary and software narrative models:

Events            Instructions, statements, functions
History           Concrete execution path
Narrative         Software trace
Presentation      View (from a trace viewer)

(*) Please do not confuse software narratology with computational narratology.

- Dmitry Vostokov @ TraceAnalysis.org -

Welcome to TraceAnalysis.org!

Wednesday, June 3rd, 2009

DumpAnalysis.org acquires TraceAnalysis.org to complete computer DATA artifact analysis. The domain currently points to Dump Analysis Portal page but this might change in the future.

- Dmitry Vostokov @ DumpAnalysis.org -

Trace Analysis Patterns (Part 3)

Wednesday, June 3rd, 2009

Next obvious structural trace analysis pattern is called Circular Trace. Sometimes this information is missing in the problem description or trace metadata doesn’t reflect this. Then circular traces can be detected by trace file size (usually large) and from timestamps, like this 100Mb CDF trace snippet:

No     Module  PID  TID  Date      Time         Statement
[Begin of trace listing]
1      ModuleA 4280 1736 5/28/2009 08:53:50.496 [... Trace statement 1]
2      ModuleB 6212 6216 5/28/2009 08:53:52.876 [... Trace statement 2] 
3      ModuleA 4280 4776 5/28/2009 08:54:13.537 [... Trace statement 3] 
[... Some traced exceptions helpful for analysis ...]
3799   ModuleA 4280 3776 5/28/2009 09:15:00.853 [… Trace statement 3799]
3800   ModuleA 4280 1736 5/27/2009 09:42:12.029 [… Trace statement 3800]
[… Skipped …]
[… Skipped …]
[… Skipped …]
579210 ModuleA 4280 4776 5/28/2009 08:53:35.989 [… Trace statement 579210]
[End of trace listing]

In such traces the analysis region usually can be found at the beginning at the trace because as soon as elusive and hard to reproduce problem happens the trace is stopped.

- Dmitry Vostokov @ TraceAnalysis.org -

Efficient vs. Effective: DATA View

Wednesday, June 3rd, 2009

DATA (Dump Artifact + Trace Artifact) - > DATA (Dump Analysis + Trace Analysis) examples:

1.  Efficient

- My 64Gb server bluescreens. I set a complete memory dump option in Control Panel.

- A user cannot connect. I started tracing yesterday. Stopped today.

- I analyze all these artifacts every day.

2. Effective

- My 64Gb server bluescreens. I set a kernel memory dump option in Control Panel.

- A user cannot connect. I started tracing, tried to connect, stopped tracing.

- I analyze all these artifacts every day and write articles to reduce DATA load.

- Dmitry Vostokov @ DumpAnalysis.org -

Tracing Best Practices

Wednesday, June 3rd, 2009

Good software engineers write good software trace statements. Good software support engineers and responsible customers trace(*) software wisely, enabling it at the right time and in the right quantities. The following preliminary article was written to help to trace software effectively to result in faster problem resolution via trace analysis:

Tracing Best Practices

Although, currently it’s geared towards CDF tracing in Citrix terminal services environment, these recommendations can be generalized to other traces as well and the article will be extended over time.

(*) Note the following terminological difference here. “Tracing” is meant in “select” / “start” / “stop” sense and not how to write good software trace statements during code construction and maintenance phases.

- Dmitry Vostokov @ TraceAnalysis.org -

Pictures from Memory Space (Part 2)

Friday, May 29th, 2009

Now some pictures from Citrix CDF traces.

Deep waters of The Sea of Traces:

Considering software tracing as narrative no wonder one day I discovered the vast Library of Software Logs in the sea above:

- Dmitry Vostokov @ DumpAnalysis.org -

The Meaning of DATA

Tuesday, May 26th, 2009

I was suddenly enlightened by the unification of software traces with memory dumps and it came to me that DATA is simply Dump Analysis + Trace Analysis. It is commutative with TADA, the sound of accomplishment (tada.wav in Windows \ Media folder).

This can’t be a coincidence, can it?

- Dmitry Vostokov @ DumpAnalysis.org -

Software Trace - A Mathematical Definition

Tuesday, May 26th, 2009

What is a software trace from a mathematical standpoint? Before any software writes its trace data, it assembles it in memory. Therefore, generally, a software trace is a linear ordered sequence of specifically prepared memory fragments (trace statements):

(ts1, ts2, …, tsn

where every tsi is a sequence of bits, bytes or other discrete units (see the definition of a memory dump):

(s11, s12, …, s1k, s21, s22, …, s2l, …, …, …, sn1, sn2, …, snm)

These trace statements can also be minidumps, selected regions of memory space. In the limit, if every tsi is a full memory snapshot saved at an instant of time (ti) we have a sequence of memory dumps:

(mt1, mt2, …, mtn

Like with memory dump analysis we need symbol files to interpret saved memory fragments unless they were already interpreted during their construction. For example, traces written according ETW specification (Event Tracing for Windows), need TMF files (Trace Message Format) for their interpretation and viewing. Usually these files are generated from PDB files and therefore we have this correspondence:

memory dump file -> software trace file

PDB file -> TMF file 

- Dmitry Vostokov @ TraceAnalysis.org -

Bugtation No.95

Monday, May 25th, 2009

“A trace is a narrative, the story of a computation.”

Dmitry Vostokov, Software Tracing and Logging: Architecture, Design, Implementation and Analysis Patterns

- Dmitry Vostokov @ TraceAnalysis.org -

Trace Analysis Patterns (Part 2)

Thursday, May 7th, 2009

A typical trace is a detailed narrative. It is accompanied by a problem description that lists essential facts. Therefore the first task of any trace analysis is to check the presence of Basic Facts in the trace. If they are not visible or do not correspond then the trace was possibly not recorded during the problem or was taken from a different computer or under different conditions. Here is an example. A user “test01″ cannot connect to an application. We look at the trace and find this statement:

No   PID  TID  Date      Time         Statement
[...]
3903 3648 5436 4/29/2009 16:17:36.150 User Name: test01
[...]

At least we can be sure that this trace was taken for the user test01 especially when we expect this or similar trace statements. If we could not see this trace statement we can suppose that the trace was taken at the wrong time, for example, after the problem happened already.

- Dmitry Vostokov @ TraceAnalysis.org -

Trace Analysis Patterns (Part 1)

Tuesday, April 28th, 2009

After coming back to engineering I decided to expand the domain of my research and start the new series of posts called Trace Analysis Patterns. In addition to Citrix CDF / Microsoft ETW traces I plan to cover other variants based on my extensive software engineering background in the past where I used tracing in software products ranging from soft multi-platform real-time systems to static code analysis tools. Connection with memory dump analysis will be covered too because sometimes the combination of static and dynamic data leads to interesting observations and helps to troubleshoot and resolve customer problems especially when not all data can be collected dynamically.

In fact, stack traces and their collections are specializations of the more general traces. Another example is historical information in memory dump files especially when it is somehow timestamped.  

In this part I start with the obvious and to some extent the trivial pattern called Periodic Error. This is an error or a status value that is observed periodically many times:

No     PID  TID   Date      Time         Statement
[...]
664957 1788 22504 4/23/2009 17:59:14.600 MyClass::Initialize: Cannot open connection “Client ID: 310″, status=5  
[…]
668834 1788 19868 4/23/2009 19:11:52.979 MyClass::Initialize: Cannot open connection “Client ID: 612″, status=5 
[…]

or 

No     PID  TID   Date      Time         Statement
[...] 
202314 1788 19128 4/21/2009 16:03:46.861 HandleDataLevel: Error 12005 Getting Mask
[…]
347653 1788 17812 4/22/2009 13:26:00.735 HandleDataLevel: Error 12005 Getting Mask
[…]

Here single trace entries can be isolated from the trace and studied in detail. 

Be aware though that some modules might report periodic errors that are false positive, in the sense, that they are expected as a part of implementation details, for example, when a function returns an error to indicate that bigger buffer is required or to estimate its size for a subsequent call. It merits its own pattern name and I come to it next time with more examples.

I also created a page where I’ll will be adding all tracing patterns:

Trace Analysis Patterns   

- Dmitry Vostokov @ TraceAnalysis.org -

CDF Traces: Analyzing Process Launch Sequence

Monday, March 31st, 2008

Citrix CDF traces are based on ETW (Event Tracing for Windows) and therefore Citrix customers, their support personnel and developers can use MS TraceView tool for troubleshooting Citrix terminal service environments:

Viewing Common Diagnostics Facility (CDF) Traces Using TraceView

In cases with slow logon or slow process startup we can analyze process launch sequence to determine delays. In the output trace we can filter tzhook module messages which also contain session id (this is quite handy to differentiate between different sessions), for example: 

PID      TID      TIME          MESSAGE     
21864    21912    06:34:53.598  tzhook: Attach on process - cmd.exe session=51         
21620    20372    06:34:59.754  tzhook: Attach on process - acregl.exe session=51         
18668    21240    06:35:02.704  tzhook: Attach on process - cmstart.exe session=51         
18560    18832    06:35:02.735  tzhook: Attach on process - wfshell.exe session=51         
18204    20060    06:35:06.575  tzhook: Attach on process - icast.exe session=51         
20640    21104    06:35:07.717  tzhook: Attach on process - LOGON.EXE session=51         
21188    21032    06:35:07.938  tzhook: Attach on process - cscript.exe session=51         
21888    19592    06:35:11.157  tzhook: Attach on process - WScript.exe session=51         
20600    20732    06:35:11.780  tzhook: Attach on process - admin.exe session=51         
17976    20456    06:35:18.752  tzhook: Attach on process - winlogon.exe session=53         
21332    13156    06:35:51.625  tzhook: Attach on process - mpnotify.exe session=53         
10988    10732    06:35:57.043  tzhook: Attach on process - rundll32.exe session=53

Here is another process launch sequence for published Notepad application:

PID     TID     TIME            MESSAGE
15828   18720   15:34:02.637    tzhook: Attach on process - winlogon.exe session=2 
5300    18508   15:34:03.043    tzhook: Attach on process - mpnotify.exe session=2 
17948   19300   15:34:03.356    tzhook: Attach on process - userinit.exe session=2 
17956   19316   15:34:03.415    tzhook: Attach on process - cmd.exe session=2 
5384    5324    15:34:03.443    tzhook: Attach on process - cmd.exe session=2 
19432   19264   15:34:03.461    tzhook: Attach on process - SSONSVR.EXE session=2 
12480   7472    15:34:03.466    tzhook: Attach on process - cmd.exe session=2 
19448   19364   15:34:03.474    tzhook: Attach on process - net.exe session=2 
19416   19656   15:34:03.489    tzhook: Attach on process - acregl.exe session=2 
19480   19596   15:34:03.544    tzhook: Attach on process - cmstart.exe session=2 
664     19512   15:34:03.559    tzhook: Attach on process - wfshell.exe session=2 
19904   13140   15:34:03.610    tzhook: Attach on process - net.exe session=2 
6864    20036   15:34:03.746    tzhook: Attach on process - icast.exe session=2 
19540   20016   15:34:03.749    tzhook: Attach on process - ctfmon.exe session=2 
19944   19032   15:34:03.757    tzhook: Attach on process - net.exe session=2 
10232   18356   15:34:03.787    tzhook: Attach on process - notepad.exe session=2

Such sequences are also useful to determine a process upon which the session initialization or startup sequence hangs. In this case a user dump of that process might be useful.

Of course we can do all this with Process Monitor and other similar tools but here we get other Citrix related trace messages as well. All in one.

- Dmitry Vostokov @ DumpAnalysis.org -