Trace Analysis Patterns (Part 5)

Sometimes we have several disjoint Periodic Errors and possible false positives. We wonder where should we start or assign relative priorities for troubleshooting suggestions. Here Statement Density and Current pattern can help. The statement or message density is simply the ratio of the number of occurrences of the specific trace statement (message) in the trace to the total number of all different recorded messages.

Consider this software trace with two frequent messages:

N     PID  TID
21    5928 8092 LookupAccountSid failed. Result = -2146238462
[...]
1013  5928 1340 SQL execution needs a retry. Result = 0

We have approx. 7,500 statements for the former and approx. 1,250 statements for the latter. The total number of trace statements is 185,700, so we have the corresponding approx. trace densities: 0.04 and 0.0067. Their relative ratio 7,500 / 1,250 is 6.

Another trace for the same problem was collected at a different time with the same errors. It has 71,100 statements and only 160 and 27 statements counted for messages above. We have a ratio 160 / 27 approx. the same, 5.93, that suggests that messages are correlated. However statement density is much lower, 0,002 and 0.00038 approx. and this suggests the closer look at the second trace to see whether these problems started at some time later after the start of the recording.

We can also check the statement current as the number of messages per unit of time. The first trace was recorded over the period of 195 seconds and the second over the period of 650 seconds. Therefore, we have 952 msg/s and 109 msg/s respectively. This suggests that the problem might have started at some time during the second trace or there were more modules selected for the first trace. To make sure, we adjust the total number of messages for these two traces. We find the first occurrence of the error and subtract its message number from the total number of messages. For our first trace we see that messages start from the very beginning, and in our second trace they also almost start from the beginning. So such adjustment shouldn’t give much better results here. Also these statements continue to be recorded till the very end of these traces.

To avoid being lost in this discusssion I repeat main results:

           Density             Relative Density   Current,
                                                  all msg/s
Trace 1    0.04 / 0.0067       6                  952
Trace 2    0.002 / 0.00038     5.93               109

The possibility that much more was traced that resulted in lower density for the second trace should be discarded because we have much lower current. Perhaps environment was not quite the same for the second tracing. However the same relative density for two different errors suggest that they are correlated and the higher density of the first error suggests that we should start our investigation from it.

The reason why I came up with this statistical trace analysis pattern is because 2 different engineers analyzed the same trace and both were suggesting different troubleshooting paths based on selected error messages from software traces. So I did a statistical analysis to prioritize their suggestions.

- Dmitry Vostokov @ TraceAnalysis.org -

6 Responses to “Trace Analysis Patterns (Part 5)”

  1. Crash Dump Analysis » Blog Archive » Trace Analysis Patterns (Part 10) Says:

    […] distribution differs greatly in both size and content. Despite the same tracing time, 15 seconds, statement current is 155 msg/s for working and 388 msg/s for non-working […]

  2. Crash Dump Analysis » Blog Archive » Statement current, coupled processes, wait chain, spiking thread, hidden exception, and not my version: memory dump and trace analysis pattern cooperation Says:

    […] when the problem happens again. The trace revealed the following message with exceptionally high statement current of 72,118 msg/s (and also superdense - no other types of trace statements were found […]

  3. Crash Dump Analysis » Blog Archive » Trace Analysis Patterns (Part 12) Says:

    […] with thousands and millions of messages (trace statements) we can see regions of activity where statement current (Jm, msg/s) is much higher than in surrounding temporal regions. Hence the name of this pattern, […]

  4. Crash Dump Analysis » Blog Archive » Trace Analysis Patterns (Part 15) Says:

    […] When looking at software traces and doing either a search for or just scrolling certain messages have our attention immediately. We call them Significant Events and hence the name of this pattern, Significant Event. It could be a recorded exception or an error, a basic fact, a trace message from vocabulary index, or just any trace statement that marks the start of some activity we want to explore in depth, for example, a certain DLL is attached to the process, a coupled process is started or a function is called. The start of a trace and the end of it are trivial significant events and are used in deciding whether the trace is circular, in determining the trace recording interval or its average statement current. […]

  5. Crash Dump Analysis » Blog Archive » Trace Analysis Patterns (Part 23) Says:

    […] Statement Density and Current […]

  6. Crash Dump Analysis » Blog Archive » Ana-Trace-Log-Lyzer Says:

    […] Statement Density and Current […]

Leave a Reply