Crash Dump Analysis Patterns (Part 57)

Another pattern that occurs frequently is Hardware Error. This can be internal CPU malfunction due to overheating, RAM or hard disk I/O problem. It usually results in the appropriate bugcheck and the most frequent one is the 6th from the top of Bug Check Frequency Table:

  • BugCheck 9C: MACHINE_CHECK_EXCEPTION

Other relevant bugchecks include:

  • BugCheck 7B: INACCESSIBLE_BOOT_DEVICE

  • BugCheck 77: KERNEL_STACK_INPAGE_ERROR

  • BugCheck 7A: KERNEL_DATA_INPAGE_ERROR

Another bugcheck from this category can also be triggered on purpose to get a crash dump of a hanging or slow system:

Please also note that other popular bugchecks like  

  • BugCheck 7F: UNEXPECTED_KERNEL_MODE_TRAP

  • BugCheck 50: PAGE_FAULT_IN_NONPAGED_AREA

can result from RAM problems but we should try to find a software cause first.

Sometimes the following bugchecks like

  • BugCheck 7E: SYSTEM_THREAD_EXCEPTION_NOT_HANDLED

report EXCEPTION_DOESNOT_MATCH_CODE where read or write address doesn’t correspond to faulted instruction at EIP:

SYSTEM_THREAD_EXCEPTION_NOT_HANDLED (7e)
This is a very common bugcheck. Usually the exception address pinpoints
the driver/function that caused the problem. Always note this address
as well as the link date of the driver/image that contains this address.
Arguments:
Arg1: c0000005, The exception code that was not handled
Arg2: bf802671, The address that the exception occurred at
Arg3: f10b8c74, Exception Record Address
Arg4: f10b88c4, Context Record Address

FAULTING_IP:
driver!AcquireSemaphoreShared+4
bf802671 90 nop

EXCEPTION_RECORD: f10b8c74 -- (.exr fffffffff10b8c74)
ExceptionAddress: bf802671 (driver!AcquireSemaphoreShared+0x00000004)
ExceptionCode: c0000005 (Access violation)
ExceptionFlags: 00000000
NumberParameters: 2
Parameter[0]: 00000001
Parameter[1]: 0000000c
Attempt to write to address 0000000c

CONTEXT: f10b88c4 -- (.cxr fffffffff10b88c4)
eax=884d2d01 ebx=0000000c ecx=00000000 edx=80010031 esi=8851ef60 edi=bc3846d4
eip=bf802671 esp=f10b8d3c ebp=f10b8d70 iopl=0 nv up ei pl nz na po nc
cs=0008 ss=0010 ds=0023 es=0023 fs=0030 gs=0000 efl=00010206
driver!AcquireSemaphoreShared+0x4:
bf802671 90 nop
Resetting default scope

WRITE_ADDRESS: 0000000c

EXCEPTION_DOESNOT_MATCH_CODE: This indicates a hardware error.
Instruction at bf802671 does not read/write to 0000000c

Code mismatch can also happen in user mode but from my experience it usually results from improper Hooked Function or similar corruption: 

EXCEPTION_RECORD: ffffffff -- (.exr 0xffffffffffffffff)
ExceptionAddress: 7c848768 (ntdll!_LdrpInitialize+0x00000184)
ExceptionCode: c0000005 (Access violation)
ExceptionFlags: 00000001
NumberParameters: 0

DEFAULT_BUCKET_ID: CODE_ADDRESS_MISMATCH

WRITE_ADDRESS: f774f120

FAULTING_IP:
ntdll!_LdrpInitialize+184
7c848768 cc int 3

EXCEPTION_DOESNOT_MATCH_CODE: This indicates a hardware error.
Instruction at 7c848768 does not read/write to f774f120

STACK_TEXT:
0012fd14 7c8284c5 0012fd28 7c800000 00000000 ntdll!_LdrpInitialize+0x184
00000000 00000000 00000000 00000000 00000000 ntdll!KiUserApcDispatcher+0x25

In such cases EIP might point to the middle of the expected instruction (Wild Code):

FAULTING_IP:
+59c3659
059c3659 86990508f09b xchg bl,byte ptr [ecx-640FF7FBh]

Here is an example of the real hardware error (note the concatenated error code for bugcheck 0×9C):

MACHINE_CHECK_EXCEPTION (9c)
A fatal Machine Check Exception has occurred.
KeBugCheckEx parameters;
    x86 Processors
        If the processor has ONLY MCE feature available (For example Intel
        Pentium), the parameters are:
        1 - Low  32 bits of P5_MC_TYPE MSR
        2 - Address of MCA_EXCEPTION structure
        3 - High 32 bits of P5_MC_ADDR MSR
        4 - Low  32 bits of P5_MC_ADDR MSR
        If the processor also has MCA feature available (For example Intel
        Pentium Pro), the parameters are:
        1 - Bank number
        2 - Address of MCA_EXCEPTION structure
        3 - High 32 bits of MCi_STATUS MSR for the MCA bank that had the error
        4 - Low  32 bits of MCi_STATUS MSR for the MCA bank that had the error
    IA64 Processors
        1 - Bugcheck Type
            1 - MCA_ASSERT
            2 - MCA_GET_STATEINFO
                SAL returned an error for SAL_GET_STATEINFO while processing MCA.
            3 - MCA_CLEAR_STATEINFO
                SAL returned an error for SAL_CLEAR_STATEINFO while processing MCA.
            4 - MCA_FATAL
                FW reported a fatal MCA.
            5 - MCA_NONFATAL
                SAL reported a recoverable MCA and we don't support currently
                support recovery or SAL generated an MCA and then couldn't
                produce an error record.
            0xB - INIT_ASSERT
            0xC - INIT_GET_STATEINFO
                  SAL returned an error for SAL_GET_STATEINFO while processing INIT event.
            0xD - INIT_CLEAR_STATEINFO
                  SAL returned an error for SAL_CLEAR_STATEINFO while processing INIT event.
            0xE - INIT_FATAL
                  Not used.
        2 - Address of log
        3 - Size of log
        4 - Error code in the case of x_GET_STATEINFO or x_CLEAR_STATEINFO
    AMD64 Processors
        1 - Bank number
        2 - Address of MCA_EXCEPTION structure
        3 - High 32 bits of MCi_STATUS MSR for the MCA bank that had the error
        4 - Low  32 bits of MCi_STATUS MSR for the MCA bank that had the error
Arguments:
Arg1: 00000000
Arg2: 808a07a0
Arg3: be000300
Arg4: 1008081f

Debugging Details:
------------------

   NOTE:  This is a hardware error.  This error was reported by the CPU
   via Interrupt 18.  This analysis will provide more information about
   the specific error.  Please contact the manufacturer for additional
   information about this error and troubleshooting assistance.

   This error is documented in the following publication:

      - IA-32 Intel(r) Architecture Software Developer's Manual
        Volume 3: System Programming Guide

   Bit Mask:

    MA                           Model Specific       MCA
 O  ID      Other Information      Error Code     Error Code
VV  SDP ___________|____________ _______|_______ _______|______
AEUECRC|                        |               |             
LRCNVVC|                        |               |             
^^^^^^^|                        |               |              
   6         5         4         3         2         1
3210987654321098765432109876543210987654321098765432109876543210
----------------------------------------------------------------
1011111000000000000000110000000000010000000010000000100000011111 

VAL   - MCi_STATUS register is valid
        Indicates that the information contained within the IA32_MCi_STATUS
        register is valid.  When this flag is set, the processor follows the
        rules given for the OVER flag in the IA32_MCi_STATUS register when
        overwriting previously valid entries.  The processor sets the VAL
        flag and software is responsible for clearing it.

UC    - Error Uncorrected
        Indicates that the processor did not or was not able to correct the
        error condition.  When clear, this flag indicates that the processor
        was able to correct the error condition.

EN    - Error Enabled
        Indicates that the error was enabled by the associated EEj bit of the
        IA32_MCi_CTL register.

MISCV - IA32_MCi_MISC Register Valid
        Indicates that the IA32_MCi_MISC register contains additional
        information regarding the error.  When clear, this flag indicates
        that the IA32_MCi_MISC register is either not implemented or does
        not contain additional information regarding the error.

ADDRV - IA32_MCi_ADDR register valid
        Indicates that the IA32_MCi_ADDR register contains the address where
        the error occurred.

PCC   - Processor Context Corrupt
        Indicates that the state of the processor might have been corrupted
        by the error condition detected and that reliable restarting of the
        processor may not be possible.

BUSCONNERR - Bus and Interconnect Error   BUS{LL}_{PP}_{RRRR}_{II}_{T}_err
        These errors match the format 0000 1PPT RRRR IILL

   Concatenated Error Code:
   --------------------------
   _VAL_UC_EN_MISCV_ADDRV_PCC_BUSCONNERR_1F

   This error code can be reported back to the manufacturer.
   They may be able to provide additional information based upon
   this error.  All questions regarding STOP 0x9C should be
   directed to the hardware manufacturer.

BUGCHECK_STR:  0x9C_IA32_GenuineIntel

DEFAULT_BUCKET_ID:  DRIVER_FAULT

PROCESS_NAME:  Idle

CURRENT_IRQL:  2

LAST_CONTROL_TRANSFER:  from 80a7fbd8 to 8087b6be

STACK_TEXT: 
f773d280 80a7fbd8 0000009c 00000000 f773d2b0 nt!KeBugCheckEx+0x1b
f773d3b4 80a7786f f7737fe0 00000000 00000000 hal!HalpMcaExceptionHandler+0x11e
f773d3b4 f75a9ca2 f7737fe0 00000000 00000000 hal!HalpMcaExceptionHandlerWrapper+0x77
f78c6d50 8083abf2 00000000 0000000e 00000000 intelppm!AcpiC1Idle+0x12
f78c6d54 00000000 0000000e 00000000 00000000 nt!KiIdleLoop+0xa

- Dmitry Vostokov @ DumpAnalysis.org -

7 Responses to “Crash Dump Analysis Patterns (Part 57)”

  1. Dmitry Vostokov Says:

    Another possibility of a hardware error: frequent multiple unrelated bugchecks and / or bugchecks in memory dumps with valid instructions at faulting IP. Beware also about misaligned IP that can also look as a valid instruction.

  2. Crash Dump Analysis » Blog Archive » Fault context, wild code and hardware error: pattern cooperation Says:

    […] Most fault IPs were showing signs of Wild Code pattern and that most probably implicated Hardware Error (Looks like WinDbg suggests that MISALIGNED_IP implicates hardware). Here is the listing of […]

  3. Crash Dump Analysis » Blog Archive » Crash Dump Analysis AntiPatterns (Part 14) Says:

    […] wouldn’t be so quick. Check Hardware Error pattern post and comments there. So let’s de-analyze the analysis. “c0000005 is Access […]

  4. Dmitry Vostokov Says:

    Another example is this:

    WHEA_UNCORRECTABLE_ERROR (124)
    A fatal hardware error has occurred. Parameter 1 identifies the type of error
    source that reported the error. Parameter 2 holds the address of the
    WHEA_ERROR_RECORD structure that describes the error conditon.
    Arguments:
    Arg1: 0000000000000000, Machine Check Exception
    Arg2: fffffa8004b46748, Address of the WHEA_ERROR_RECORD structure.
    Arg3: 0000000000000000, High order 32-bits of the MCi_STATUS value.
    Arg4: 0000000000000000, Low order 32-bits of the MCi_STATUS value.

    1: kd> dt -r _WHEA_ERROR_RECORD fffffa8004b46748
    hal!_WHEA_ERROR_RECORD
       +0x000 Header           : _WHEA_ERROR_RECORD_HEADER
          +0x000 Signature        : 0x52455043
          +0x004 Revision         : _WHEA_REVISION
             +0x000 MinorRevision    : 0x10 ''
             +0x001 MajorRevision    : 0x2 ''
             +0x000 AsUSHORT         : 0x210
          +0x006 SignatureEnd     : 0xffffffff
          +0x00a SectionCount     : 3
          +0x00c Severity         : 1 ( WheaErrSevFatal )
          +0x010 ValidBits        : _WHEA_ERROR_RECORD_HEADER_VALIDBITS
             +0x000 PlatformId       : 0y0
             +0x000 Timestamp        : 0y1
             +0x000 PartitionId      : 0y0
             +0x000 Reserved         : 0y00000000000000000000000000000 (0)
             +0x000 AsULONG          : 2
          +0x014 Length           : 0x3a0
          +0x018 Timestamp        : _WHEA_TIMESTAMP
             +0x000 Seconds          : 0y00100010 (0x22)
             +0x000 Minutes          : 0y00101011 (0x2b)
             +0x000 Hours            : 0y00001100 (0xc)
             +0x000 Precise          : 0y0
             +0x000 Reserved         : 0y0000000 (0)
             +0x000 Day              : 0y00010110 (0x16)
             +0x000 Month            : 0y00000100 (0x4)
             +0x000 Year             : 0y00001010 (0xa)
             +0x000 Century          : 0y00010100 (0x14)
             +0x000 AsLARGE_INTEGER  : _LARGE_INTEGER 0x140a0416`000c2b22
          +0x020 PlatformId       : _GUID {00000000-0000-0000-0000-000000000000}
             +0x000 Data1            : 0
             +0x004 Data2            : 0
             +0x006 Data3            : 0
             +0x008 Data4            : [8]  ""
          +0x030 PartitionId      : _GUID {00000000-0000-0000-0000-000000000000}
             +0x000 Data1            : 0
             +0x004 Data2            : 0
             +0x006 Data3            : 0
             +0x008 Data4            : [8]  ""
          +0x040 CreatorId        : _GUID {cf07c4bd-b789-4e18-b3c4-1f732cb57131}
             +0x000 Data1            : 0xcf07c4bd
             +0x004 Data2            : 0xb789
             +0x006 Data3            : 0x4e18
             +0x008 Data4            : [8]  "???"
          +0x050 NotifyType       : _GUID {e8f56ffe-919c-4cc5-ba88-65abe14913bb}
             +0x000 Data1            : 0xe8f56ffe
             +0x004 Data2            : 0x919c
             +0x006 Data3            : 0x4cc5
             +0x008 Data4            : [8]  "???"
          +0x060 RecordId         : 0x01cae219`673474d3
          +0x068 Flags            : _WHEA_ERROR_RECORD_HEADER_FLAGS
             +0x000 Recovered        : 0y0
             +0x000 PreviousError    : 0y1
             +0x000 Simulated        : 0y0
             +0x000 Reserved         : 0y00000000000000000000000000000 (0)
             +0x000 AsULONG          : 2
          +0x06c PersistenceInfo  : _WHEA_PERSISTENCE_INFO
             +0x000 Signature        : 0y0000000000000000 (0)
             +0x000 Length           : 0y000000000000000000000000 (0)
             +0x000 Identifier       : 0y0000000000000000 (0)
             +0x000 Attributes       : 0y00
             +0x000 DoNotLog         : 0y0
             +0x000 Reserved         : 0y00000 (0)
             +0x000 AsULONGLONG      : 0
          +0x074 Reserved         : [12]  ""
       +0x080 SectionDescriptor : [1] _WHEA_ERROR_RECORD_SECTION_DESCRIPTOR
          +0x000 SectionOffset    : 0x158
          +0x004 SectionLength    : 0xc0
          +0x008 Revision         : _WHEA_REVISION
             +0x000 MinorRevision    : 0x1 ''
             +0x001 MajorRevision    : 0x2 ''
             +0x000 AsUSHORT         : 0x201
          +0x00a ValidBits        : _WHEA_ERROR_RECORD_SECTION_DESCRIPTOR_VALIDBITS
             +0x000 FRUId            : 0y0
             +0x000 FRUText          : 0y0
             +0x000 Reserved         : 0y000000 (0)
             +0x000 AsUCHAR          : 0 ''
          +0x00b Reserved         : 0 ''
          +0x00c Flags            : _WHEA_ERROR_RECORD_SECTION_DESCRIPTOR_FLAGS
             +0x000 Primary          : 0y1
             +0x000 ContainmentWarning : 0y0
             +0x000 Reset            : 0y0
             +0x000 ThresholdExceeded : 0y0
             +0x000 ResourceNotAvailable : 0y0
             +0x000 LatentError      : 0y0
             +0x000 Reserved         : 0y00000000000000000000000000 (0)
             +0x000 AsULONG          : 1
          +0x010 SectionType      : _GUID {9876ccad-47b4-4bdb-b65e-16f193c4f3db}
             +0x000 Data1            : 0x9876ccad
             +0x004 Data2            : 0x47b4
             +0x006 Data3            : 0x4bdb
             +0x008 Data4            : [8]  "???"
          +0x020 FRUId            : _GUID {00000000-0000-0000-0000-000000000000}
             +0x000 Data1            : 0
             +0x004 Data2            : 0
             +0x006 Data3            : 0
             +0x008 Data4            : [8]  ""
          +0x030 SectionSeverity  : 1 ( WheaErrSevFatal )
          +0x034 FRUText          : [20]  ""

  5. Dmitry Vostokov Says:

    KERNEL_STACK_INPAGE_ERROR (77)
    The requested page of kernel data could not be read in. Caused by
    bad block in paging file or disk controller error.
    In the case when the first arguments is 0 or 1, the stack signature
    in the kernel stack was not found. Again, bad hardware.
    An I/O status of c000009c (STATUS_DEVICE_DATA_ERROR) or
    C000016AL (STATUS_DISK_OPERATION_FAILED) normally indicates
    the data could not be read from the disk due to a bad
    block. Upon reboot autocheck will run and attempt to map out the bad
    sector. If the status is C0000185 (STATUS_IO_DEVICE_ERROR) and the paging
    file is on a SCSI disk device, then the cabling and termination should be
    checked. See the knowledge base article on SCSI termination.
    Arguments:
    Arg1: 0000000000000001, (page was retrieved from disk)
    Arg2: fffffa800818e870, value found in stack where signature should be
    Arg3: 0000000000000000, 0
    Arg4: fffff8800c6e5e80, address of signature on kernel stack

    2: kd> k
    Child-SP RetAddr Call Site
    fffff880`0371da18 fffff800`03110b01 nt!KeBugCheckEx
    fffff880`0371da20 fffff800`030c8c54 nt! ?? ::FNODOBFM::`string’+0×51e31
    fffff880`0371db30 fffff800`030c8bef nt!MmInPageKernelStack+0×40
    fffff880`0371db90 fffff800`030c8928 nt!KiInSwapKernelStacks+0×1f
    fffff880`0371dbc0 fffff800`0332be5a nt!KeSwapProcessOrStack+0×84
    fffff880`0371dc00 fffff800`03085d26 nt!PspSystemThreadStartup+0×5a
    fffff880`0371dc40 00000000`00000000 nt!KiStartSystemThread+0×16

  6. Dmitry Vostokov Says:

    For WHEA_UNCORRECTABLE_ERROR (124) we have additional WinDbg commands !whea, !errrec, and !errpkt:

    2: kd> !whea
    Error Source Table @ fffff8004bbd4a90
    4 Error Sources
    Error Source 0 @ ffffe00014376bd0
    Notify Type : {14374010-e000-ffff-984a-bd4b00f8ffff}
    Type : 0×0 (MCE)
    Error Count : 1
    Record Count : 4
    Record Length : 728
    Error Records : wrapper @ ffffe000110e0000 record @ ffffe000110e0028
    : wrapper @ ffffe000110e0728 record @ ffffe000110e0750
    : wrapper @ ffffe000110e0e50 record @ ffffe000110e0e78
    : wrapper @ ffffe000110e1578 record @ ffffe000110e15a0
    Descriptor : @ ffffe00014376c29
    Length : 3cc
    Max Raw Data Length : 141
    Num Records To Preallocate : 4
    Max Sections Per Record : 4
    Error Source ID : 0
    Flags : 00000000
    […]

    2: kd> !errrec ffffe000110e0028
    ============================================
    Common Platform Error Record @ ffffe000110e0028
    ——————————————————————————-
    Record Id : 01d21a1a7e5fffd1
    Severity : Fatal (1)
    Length : 928
    Creator : Microsoft
    Notify Type : Machine Check Exception
    Timestamp : 9/30/2016 9:05:50 (UTC)
    Flags : 0×00000000

    ============================================
    Section 0 : Processor Generic
    ——————————————————————————-
    Descriptor @ ffffe000110e00a8
    Section @ ffffe000110e0180
    Offset : 344
    Length : 192
    Flags : 0×00000001 Primary
    Severity : Fatal

    Proc. Type : x86/x64
    Instr. Set : x64
    Error Type : Micro-Architectural Error
    Flags : 0×00
    CPU Version : 0×00000000000306a9
    Processor ID : 0×0000000000000002

    ============================================
    Section 1 : x86/x64 Processor Specific
    ——————————————————————————-
    Descriptor @ ffffe000110e00f0
    Section @ ffffe000110e0240
    Offset : 536
    Length : 128
    Flags : 0×00000000
    Severity : Fatal

    Local APIC Id : 0×0000000000000002
    CPU Id : a9 06 03 00 00 08 10 02 - bf e3 ba 7f ff fb eb bf
    00 00 00 00 00 00 00 00 - 00 00 00 00 00 00 00 00
    00 00 00 00 00 00 00 00 - 00 00 00 00 00 00 00 00

    Proc. Info 0 @ ffffe000110e0240

    ============================================
    Section 2 : x86/x64 MCA
    ——————————————————————————-
    Descriptor @ ffffe000110e0138
    Section @ ffffe000110e02c0
    Offset : 664
    Length : 264
    Flags : 0×00000000
    Severity : Fatal

    Error : Internal unclassified (Proc 2 Bank 4)
    Status : 0xb200000000100402

  7. Dmitry Vostokov Says:

    Recently we observed internal errors in Visual C++ compiler followed by memory management bugchecks a few seconds later.

Leave a Reply