Crash Dump Analysis Patterns (Part 57)
Another pattern that occurs frequently is Hardware Error. This can be internal CPU malfunction due to overheating, RAM or hard disk I/O problem. It usually results in the appropriate bugcheck and the most frequent one is the 6th from the top of Bug Check Frequency Table:
-
BugCheck 9C: MACHINE_CHECK_EXCEPTION
Other relevant bugchecks include:
-
BugCheck 7B: INACCESSIBLE_BOOT_DEVICE
-
BugCheck 77: KERNEL_STACK_INPAGE_ERROR
-
BugCheck 7A: KERNEL_DATA_INPAGE_ERROR
Another bugcheck from this category can also be triggered on purpose to get a crash dump of a hanging or slow system:
Please also note that other popular bugchecks like
-
BugCheck 7F: UNEXPECTED_KERNEL_MODE_TRAP
-
BugCheck 50: PAGE_FAULT_IN_NONPAGED_AREA
can result from RAM problems but we should try to find a software cause first.
Sometimes the following bugchecks like
-
BugCheck 7E: SYSTEM_THREAD_EXCEPTION_NOT_HANDLED
report EXCEPTION_DOESNOT_MATCH_CODE where read or write address doesn’t correspond to faulted instruction at EIP:
SYSTEM_THREAD_EXCEPTION_NOT_HANDLED (7e)
This is a very common bugcheck. Usually the exception address pinpoints
the driver/function that caused the problem. Always note this address
as well as the link date of the driver/image that contains this address.
Arguments:
Arg1: c0000005, The exception code that was not handled
Arg2: bf802671, The address that the exception occurred at
Arg3: f10b8c74, Exception Record Address
Arg4: f10b88c4, Context Record Address
FAULTING_IP:
driver!AcquireSemaphoreShared+4
bf802671 90 nop
EXCEPTION_RECORD: f10b8c74 -- (.exr fffffffff10b8c74)
ExceptionAddress: bf802671 (driver!AcquireSemaphoreShared+0x00000004)
ExceptionCode: c0000005 (Access violation)
ExceptionFlags: 00000000
NumberParameters: 2
Parameter[0]: 00000001
Parameter[1]: 0000000c
Attempt to write to address 0000000c
CONTEXT: f10b88c4 -- (.cxr fffffffff10b88c4)
eax=884d2d01 ebx=0000000c ecx=00000000 edx=80010031 esi=8851ef60 edi=bc3846d4
eip=bf802671 esp=f10b8d3c ebp=f10b8d70 iopl=0 nv up ei pl nz na po nc
cs=0008 ss=0010 ds=0023 es=0023 fs=0030 gs=0000 efl=00010206
driver!AcquireSemaphoreShared+0x4:
bf802671 90 nop
Resetting default scope
WRITE_ADDRESS: 0000000c
EXCEPTION_DOESNOT_MATCH_CODE: This indicates a hardware error.
Instruction at bf802671 does not read/write to 0000000c
Code mismatch can also happen in user mode but from my experience it usually results from improper Hooked Function or similar corruption:
EXCEPTION_RECORD: ffffffff -- (.exr 0xffffffffffffffff)
ExceptionAddress: 7c848768 (ntdll!_LdrpInitialize+0x00000184)
ExceptionCode: c0000005 (Access violation)
ExceptionFlags: 00000001
NumberParameters: 0
DEFAULT_BUCKET_ID: CODE_ADDRESS_MISMATCH
WRITE_ADDRESS: f774f120
FAULTING_IP:
ntdll!_LdrpInitialize+184
7c848768 cc int 3
EXCEPTION_DOESNOT_MATCH_CODE: This indicates a hardware error.
Instruction at 7c848768 does not read/write to f774f120
STACK_TEXT:
0012fd14 7c8284c5 0012fd28 7c800000 00000000 ntdll!_LdrpInitialize+0x184
00000000 00000000 00000000 00000000 00000000 ntdll!KiUserApcDispatcher+0x25
In such cases EIP might point to the middle of the expected instruction (Wild Code):
FAULTING_IP:
+59c3659
059c3659 86990508f09b xchg bl,byte ptr [ecx-640FF7FBh]
Here is an example of the real hardware error (note the concatenated error code for bugcheck 0×9C):
MACHINE_CHECK_EXCEPTION (9c)
A fatal Machine Check Exception has occurred.
KeBugCheckEx parameters;
x86 Processors
If the processor has ONLY MCE feature available (For example Intel
Pentium), the parameters are:
1 - Low 32 bits of P5_MC_TYPE MSR
2 - Address of MCA_EXCEPTION structure
3 - High 32 bits of P5_MC_ADDR MSR
4 - Low 32 bits of P5_MC_ADDR MSR
If the processor also has MCA feature available (For example Intel
Pentium Pro), the parameters are:
1 - Bank number
2 - Address of MCA_EXCEPTION structure
3 - High 32 bits of MCi_STATUS MSR for the MCA bank that had the error
4 - Low 32 bits of MCi_STATUS MSR for the MCA bank that had the error
IA64 Processors
1 - Bugcheck Type
1 - MCA_ASSERT
2 - MCA_GET_STATEINFO
SAL returned an error for SAL_GET_STATEINFO while processing MCA.
3 - MCA_CLEAR_STATEINFO
SAL returned an error for SAL_CLEAR_STATEINFO while processing MCA.
4 - MCA_FATAL
FW reported a fatal MCA.
5 - MCA_NONFATAL
SAL reported a recoverable MCA and we don't support currently
support recovery or SAL generated an MCA and then couldn't
produce an error record.
0xB - INIT_ASSERT
0xC - INIT_GET_STATEINFO
SAL returned an error for SAL_GET_STATEINFO while processing INIT event.
0xD - INIT_CLEAR_STATEINFO
SAL returned an error for SAL_CLEAR_STATEINFO while processing INIT event.
0xE - INIT_FATAL
Not used.
2 - Address of log
3 - Size of log
4 - Error code in the case of x_GET_STATEINFO or x_CLEAR_STATEINFO
AMD64 Processors
1 - Bank number
2 - Address of MCA_EXCEPTION structure
3 - High 32 bits of MCi_STATUS MSR for the MCA bank that had the error
4 - Low 32 bits of MCi_STATUS MSR for the MCA bank that had the error
Arguments:
Arg1: 00000000
Arg2: 808a07a0
Arg3: be000300
Arg4: 1008081f
Debugging Details:
------------------
NOTE: This is a hardware error. This error was reported by the CPU
via Interrupt 18. This analysis will provide more information about
the specific error. Please contact the manufacturer for additional
information about this error and troubleshooting assistance.
This error is documented in the following publication:
- IA-32 Intel(r) Architecture Software Developer's Manual
Volume 3: System Programming Guide
Bit Mask:
MA Model Specific MCA
O ID Other Information Error Code Error Code
VV SDP ___________|____________ _______|_______ _______|______
AEUECRC| | |
LRCNVVC| | |
^^^^^^^| | |
6 5 4 3 2 1
3210987654321098765432109876543210987654321098765432109876543210
----------------------------------------------------------------
1011111000000000000000110000000000010000000010000000100000011111
VAL - MCi_STATUS register is valid
Indicates that the information contained within the IA32_MCi_STATUS
register is valid. When this flag is set, the processor follows the
rules given for the OVER flag in the IA32_MCi_STATUS register when
overwriting previously valid entries. The processor sets the VAL
flag and software is responsible for clearing it.
UC - Error Uncorrected
Indicates that the processor did not or was not able to correct the
error condition. When clear, this flag indicates that the processor
was able to correct the error condition.
EN - Error Enabled
Indicates that the error was enabled by the associated EEj bit of the
IA32_MCi_CTL register.
MISCV - IA32_MCi_MISC Register Valid
Indicates that the IA32_MCi_MISC register contains additional
information regarding the error. When clear, this flag indicates
that the IA32_MCi_MISC register is either not implemented or does
not contain additional information regarding the error.
ADDRV - IA32_MCi_ADDR register valid
Indicates that the IA32_MCi_ADDR register contains the address where
the error occurred.
PCC - Processor Context Corrupt
Indicates that the state of the processor might have been corrupted
by the error condition detected and that reliable restarting of the
processor may not be possible.
BUSCONNERR - Bus and Interconnect Error BUS{LL}_{PP}_{RRRR}_{II}_{T}_err
These errors match the format 0000 1PPT RRRR IILL
Concatenated Error Code:
--------------------------
_VAL_UC_EN_MISCV_ADDRV_PCC_BUSCONNERR_1F
This error code can be reported back to the manufacturer.
They may be able to provide additional information based upon
this error. All questions regarding STOP 0x9C should be
directed to the hardware manufacturer.
BUGCHECK_STR: 0x9C_IA32_GenuineIntel
DEFAULT_BUCKET_ID: DRIVER_FAULT
PROCESS_NAME: Idle
CURRENT_IRQL: 2
LAST_CONTROL_TRANSFER: from 80a7fbd8 to 8087b6be
STACK_TEXT:
f773d280 80a7fbd8 0000009c 00000000 f773d2b0 nt!KeBugCheckEx+0x1b
f773d3b4 80a7786f f7737fe0 00000000 00000000 hal!HalpMcaExceptionHandler+0x11e
f773d3b4 f75a9ca2 f7737fe0 00000000 00000000 hal!HalpMcaExceptionHandlerWrapper+0x77
f78c6d50 8083abf2 00000000 0000000e 00000000 intelppm!AcpiC1Idle+0x12
f78c6d54 00000000 0000000e 00000000 00000000 nt!KiIdleLoop+0xa
- Dmitry Vostokov @ DumpAnalysis.org -
March 15th, 2010 at 12:15 am
Another possibility of a hardware error: frequent multiple unrelated bugchecks and / or bugchecks in memory dumps with valid instructions at faulting IP. Beware also about misaligned IP that can also look as a valid instruction.
March 16th, 2010 at 10:38 pm
[…] Most fault IPs were showing signs of Wild Code pattern and that most probably implicated Hardware Error (Looks like WinDbg suggests that MISALIGNED_IP implicates hardware). Here is the listing of […]
June 4th, 2010 at 11:43 pm
[…] wouldn’t be so quick. Check Hardware Error pattern post and comments there. So let’s de-analyze the analysis. “c0000005 is Access […]
January 22nd, 2013 at 11:48 pm
Another example is this:
1: kd> dt -r _WHEA_ERROR_RECORD fffffa8004b46748 hal!_WHEA_ERROR_RECORD +0x000 Header : _WHEA_ERROR_RECORD_HEADER +0x000 Signature : 0x52455043 +0x004 Revision : _WHEA_REVISION +0x000 MinorRevision : 0x10 '' +0x001 MajorRevision : 0x2 '' +0x000 AsUSHORT : 0x210 +0x006 SignatureEnd : 0xffffffff +0x00a SectionCount : 3 +0x00c Severity : 1 ( WheaErrSevFatal ) +0x010 ValidBits : _WHEA_ERROR_RECORD_HEADER_VALIDBITS +0x000 PlatformId : 0y0 +0x000 Timestamp : 0y1 +0x000 PartitionId : 0y0 +0x000 Reserved : 0y00000000000000000000000000000 (0) +0x000 AsULONG : 2 +0x014 Length : 0x3a0 +0x018 Timestamp : _WHEA_TIMESTAMP +0x000 Seconds : 0y00100010 (0x22) +0x000 Minutes : 0y00101011 (0x2b) +0x000 Hours : 0y00001100 (0xc) +0x000 Precise : 0y0 +0x000 Reserved : 0y0000000 (0) +0x000 Day : 0y00010110 (0x16) +0x000 Month : 0y00000100 (0x4) +0x000 Year : 0y00001010 (0xa) +0x000 Century : 0y00010100 (0x14) +0x000 AsLARGE_INTEGER : _LARGE_INTEGER 0x140a0416`000c2b22 +0x020 PlatformId : _GUID {00000000-0000-0000-0000-000000000000} +0x000 Data1 : 0 +0x004 Data2 : 0 +0x006 Data3 : 0 +0x008 Data4 : [8] "" +0x030 PartitionId : _GUID {00000000-0000-0000-0000-000000000000} +0x000 Data1 : 0 +0x004 Data2 : 0 +0x006 Data3 : 0 +0x008 Data4 : [8] "" +0x040 CreatorId : _GUID {cf07c4bd-b789-4e18-b3c4-1f732cb57131} +0x000 Data1 : 0xcf07c4bd +0x004 Data2 : 0xb789 +0x006 Data3 : 0x4e18 +0x008 Data4 : [8] "???" +0x050 NotifyType : _GUID {e8f56ffe-919c-4cc5-ba88-65abe14913bb} +0x000 Data1 : 0xe8f56ffe +0x004 Data2 : 0x919c +0x006 Data3 : 0x4cc5 +0x008 Data4 : [8] "???" +0x060 RecordId : 0x01cae219`673474d3 +0x068 Flags : _WHEA_ERROR_RECORD_HEADER_FLAGS +0x000 Recovered : 0y0 +0x000 PreviousError : 0y1 +0x000 Simulated : 0y0 +0x000 Reserved : 0y00000000000000000000000000000 (0) +0x000 AsULONG : 2 +0x06c PersistenceInfo : _WHEA_PERSISTENCE_INFO +0x000 Signature : 0y0000000000000000 (0) +0x000 Length : 0y000000000000000000000000 (0) +0x000 Identifier : 0y0000000000000000 (0) +0x000 Attributes : 0y00 +0x000 DoNotLog : 0y0 +0x000 Reserved : 0y00000 (0) +0x000 AsULONGLONG : 0 +0x074 Reserved : [12] "" +0x080 SectionDescriptor : [1] _WHEA_ERROR_RECORD_SECTION_DESCRIPTOR +0x000 SectionOffset : 0x158 +0x004 SectionLength : 0xc0 +0x008 Revision : _WHEA_REVISION +0x000 MinorRevision : 0x1 '' +0x001 MajorRevision : 0x2 '' +0x000 AsUSHORT : 0x201 +0x00a ValidBits : _WHEA_ERROR_RECORD_SECTION_DESCRIPTOR_VALIDBITS +0x000 FRUId : 0y0 +0x000 FRUText : 0y0 +0x000 Reserved : 0y000000 (0) +0x000 AsUCHAR : 0 '' +0x00b Reserved : 0 '' +0x00c Flags : _WHEA_ERROR_RECORD_SECTION_DESCRIPTOR_FLAGS +0x000 Primary : 0y1 +0x000 ContainmentWarning : 0y0 +0x000 Reset : 0y0 +0x000 ThresholdExceeded : 0y0 +0x000 ResourceNotAvailable : 0y0 +0x000 LatentError : 0y0 +0x000 Reserved : 0y00000000000000000000000000 (0) +0x000 AsULONG : 1 +0x010 SectionType : _GUID {9876ccad-47b4-4bdb-b65e-16f193c4f3db} +0x000 Data1 : 0x9876ccad +0x004 Data2 : 0x47b4 +0x006 Data3 : 0x4bdb +0x008 Data4 : [8] "???" +0x020 FRUId : _GUID {00000000-0000-0000-0000-000000000000} +0x000 Data1 : 0 +0x004 Data2 : 0 +0x006 Data3 : 0 +0x008 Data4 : [8] "" +0x030 SectionSeverity : 1 ( WheaErrSevFatal ) +0x034 FRUText : [20] ""February 18th, 2013 at 10:30 pm
KERNEL_STACK_INPAGE_ERROR (77)
The requested page of kernel data could not be read in. Caused by
bad block in paging file or disk controller error.
In the case when the first arguments is 0 or 1, the stack signature
in the kernel stack was not found. Again, bad hardware.
An I/O status of c000009c (STATUS_DEVICE_DATA_ERROR) or
C000016AL (STATUS_DISK_OPERATION_FAILED) normally indicates
the data could not be read from the disk due to a bad
block. Upon reboot autocheck will run and attempt to map out the bad
sector. If the status is C0000185 (STATUS_IO_DEVICE_ERROR) and the paging
file is on a SCSI disk device, then the cabling and termination should be
checked. See the knowledge base article on SCSI termination.
Arguments:
Arg1: 0000000000000001, (page was retrieved from disk)
Arg2: fffffa800818e870, value found in stack where signature should be
Arg3: 0000000000000000, 0
Arg4: fffff8800c6e5e80, address of signature on kernel stack
2: kd> k
Child-SP RetAddr Call Site
fffff880`0371da18 fffff800`03110b01 nt!KeBugCheckEx
fffff880`0371da20 fffff800`030c8c54 nt! ?? ::FNODOBFM::`string’+0×51e31
fffff880`0371db30 fffff800`030c8bef nt!MmInPageKernelStack+0×40
fffff880`0371db90 fffff800`030c8928 nt!KiInSwapKernelStacks+0×1f
fffff880`0371dbc0 fffff800`0332be5a nt!KeSwapProcessOrStack+0×84
fffff880`0371dc00 fffff800`03085d26 nt!PspSystemThreadStartup+0×5a
fffff880`0371dc40 00000000`00000000 nt!KiStartSystemThread+0×16
October 4th, 2016 at 5:28 pm
For WHEA_UNCORRECTABLE_ERROR (124) we have additional WinDbg commands !whea, !errrec, and !errpkt:
2: kd> !whea
Error Source Table @ fffff8004bbd4a90
4 Error Sources
Error Source 0 @ ffffe00014376bd0
Notify Type : {14374010-e000-ffff-984a-bd4b00f8ffff}
Type : 0×0 (MCE)
Error Count : 1
Record Count : 4
Record Length : 728
Error Records : wrapper @ ffffe000110e0000 record @ ffffe000110e0028
: wrapper @ ffffe000110e0728 record @ ffffe000110e0750
: wrapper @ ffffe000110e0e50 record @ ffffe000110e0e78
: wrapper @ ffffe000110e1578 record @ ffffe000110e15a0
Descriptor : @ ffffe00014376c29
Length : 3cc
Max Raw Data Length : 141
Num Records To Preallocate : 4
Max Sections Per Record : 4
Error Source ID : 0
Flags : 00000000
[…]
2: kd> !errrec ffffe000110e0028
============================================
Common Platform Error Record @ ffffe000110e0028
——————————————————————————-
Record Id : 01d21a1a7e5fffd1
Severity : Fatal (1)
Length : 928
Creator : Microsoft
Notify Type : Machine Check Exception
Timestamp : 9/30/2016 9:05:50 (UTC)
Flags : 0×00000000
============================================
Section 0 : Processor Generic
——————————————————————————-
Descriptor @ ffffe000110e00a8
Section @ ffffe000110e0180
Offset : 344
Length : 192
Flags : 0×00000001 Primary
Severity : Fatal
Proc. Type : x86/x64
Instr. Set : x64
Error Type : Micro-Architectural Error
Flags : 0×00
CPU Version : 0×00000000000306a9
Processor ID : 0×0000000000000002
============================================
Section 1 : x86/x64 Processor Specific
——————————————————————————-
Descriptor @ ffffe000110e00f0
Section @ ffffe000110e0240
Offset : 536
Length : 128
Flags : 0×00000000
Severity : Fatal
Local APIC Id : 0×0000000000000002
CPU Id : a9 06 03 00 00 08 10 02 - bf e3 ba 7f ff fb eb bf
00 00 00 00 00 00 00 00 - 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 - 00 00 00 00 00 00 00 00
Proc. Info 0 @ ffffe000110e0240
============================================
Section 2 : x86/x64 MCA
——————————————————————————-
Descriptor @ ffffe000110e0138
Section @ ffffe000110e02c0
Offset : 664
Length : 264
Flags : 0×00000000
Severity : Fatal
Error : Internal unclassified (Proc 2 Bank 4)
Status : 0xb200000000100402
April 23rd, 2018 at 7:49 pm
Recently we observed internal errors in Visual C++ compiler followed by memory management bugchecks a few seconds later.