Crash Dump Analysis Patterns (Part 278, Linux)
This is a Linux pattern variant of the Windows Spiking Interrupts memory analysis pattern. The Windows pattern describes high interrupt and DPC activity that causes perceived freezes, response lag, or high kernel CPU time; the original pattern uses per-processor interrupt counts, DPC/interrupt time, and DPC queue data, and stresses comparisons with a normal system because raw counters depend on uptime.
On Linux, the closest mapping is (including associated Paratext):
Hardware interrupts - hard IRQs, vectors, MSI/MSI-X, per-device IRQ lines
DPC - softirq, tasklet, NAPI poll, threaded IRQ, sometimes workqueue
DPC delegate thread / idle-context DPC execution - ksoftirqd/N, irq/<irq>-<name>, kworker/*
!prcb, DPC counts, interrupt time - /proc/interrupts, /proc/softirqs, /proc/stat, crash> irq -s, stacks, logs
Note: The mapping of DPC delegate thread / idle-context DPC execution to kworker/* is an approximation. Workqueues run in kernel thread context and can execute deferred work similarly to how idle-context DPCs offload work from the interrupt path, but workqueues are a general-purpose deferral mechanism, not specifically an interrupt bottom-half mechanism. Unlike softirqs and tasklets, which exist primarily to defer interrupt handler work, a kworker stall may have nothing to do with interrupt pressure. When investigating spiking interrupt symptoms, kworker threads appearing in stacks should be treated as a secondary signal only, and their work items examined individually (via crash> bt on the kworker thread) to determine whether they originate from IRQ or softirq context before drawing conclusions about interrupt-driven CPU saturation.
As we see, Linux has no exact DPC object model. The strongest Linux analogy is when a CPU is consumed by hard IRQs and/or interrupt-related bottom-half work, such as softirqs, tasklets, NAPI polling, threaded IRQs, or interrupt-originated workqueue processing.
The crash tool irq command and its various options may show when a single CPU is absorbing a device interrupt stream:
crash> irq -s
CPU0 CPU1
[...]
77: 513461 0 ITS-MSI virtio2-request
[...]
Then we check whether that CPU was also the crash CPU, a soft-lockup CPU, an RCU-stall CPU, or the CPU running ksoftirqd/N using the follow-up commands such as:
crash> ps | grep -E "ksoftirqd|irq/|kworker|rcu"
crash> runq
crash> bt -a
crash> bt -E
The last command searches IRQ stacks, and on x64 also exception stacks, for possible exception frames on supported architectures
If ksoftirqd/* is running or runnable and CPU* also has rapidly accumulated NET_RX, NET_TX, BLOCK, TIMER, or RCU softirq work, that is the Linux equivalent of DPC pressure. softirqs are deferred interrupt work that can run after an interrupt handler or from ksoftirqd; when limits are reached, pending softirqs are run from ksoftirqd. Also, ksoftirqd/* executes softirq handlers when threaded or under heavy load, and irq/<irq>-<name> handles threaded interrupts. See: https://docs.kernel.org/admin-guide/kernel-per-CPU-kthreads.html
You can also see interrupt-pressure symptoms in the kernel log:
crash> log
Typical diagnostic messages include these fragments:
watchdog: BUG: soft lockup - CPU#N stuck
NMI watchdog: Watchdog detected hard LOCKUP on cpu N
rcu: INFO: rcu_sched detected stalls on CPUs/tasks
irq XX: nobody cared
Disabling IRQ #XX
NETDEV WATCHDOG: ... transmit queue timed out
RCU stall logs are particularly useful because the kernel documentation explicitly lists CPUs looping with interrupts disabled, preemption disabled, bottom halves disabled, or periodic interrupt handlers taking too long as possible causes. It also says reproducible massive hard/soft interrupt cases can be narrowed using /proc/interrupts. See: https://docs.kernel.org/RCU/stallwarn.html
For RCU definition, see https://en.wikipedia.org/wiki/Read-copy-update
Below is the guide for collecting supplemental paratext information from the live system:
// Hard IRQ distribution
cat /proc/interrupts
watch -n 1 cat /proc/interrupts
/proc/interrupts records the number of interrupts per CPU per I/O device and, on x64, also includes internal interrupts such as NMI, LOC, TLB, RES, and CAL. See: https://man7.org/linux/man-pages/man5/proc_interrupts.5.html
// SoftIRQ distribution
cat /proc/softirqs
watch -n 1 cat /proc/softirqs
Common rows include:
NET_RX receive-side network pressure
NET_TX transmit-side network pressure
BLOCK block I/O completion pressure
IRQ_POLL block polling pressure
TIMER timer callback pressure
HRTIMER high-resolution timer pressure
SCHED scheduler/IPI/load-balancing pressure
RCU RCU callback pressure
TASKLET legacy driver deferred work
The /proc/stat softirq line reports the count of softirqs serviced since boot, and /proc/stat also reports CPU time spent servicing irqs and softirqs. See: https://www.kernel.org/doc/html/v6.9/filesystems/proc.html
// IRQ and softirq CPU time
awk '
/^cpu[0-9]/ {
printf "%s irq_jiffies=%s softirq_jiffies=%s\n", $1, $7, $8
}' /proc/stat
while true; do
date
awk '/^cpu[0-9]/ {printf "%s irq=%s softirq=%s\n",$1,$7,$8}' /proc/stat
sleep 1
done
High irq time points more toward hard interrupt handling. High softirq time points more toward deferred interrupt work such as NAPI, timers, block completions, scheduler softirqs, or RCU.
// IRQ affinity
Summary:
Spiking Interrupt activity is suspected when response latency or apparent freezes coincide with disproportionate hard IRQ or softirq activity on one or more CPUs. In a core dump, the pattern appears as high per-CPU IRQ counters, IRQ-affinity skew, active IRQ/softirq/ksoftirqd/threaded-IRQ stacks, and possible watchdog or RCU-stall messages. In live /proc, the pattern appears as rapidly increasing deltas in /proc/interrupts, /proc/softirqs, and /proc/stat irq/softirq CPU time.
- Dmitry Vostokov @ DumpAnalysis.org + TraceAnalysis.org -