[Top] [All Lists]

Re: Help with decoding a NMI Watchdog interrupt on an Octeon

To: Jan Rovins <>
Subject: Re: Help with decoding a NMI Watchdog interrupt on an Octeon
From: "Kevin D. Kissell" <>
Date: Thu, 17 Jun 2010 14:26:49 -0700
Domainkey-signature: a=rsa-sha1; q=dns; c=nofws; s=default;; h=Received:Message-ID:Date:From:User-Agent:MIME-Version:To:CC:Subject:References:In-Reply-To:Content-Type:Content-Transfer-Encoding; b=q7apaPd7W2hs10P8WnrGgi4b/VllcYAN3+zkqfCWhX3R0Jr2ZaO6hnXeDW//YEjkYrQWaj5x63C8sjsW0TlGi1HwEYaLUMVP8WFhMbsCp/SMMR++x2lH0xjku4jwOpCr;
In-reply-to: <>
References: <>
User-agent: Thunderbird (X11/20100317)
NMI is just an input pin, so you'd really need to know what it's connected to in the system you're working on. Typically, it's tied to some kind of memory bus time-out, but it could be other things. Depending on what it's hooked up to, knowing what code was executing when it came in may be completely useless. *If* it's hooked up to a bus time-out, *and* the instruction that caused the time-out was a load, *and* the time-out and NMI occurred *after* the processor got to the instruction that consumes the load value (pretty likely if the first two conditions are met), *then* looking at disassembled kernel code (mips-linux-objdump --disassemble vmlinux) at the ErrorEPC address, *not* the address in EPC, which will have latched the address of the last recoverable exception (which NMI is not, strictly speaking). That instruction should be the consumer of the bad load, so one of its input registers should be the target of that load. If it's a two-input instruction, e.g. add r1,r2,r3, then it could be either r2 or r3, and you have to work your way backwords up the code flow to find out where the r2 and r3 values came from, respectively. *Usually* it's possible to identify the load, thus the register used as a base address, and see that the base address register was trashed, at which point you can start forming hypotheses as to how that could have happened.

Of course, in the dump below, we don't see ErrorEPC. I've never been able to figure out why so many kernel register dumps skip that register, especially for NMI reporting. But unless you're able to reproduce this with a kernel that you build yourself, so that you can fix the instrumentation, it's going to be tough. So "Plan B" would be to make sure that any removable memory DIMMs have been properly seated, and double-check that the actual memory capacity corresponds to whatever boot parameters are being passed to the kernel. In otherwords, if you can't debug the kernel, pray that it's a hardware or operator error. ;o)


         Kevin K.

Jan Rovins wrote:
Hi, I need some tips on how to go about deciphering the following NMI dump.

This is from a kernel that came with the Cavium Networks 1.8.1 toolchain. Is there any way to get some kind of back trace from this, or just find out which function it was in?

I have been playing around with objdump -x vmlinux but I cant zero in on anything this way.

Thanks in advance,

*** NMI Watchdog interrupt on Core 0x6 ***
       $0      0x0000000000000000      at      0x000000001010cce0
       v0      0x000000000000003d      v1      0x000000000000024a
       a0      0xffffffff807d7b70      a1      0x0000000000000000
       a2      0x000000000000024a      a3      0x0000000000000000
       a4      0xffffffff807d7b60      a5      0x0000000000000080
       a6      0x0000000000000001      a7      0xa800000411c62578
       t0      0x0000000000000001      t1      0xa80000048ef3e880
       t2      0xffffffff82d40000      t3      0xa80000041f48c000
       s0      0xc0000000000d9640      s1      0xc000000000088028
       s2      0x0000000000000000      s3      0x0000000000000180
       s4      0x0000000000000000      s5      0x0000000000000000
       s6      0xb7a89c196f513832      s7      0x0000000000000000
       t8      0xffffffff807d0000      t9      0xffffffff807d0000
       k0      0x0000000000000000      k1      0x00000000104dbcbf
       gp      0xa80000041f48c000      sp      0xa80000041f48fcf0
       s8      0x0000000000000000      ra      0xc0000000023c5004
       epc     0xffffffff802b10b8
       status  0x000000001058cce4      cause   0x0000000040008c08
       sum0    0x0000002100000000      en0     0x0000009300008000
Code around epc
       0xffffffff802b10a8      000000002406ffff
       0xffffffff802b10ac      0000000064a5ffff
       0xffffffff802b10b0      0000000010a60005
       0xffffffff802b10b4      0000000000000000
       0xffffffff802b10b8      0000000080620000
       0xffffffff802b10bc      000000001440fffb
       0xffffffff802b10c0      0000000064630001
       0xffffffff802b10c4      000000006463ffff
       0xffffffff802b10c8      0000000003e00008

<Prev in Thread] Current Thread [Next in Thread>