linux-mips
[Top] [All Lists]

Re: Help with decoding a NMI Watchdog interrupt on an Octeon

To: Jan Rovins <janr@adax.com>
Subject: Re: Help with decoding a NMI Watchdog interrupt on an Octeon
From: David Daney <ddaney@caviumnetworks.com>
Date: Mon, 21 Jun 2010 09:22:08 -0700
Cc: "'Kevin D. Kissell'" <kevink@paralogos.com>, linux-mips@linux-mips.org
In-reply-to: <438633685C664132B53157C81BE355DD@ZuniBear>
References: <4C1A8D86.60005@adax.com> <4C1A9319.1020202@paralogos.com> <4C1A98EC.1030708@caviumnetworks.com> <4C1D16F0.2090102@adax.com> <438633685C664132B53157C81BE355DD@ZuniBear>
Sender: linux-mips-bounce@linux-mips.org
User-agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.1.9) Gecko/20100430 Fedora/3.0.4-2.fc12 Thunderbird/3.0.4
On 06/20/2010 10:55 PM, Jan Rovins wrote:
Some additions&  corrections to the previous:

-----Original Message-----
From: linux-mips-bounce@linux-mips.org [mailto:linux-mips-bounce@linux-
mips.org] On Behalf Of Jan Rovins
Sent: Saturday, June 19, 2010 3:14 PM
To: David Daney
Cc: Kevin D. Kissell; linux-mips@linux-mips.org
Subject: Re: Help with decoding a NMI Watchdog interrupt on an Octeon

David Daney wrote:
On 06/17/2010 02:26 PM, Kevin D. Kissell wrote:
NMI is just an input pin, so you'd really need to know what it's
connected to in the system you're working on.

In this case, the NMI is likely being asserted by the watchdog.  So if
you are stuck in a loop with interrupts disabled, the register dump
might help you figure out where things are stuck.  But as you say
below, knowing the value of the ErrorEPC register is critical.
Thank you David&  Kevin for the detailed information.

Yes, in my case it's the watchdog, when I turn the watchdog off, the
machine just hangs, with no NMI dump.

Ok, I added the code to Print out the ErrorEPC, and got:
ErrorEpc        0xc0000000023c5004
This address is not in vmlinux, but is the address of a loaded module.

So, I poked around in /sys/module/ until I found one that had that
address range:
cat /sys/module/linux_bcm_core/sections/.text :0xc000000001c4e000

And then did an objdump on this module. Since the module dump did not
contain the actual addresses that it was running from, I doctored up the
offsets by using the .text address from /sys/module/ of where the module
actually loaded.
objdump.cavium -d --adjust-vma 0xc000000001c4e000  linux-bcm-core.ko


When looking at kernel modules, it can be helpful to show the relocations as well, so add '-r' to your objdump command line...

Just want to check if all this sounds correct so far? is my objdump
valid with the .text offset?

I got a hit on the ErrorEPC value in my dump:
c0000000023c5004:       08000000        j       c000000000000000
<sal_dma_alloc-0x1c4e000>


... Once you turn on display or relocations, you can see where the jump is really going. The relocations are applied by the kernel when loading the module.



This line of code was inside a function called _default_assert, which on
assertion failure, did a printk() and went into an intentional infinite
loop, which explains the NMI dump. The only thing that puzzles me now, is
that the assert failure printk rarely displayed. Could that be because it
was called while interrupts were turned off? I suppose that would stop it
from showing up in /var/log/messages.

The assembly still does not make sense to me (first time with MIPS assembly)
but on examining the C code I think I understand what's going on here.


It seems like you may be onto the cause of the watchdog expiring, all that's left is to figure out how you get into this spot in the first place.

David Daney

<Prev in Thread] Current Thread [Next in Thread>