On Sat, 7 Feb 2015, Joshua Kinard wrote:
> I've had my Onyx2 running quite a bit lately doing compile runs, and it seems
> that after about ~16 hours, there's a random possibility that the machine just
> completely stops. No errors printed anywhere, serial becomes completely
> unresponsive. I have to issue a 'rst' from the MSC to bring it back up again.
If the time spent up is always similar, then one possibility is a counter
wraparound or suchlike that is not handled correctly (i.e. the carry from
the topmost bit is not taken into account), causing a kernel deadlock.
> It's currently got dual IP31 R14000 node boards (500MHz), and for the most
> part, runs great (I'll regret the electric bill later...). Clearly a bug,
> though, but I am not sure where to start debugging on this platform to find
> this bug, since I can't trigger it manually. Even tried an NMI interrupt,
> since this machine has an NMI handler in the kernel, but all that does is
> the machine.
The NMI exception is routed to the same vector reset is, firmware would
have to tell them apart (with the use of the CP0.Status.NMI bit) and then
call a handler supplied. Perhaps there's a way to register such a handler
with the firmware -- does the kernel do it? You could then use the
handler to examine the kernel state and perhaps dump it somehow.
On MIPS processors an NMI or even a reset event does not clobber any
registers except from the CP0 ErrorEPC register, where the PC at the time
the event happened is stored, some bits in the CP0 Status register (ERL,
BEV, etc.), and of course the PC. So alternatively does the firmware have
a way to dump registers on reset or NMI then somehow?
For example R4k DECstations dump registers automatically, when the reset
button is pressed at a time when the machine operates normally (a power-up
reset can be told apart by the state of the CP0.Status.SR bit).