linux-mips
[Top] [All Lists]

Help with OOPSes, anyone?

To: linux-mips@oss.sgi.com
Subject: Help with OOPSes, anyone?
From: Matthew Dharm <mdharm@momenco.com>
Date: Sun, 27 Jan 2002 00:22:42 -0800
Organization: Momentum Computer, Inc.
Sender: owner-linux-mips@oss.sgi.com
User-agent: Mutt/1.2.5i
So, I'm back to trying to get Linux running on our boards, and I'm having a
problem I'd like some help with...

(See http://www.momenco.com/products/ocelot if you're interested.)

At one point, we had 2.4.5 working on this board.  That was the result of
joint work between myself, RedHat, and MontaVista.  Seemed to work pretty
well, given that we had basically no userspace to work with.  So, while it
seemed functional, it's not much of a datapoint.

So, I did a cvs update, and tried to build 2.4.17.  Lo and behold, it still
compiles.  And it even runs.  And the RedHat 7.1 userspace from oss.sgi.com
even seems to work, mostly.

But, under certain conditions, the kernel OOPSes.  Attached to this message
are a few of those OOPSes (serial console is wonderful!) along with the
ksymoops output.  I think the read_lsmod() warning is bogus, because there
are, actually, no modules loaded.

My instincts are telling me that these are all being caused by the same
problem, but I'll be damned if I can figure out what that is.  Caching is a
good suspect, but that's just because it's always a good suspect.

What does work is a ping -f to the board... so I think the interrupt code
is rock-solid.  It's pretty simple anyway, so I'm not suprised that's a
problem.

While we did have some problem supporting our boards that have 512MB on
them, the board I'm testing with only has 128MB.  Dealing with the 512MB
problems (which, I understand, are caused by the cache-flush routines
trying to flush too much), is on the TODO list, but I don't think those
problems are affecting me right now.

Load does seem to be a factor.  Big compiles, lots of NFS traffic, that
sort of thing seems to be the triggering factor.  It's easy to duplicate,
given a few minutes.  Oddly enough, lite loads or idle doesn't trigger this
-- I FTPed the SRPM for wget and built it without any problems.  Heck, it
even works!  But when I try to build something bigger (say, ncftp or
glibc), it dies an ugly death.  Heck, I could FTP, build, and use ksymoops
natively on the board without any problems.  Lots of things work fine,
but sometimes....

I know Ralf has one of our boards, but I don't know if he's in the same
country as that board... Ralf?

In these OOPSes, one is caused by some code in unaligned.c -- I've seen
several (many) like this, tho I only captured and decoded one.  The code in
question seems to be one of those "you can never get here" situations,
which makes me really worry.  The other two look like some form of
NULL-pointer dereference.

I've tried several different configurations, stripping down drivers as I
go.  I'm going to continue that tomorrow/monday, but I don't have high
hopes that will fix the problem.  I'm thinking about trying
CONFIG_MIPS_UNCACHED, but I don't know if that works on an RM7000 processor
-- the L1 and L2 are built-in to the processor, and I don't think the L1
can be deactivated.  Then again, I don't know how CONFIG_MIPS_UNCACHED
works, so if someone would like to clue me in on the subject, I'd really
appreciate it.

Another thing I'm going to try is an ELF image from that Redhat/Montavista
work, to see if it shows the same problems.  That datapoint will be useful
also.

If anyone has any clues as to what's going on, I'd really appreciate it.
Anyone?

Matt

-- 
Matthew Dharm                              Work: mdharm@momenco.com
Senior Software Designer, Momentum Computer

Attachment: oops.in
Description: Text document

Attachment: oops.out
Description: Text document

Attachment: oops2.in
Description: Text document

Attachment: oops2.out
Description: Text document

Attachment: oops3.in
Description: Text document

Attachment: oops3.out
Description: Text document

<Prev in Thread] Current Thread [Next in Thread>