linux-mips
[Top] [All Lists]

Re: schedule() BUG

To: <jsun@mvista.com>
Subject: Re: schedule() BUG
From: "Steve Scott" <steve.scott@pioneer-pdt.com>
Date: Mon, 6 Oct 2003 19:05:06 -0700
Cc: <linux-mips@linux-mips.org>, <craig.mautner@pioneer-pdt.com>
Original-recipient: rfc822;linux-mips@linux-mips.org
References: <FJEIIOCBFAIOIDNKLPFJCECODAAA.koji.kawachi@pioneer-pdt.com>
Sender: linux-mips-bounce@linux-mips.org
We tried the fault.c patch Jun suggested, but it didn't solve the problem we 
were
having with the BUG() in schedule(). The patch at the beginning of
except_vec3_generic for the Vr5432 bug had previously been installed.

While chasing the BUG() in schedule(), though, we ran across another BUG() in
alloc_skb() in ...linux/net/core/skbuff.c. :

    alloc_skb called nonatomically from interrupt 80117acc
    kernel BUG at skbuff.c:179!

We changed the way sock_init_data initializes the 'allocation' field and
were able to get past this one (see attached sock.c.patch). We're not sure
if this fix needs to be permanent, or if it's just a temporary workaround.

For the schedule() BUG(), all evidence that we collected pointed to some
interrupt causing us to reenter schedule() (i.e., somehow schedule() was
called during an interrupt handler). We suspected something being run
from the timer interrupt bottom half, but were never able to prove it. We
also thought a remote possibility might be a pipeline hazard in the MIPS
causing the EPC register not to update on a nested exception, but NEC says
that can't happen on the Vr5432 that we're using...

We finally worked around the schedule BUG() by disabling interrupts
during the context switch in schedule(). This workaround required changes
in linux/kernel/sched.c and linux/arch/mips/kernel/r4k_switch.S (see attached
patches).

--steve

> 
> 
> -----Original Message-----
> From: linux-mips-bounce@linux-mips.org
> [mailto:linux-mips-bounce@linux-mips.org]On Behalf Of Jun Sun
> Sent: Wednesday, October 01, 2003 4:50 PM
> To: Craig Mautner
> Cc: linux-mips@linux-mips.org; jsun@mvista.com
> Subject: Re: schedule() BUG
> 
> 
> On Fri, Sep 12, 2003 at 11:04:16AM -0700, Craig Mautner wrote:
> > We are using mips-linux 2.4.17, gcc 3.2.1 (MontaVista) and crashing in
> > schedule():
> >
> > Unable to handle kernel paging request at virtual address 00000000, epc ==
> > 800153c0, ra == 800153c0
> > $0 : 00000000 9001f800 0000001b 00000000 0000001a 83f56000 8298f4a0
> 0000001f
> > $8 : 00000001 ffffe2e0 000022e0 00000000 fffffff9 ffffffff 0000000a
> 00000002
> > $16: 00000000 00000000 82af0000 8298f4a0 83f56000 00000000 80008000
> 00000000
> > $24: 82af1dc2 00000002                   82af0000 82af1ef8 82af1ef8
> 800153c0
> > epc  : 800153c0    Not tainted
> >
> > The code is:
> >
> >     {
> >       struct mm_struct *mm = next->mm;
> >       struct mm_struct *oldmm = prev->active_mm;
> >       if (!mm) {
> >            if (next->active_mm) BUG();   <- this is where we crash
> >            next->active_mm = oldmm;
> >            atomic_inc(&oldmm->mm_count);
> >            enter_lazy_tlb(oldmm, next, this_cpu);
> >       }
> >         .
> >         .
> >         .
> >
> > This seems to happen in our case when 'next' points to 'kswapd' although
> we
> > think it could happen when switching to any kernel task (i.e. those tasks
> > with mm==NULL).
> >
> > We think the culprit is that we are taking an interrupt and rescheduling
> > while at a vulnerable point in 'schedule()'. Interrupts are enabled in
> line
> > 743. If we get an interrupt any time after line 785:
> >
> >            next->active_mm = oldmm;
> >
> > but before line 806
> >
> > __schedule_tail()
> >
> > completes the swap, the interrupt can force 'schedule()' to be reentered
> via
> > 'ret_from_intr()'.
> >
> > If so, 'kswapd's 'active_mm' field will be left non-zero, but 'current'
> will
> > not have been set to point to 'kswapd'. The next time 'schedule()' tries
> to
> > switch to 'kswapd', 'next' points to 'kswapd', and
> >
> >         next->mm == NULL
> >         next->active_mm != NULL
> >
> > which is detected as an invalid state, so we hit the BUG.
> >
> > Some questions:
> > Are we looking at this correctly?
> > Has anyone ever seen this before?
> > Is there a published fix?
> >
> > Thanks,
> >
> > -Craig
> >
> 
> This is an known problem.  Please try the attached patch.
> 
> On R5432 CPU, there is also an hardware bug which can cause the same
> problem.  Please double-check vec3_generic to see if workaround is
> at the beginning of the handler.
> 
> BTW, 2.4.17 is an old kernel. You really need to upgrade.
> 
> Jun
> 
> 
> 

Attachment: sock.c.patch
Description: Binary data

Attachment: r4k_switch.S.patch
Description: Binary data

Attachment: sched.c.patch
Description: Binary data

<Prev in Thread] Current Thread [Next in Thread>