On Sun, 28 May 2006 02:06:03 +0100
Ralf Baechle <ralf@linux-mips.org> wrote:
> On Sat, May 27, 2006 at 05:13:21PM -0400, Kumba wrote:
>
> > Finally managed to track down the git commit causing SGI IP32 (O2) systems
> > to lock up really early in the boot cycle, but I'm at a loss to understand
> > why.
> >
> > Effect:
> > It appears the system silently hangs somewhere in the void between function
> > calls when trying to invoke the memset() call in __alloc_bootmem_core() in
> > mm/bootmem.c. This puts the machine hardware in a state such that a simple
> > soft reset doesn't clear it -- the machine has to be cold booted to get it
> > to boot a working kernel again.
> >
> > Determined Cause:
> > It seems this commit:
> > 78eef01b0fae087c5fadbd85dd4fe2918c3a015f
> > [PATCH] on_each_cpu(): disable local interrupts
> >
> > Is the cause. I've verified this by reversing this one change on a
> > 2.6.17-rc4 tree, and it'll boot to a mini-userland (initramfs-based) and
> > appears to function normally.
> >
> >
> > But this is as far as I can trace this. I'm not sure what this change is
> > doing internally that's triggering this lockup on O2 systems. It doesn't
> > appear to affect Octane (IP30) systems or Origin (IP27). I haven't
> > test-ran it on IP22/IP28 hardware yet, so only IP32 is known to be
> > affected. Unsure about non-SGI MIPS hardware.
>
> on_each_cpu is re-enabling interrupt. This may crash the system if it
> happens before interrupt handlers have been installed.
on_each_cpu() calls smp_call_function(). It is not correct to call
smp_call_function() with local interrupts disabled, because it can lead to
deadlocks.
Therefore on_each_cpu() also must not be called with local interrupts
disabled. Therefore on_each_cpu() may use
local_irq_disable()/local_irq_enable().
> A while ago I've
> fixes all such calls but I may have missed some instances.
>
> Andrew, what was the reason for 78eef01b0fae087c5fadbd85dd4fe2918c3a015f ?
>
That change made the various calling environments consistent, as described
in the changelog.
If it's really, really not deadlocky to call smp_call_function() with
interrupts disabled at that time in the MIPS kernel bringup then I'd
suggest you should open-code an smp_call_function() and put a big comment
over it explaining why it's done this way, and why it isn't deadlocky.
<tries to remember what the deadlock is>
If CPU A is running smp_call_function() it's waiting for CPU B to run the
handler.
But if CPU B is presently _also_ running smp_call_function(), it's waiting
for CPU A to run the handler.
If either of those CPUs is waiting for the other with local interrupts
disabled, that CPU will never respond to the other CPU's IPI and they'll
deadlock.
|