On Mon, Oct 20, 2008 at 12:05:43PM -0700, Anirban Sinha wrote:
> Thanks for responding and posting the patch. There is actually a another
> important issue of a more general nature. I have already posted this in
> the general Linux kernel mailing list under the subject "panic() logic".
> The crux of the issue is:
>
> The panic() call does a smp_send_stop() pretty early in the call
> process for SMP systems. smp_send_stop basically marks all the other
> cores as 'down' and
> updates the cpu bitmap. One implication of this is that you cannot do
> an IPI later on to other cores. However, interestingly, mips sibyte
> processor tries to do a cfe_exit() through an IPI as a part of
> emergency_reboot() that is called pretty late in the panic() logic.
>
> As a consequence of this, if a panic happens on a back core, the system
> simply hangs and never actually does a "rebooting in 5 sec" thing.
Interesting. I've observed this effect frequently. But without researching
the issue further I did blame CFE for it.
> I believe the way panic logic is organized is in conflict with the
> requirements of some archs, for example our mips sibyte arch. Currently,
> the arch independent logic defeats the main purpose of the arch
> dependent emergency_restart() function which is to restart the system.
> In a vast majority of the cases, we do have a perfectly sane and
> functional front core and we are just not able to gracefully reboot the
> system because we are limited by the way panic() handles the shutdown
> logic. If there are other archs that does a similar specific operation
> for the front core as a part of 'emergency restart', they are all
> defeated.
SMP systems generally have some sledgehammer mechanism that can be used to
trigger a hardware reset of another or all cores. We probably should
use that instead of relying on firmware - which in many cases becomes
unusable after Linux initialization.
> I believe, the way to solve this problem is that the archs themselves
> take the responsibility of shutting down the core and not the generic
> panic() call. The actual power down mechanism is arch dependent anyway,
> so I guess it can be made to be a part of emergency_shutdown(). The arch
> independent kernel code will then simply do the necessary arch
> independent things to handle panic and simply call emergency_reboot() to
> do the rest of the arch specific stuff, including powering down the
> cores.
It would certainly make some sense in this particular scenario.
Ralf
|