> > > Hmm, I see. The lazy fpu context switch code is not SMP safe.
> > > I see fishy things like "last_task_used_math" etc...
> > What, you mean "last_task_used_math" isn't allocated in a
> > processor-specific page of kseg3??? ;-)
> You must be talking about another OS, right? :-) I don't think
> Linux has processor-specific page, although this sounds like
> a good idea to explore.
It's gotta be done. I mean, the last I heard (which was a long
time ago) mips64 Linux was keeping the CPU node number in
a watchpoint register (or something equally unwholesome) and
using that value as an index into tables. Sticking all the per-CPU
state in a kseg3 VM page which is allocated and locked at boot
time would be much cleaner and on the average probably quite
a bit faster (definitely faster in the kernel but to be fair one has
to factor in the increase in TLB pressure from the locked entry).
But getting back to the original topic, there's another fun bug
waiting for us in MIPS/Linux SMP floating point that can't
be fixed as easly with VM slight-of-hand. Consider processes
"A" and "B", where A uses FP and B does not: A gets scheduled
on CPU 1, runs for a while, gets preempted, and B gets CPU 1.
CPU 2 gets freed, so A gets scheduled on CPU 2. Unfortunately,
A's FP state is still in the FP register set of CPU 1. The lazy FPU
context switch either needs to be turned off (bleah!) or be fixed
for SMP to handle the case where the "owner" of the FPR's
on one CPU gets scheduled on another.
The brute force would be somehow to send an interrupt to the CPU
with the FP state that will cause it to cough it up into the thread context
area. One alternative would be to give strict CPU affinity to the thread
that has it's FP state on a particular CPU. That could complicate load
balancing, but might not really be too bad. At most one thread per CPU
will be non-migratable at a given point in time. In the above scenario,
"A" could never migrate off of CPU 1, but "B" could, and would
presumably be picked up by an idle CPU 2 as soon as it's time slice
is up on CPU 1. That will be less efficient than doing an "FPU shootdown"
in some cases, but it should also be more portable and easier
to get right.
Does this come up in x86-land? The FPU state is much smaller
there, so lazy context switching is presumably less important.