OK, I am fully convinced that this is some kind of hardware
problem. Basically, it appears CPU is running from a stale
icache line even though it has invalid flags!
In order to see how I draw this conclusion, you will need to
be a little patient.
Here is the relavent user code segment:
400f3c: 45010009 bc1t 400f64 <C_x_co_DIDI_J+0x84>
400f40: 8c510008 lw $s1,8($v0)
....
400f94: 4501000a bc1t 400fc0 <C_x_co_DIDI_J+0xe0>
400f98: 8fa20050 lw $v0,80($sp)
In both cases, branch are taken. The faulting symptom is that
"lw" in the second delay slot does not load correct value to $v0.
I wrote some sizable instrumentation code to capture what has happened
before and after "lw" emulation. A rough patch of my change is included for
the diehard hackers.
Here is the output from my captured data for the faulting "lw" case:
----------
dsemul_insns = 7fff78f0, epc=00400f98, cpc=00400fc0, ir=8fa20050
v0 = 10000028, s1=00000004
ret_epc = 7fff78f4, ret_ir=8fa20050, ret_sp=7fff7900, ret_sp_80val=00000004
ret_v0 = 10000028, ret_s1=00000004, bad_addr=00000000
cache tags before: 0018a8f2, 03b878f0, 000308f2, 03b888f2
cache tags after : 0018a8f2, 03b87800, 000308f2, 03b888f2
mem @ a3b878f0 : 8fa20050, 8c000001, 0000bd36, 00400fc0
----------
Several notes and observations:
. 7fff78f0 is in fact 83b878f0 (or a3b878f0) in kernel space
. ret_v0 is the v0 value after we return from trampoline execution. Wowla,
it has the same value as before, meaning "lw $v0,80($sp)" did not happen.
To further confirm that, I actually printed out the value at 80($sp) which is
"ret_sp_80val=00000004" (That is the right value $v0 should have, BTW).
. apparently the trampoline is executed, because we did come back from it
through unaligned access exception.
. In order to figure out what the first instruction is, I modified $s1 value
to 0x55 right before we start to execute trampoline code. Guess what, after
it comes back, $s1 is changed to "ret_s1=00000004", which suggests an stale
instruction
"lw $s1,8($v0)" was executed. This instruction was put into icache during
the previous
bc1t emulation.
. "cache tags before/after" dumps cache tags of all four ways in the same set
as the
trampoline. It is clear that after flush_cache_sigtramp() the valid bits
are correctly
cleared.
. The last line shows memory indeed has the right values. It is the icache
to blame.
. No page fault has happened during flush_cache_sigtramp() because bad_addr
would
otherwise contains the faulting address.
It seems safe to conclude CPU executed from a stale icache line. I have no
clue why
icache exhibits such a problem.
I modified flush_cache_sigtramp() to flush the whole icache, and things appear
to be working. However, not knowing the root cause I am not 100% sure
if this is a valid workaround.
Here are some info related to the CPU:
CPU revision is: 00018001
Primary instruction cache 16kb, linesize 16 bytes (4 ways)
Primary data cache 16kb, linesize 16 bytes (4 ways)
Jun
On Tue, Dec 03, 2002 at 10:45:04PM -0800, Jun Sun wrote:
>
> I attached the test case. Untar it. Type 'make' and run 'a.out'.
>
> If the test fails you will see a print-out. Otherwise you see nothing.
>
> It does not always fail. But if it fails, it is usually pretty consistent.
> Try a few times. Moving source tree to a different directory may cause
> the symptom appear or disappear.
>
> I spent quite some time to trace this problem, and came to suspect
> there might be a hardware problem.
>
> The problem involves emulating a "lw" instruction in cp1 branch delay
> slot, which needs to set up trampoline in user stack. The net effect
> looks as if the icache line or dcache line is not flushed properly.
>
> Using gdb/kgdb, printf or printk in any useful places would hide the bug.
>
> I did find a smaller part of the problem. flush_cache_sigtramp for
> MIPS32 (4Kc) calls protected_writeback_dcache_line in mips32_cache.h.
> It uses Hit_Writeback_D, and the 4Kc mannual says it is not implemented
> and executed as no-op (*ick*).
>
> Even after fixing this, I still see the problem happening.
>
> If you replace flush_cache_sigtramp() with flush_cache_all(), symptom
> would disppear.
>
> Several of my tests seem to suggest it is the icache that did not
> get flushed (or updated) properly.
>
> Not re-producible on other MIPS boards. At least so far.
>
> Does anybody with more knowledge about 4Kc have any clues here?
>
> Thanks.
>
> Jun
trace.patch
Description: Text document
|