linux-mips
[Top] [All Lists]

Re: possible Malta 4Kc cache problem ...

To: linux-mips@linux-mips.org
Subject: Re: possible Malta 4Kc cache problem ...
From: Jun Sun <jsun@mvista.com>
Date: Wed, 4 Dec 2002 14:53:55 -0800
Cc: jsun@mvista.com
In-reply-to: <20021203224504.B13437@mvista.com>; from jsun@mvista.com on Tue, Dec 03, 2002 at 10:45:04PM -0800
Original-recipient: rfc822;linux-mips@linux-mips.org
References: <20021203224504.B13437@mvista.com>
Sender: linux-mips-bounce@linux-mips.org
User-agent: Mutt/1.2.5i
OK, I am fully convinced that this is some kind of hardware
problem.  Basically, it appears CPU is running from a stale 
icache line even though it has invalid flags!

In order to see how I draw this conclusion, you will need to 
be a little patient.

Here is the relavent user code segment:

  400f3c:       45010009        bc1t    400f64 <C_x_co_DIDI_J+0x84>
  400f40:       8c510008        lw      $s1,8($v0)
  ....
  400f94:       4501000a        bc1t    400fc0 <C_x_co_DIDI_J+0xe0>
  400f98:       8fa20050        lw      $v0,80($sp)

In both cases, branch are taken.  The faulting symptom is that
"lw" in the second delay slot does not load correct value to $v0.

I wrote some sizable instrumentation code to capture what has happened
before and after "lw" emulation.  A rough patch of my change is included for 
the diehard hackers.

Here is the output from my captured data for the faulting "lw" case:

----------
dsemul_insns = 7fff78f0, epc=00400f98, cpc=00400fc0, ir=8fa20050
v0 = 10000028, s1=00000004
ret_epc = 7fff78f4, ret_ir=8fa20050, ret_sp=7fff7900, ret_sp_80val=00000004
ret_v0 = 10000028, ret_s1=00000004, bad_addr=00000000
cache tags before: 0018a8f2, 03b878f0, 000308f2, 03b888f2
cache tags after : 0018a8f2, 03b87800, 000308f2, 03b888f2
mem @ a3b878f0 : 8fa20050, 8c000001, 0000bd36, 00400fc0
----------

Several notes and observations:

. 7fff78f0 is in fact 83b878f0 (or a3b878f0) in kernel space

. ret_v0 is the v0 value after we return from trampoline execution.  Wowla,
  it has the same value as before, meaning "lw $v0,80($sp)" did not happen.  
  To further confirm that, I actually printed out the value at 80($sp) which is
  "ret_sp_80val=00000004" (That is the right value $v0 should have, BTW).

. apparently the trampoline is executed, because we did come back from it
  through unaligned access exception.

. In order to figure out what the first instruction is, I modified $s1 value
  to 0x55 right before we start to execute trampoline code.  Guess what, after 
  it comes back, $s1 is changed to "ret_s1=00000004", which suggests an stale 
instruction
  "lw $s1,8($v0)" was executed.  This instruction was put into icache during 
the previous
  bc1t emulation.

. "cache tags before/after" dumps cache tags of all four ways in the same set 
as the
   trampoline.  It is clear that after flush_cache_sigtramp() the valid bits 
are correctly
   cleared.

. The last line shows memory indeed has the right values.  It is the icache
  to blame.

. No page fault has happened during flush_cache_sigtramp() because bad_addr 
would
  otherwise contains the faulting address.

It seems safe to conclude CPU executed from a stale icache line.  I have no 
clue why 
icache exhibits such a problem.

I modified flush_cache_sigtramp() to flush the whole icache, and things appear
to be working.  However, not knowing the root cause I am not 100% sure
if this is a valid workaround.

Here are some info related to the CPU:

CPU revision is: 00018001
Primary instruction cache 16kb, linesize 16 bytes (4 ways)
Primary data cache 16kb, linesize 16 bytes (4 ways)

Jun


On Tue, Dec 03, 2002 at 10:45:04PM -0800, Jun Sun wrote:
> 
> I attached the test case.  Untar it.  Type 'make' and run 'a.out'.
> 
> If the test fails you will see a print-out.  Otherwise you see nothing.
> 
> It does not always fail.  But if it fails, it is usually pretty consistent.
> Try a few times.  Moving source tree to a different directory may cause
> the symptom appear or disappear.
> 
> I spent quite some time to trace this problem, and came to suspect
> there might be a hardware problem.
> 
> The problem involves emulating a "lw" instruction in cp1 branch delay
> slot, which needs to  set up trampoline in user stack.  The net effect
> looks as if the icache line or dcache line is not flushed properly.
> 
> Using gdb/kgdb, printf or printk in any useful places would hide the bug.
> 
> I did find a smaller part of the problem.  flush_cache_sigtramp for
> MIPS32 (4Kc) calls protected_writeback_dcache_line in mips32_cache.h.
> It uses Hit_Writeback_D, and the 4Kc mannual says it is not implemented
> and executed as no-op (*ick*).
> 
> Even after fixing this, I still see the problem happening.
> 
> If you replace flush_cache_sigtramp() with flush_cache_all(), symptom
> would disppear.
> 
> Several of my tests seem to suggest it is the icache that did not
> get flushed (or updated) properly.
> 
> Not re-producible on other MIPS boards.  At least so far.
> 
> Does anybody with more knowledge about 4Kc have any clues here?
> 
> Thanks.
> 
> Jun


Attachment: trace.patch
Description: Text document

<Prev in Thread] Current Thread [Next in Thread>