linux-mips
[Top] [All Lists]

Re: RM7k cache_flush_sigtramp

To: Adam Kiepul <Adam_Kiepul@pmc-sierra.com>
Subject: Re: RM7k cache_flush_sigtramp
From: Fuxin Zhang <fxzhang@ict.ac.cn>
Date: Wed, 06 Aug 2003 19:00:07 +0800
Cc: Ralf Baechle <ralf@linux-mips.org>, MAKE FUN PRANK CALLS <linux-mips@linux-mips.org>
In-reply-to: <9DFF23E1E33391449FDC324526D1F259017DF091@SJC1EXM02>
Original-recipient: rfc822;linux-mips@linux-mips.org
References: <9DFF23E1E33391449FDC324526D1F259017DF091@SJC1EXM02>
Sender: linux-mips-bounce@linux-mips.org
User-agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.4) Gecko/20030624
hi,
  These days I have performed more experiments on our ev64240 board.

  Now it seems I get at least two problems: sigtramp flush and L3 cache.

  Let me descripe the phenomemena first.
1. fsck /dev/hda4(a 10G partition of 40G ide disk on a pci add-on card use
           intel piix4 chip) frequently fail with oops in various place:
               __remove_inode_queue, free_buffers, vmscan:359 etc.
2. occasionally other apps may fail with segmentation fault or bus error. 3. xwindow system is extremely unstable,both the applications and the
          Xserver may fail with sigill/sigsegv/sigbus etc.

To address the problems, I modified arch/mips/signal.c to let kernel dump core unconditionally(even if there are use handler installed) for sigill/sigsegv/sigbus. By this way I get many core files for XFree86,then I find that they all look quite similiar--all around the point of kernel generated sigreturn code(Two example are attached). Days ago i added a 'sync' after writeback and the situation was much better. But then i still see this kinds of failure even with the 'sync'. I have to go further back to use 'Writeback_SD', so far no more such fault. But just as Adam pointed out,it may just mask over another error. I have tried to add code in r4k_flush_sigtramp and sigreturn,and when xserver fails,I do observe that there are flush for the faulting point, but no sigreturn executed. So it is at my wit's end:(. Maybe some complex schedule or reentry problem? Or even a potential bug of context management(e.g.,we are using the
other's stack)?

Using Writeback_SD only help xserver problem, the other problems look like cache related. So I try to run with L3 cache disabled. That helps greatly, no oops now. With a little tweak on ide code,the 'lost interrupt' problem seems gone too.
But with only L3 disabled, the Xserver problem remains.

 I am doing stress test now. Hope it won't give me more surprise.

And here I have a question for Mr. Adam: original linux code use 'Writeback_Inv_D" and "Hit_Invalidate_I",not "Writeback_D" and "Hit_Invalidate_I",could it lead to the
problem?

BTW:
a silly question: how can i make my email show up pretier? I find that the mailing list often break my lines very badly. I feel guilty for that:) I am using mozilla composer,the original linebreaks are manually inserted(hit enter when i feel it is long enough).

Thank you for any help.

Adam Kiepul wrote:

Hi Fuxin,

Could you please provide me with the _original_ Kernel code disassembly snippet 
around the point where your SYNC patch applies?
Also, can you check what RM7000 part revision is on your board? You can find it 
out by reading the PrID register.

I will check if there is an erratum that the code could trigger.

By the way, are you aware of any other ev64240 board that would exhibit the 
same behavior?

I would be quite careful drawing any conclusions at the moment since we can not preclude 
the possibility that it is simply a "bad CPU on the board" case. Please note 
that the SYNC instruction changes a lot in the manner things physically happen in the CPU 
so it can often mask off various problems, such as a bad part.

Thank you,

Adam


-----Original Message-----
From: Fuxin Zhang [mailto:fxzhang@ict.ac.cn]
Sent: Thursday, July 31, 2003 9:59 PM
To: Ralf Baechle
Cc: Adam Kiepul; MAKE FUN PRANK CALLS
Subject: Re: RM7k cache_flush_sigtramp


I am using a slightly modified 2.4.21-pre4,based on cvs of early this month(?).
We have merged with latest cvs, I will run it and report the result tonight.


Ralf Baechle wrote:

Adam,

On Fri, Aug 01, 2003 at 08:40:14AM +0800, Fuxin Zhang wrote:



Current linux code does exactly this. But I was seeing all kinds of faults occuring around the sigreturn point on the stack without a sync? And a sync does greatly improve the stablity.

The ordering does matter however since the Hit_Invalidate_I makes sure the write buffer is flushed.
could there be an errata explaining Fuxin's findings?

Fuxin, what version are you running?

Ralf







GNU gdb 2002-04-01-cvs
Copyright 2002 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "mipsel-linux"...(no debugging symbols found)...
Core was generated by `/usr/bin/X11/X -dpi 100 -nolisten tcp'.
Program terminated with signal 4, Illegal instruction.
Reading symbols from /usr/lib/libz.so.1...(no debugging symbols found)...done.
Loaded symbols for /usr/lib/libz.so.1
Reading symbols from /lib/libm.so.6...(no debugging symbols found)...done.
Loaded symbols for /lib/libm.so.6
Reading symbols from /lib/libc.so.6...(no debugging symbols found)...done.
Loaded symbols for /lib/libc.so.6
Reading symbols from /lib/ld.so.1...(no debugging symbols found)...done.
Loaded symbols for /lib/ld.so.1
Reading symbols from /lib/libnss_files.so.2...(no debugging symbols found)...
done.
Loaded symbols for /lib/libnss_files.so.2

    GDB is unable to find the start of the function at 0x7fff7600
and thus can't determine the size of that function's stack frame.
This means that GDB may be unable to access that stack frame, or
the frames below it.
    This problem is most likely caused by an invalid program counter or
stack pointer.
    However, if you think GDB should simply search farther back
from 0x7fff7600 for code which looks like the beginning of a
function, you can increase the range of the search using the `set
heuristic-fence-post' command.
#0  0x7fff7600 in ?? ()
(gdb) where
#0  0x7fff7600 in ?? ()
(gdb) disass 0x7fff7580 0x7fff7680
Dump of assembler code from 0x7fff7580 to 0x7fff7680:
0x7fff7580:     nop
0x7fff7584:     nop
0x7fff7588:     nop
0x7fff758c:     nop
0x7fff7590:     nop
0x7fff7594:     nop
0x7fff7598:     nop
0x7fff759c:     nop
0x7fff75a0:     nop
0x7fff75a4:     nop
0x7fff75a8:     nop
0x7fff75ac:     nop
0x7fff75b0:     nop
0x7fff75b4:     nop
0x7fff75b8:     nop
0x7fff75bc:     nop
0x7fff75c0:     nop
0x7fff75c4:     nop
0x7fff75c8:     nop
0x7fff75cc:     nop
0x7fff75d0:     nop
0x7fff75d4:     beq     at,t3,0x80003ef8
0x7fff75d8:     sllv    zero,zero,zero
0x7fff75dc:     0xe
0x7fff75e0:     beq     zero,gp,0x7ffef244
0x7fff75e4:     beq     zero,t0,0x800012e8
0x7fff75e8:     0x7fff7600
0x7fff75ec:     beq     zero,t2,0x7ffd98b0
0x7fff75f0:     sd      ra,-1(ra)
0x7fff75f4:     sd      ra,-1(ra)
0x7fff75f8:     slti    sp,s6,-25040
0x7fff75fc:     beq     zero,t2,0x7ffd9cc0
0x7fff7600:     li      v0,4119
0x7fff7604:     syscall
0x7fff7608:     0x12c
0x7fff760c:     lb      a0,-19437(zero)
0x7fff7610:     slti    s1,s6,17812
0x7fff7614:     nop
0x7fff7618:     nop
0x7fff761c:     nop
0x7fff7620:     0xcf9210
0x7fff7624:     nop
0x7fff7628:     mfhi    zero
0x7fff762c:     nop
0x7fff7630:     beq     zero,at,0x8000caa4
0x7fff7634:     nop
0x7fff7638:     0xe
0x7fff763c:     nop
0x7fff7640:     slti    v0,k1,28680
0x7fff7644:     nop
0x7fff7648:     0x1228
0x7fff764c:     nop
0x7fff7650:     nop
0x7fff7654:     nop
0x7fff7658:     multu   zero,zero
0x7fff765c:     nop
0x7fff7660:     nop
0x7fff7664:     nop
0x7fff7668:     slti    t9,s7,17756
0x7fff766c:     nop
0x7fff7670:     beq     at,t3,0x7fff6f04
0x7fff7674:     nop
0x7fff7678:     0x12c
0x7fff767c:     nop
End of assembler dump.
(gdb) info regs
(gdb) info regs 
          zero       at       v0       v1       a0       a1       a2       a3
 R0   00000000 b004b400 ffffffff ffffffff 00000001 7fff73f8 00000000 00000000 
            t0       t1       t2       t3       t4       t5       t6       t7
 R8   0000b400 00000000 00000000 00000000 00000000 822b2880 822b2900 00000000 
            s0       s1       s2       s3       s4       s5       s6       s7
 R16  00000000 102b3248 00000004 0000000e 101cdf18 102b1820 00000001 00000000 
            t8       t9       k0       k1       gp       sp       s8       ra
 R24  00000000 006c86ec 00000000 00000000 10082740 7fff75f0 7fff7d08 7fff7600 
            sr       lo       hi      bad    cause       pc
      a004b413 00000002 00000000 8009c6a0 00000028 7fff7600 
           fsr      fir       fp
      00800004 00000000 00000000 
(gdb) quit
GNU gdb 2002-04-01-cvs
Copyright 2002 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "mipsel-linux"...(no debugging symbols found)...
Core was generated by `/bin/sh /usr/bin/X11/startx'.
Program terminated with signal 4, Illegal instruction.

    GDB is unable to find the start of the function at 0x7fff75b8
and thus can't determine the size of that function's stack frame.
This means that GDB may be unable to access that stack frame, or
the frames below it.
    This problem is most likely caused by an invalid program counter or
stack pointer.
    However, if you think GDB should simply search farther back
from 0x7fff75b8 for code which looks like the beginning of a
function, you can increase the range of the search using the `set
heuristic-fence-post' command.
#0  0x7fff75b8 in ?? ()
(gdb) info reg
          zero       at       v0       v1       a0       a1       a2       a3
 R0   00000000 2ad918f0 2ad918f0 0000000a 00000012 7fff7538 00000001 00000001 
            t0       t1       t2       t3       t4       t5       t6       t7
 R8   0000000a 2aca6394 00000000 00000004 00000000 00000000 00000000 07200720 
            s0       s1       s2       s3       s4       s5       s6       s7
 R16  00000000 00000004 00000080 7fff7878 00000003 ffffffff 1000f0f8 00000001 
            t8       t9       k0       k1       gp       sp       s8       ra
 R24  00000000 00000000 00000000 00000000 1000d880 7fff7590 00000003 7fff75a0 
            sr       lo       hi      bad    cause       pc
      a004f413 000001b0 00000000 8009c6a0 80000028 7fff75b8 
           fsr      fir       fp
      00000000 00000000 00000000 
(gdb) disass 0x7fff7500 0x7fff7600
Dump of assembler code from 0x7fff7500 to 0x7fff7600:
0x7fff7500:     0xc2009d
0x7fff7504:     0x10000e8
0x7fff7508:     0x11a0110
0x7fff750c:     0x990121
0x7fff7510:     slti    t9,s6,32304
0x7fff7514:     tltu    a0,t9,0x2
0x7fff7518:     slti    t9,s6,32304
0x7fff751c:     0x442c88
0x7fff7520:     nop
0x7fff7524:     nop
0x7fff7528:     nop
0x7fff752c:     nop
0x7fff7530:     b       0x7ffed734
0x7fff7534:     nop
0x7fff7538:     nop
0x7fff753c:     slti    t8,s6,-8108
0x7fff7540:     nop
0x7fff7544:     sllv    zero,zero,zero
0x7fff7548:     sll     zero,zero,0x2
0x7fff754c:     0x7fff7878
0x7fff7550:     sra     zero,zero,0x0
0x7fff7554:     sd      ra,-1(ra)
0x7fff7558:     b       0x7fff393c
0x7fff755c:     b       0x7ffed760
0x7fff7560:     teq     v0,a0,0xa9
0x7fff7564:     nop
0x7fff7568:     nop
0x7fff756c:     nop
0x7fff7570:     nop
0x7fff7574:     nop
0x7fff7578:     b       0x7ffed77c
0x7fff757c:     nop
0x7fff7580:     nop
0x7fff7584:     b       0x7ffed788
0x7fff7588:     0x7fff75a0
0x7fff758c:     0x1
0x7fff7590:     b       0x7fff2174
0x7fff7594:     0x7fff7804
0x7fff7598:     slti    t9,s6,32304
0x7fff759c:     0x475718
0x7fff75a0:     li      v0,4119
0x7fff75a4:     syscall
0x7fff75a8:     slti    t8,s6,-8108
0x7fff75ac:     lb      a0,-3053(zero)
0x7fff75b0:     slti    t5,s6,9620
0x7fff75b4:     nop
0x7fff75b8:     nop
0x7fff75bc:     nop
0x7fff75c0:     b       0x7fff8cb4
0x7fff75c4:     nop
0x7fff75c8:     sllv    zero,zero,zero
0x7fff75cc:     nop
0x7fff75d0:     nop
0x7fff75d4:     nop
0x7fff75d8:     sra     zero,zero,0x0
0x7fff75dc:     nop
0x7fff75e0:     0x7fff7878
0x7fff75e4:     nop
0x7fff75e8:     sll     zero,zero,0x2
0x7fff75ec:     nop
0x7fff75f0:     0x1
0x7fff75f4:     nop
0x7fff75f8:     mult    zero,zero
0x7fff75fc:     nop
End of assembler dump.
(gdb) quit
<Prev in Thread] Current Thread [Next in Thread>