hi,
These days I have performed more experiments on our ev64240 board.
Now it seems I get at least two problems: sigtramp flush and L3 cache.
Let me descripe the phenomemena first.
1. fsck /dev/hda4(a 10G partition of 40G ide disk on a pci add-on
card use
intel piix4 chip) frequently fail with oops in various place:
__remove_inode_queue, free_buffers, vmscan:359 etc.
2. occasionally other apps may fail with segmentation fault or
bus error.
3. xwindow system is extremely unstable,both the applications and
the
Xserver may fail with sigill/sigsegv/sigbus etc.
To address the problems, I modified arch/mips/signal.c to let kernel
dump core
unconditionally(even if there are use handler installed) for
sigill/sigsegv/sigbus.
By this way I get many core files for XFree86,then I find that they all
look quite
similiar--all around the point of kernel generated sigreturn code(Two
example are
attached). Days ago i added a 'sync' after writeback and the situation
was much better.
But then i still see this kinds of failure even with the 'sync'. I have
to go further back
to use 'Writeback_SD', so far no more such fault. But just as Adam
pointed out,it may
just mask over another error. I have tried to add code in
r4k_flush_sigtramp and
sigreturn,and when xserver fails,I do observe that there are flush for
the faulting point,
but no sigreturn executed. So it is at my wit's end:(. Maybe some
complex schedule or
reentry problem? Or even a potential bug of context management(e.g.,we
are using the
other's stack)?
Using Writeback_SD only help xserver problem, the other problems look
like
cache related. So I try to run with L3 cache disabled. That helps
greatly, no oops
now. With a little tweak on ide code,the 'lost interrupt' problem seems
gone too.
But with only L3 disabled, the Xserver problem remains.
I am doing stress test now. Hope it won't give me more surprise.
And here I have a question for Mr. Adam: original linux code use
'Writeback_Inv_D"
and "Hit_Invalidate_I",not "Writeback_D" and "Hit_Invalidate_I",could it
lead to the
problem?
BTW:
a silly question: how can i make my email show up pretier? I find
that the mailing list
often break my lines very badly. I feel guilty for that:) I am using
mozilla composer,the
original linebreaks are manually inserted(hit enter when i feel it is
long enough).
Thank you for any help.
Adam Kiepul wrote:
Hi Fuxin,
Could you please provide me with the _original_ Kernel code disassembly snippet
around the point where your SYNC patch applies?
Also, can you check what RM7000 part revision is on your board? You can find it
out by reading the PrID register.
I will check if there is an erratum that the code could trigger.
By the way, are you aware of any other ev64240 board that would exhibit the
same behavior?
I would be quite careful drawing any conclusions at the moment since we can not preclude
the possibility that it is simply a "bad CPU on the board" case. Please note
that the SYNC instruction changes a lot in the manner things physically happen in the CPU
so it can often mask off various problems, such as a bad part.
Thank you,
Adam
-----Original Message-----
From: Fuxin Zhang [mailto:fxzhang@ict.ac.cn]
Sent: Thursday, July 31, 2003 9:59 PM
To: Ralf Baechle
Cc: Adam Kiepul; MAKE FUN PRANK CALLS
Subject: Re: RM7k cache_flush_sigtramp
I am using a slightly modified 2.4.21-pre4,based on cvs of early this
month(?).
We have merged with latest cvs, I will run it and report the result tonight.
Ralf Baechle wrote:
Adam,
On Fri, Aug 01, 2003 at 08:40:14AM +0800, Fuxin Zhang wrote:
Current linux code does exactly this. But I was seeing all kinds of
faults occuring around the
sigreturn point on the stack without a sync? And a sync does greatly
improve the stablity.
The ordering does matter however since the Hit_Invalidate_I makes sure the
write buffer is flushed.
could there be an errata explaining Fuxin's findings?
Fuxin, what version are you running?
Ralf
GNU gdb 2002-04-01-cvs
Copyright 2002 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB. Type "show warranty" for details.
This GDB was configured as "mipsel-linux"...(no debugging symbols found)...
Core was generated by `/usr/bin/X11/X -dpi 100 -nolisten tcp'.
Program terminated with signal 4, Illegal instruction.
Reading symbols from /usr/lib/libz.so.1...(no debugging symbols found)...done.
Loaded symbols for /usr/lib/libz.so.1
Reading symbols from /lib/libm.so.6...(no debugging symbols found)...done.
Loaded symbols for /lib/libm.so.6
Reading symbols from /lib/libc.so.6...(no debugging symbols found)...done.
Loaded symbols for /lib/libc.so.6
Reading symbols from /lib/ld.so.1...(no debugging symbols found)...done.
Loaded symbols for /lib/ld.so.1
Reading symbols from /lib/libnss_files.so.2...(no debugging symbols found)...
done.
Loaded symbols for /lib/libnss_files.so.2
GDB is unable to find the start of the function at 0x7fff7600
and thus can't determine the size of that function's stack frame.
This means that GDB may be unable to access that stack frame, or
the frames below it.
This problem is most likely caused by an invalid program counter or
stack pointer.
However, if you think GDB should simply search farther back
from 0x7fff7600 for code which looks like the beginning of a
function, you can increase the range of the search using the `set
heuristic-fence-post' command.
#0 0x7fff7600 in ?? ()
(gdb) where
#0 0x7fff7600 in ?? ()
(gdb) disass 0x7fff7580 0x7fff7680
Dump of assembler code from 0x7fff7580 to 0x7fff7680:
0x7fff7580: nop
0x7fff7584: nop
0x7fff7588: nop
0x7fff758c: nop
0x7fff7590: nop
0x7fff7594: nop
0x7fff7598: nop
0x7fff759c: nop
0x7fff75a0: nop
0x7fff75a4: nop
0x7fff75a8: nop
0x7fff75ac: nop
0x7fff75b0: nop
0x7fff75b4: nop
0x7fff75b8: nop
0x7fff75bc: nop
0x7fff75c0: nop
0x7fff75c4: nop
0x7fff75c8: nop
0x7fff75cc: nop
0x7fff75d0: nop
0x7fff75d4: beq at,t3,0x80003ef8
0x7fff75d8: sllv zero,zero,zero
0x7fff75dc: 0xe
0x7fff75e0: beq zero,gp,0x7ffef244
0x7fff75e4: beq zero,t0,0x800012e8
0x7fff75e8: 0x7fff7600
0x7fff75ec: beq zero,t2,0x7ffd98b0
0x7fff75f0: sd ra,-1(ra)
0x7fff75f4: sd ra,-1(ra)
0x7fff75f8: slti sp,s6,-25040
0x7fff75fc: beq zero,t2,0x7ffd9cc0
0x7fff7600: li v0,4119
0x7fff7604: syscall
0x7fff7608: 0x12c
0x7fff760c: lb a0,-19437(zero)
0x7fff7610: slti s1,s6,17812
0x7fff7614: nop
0x7fff7618: nop
0x7fff761c: nop
0x7fff7620: 0xcf9210
0x7fff7624: nop
0x7fff7628: mfhi zero
0x7fff762c: nop
0x7fff7630: beq zero,at,0x8000caa4
0x7fff7634: nop
0x7fff7638: 0xe
0x7fff763c: nop
0x7fff7640: slti v0,k1,28680
0x7fff7644: nop
0x7fff7648: 0x1228
0x7fff764c: nop
0x7fff7650: nop
0x7fff7654: nop
0x7fff7658: multu zero,zero
0x7fff765c: nop
0x7fff7660: nop
0x7fff7664: nop
0x7fff7668: slti t9,s7,17756
0x7fff766c: nop
0x7fff7670: beq at,t3,0x7fff6f04
0x7fff7674: nop
0x7fff7678: 0x12c
0x7fff767c: nop
End of assembler dump.
(gdb) info regs
(gdb) info regs
zero at v0 v1 a0 a1 a2 a3
R0 00000000 b004b400 ffffffff ffffffff 00000001 7fff73f8 00000000 00000000
t0 t1 t2 t3 t4 t5 t6 t7
R8 0000b400 00000000 00000000 00000000 00000000 822b2880 822b2900 00000000
s0 s1 s2 s3 s4 s5 s6 s7
R16 00000000 102b3248 00000004 0000000e 101cdf18 102b1820 00000001 00000000
t8 t9 k0 k1 gp sp s8 ra
R24 00000000 006c86ec 00000000 00000000 10082740 7fff75f0 7fff7d08 7fff7600
sr lo hi bad cause pc
a004b413 00000002 00000000 8009c6a0 00000028 7fff7600
fsr fir fp
00800004 00000000 00000000
(gdb) quit
GNU gdb 2002-04-01-cvs
Copyright 2002 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB. Type "show warranty" for details.
This GDB was configured as "mipsel-linux"...(no debugging symbols found)...
Core was generated by `/bin/sh /usr/bin/X11/startx'.
Program terminated with signal 4, Illegal instruction.
GDB is unable to find the start of the function at 0x7fff75b8
and thus can't determine the size of that function's stack frame.
This means that GDB may be unable to access that stack frame, or
the frames below it.
This problem is most likely caused by an invalid program counter or
stack pointer.
However, if you think GDB should simply search farther back
from 0x7fff75b8 for code which looks like the beginning of a
function, you can increase the range of the search using the `set
heuristic-fence-post' command.
#0 0x7fff75b8 in ?? ()
(gdb) info reg
zero at v0 v1 a0 a1 a2 a3
R0 00000000 2ad918f0 2ad918f0 0000000a 00000012 7fff7538 00000001 00000001
t0 t1 t2 t3 t4 t5 t6 t7
R8 0000000a 2aca6394 00000000 00000004 00000000 00000000 00000000 07200720
s0 s1 s2 s3 s4 s5 s6 s7
R16 00000000 00000004 00000080 7fff7878 00000003 ffffffff 1000f0f8 00000001
t8 t9 k0 k1 gp sp s8 ra
R24 00000000 00000000 00000000 00000000 1000d880 7fff7590 00000003 7fff75a0
sr lo hi bad cause pc
a004f413 000001b0 00000000 8009c6a0 80000028 7fff75b8
fsr fir fp
00000000 00000000 00000000
(gdb) disass 0x7fff7500 0x7fff7600
Dump of assembler code from 0x7fff7500 to 0x7fff7600:
0x7fff7500: 0xc2009d
0x7fff7504: 0x10000e8
0x7fff7508: 0x11a0110
0x7fff750c: 0x990121
0x7fff7510: slti t9,s6,32304
0x7fff7514: tltu a0,t9,0x2
0x7fff7518: slti t9,s6,32304
0x7fff751c: 0x442c88
0x7fff7520: nop
0x7fff7524: nop
0x7fff7528: nop
0x7fff752c: nop
0x7fff7530: b 0x7ffed734
0x7fff7534: nop
0x7fff7538: nop
0x7fff753c: slti t8,s6,-8108
0x7fff7540: nop
0x7fff7544: sllv zero,zero,zero
0x7fff7548: sll zero,zero,0x2
0x7fff754c: 0x7fff7878
0x7fff7550: sra zero,zero,0x0
0x7fff7554: sd ra,-1(ra)
0x7fff7558: b 0x7fff393c
0x7fff755c: b 0x7ffed760
0x7fff7560: teq v0,a0,0xa9
0x7fff7564: nop
0x7fff7568: nop
0x7fff756c: nop
0x7fff7570: nop
0x7fff7574: nop
0x7fff7578: b 0x7ffed77c
0x7fff757c: nop
0x7fff7580: nop
0x7fff7584: b 0x7ffed788
0x7fff7588: 0x7fff75a0
0x7fff758c: 0x1
0x7fff7590: b 0x7fff2174
0x7fff7594: 0x7fff7804
0x7fff7598: slti t9,s6,32304
0x7fff759c: 0x475718
0x7fff75a0: li v0,4119
0x7fff75a4: syscall
0x7fff75a8: slti t8,s6,-8108
0x7fff75ac: lb a0,-3053(zero)
0x7fff75b0: slti t5,s6,9620
0x7fff75b4: nop
0x7fff75b8: nop
0x7fff75bc: nop
0x7fff75c0: b 0x7fff8cb4
0x7fff75c4: nop
0x7fff75c8: sllv zero,zero,zero
0x7fff75cc: nop
0x7fff75d0: nop
0x7fff75d4: nop
0x7fff75d8: sra zero,zero,0x0
0x7fff75dc: nop
0x7fff75e0: 0x7fff7878
0x7fff75e4: nop
0x7fff75e8: sll zero,zero,0x2
0x7fff75ec: nop
0x7fff75f0: 0x1
0x7fff75f4: nop
0x7fff75f8: mult zero,zero
0x7fff75fc: nop
End of assembler dump.
(gdb) quit
|