I don't have time to go chasing this stuff any further on your behalf,
but it *does* smell to me like an icache management problem. Remember,
MIPS processors almost universally have split I/D caches and no
coherence support between them, so if you either (a) forget to do an
explicit D-cache write-back operation after copying to a page mapped
write-back that's going to be used as instructions/text, or (b) forget
to do an explicit I-cache invalidate when you re-use a page for
instructions that has been previously used for a different instruction
page, you will have problems, even without going into DMA I/O coherence
issues. If your problem were (b), though, you'd be seeing bad answers,
segmentation violations, bus errors, etc., at least as often as you'd
be seeing illegal instruction exceptions. So my money would be on (a).|
The need for cache management is so fundamental to Linux for MIPS that
all the necessary general hooks have been there for years. If I were
you, I'd focus on the definitions of the primitives that you spotted in
c-r4k.c. Does the stuff in the JZ_RISC section correspond to the
assembly language flush sequence done in the Ingenic patch to head.S?
Are you sure that the JZ_RISC section is in fact the version of those
functions that's being built into your kernel?
Nils Faerber wrote:
Kevin D. Kissell schrieb:
The only thing that you've mentioned below that really makes me think
that you're looking at a kernel bug is the comment about things not
failing under GDB. But if *any* of the programs that are failing fail
under gdb, I'd want to know just what instruction is at the place where
they're taking a SIGILL. If gdb heisenbergs things too much, then the
basic brute force thing to do would be to instrument the kernel itself
to report on what happened, and what it sees at the "bad instruction"
address, using printk. If the memory value actually looks like a legit
instruction, it would confirm the hypothesis that you've got an icache
maintenance problem. I note that the Ingenic patch has a "flushcaches"
routine that has hardwired assumptions about the cache organization.
Could those be incorrect on the chip you're using?
Thanks for having a thought about the issue!
By now I pitily have to admit that my GDB assumption was not all that
correct :( After *a*lot* more tries I found an application that actually
also fails inside GDB. But with some more tries I can now confirm that
applications fail at random points - it is not a single instruction that
causes the fault but rather random points.
So I think your memory/cache issue theory sounds pretty interesting...
I just had a look at the JZ4730 code (in arch/mips/jz4730/) and the only
mention of a cache flush is in pm.c which will only be executed in case
of going to sleep (i.e. CPU deep sleep aka s2ram).
arch/mips/mm/c-r4k.c also contains a JZ_RISC section for setting up
cache options and arch/mips/mm/tlbex.c a TLB case special for the JZ.
Those look promising!
I could very well think of cases where a wrong cache flush could cause
such or similar problems.
Regards, and happy hunting,
Happy? When I found it maybe. The annoying thing about this is that
Ingenic is not very helpful. I emailed them several times already asking
for the full datasheet of the CPU with no replay at all yet. The
datasheet they hae on their webpage is just the brief with about 60
pages and not very helpful when you ar elooking for details like cache
So I will have to resort to experiments - trial an error.
Thank you very much for your thoughts and idea!