I'm seeing a rather strange problem on an NXP PNX8550 board we're
using (the PNX8550 is a SoC with a MIPS PR4450 core) and I'm wondering
if anyone here has ever seen anything similar or have any particular
advice as to how it can be investigated. I've searched the list
archives, googled, diffed with 2.6.22 etc. but not found anything
We're using gcc 4.0.0 (from ELDK-4.0) on a 2.6.19 kernel with
glibc/LinuxThreads 2.3.5. I realise that this version of gcc is quite
old but it's the version used by the chip vendor for this platform.
The problem is easy to reproduce with a particular build of our
software but goes away very easily if code is changed, particularly
changes that affect the GOT or move code around. Even adding a single
"nop" instruction to the offending function "fixes the problem" This
is making it hard to debug.
We have a function that does some string manipulation (not
particularly dangerous manipulation and I've been through it
carefully) and then calls atol. As expected the prologue of this
function calculates the value of the gp register by applying an offset
to the t9 register which contains the address of the start of the
function like this:
47995c: 3c1c0fba lui gp,0xfba
479960: 279c1fe4 addiu gp,gp,8164
479964: 0399e021 addu gp,gp,t9
479968: 27bdff80 addiu sp,sp,-128
47996c: afbf007c sw ra,124(sp)
479970: afbe0078 sw s8,120(sp)
479974: 03a0f021 move s8,sp
479978: afbc0010 sw gp,16(sp)
47997c: afc40080 sw a0,128(s8)
479980: afc50084 sw a1,132(s8)
The function then doesn't go near the t9 or gp registers until it it
needs to read the address of the atol function from the GOT and it
does so like this:
479a98: 8f99a7c4 lw t9,-22588(gp)
479a9c: 00402021 move a0,v0
479aa0: 0320f809 jalr t9
479aa4: 00000000 nop
At this point it segfaults because gp has an invalid value of
0x10497280. t9 still has the correct value of 0x47995c. 16(sp) (see
479978 above) also has the incorrect value of gp of 0x10497280.
The correct value for gp in this binary is 0x1001b940.
Interestingly the bad and good gp values are related in the following
0x1001b940 (correct value of gp)
+ 0x47995c (address of this function == t9)
+ 0x1fe4 (second part of gp fixup (8164 in decimal) from 479960 above)
=0x10497280 (bad value of gp)
This implies that upon execution of the instruction at 0x479960 gp
contained the "good" gp value of 0x1001b940 rather than the 0x0fba0000
it should have contained according to the previous instruction.
The only user-space reason I can come up with for this happening is if
the caller jumped into this function one instruction late. This seems
unlikely because t9 contains the correct value and the stack looks
fine. By instrumenting the kernel I determined that no signals are
being delivered around this time but instrumenting all context
switches looked rather hard.
TIA. Any advice gratefully received.