[Top] [All Lists]

Strange gp corruption problem

Subject: Strange gp corruption problem
From: Mike Crowe <>
Date: Thu, 12 Jul 2007 18:06:24 +0100
Original-recipient: rfc822;
User-agent: Mutt/1.5.13 (2006-08-11)
I'm seeing a rather strange problem on an NXP PNX8550 board we're
using (the PNX8550 is a SoC with a MIPS PR4450 core) and I'm wondering
if anyone here has ever seen anything similar or have any particular
advice as to how it can be investigated. I've searched the list
archives, googled, diffed with 2.6.22 etc. but not found anything
revealing. :(

We're using gcc 4.0.0 (from ELDK-4.0) on a 2.6.19 kernel with
glibc/LinuxThreads 2.3.5. I realise that this version of gcc is quite
old but it's the version used by the chip vendor for this platform.

The problem is easy to reproduce with a particular build of our
software but goes away very easily if code is changed, particularly
changes that affect the GOT or move code around. Even adding a single
"nop" instruction to the offending function "fixes the problem" This
is making it hard to debug.

We have a function that does some string manipulation (not
particularly dangerous manipulation and I've been through it
carefully) and then calls atol. As expected the prologue of this
function calculates the value of the gp register by applying an offset
to the t9 register which contains the address of the start of the
function like this:

 47995c:       3c1c0fba        lui     gp,0xfba
 479960:       279c1fe4        addiu   gp,gp,8164
 479964:       0399e021        addu    gp,gp,t9
 479968:       27bdff80        addiu   sp,sp,-128
 47996c:       afbf007c        sw      ra,124(sp)
 479970:       afbe0078        sw      s8,120(sp)
 479974:       03a0f021        move    s8,sp
 479978:       afbc0010        sw      gp,16(sp)
 47997c:       afc40080        sw      a0,128(s8)
 479980:       afc50084        sw      a1,132(s8)

The function then doesn't go near the t9 or gp registers until it it
needs to read the address of the atol function from the GOT and it
does so like this:

 479a98:       8f99a7c4        lw      t9,-22588(gp)
 479a9c:       00402021        move    a0,v0
 479aa0:       0320f809        jalr    t9
 479aa4:       00000000        nop

At this point it segfaults because gp has an invalid value of
0x10497280.  t9 still has the correct value of 0x47995c. 16(sp) (see
479978 above) also has the incorrect value of gp of 0x10497280.

The correct value for gp in this binary is 0x1001b940.

Interestingly the bad and good gp values are related in the following

 0x1001b940 (correct value of gp)
+  0x47995c (address of this function == t9)
+    0x1fe4 (second part of gp fixup (8164 in decimal) from 479960 above)
=0x10497280 (bad value of gp)

This implies that upon execution of the instruction at 0x479960 gp
contained the "good" gp value of 0x1001b940 rather than the 0x0fba0000
it should have contained according to the previous instruction.

The only user-space reason I can come up with for this happening is if
the caller jumped into this function one instruction late. This seems
unlikely because t9 contains the correct value and the stack looks
fine. By instrumenting the kernel I determined that no signals are
being delivered around this time but instrumenting all context
switches looked rather hard.

TIA. Any advice gratefully received.


<Prev in Thread] Current Thread [Next in Thread>