There is a possible cache line read-after-write pseudo-dependency that, along 
with the code alignment in terms of the instruction pair doublewords, may do 
something weird to the sb1250 pipeline. Just my guess.

found some very strange behavior with sb1250.
Gcc 3.2.3 with sibyte mods. Running Linux 2.4.21 with whatever
mods off sibyte.

sending a large amount of traffic 

given the nature of processing, say i was getting 100Kpps throughput.
Now i fire a very basic program that has just loops and forever
sums up two numbers.

      1 #include <stdlib.h>
      3 int main ()
      4 {
      5         int a = 1;
      6         int b = 2;
      7         int c = 0; 
      8         // int c;
      9         while (1) {
     10                 c = a + b;
     11         }
     12 }

I see very little drop in throughput - probably around 0.01%.

Now comment line 7 then uncomment line 8. Hallelujah.
Perfomance drops to about 100pps. Thats about a factor of 1000 down!

Interesting thing is if you add a nop (__asm__ __volatile__("nop");)
in the second version just before the while loop, we get back the same
performance as in the earlier version.
