[Top] [All Lists]

Re: [PATCH] MIPS: Optimize spinlocks.

To: Ralf Baechle <>
Subject: Re: [PATCH] MIPS: Optimize spinlocks.
From: David Daney <>
Date: Wed, 24 Feb 2010 08:55:12 -0800
In-reply-to: <>
Original-recipient: rfc822;
References: <> <>
User-agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv: Gecko/20100120 Fedora/3.0.1-1.fc12 Thunderbird/3.0.1
On 02/24/2010 07:53 AM, Ralf Baechle wrote:
On Thu, Feb 04, 2010 at 11:31:49AM -0800, David Daney wrote:

The current locking mechanism uses a ll/sc sequence to release a
spinlock.  This is slower than a wmb() followed by a store to unlock.

The branching forward to .subsection 2 on sc failure slows down the
contended case.  So we get rid of that part too.

Since we are now working on naturally aligned u16 values, we can get
rid of a masking operation as the LHU already does the right thing.
The ANDI are reversed for better scheduling on multi-issue CPUs

On a 12 CPU 750MHz Octeon cn5750 this patch improves ipv4 UDP packet
forwarding rates from 3.58*10^6 PPS to 3.99*10^6 PPS, or about 11%.

And in your benchmarking patch you wrote:

                spin_single     spin_multi
base              106885        247941
spinlock_patch  75194           219465

I did some benchmarking on an IP27 (180MHz, 2 CPU, needs LL/SC workaround):

                spin_single     spin_multi
base            229341          3505690
spinlock_patch  177847          3615326

So about 22% speedup for spin_single but 3% slowdown for spin_multi.

It is possible that by choosing a better nudge_writes() implementation for R10K, that the 3% degradation could be erased. Perhaps:

#define nudge_writes() do { } while (0)

Basically you want something that is fast, but that also forces the write to be globally visible as soon as possible. Some processors have a prefetch instruction that does this. On other processors a NOP is optimal as they don't combine writes in the write back buffer.

There is a wbflush() function that could potentially be used, but its implementation is too heavy on Octeon.

Disabling the R10k LL/SC workaround btw. gives another 23% speedup for
spin_single and marginal 0.3% for spin_multi; the latter may well be
statistical noise.


<Prev in Thread] Current Thread [Next in Thread>