linux-mips
[Top] [All Lists]

Re: [PATCH] MIPS: Optimize spinlocks.

To: Ralf Baechle <ralf@linux-mips.org>
Subject: Re: [PATCH] MIPS: Optimize spinlocks.
From: David Daney <ddaney@caviumnetworks.com>
Date: Thu, 25 Feb 2010 09:31:38 -0800
Cc: linux-mips@linux-mips.org
In-reply-to: <20100225141548.GB29565@linux-mips.org>
Original-recipient: rfc822;linux-mips@linux-mips.org
References: <1265311909-1679-1-git-send-email-ddaney@caviumnetworks.com> <20100224155336.GA5130@linux-mips.org> <4B8559F0.6080908@caviumnetworks.com> <20100225141548.GB29565@linux-mips.org>
Sender: linux-mips-bounce@linux-mips.org
User-agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.1.7) Gecko/20100120 Fedora/3.0.1-1.fc12 Thunderbird/3.0.1
On 02/25/2010 06:15 AM, Ralf Baechle wrote:
On Wed, Feb 24, 2010 at 08:55:12AM -0800, David Daney wrote:

It is possible that by choosing a better nudge_writes()
implementation for R10K, that the 3% degradation could be erased.
Perhaps:

#define nudge_writes() do { } while (0)

raw_spin_unlock must provide a barrier so this wouldn't be a valid
implementation for nudge_writes().

That barrier is separate (and present). The sole purpose of nudge_writes() is to make speed up the global visibility of the releasing write, it does not have anything to do with locking semantics.

 Implementing it as barrier() this
is a pure compiler barrier is the most liberal valid implementation.

No, the most liberal would be a true NOP: 'do { } while (0)'.


Basically you want something that is fast, but that also forces the
write to be globally visible as soon as possible.  Some processors
have a prefetch instruction that does this.  On other processors a
NOP is optimal as they don't combine writes in the write back
buffer.

There is a wbflush() function that could potentially be used, but
its implementation is too heavy on Octeon.

For IP27 which is a strongly ordered system nudge_writes() is implemented
as barrier().

Another experiment I did was alignment.  A branch on an R10000 has a
significant execution time penalty if it's delay slot is overlapping a
128 byte S-cache boundary.  Suitable alignment however didn't not seem
to make any difference at all on R10000.

   Ralf



<Prev in Thread] Current Thread [Next in Thread>