On Sun, Apr 16, 2006 at 05:34:18PM +0200, Arnd Bergmann wrote:
> On Sunday 16 April 2006 15:40, Steven Rostedt wrote:
> > I'll think more about this, but maybe someone else has some crazy ideas
> > that can find a solution to this that is both fast and robust.
> Ok, you asked for a crazy idea, you're going to get it ;-)
> You could take a fixed range from the vmalloc area (e.g. 1MB per cpu)
> and use that to remap pages on demand when you need per cpu data.
> #define PER_CPU_BASE 0xe000000000000000UL /* arch dependant */
> #define PER_CPU_SHIFT 0x100000UL
> #define __per_cpu_offset(__cpu) (PER_CPU_BASE + PER_CPU_STRIDE * (__cpu))
> #define per_cpu(var, cpu) (*RELOC_HIDE(&per_cpu__##var,
> #define __get_cpu_var(var) per_cpu(var, smp_processor_id())
> This is a lot like the current sparc64 implementation already is.
> The tricky part here is the remapping of pages. You'd need to
> alloc_pages_node() new pages whenever the already reserved space is
> not enough for the module you want to load and then map_vm_area()
> them into the space reserved for them.
> Advantages of this solution are:
> - no dependant load access for per_cpu()
> - might be flexible enough to implement a faster per_cpu_ptr()
> - can be combined with ia64-style per-cpu remapping
An implemenation similar to one you are mentioning was already proposed
The design was also meant to not restrict/limit per-cpu memory being
allocated from modules. Maybe it was too early then, and maybe now is the
right time, going by the interest in this thread :). IMHO, a new solution
should fix both static and dynamic per-cpu allocators,
- Avoid possibility of false sharing for dynamically allocated per-CPU data
(with current alloc percpu)
- work early enough -- if alloc_percpu can work early enough, (we can use
that for counters like slab cachep stats which is currently racy; using
atomic_t for them would be bad for performance)
An extra dereference in Steven's original proposal is bad, (I had done some
measurements earlier). My implementation had one less reference compared to
static per-cpu allocators, but the performance of both were the same as
the __per_cpu_offset table is always cache hot.
> Disadvantages are:
> - you can't use huge tlbs for mapping per cpu data like the
> regular linear mapping -> may be slower on some archs
Yep, we waste a few tlb entries then, which is a bit of concern, but then we
might be able to use hugetlbs for blocks of per-cpu data and minimize the