(I tried to include all arch maintianers and memory maintainers, if I
missed someone, please let me know).
OK, as promised, I've got a working VM percpu solution.
A little history: please read the thread http://lkml.org/lkml/2006/4/14/137
Preivously I noticed that the per_cpu variables were implemented with
a little hack that Rusty Russell wrote up. I call it a hack because,
for modules an arbitrary number was used to hold all per_cpu variables
that would be used in the kernel or in modules. If this number was too
big, you waste the memory that is allocated for it, and if it were
too small, you wont be able to load all the modules that are required.
My first attempt to fix this introduced another dereference, to allow
for modules to allocate their own memory. This was quickly shot down,
and for good reason, because dereferences kill performance, and don't
play nice with large SMP systems that depend on per_cpu being fast.
But that first attempt had one benefit. It had good responses on how
to actually fix the problem. Without the first try, I would not have
tried this approach.
So what is this solution?
I now place the per_cpu variables into VM, such that the pages are
only allocated when needed. All the architecture needs to do is
supply a VM address range, size for each CPU to use (note this
implementation expects all the VM CPU areas to be together), and
three functions to allow for allocating page tables at bootup.
The bootup page table allocations are needed because the percpu
variables are used before the system initializes the memory. So
bootmem is needed to load the page tables.
But that's it! No more hacks for architectures with lots of CPUS and
NUMA to implement their own percpu algorithms. They just supply the
functions to allocate the variables, and the memory will be loaded
appropriately. So all architectures with VM can use the generic
solution. Does the main-line kernel support any architectures that
doesn't have a VM?
Another benefit is that this approach removes the one redirection that
was already used in the generic solution. Since the virtual memory
of the architecture for the per_cpu is static, the calculation to
find the variable is done with constants. So this is another win
for this solution.
So the three things that this patch gives us are:
1) Robust per_cpu. No more guessing the size of the per_cpu variable
sections. Default VM is 1 Meg per cpu. If you use more than that
I guess you're SOL.
2) Generic solution for all architectures. No more fancy hacks to
get the per cpu offest.
3) Removes the inderection of the current per_cpu generic solution.
The set of patches that I have implement this for the i386 architecture
and lays out the ground work for others. The i386 implementation I did
was mainly a hack. Nick Piggin suggested that before I do too much
work, I should post my stuff and ask for comments. That's what I'm doing
This patch set works for my laptop (P4 HT) and I haven't tested
it with any other configuration. The implementation that I did for
i386 was only to get it to work and is not flexible. It needs much
better implementation, and I ask for help with that.
I basically concentrated to the core stuff. Another nice thing about
this patch set is that it needs both CONFIG_HAS_VM_PERCPU and
__ARCH_HAS_VM_PERCPU to take advantage of the VM percpu implementation.
Without these set, it defaults to the old usage. This way we
can slowly implement each architecture, in an incremental fashion.
OK, let'er rip. Flame me, yell at me, curse me, but please give me
Thanks for your time.
PS. The attached module was used to test this patch on my laptop.
Description: Text Data