linux-mips
[Top] [All Lists]

system lockup with 2.6.29 on Cavium/Octeon

To: linux-mips@linux-mips.org
Subject: system lockup with 2.6.29 on Cavium/Octeon
From: Greg Ungerer <gerg@snapgear.com>
Date: Wed, 20 May 2009 16:12:32 +1000
Original-recipient: rfc822;linux-mips@linux-mips.org
Sender: linux-mips-bounce@linux-mips.org
User-agent: Thunderbird 2.0.0.19 (X11/20090105)

Hi All,

I have a system lockup problem that I have been looking at on a custom
Cavium/Octeon 5010 based design. I am running on linux-2.6.29 with
David Daney's latest round of PCI and ethernet patches (posted here
on this list).

I have tracked the problem back to local_flush_tlb_kernel_range() in
arch/mips/mm/tlb-r4k.c. At the top of this function is:

void local_flush_tlb_kernel_range(unsigned long start, unsigned long end)
    {
        unsigned long flags;
        int size;

        ENTER_CRITICAL(flags);
        size = (end - start + (PAGE_SIZE - 1)) >> PAGE_SHIFT;
        size = (size + 1) >> 1;
        if (size <= current_cpu_data.tlbsize / 2) {

The problem is that typical example values I see passed in for start
and end are:

    start = c000000000006000
    end   = ffffffffc01d8000

Now the vmalloc area starts at 0xc000000000000000 and the kernel code
and data is all at 0xffffffff80000000 and above. I don't know if the
start and end are reasonable values, but I can see some logic as to
where they come from. The code path that leads to this is via
__vunmap() and __purge_vmap_area_lazy(). So it is not too difficult
to see how we end up with values like this.

But the size calculation above with these types of values will result
in still a large number. Larger than the 32bit "int" that is "size".
I see large negative values fall out as size, and so the following
tlbsize check becomes true, and the code spins inside the loop inside
that if statement for a _very_ long time trying to flush tlb entries.

This is of course easily fixed, by making that size "unsigned long".
The patch below trivially does this.

But is this analysis correct?

Regards
Greg




The address range size calculation inside local_flush_tlb_kernel_range()
is being truncated by a too small size variable holder on 64bit systems.
The truncated size can result in an erroneous tlbsize check that means
we sit spinning inside a loop trying to flush a hige number of TLB
entries. This is for all intents and purposes a system hang. Fix by
using an appropriately sized valiable to hold the size.

Signed-off-by: Greg Ungerer <gerg@snapgear.com>

---

--- ORG.linux-2.6.29/arch/mips/mm/tlb-r4k.c.org 2009-05-20 15:30:28.000000000 +1000 +++ ORG.linux-2.6.29/arch/mips/mm/tlb-r4k.c 2009-05-20 15:30:56.000000000 +1000
@@ -161,7 +161,7 @@ void local_flush_tlb_range(struct vm_are
 void local_flush_tlb_kernel_range(unsigned long start, unsigned long end)
 {
        unsigned long flags;
-       int size;
+       unsigned long size;

        ENTER_CRITICAL(flags);
        size = (end - start + (PAGE_SIZE - 1)) >> PAGE_SHIFT;


------------------------------------------------------------------------
Greg Ungerer  --  Principal Engineer        EMAIL:     gerg@snapgear.com
SnapGear Group, McAfee                      PHONE:       +61 7 3435 2888
825 Stanley St,                             FAX:         +61 7 3891 3630
Woolloongabba, QLD, 4102, Australia         WEB: http://www.SnapGear.com

<Prev in Thread] Current Thread [Next in Thread>