Difference between revisions of "PCI Subsystem"

From LinuxMIPS
Jump to: navigation, search
(Move the very broken PCI devices chapter)
(Fixup markup.)
Line 131: Line 131:
               and XFree86. Eventually will be removed. */
               and XFree86. Eventually will be removed. */
           unsigned int need_domain_info;
           unsigned int need_domain_info;
           int iommu;
           int iommu;

Revision as of 21:55, 12 November 2004

PCI subsystem

The PCI subsystem is perhaps the most complex code you have to deal with during the porting process. The key to making the PCI subsystem work properly is a good understanding of the PCI bus itself, the code layout, and the execution flow in Linux. Like many other parts of porting, you will find in the end, the actual code writen is minimumal.

Overview of the PCI bus

Here we summarize some facts of the PCI bus:

  • The PCI bus has three separate address spaces, config, I/O, and memory space.
  • Every PCI device responds to config commands, and it can respond to I/O accesses and/or memory accesses.
  • During the boot time, the BIOS or the OS sets the base address registers (BARs) through configuration space. BARs determine address ranges in I/O or memory space that a device should respond to. Obviously, those ranges should not be duplicated anywhere else in the I/O space or memory space on the same PCI bus.
  • Multiple PCI buses can be connected through PCI-PCI bridges.

BAR assignment in Linux

On all IBM PC-compatible machines, BARs are assigned by the BIOS. Linux simply scans through the buses and records the BAR values.

Some MIPS boards adopt similar approaches, where BARs are assigned by firmware. However, the quality of BAR assignment by firmware vary quite a bit. Some firmware simply assigns BARs to on-board PCI devices and ignore all add-on PCI cards. In that case, Linux cannot solely rely on the firmware's assignment.

There is another issue of depending on the firmware assignment. You need to stick with the address range setup by the firmware. In other words, if the firmware assigns PCI memory space from 0x10000000 to 0x14000000, you cannot easily move it to a different address space somewhere else in Linux.

There three ways to possibly fix this:

  • The first way is to fix the BAR assignment manually in your board setup routine. This only works if your board does not have a PCI slot to plug in an arbitrary PCI card. You need to carefully examine the existing PCI resource assignment done by firmware so that you do not assign overlapping address ranges.
  • The second way is to do a complete PCI resource assignment *before* Linux starts PCI bus scanning. In other words, we discard any PCI resource assignment done in firmware, if there is any, and do a new assignment by ourselves. This approach gives us complete control over the address range and resource allocation. With the CONFIG_PCI_AUTO option used in 'arch/mips/config-shared.in' and 'arch/mips/kernel/pci_auto.c' file, it turns out to be quite easy to do. This approach is the focus of this chapter.
  • Another approach is to call the 'pci_assign_unassigned_resources()' function, which is defined in the 'drivers/pci/setup-bus.c' file in recent 2.4.x kernels *after* Linux completes the PCI bus scan. With earlier versions of this function, Linux will assign resources to PCI devices whose BAR have *not* been properly assigned. With the recent versions (that have "optimal" resource assignment based on sizes), this PCI routine apparently does a complete resource re-assignment. In other words, it does almost exactly the same as what the 'pci_auto.c' file does.

[DEBATE] The 'pci_auto' and 'assign_unassigned_resource' approaches have their own advantages and disadvantages. Ideally, the whole PCI subsystem should be completely re-written so that several things can be taken into consideration which are not currently addressed:

  • A notion of a host-PCI controller, and supporting multiple host-PCI controllers.
  • A kernel-independent abstraction layer to access configuration space, distinguishing Type 0 and Type 1 configuration accesses. This removes the need for 'pci_dev' in the lowest-level PCI routines.
  • Pass 1 scan to do bus number assignment and record bus topology through the top-level host-PCI controller structure.
  • Pass 2 scan to discover all other PCI devices and assign the resources along the way.
  • If we want to do optimal resource assignment, we need to do the resource assignment in Pass 3 instead of Pass 2.
  • A complete PCI device list is built during the above passes.
  • The address range for PCI memory space and I/O space are set when host-PCI controller structures are initially created.

PCI startup sequence

1. do_basic_setup() calls pci_init(), which is defined in 'drivers/pci/pci.c'.
2. pci_init() first calls pcibios_init(). If you enable CONFIG_NEW_PCI,
   pcibios_init() is implemented in the 'arch/mips/kernel/pci.c' file.
   Otherwise you need to provide it in your own board-dependent code.
3. Optionally, pcibios_init() may call pciauto_assign_resources() to do a
   complete PCI resource assignment.
4. Somewhere inside pcibios_init(), pci_scan_bus() is called. If a machine has
   multiple host-PCI controllers, pci_scan_bus() should be called for each of
   the top-level PCI buses. Apparently, bus numbers should have already been
   setup before pci_scan_bus() can properly run.
5. Optionally, after pci_scan_bus() is called, pcibios_init() may choose to
   call pci_assign_unassigned_resources() to do a complete PCI resource
6. pcibios_init() will do some more fixups (resources, IRQs, etc.).
7. Returning from pcibios_init(), pci_init() will do a final round of
   device-based fixups.

In this chapter, we focus on the approach where both CONFIG_NEW_PCI and CONFIG_PCI_AUTO are enabled.

PCI driver interface

All of this work for PCI is to eventually setup a structure where all PCI device drivers can run happily. Knowing how PCI device drivers access PCI resources can greatly help you understand how you should do PCI initialization and setup.

  • inb()/outb()/inw()/outw()/inl()/outl()

Defined in 'include/asm-mips/io.h'. Drivers use these macros to read/write into PCI IO space. The address arguments are addresses in PCI IO space, and should correspond to one of the BAR values of the device. On MIPS machines, we assume PCI IO space is mapped into a contiguous physical address block. The base address of the block is mips_io_port_base, which you need to set up at the beginning of board setup time. The proper way to set it up is call set_io_port_base().

  • readb()/writeb()/readw()/writew()/readl()/writel()

Defined in the 'include/asm-mips/io.h' file. Drivers use them to access PCI memory space. On MIPS machines, we assume PCI memory space is 1:1 mapped into a block of physical address. Therefore those macros are equivalent to direct physical memory access.

  • pci_read_config_word() and friends Defined in 'include/linux.pci.h' (through PCI_OP()). Drivers use them to read or write the configuration registers of devices. Low-level routines are abstracted as struct pci_ops, where each board must supply one.
  • pci_map_single() and friends Defined in 'include/asm-mips/pci.h'. Drivers use these macros to map a virtual address to a bus address (so you can tell the device to do DMA). They don't usually affect PCI porting.

Setting up the host-PCI controller

As we can see from the above discussion, we need to set up the host-PCI controller such that:

1. It has a 1:1 mapping between PCI memory space and CPU physical address
2. It maps the beginning part of PCI IO space into a address block in physical
   address space.

The Host-PCI controller usually allows you to map PCI memory space into a window in physical address space. Let us say the base address is ba_mem and size is sz_mem. It usually allows you to translate that address into another one with a fixed offset, say off. (Note off can be both positive or negative). So if a driver accesses the address ba_mem+x (0 <= x < sz_mem), the host-PCI controller will intercept the command and translate it into a PCI memory access at address ba_mem+off+x.

To maintain 1:1 mapping, it implies we must set up the PCI addressing such that off is 0. Also note that with this setup, we cannot access the PCI memory range [0x0, ba_mem] and [ba_mem + sz_mem, 0xffffffff].

Additionaly, we must also make system RAM visible on the PCI memory bus at address 0x0 (assuming that is the address in physical address space) in order for PCI devices to do DMA transfers.

The beginning part of PCI IO space is usually mapped into another window in physical address space, say [ba_io, ba_io + sz_io]. In other words, a range [0, sz_io] in PCI IO space corresponds to the range [ba_io, ba_io + sz_io]. Obviously, mips_io_port_base should be set to ba_io.

The above setup is typically done in the board-specific setup routine (i.e., <board>_setup()). You typically also setup ioport_resource and iomem_resource as well:

           ioport_resource.start = 0x0;
           ioport_resource.end = sz_io;
           iomem_resource.start = 0x0;
           iomem_resource.end = sz_mem;

These variables are the roots of all IO and memory resources (roughly corresponding to the ancient ISA IO space and ISA memory space). For simplicity you can also set the end to be 0xffffffff.

Board-specific functions and variables

Here is a list of board-specific functions you must implement in 2.4. Again, I assume this board has CONFIG_NEW_PCI and CONFIG_PCI_AUTO options enabled.

Accessing the PCI configspace: struct pci_ops

You implement six functions to fill into this structure, which struct is needed by pci_scan_bus() and pciauto_assign_resources(). Note that you need to dintinguish type 0 or type 1 configuration in those functions. You can typically check that by checking whether the bus's parent (dev->bus->parent) is NULL.

In 2.6 struct pci_ops was changed; it's now looking like this:

  struct pci_ops {
          int (*read)(struct pci_bus *bus, unsigned int devfn, int where, int size, u32 *val);
          int (*write)(struct pci_bus *bus, unsigned int devfn, int where, int size, u32 val);

So it now takes a pci_bus instead of a pci_dev as an argument which makes a little more sense. And the three read and write functions were combined which made an additional size argument necessary. Size may take the values 1, 2 or 4 and you don't need to do try to deal with other sizes.


Another 2.4-ism. You need to define this array in order to use pciauto resource assignment. For each top-level PCI bus, you need to supply an element data to this array. The array ends with an all-NULL element. Each element is a structure that consists a pci_ops, pci_io_resource, and pci_mem_resource, which usually represents a top-level PCI bus connected to CPU. pci_ops defines the functions access the PCI bus's config space. pci_io_resource and pci_mem_resource specifies the address range that pciauto will use to assign to the BARs of PCI devices. For pci_io_resource, it starts with 0x0 (or 0x1000 to leave some room for legacy ISA devices), and it ends at sz_io. For pci_mem_resource, it starts at ba_mem and ends at ba_mem + sz_mem. Note these addresses are in PCI IO and PCI memory space respectively. However, since we maintain 1:1 mapping between PCI memory space and CPU physical address space, pci_mem_resource also represents the PCI memory space window in CPU physical address space.

In 2.6 PCI busses are not configued through a static array but registered through register_pci_controller() which takes a struct pci_controller argument which is defined as:

  struct pci_controller {
          struct pci_controller *next;
          struct pci_bus *bus;

          struct pci_ops *pci_ops;
          struct resource *mem_resource;
          unsigned long mem_offset;
          struct resource *io_resource;
          unsigned long io_offset;

          unsigned int index;
          /* For compatibility with current (as of July 2003) pciutils
             and XFree86. Eventually will be removed. */
          unsigned int need_domain_info;

          int iommu;

The next field is used internally. Don't touch, don't look at it. After the pci_bus was scanned bus will contain a pointer to a the root bus of the PCI bus hierarchy at this PCI controller. The pci_ops field contains a struct pci_ops pointer - we've just discussed that structure above.


This routine is passed to pci_scan_bus() and invoked right after when the PCI device is discovered. This is place where you can do some device-specific fixup (BARs, pci_dev structure, etc). Note you can do the same fixup in pcibios_fixup(). I recommand leave this function empty unless you have some specific need that requires immediate fixup.


This function is invoked after pci_scan_bus() is done, i.e., all PCI bridges and devices are discovered by Linux. Here you can enumerate through PCI devices and do device based fixup. Or you can do bus or contoller related fixups.


This is the place to fix up PCI related IRQs. It is invoked after pci_scan_bus() is done and next to pcibios_fixup(). A typical strategy is to assign irq based on the slot number and possibly bus number if there are more than top-level buses in the system. Note that you can do the fixup in pcibios_fixup() as well and leave this function empty.


Return 1 to indicate that all bus numbers have been assigned by pciauto. [TODO:Should this function be included in pciauto by default?] [TODO: the driver/pci/pci.c file seems to have a typo when it calls this function. The logic is inversed.]

PCI fixups

For less severe cases Linux offers something called fixups. Fixups are meant to deal with issues like broken PCI config headers and similar. There are three types of fixups.

Header fixups

Header fixups are called after the PCI code has read the headers from the PCI device's config space. This type of fixup can be used for example when headers don't identify the device class or if a device is likely to be badly missconfigured after reset. Be careful, don't touch anything else than the config space at this time; the PCI bus is still being configured!

Final fixups

These are called just before device drivers are being initialized. So why not simply putting those fixups into the driver code you're asking? Remember drivers in Linux can be modules and even if they're compiled in they're treated pretty much the same as a module. So a final fixup is for exampe a good method to ensure a device is a good citizen on a bus.

Enable fixups

These are being called by the PCI code when the PCI device is being enabled by calling pci_enable_device().

drivers/pci/quirks.c contains a large number of examples for the use of PCI fixups. This file only contains fixups for devices that are being used on multiple architectures. If a device is only in use on a single architecture or even on a single board then other places outside drivers/pci might be preferable.

Fixup differences in 2.4 and 2.6

In Linux 2.4 fixups are listed in two arrays. The first is defined by the architecture; it's an array named pcibios_fixups. The MIPS codee actually leaves this to individual platforms, so greping in arch/mips will find many definitions. Each fixup is a structure like

  struct pci_fixup {
          int pass;
          u16 vendor, device;
          void (*hook)(struct pci_dev *dev);

Pass can have two possible values, PCI_FIXUP_HEADER or PCI_FIXUP_FINAL. That also means there are no enable fixups in 2.4. The vendor and device fields contain a PCI vendor ID rsp. device ID as defined in include/linux/pci_ids.h. Where matching any vendor or device is necessary, PCI_ANY_ID can be used.

After that the kernel performs the "generic" fixups, that is for devices used on several architectures; these fixups are defined in the pci_fixups array in drivers/pci/quirks.c.

Linux 2.6 removes the restriction on just two arrays. The approach was fine initially but with the huge number of PCI fixups that had become necessary over the years having everything in just two arrays and thus effectivly two files had become too inflexible. The new approach is based on the three macros DECLARE_PCI_FIXUP_HEADER, DECLARE_PCI_FIXUP_FINAL and DECLARE_PCI_FIXUP_ENABLE. Each takes the same thre arguments as in 2.4, so conversion is trivial. The big improvment is that these macros are allowed anywhere in the kernel proper. That is usage in modules still isn't possible.

Very broken PCI devices ...

Some ill-behaviored PCI device may spoil the party. The best way to deal with them is to signle them out in the PCI configuration routines (in pci_ops). Inside those routines you check for the slot number and function number corresponding to the bad devices and return some NULL numbers. Very few devices are broken to the point where this is needed; the SGI IOC3 is one because it doesn't fully decode the PCI configspace in violation of the PCI spec.

Legacy devices

If you have multiple primary PCI busses make sure to have the PCI bus with the legacy devices as the first bus. Linux isn't very good at dealing with legacy devices on multiple PCI busses, to say the least.

Multiple PCI busses

If you have multiple top-level PCI buses, it is tricky to do PCI IO assignment. Assume you have two PCI buses and their IO spaces are mapped into [ba_io1, ba_io1 + sz_io1] and [ba_io2, ba_io2+sz_io2] (ba_io1 + sz_io1 <= ba_io2). You need to

  • Set mips_io_port_base to be ba_io1
  • Set pci_io_resource to be [0x0, sz_io1] for the first PCI bus
  • Set pci_io_resource to be [ba_io2 - ba_io1, ba_io2 - ba_io1 + sz_io2] for the 2nd PCI bus. In addition, the legacy ISA devices on the second PCI bus cannot be used without modifying their drivers.

Next page: Debugging

External links

  • http://www.pcisig.com/ Homepage of the PCI Special Interest Group
  • ISBN 0321168453 HyperTransport System Architecture
  • ISBN 0201726823 PCI-X System Architecture
  • ISBN 0321156307 PCI Express System Architecture
  • Documentation/mips/pci/pci.README has some information about setting up PCI