1 The PCI Express Advanced Error Reporting Driver Guide HOWTO 2 T. Long Nguyen <tom.l.nguyen@intel.com> 3 Yanmin Zhang <yanmin.zhang@intel.com> 4 07/29/2006 5 6 71. Overview 8 91.1 About this guide 10 11This guide describes the basics of the PCI Express Advanced Error 12Reporting (AER) driver and provides information on how to use it, as 13well as how to enable the drivers of endpoint devices to conform with 14PCI Express AER driver. 15 161.2 Copyright (C) Intel Corporation 2006. 17 181.3 What is the PCI Express AER Driver? 19 20PCI Express error signaling can occur on the PCI Express link itself 21or on behalf of transactions initiated on the link. PCI Express 22defines two error reporting paradigms: the baseline capability and 23the Advanced Error Reporting capability. The baseline capability is 24required of all PCI Express components providing a minimum defined 25set of error reporting requirements. Advanced Error Reporting 26capability is implemented with a PCI Express advanced error reporting 27extended capability structure providing more robust error reporting. 28 29The PCI Express AER driver provides the infrastructure to support PCI 30Express Advanced Error Reporting capability. The PCI Express AER 31driver provides three basic functions: 32 33- Gathers the comprehensive error information if errors occurred. 34- Reports error to the users. 35- Performs error recovery actions. 36 37AER driver only attaches root ports which support PCI-Express AER 38capability. 39 40 412. User Guide 42 432.1 Include the PCI Express AER Root Driver into the Linux Kernel 44 45The PCI Express AER Root driver is a Root Port service driver attached 46to the PCI Express Port Bus driver. If a user wants to use it, the driver 47has to be compiled. Option CONFIG_PCIEAER supports this capability. It 48depends on CONFIG_PCIEPORTBUS, so pls. set CONFIG_PCIEPORTBUS=y and 49CONFIG_PCIEAER = y. 50 512.2 Load PCI Express AER Root Driver 52 53Some systems have AER support in firmware. Enabling Linux AER support at 54the same time the firmware handles AER may result in unpredictable 55behavior. Therefore, Linux does not handle AER events unless the firmware 56grants AER control to the OS via the ACPI _OSC method. See the PCI FW 3.0 57Specification for details regarding _OSC usage. 58 592.3 AER error output 60 61When a PCIe AER error is captured, an error message will be output to 62console. If it's a correctable error, it is output as a warning. 63Otherwise, it is printed as an error. So users could choose different 64log level to filter out correctable error messages. 65 66Below shows an example: 670000:50:00.0: PCIe Bus Error: severity=Uncorrected (Fatal), type=Transaction Layer, id=0500(Requester ID) 680000:50:00.0: device [8086:0329] error status/mask=00100000/00000000 690000:50:00.0: [20] Unsupported Request (First) 700000:50:00.0: TLP Header: 04000001 00200a03 05010000 00050100 71 72In the example, 'Requester ID' means the ID of the device who sends 73the error message to root port. Pls. refer to pci express specs for 74other fields. 75 76 773. Developer Guide 78 79To enable AER aware support requires a software driver to configure 80the AER capability structure within its device and to provide callbacks. 81 82To support AER better, developers need understand how AER does work 83firstly. 84 85PCI Express errors are classified into two types: correctable errors 86and uncorrectable errors. This classification is based on the impacts 87of those errors, which may result in degraded performance or function 88failure. 89 90Correctable errors pose no impacts on the functionality of the 91interface. The PCI Express protocol can recover without any software 92intervention or any loss of data. These errors are detected and 93corrected by hardware. Unlike correctable errors, uncorrectable 94errors impact functionality of the interface. Uncorrectable errors 95can cause a particular transaction or a particular PCI Express link 96to be unreliable. Depending on those error conditions, uncorrectable 97errors are further classified into non-fatal errors and fatal errors. 98Non-fatal errors cause the particular transaction to be unreliable, 99but the PCI Express link itself is fully functional. Fatal errors, on 100the other hand, cause the link to be unreliable. 101 102When AER is enabled, a PCI Express device will automatically send an 103error message to the PCIe root port above it when the device captures 104an error. The Root Port, upon receiving an error reporting message, 105internally processes and logs the error message in its PCI Express 106capability structure. Error information being logged includes storing 107the error reporting agent's requestor ID into the Error Source 108Identification Registers and setting the error bits of the Root Error 109Status Register accordingly. If AER error reporting is enabled in Root 110Error Command Register, the Root Port generates an interrupt if an 111error is detected. 112 113Note that the errors as described above are related to the PCI Express 114hierarchy and links. These errors do not include any device specific 115errors because device specific errors will still get sent directly to 116the device driver. 117 1183.1 Configure the AER capability structure 119 120AER aware drivers of PCI Express component need change the device 121control registers to enable AER. They also could change AER registers, 122including mask and severity registers. Helper function 123pci_enable_pcie_error_reporting could be used to enable AER. See 124section 3.3. 125 1263.2. Provide callbacks 127 1283.2.1 callback reset_link to reset pci express link 129 130This callback is used to reset the pci express physical link when a 131fatal error happens. The root port aer service driver provides a 132default reset_link function, but different upstream ports might 133have different specifications to reset pci express link, so all 134upstream ports should provide their own reset_link functions. 135 136In struct pcie_port_service_driver, a new pointer, reset_link, is 137added. 138 139pci_ers_result_t (*reset_link) (struct pci_dev *dev); 140 141Section 3.2.2.2 provides more detailed info on when to call 142reset_link. 143 1443.2.2 PCI error-recovery callbacks 145 146The PCI Express AER Root driver uses error callbacks to coordinate 147with downstream device drivers associated with a hierarchy in question 148when performing error recovery actions. 149 150Data struct pci_driver has a pointer, err_handler, to point to 151pci_error_handlers who consists of a couple of callback function 152pointers. AER driver follows the rules defined in 153pci-error-recovery.txt except pci express specific parts (e.g. 154reset_link). Pls. refer to pci-error-recovery.txt for detailed 155definitions of the callbacks. 156 157Below sections specify when to call the error callback functions. 158 1593.2.2.1 Correctable errors 160 161Correctable errors pose no impacts on the functionality of 162the interface. The PCI Express protocol can recover without any 163software intervention or any loss of data. These errors do not 164require any recovery actions. The AER driver clears the device's 165correctable error status register accordingly and logs these errors. 166 1673.2.2.2 Non-correctable (non-fatal and fatal) errors 168 169If an error message indicates a non-fatal error, performing link reset 170at upstream is not required. The AER driver calls error_detected(dev, 171pci_channel_io_normal) to all drivers associated within a hierarchy in 172question. for example, 173EndPoint<==>DownstreamPort B<==>UpstreamPort A<==>RootPort. 174If Upstream port A captures an AER error, the hierarchy consists of 175Downstream port B and EndPoint. 176 177A driver may return PCI_ERS_RESULT_CAN_RECOVER, 178PCI_ERS_RESULT_DISCONNECT, or PCI_ERS_RESULT_NEED_RESET, depending on 179whether it can recover or the AER driver calls mmio_enabled as next. 180 181If an error message indicates a fatal error, kernel will broadcast 182error_detected(dev, pci_channel_io_frozen) to all drivers within 183a hierarchy in question. Then, performing link reset at upstream is 184necessary. As different kinds of devices might use different approaches 185to reset link, AER port service driver is required to provide the 186function to reset link. Firstly, kernel looks for if the upstream 187component has an aer driver. If it has, kernel uses the reset_link 188callback of the aer driver. If the upstream component has no aer driver 189and the port is downstream port, we will perform a hot reset as the 190default by setting the Secondary Bus Reset bit of the Bridge Control 191register associated with the downstream port. As for upstream ports, 192they should provide their own aer service drivers with reset_link 193function. If error_detected returns PCI_ERS_RESULT_CAN_RECOVER and 194reset_link returns PCI_ERS_RESULT_RECOVERED, the error handling goes 195to mmio_enabled. 196 1973.3 helper functions 198 1993.3.1 int pci_enable_pcie_error_reporting(struct pci_dev *dev); 200pci_enable_pcie_error_reporting enables the device to send error 201messages to root port when an error is detected. Note that devices 202don't enable the error reporting by default, so device drivers need 203call this function to enable it. 204 2053.3.2 int pci_disable_pcie_error_reporting(struct pci_dev *dev); 206pci_disable_pcie_error_reporting disables the device to send error 207messages to root port when an error is detected. 208 2093.3.3 int pci_cleanup_aer_uncorrect_error_status(struct pci_dev *dev); 210pci_cleanup_aer_uncorrect_error_status cleanups the uncorrectable 211error status register. 212 2133.4 Frequent Asked Questions 214 215Q: What happens if a PCI Express device driver does not provide an 216error recovery handler (pci_driver->err_handler is equal to NULL)? 217 218A: The devices attached with the driver won't be recovered. If the 219error is fatal, kernel will print out warning messages. Please refer 220to section 3 for more information. 221 222Q: What happens if an upstream port service driver does not provide 223callback reset_link? 224 225A: Fatal error recovery will fail if the errors are reported by the 226upstream ports who are attached by the service driver. 227 228Q: How does this infrastructure deal with driver that is not PCI 229Express aware? 230 231A: This infrastructure calls the error callback functions of the 232driver when an error happens. But if the driver is not aware of 233PCI Express, the device might not report its own errors to root 234port. 235 236Q: What modifications will that driver need to make it compatible 237with the PCI Express AER Root driver? 238 239A: It could call the helper functions to enable AER in devices and 240cleanup uncorrectable status register. Pls. refer to section 3.3. 241 242 2434. Software error injection 244 245Debugging PCIe AER error recovery code is quite difficult because it 246is hard to trigger real hardware errors. Software based error 247injection can be used to fake various kinds of PCIe errors. 248 249First you should enable PCIe AER software error injection in kernel 250configuration, that is, following item should be in your .config. 251 252CONFIG_PCIEAER_INJECT=y or CONFIG_PCIEAER_INJECT=m 253 254After reboot with new kernel or insert the module, a device file named 255/dev/aer_inject should be created. 256 257Then, you need a user space tool named aer-inject, which can be gotten 258from: 259 http://www.kernel.org/pub/linux/utils/pci/aer-inject/ 260 261More information about aer-inject can be found in the document comes 262with its source code. 263