1PCI Power Management 2 3Copyright (c) 2010 Rafael J. Wysocki <rjw@sisk.pl>, Novell Inc. 4 5An overview of concepts and the Linux kernel's interfaces related to PCI power 6management. Based on previous work by Patrick Mochel <mochel@transmeta.com> 7(and others). 8 9This document only covers the aspects of power management specific to PCI 10devices. For general description of the kernel's interfaces related to device 11power management refer to Documentation/power/devices.txt and 12Documentation/power/runtime_pm.txt. 13 14--------------------------------------------------------------------------- 15 161. Hardware and Platform Support for PCI Power Management 172. PCI Subsystem and Device Power Management 183. PCI Device Drivers and Power Management 194. Resources 20 21 221. Hardware and Platform Support for PCI Power Management 23========================================================= 24 251.1. Native and Platform-Based Power Management 26----------------------------------------------- 27In general, power management is a feature allowing one to save energy by putting 28devices into states in which they draw less power (low-power states) at the 29price of reduced functionality or performance. 30 31Usually, a device is put into a low-power state when it is underutilized or 32completely inactive. However, when it is necessary to use the device once 33again, it has to be put back into the "fully functional" state (full-power 34state). This may happen when there are some data for the device to handle or 35as a result of an external event requiring the device to be active, which may 36be signaled by the device itself. 37 38PCI devices may be put into low-power states in two ways, by using the device 39capabilities introduced by the PCI Bus Power Management Interface Specification, 40or with the help of platform firmware, such as an ACPI BIOS. In the first 41approach, that is referred to as the native PCI power management (native PCI PM) 42in what follows, the device power state is changed as a result of writing a 43specific value into one of its standard configuration registers. The second 44approach requires the platform firmware to provide special methods that may be 45used by the kernel to change the device's power state. 46 47Devices supporting the native PCI PM usually can generate wakeup signals called 48Power Management Events (PMEs) to let the kernel know about external events 49requiring the device to be active. After receiving a PME the kernel is supposed 50to put the device that sent it into the full-power state. However, the PCI Bus 51Power Management Interface Specification doesn't define any standard method of 52delivering the PME from the device to the CPU and the operating system kernel. 53It is assumed that the platform firmware will perform this task and therefore, 54even though a PCI device is set up to generate PMEs, it also may be necessary to 55prepare the platform firmware for notifying the CPU of the PMEs coming from the 56device (e.g. by generating interrupts). 57 58In turn, if the methods provided by the platform firmware are used for changing 59the power state of a device, usually the platform also provides a method for 60preparing the device to generate wakeup signals. In that case, however, it 61often also is necessary to prepare the device for generating PMEs using the 62native PCI PM mechanism, because the method provided by the platform depends on 63that. 64 65Thus in many situations both the native and the platform-based power management 66mechanisms have to be used simultaneously to obtain the desired result. 67 681.2. Native PCI Power Management 69-------------------------------- 70The PCI Bus Power Management Interface Specification (PCI PM Spec) was 71introduced between the PCI 2.1 and PCI 2.2 Specifications. It defined a 72standard interface for performing various operations related to power 73management. 74 75The implementation of the PCI PM Spec is optional for conventional PCI devices, 76but it is mandatory for PCI Express devices. If a device supports the PCI PM 77Spec, it has an 8 byte power management capability field in its PCI 78configuration space. This field is used to describe and control the standard 79features related to the native PCI power management. 80 81The PCI PM Spec defines 4 operating states for devices (D0-D3) and for buses 82(B0-B3). The higher the number, the less power is drawn by the device or bus 83in that state. However, the higher the number, the longer the latency for 84the device or bus to return to the full-power state (D0 or B0, respectively). 85 86There are two variants of the D3 state defined by the specification. The first 87one is D3hot, referred to as the software accessible D3, because devices can be 88programmed to go into it. The second one, D3cold, is the state that PCI devices 89are in when the supply voltage (Vcc) is removed from them. It is not possible 90to program a PCI device to go into D3cold, although there may be a programmable 91interface for putting the bus the device is on into a state in which Vcc is 92removed from all devices on the bus. 93 94PCI bus power management, however, is not supported by the Linux kernel at the 95time of this writing and therefore it is not covered by this document. 96 97Note that every PCI device can be in the full-power state (D0) or in D3cold, 98regardless of whether or not it implements the PCI PM Spec. In addition to 99that, if the PCI PM Spec is implemented by the device, it must support D3hot 100as well as D0. The support for the D1 and D2 power states is optional. 101 102PCI devices supporting the PCI PM Spec can be programmed to go to any of the 103supported low-power states (except for D3cold). While in D1-D3hot the 104standard configuration registers of the device must be accessible to software 105(i.e. the device is required to respond to PCI configuration accesses), although 106its I/O and memory spaces are then disabled. This allows the device to be 107programmatically put into D0. Thus the kernel can switch the device back and 108forth between D0 and the supported low-power states (except for D3cold) and the 109possible power state transitions the device can undergo are the following: 110 111+----------------------------+ 112| Current State | New State | 113+----------------------------+ 114| D0 | D1, D2, D3 | 115+----------------------------+ 116| D1 | D2, D3 | 117+----------------------------+ 118| D2 | D3 | 119+----------------------------+ 120| D1, D2, D3 | D0 | 121+----------------------------+ 122 123The transition from D3cold to D0 occurs when the supply voltage is provided to 124the device (i.e. power is restored). In that case the device returns to D0 with 125a full power-on reset sequence and the power-on defaults are restored to the 126device by hardware just as at initial power up. 127 128PCI devices supporting the PCI PM Spec can be programmed to generate PMEs 129while in a low-power state (D1-D3), but they are not required to be capable 130of generating PMEs from all supported low-power states. In particular, the 131capability of generating PMEs from D3cold is optional and depends on the 132presence of additional voltage (3.3Vaux) allowing the device to remain 133sufficiently active to generate a wakeup signal. 134 1351.3. ACPI Device Power Management 136--------------------------------- 137The platform firmware support for the power management of PCI devices is 138system-specific. However, if the system in question is compliant with the 139Advanced Configuration and Power Interface (ACPI) Specification, like the 140majority of x86-based systems, it is supposed to implement device power 141management interfaces defined by the ACPI standard. 142 143For this purpose the ACPI BIOS provides special functions called "control 144methods" that may be executed by the kernel to perform specific tasks, such as 145putting a device into a low-power state. These control methods are encoded 146using special byte-code language called the ACPI Machine Language (AML) and 147stored in the machine's BIOS. The kernel loads them from the BIOS and executes 148them as needed using an AML interpreter that translates the AML byte code into 149computations and memory or I/O space accesses. This way, in theory, a BIOS 150writer can provide the kernel with a means to perform actions depending 151on the system design in a system-specific fashion. 152 153ACPI control methods may be divided into global control methods, that are not 154associated with any particular devices, and device control methods, that have 155to be defined separately for each device supposed to be handled with the help of 156the platform. This means, in particular, that ACPI device control methods can 157only be used to handle devices that the BIOS writer knew about in advance. The 158ACPI methods used for device power management fall into that category. 159 160The ACPI specification assumes that devices can be in one of four power states 161labeled as D0, D1, D2, and D3 that roughly correspond to the native PCI PM 162D0-D3 states (although the difference between D3hot and D3cold is not taken 163into account by ACPI). Moreover, for each power state of a device there is a 164set of power resources that have to be enabled for the device to be put into 165that state. These power resources are controlled (i.e. enabled or disabled) 166with the help of their own control methods, _ON and _OFF, that have to be 167defined individually for each of them. 168 169To put a device into the ACPI power state Dx (where x is a number between 0 and 1703 inclusive) the kernel is supposed to (1) enable the power resources required 171by the device in this state using their _ON control methods and (2) execute the 172_PSx control method defined for the device. In addition to that, if the device 173is going to be put into a low-power state (D1-D3) and is supposed to generate 174wakeup signals from that state, the _DSW (or _PSW, replaced with _DSW by ACPI 1753.0) control method defined for it has to be executed before _PSx. Power 176resources that are not required by the device in the target power state and are 177not required any more by any other device should be disabled (by executing their 178_OFF control methods). If the current power state of the device is D3, it can 179only be put into D0 this way. 180 181However, quite often the power states of devices are changed during a 182system-wide transition into a sleep state or back into the working state. ACPI 183defines four system sleep states, S1, S2, S3, and S4, and denotes the system 184working state as S0. In general, the target system sleep (or working) state 185determines the highest power (lowest number) state the device can be put 186into and the kernel is supposed to obtain this information by executing the 187device's _SxD control method (where x is a number between 0 and 4 inclusive). 188If the device is required to wake up the system from the target sleep state, the 189lowest power (highest number) state it can be put into is also determined by the 190target state of the system. The kernel is then supposed to use the device's 191_SxW control method to obtain the number of that state. It also is supposed to 192use the device's _PRW control method to learn which power resources need to be 193enabled for the device to be able to generate wakeup signals. 194 1951.4. Wakeup Signaling 196--------------------- 197Wakeup signals generated by PCI devices, either as native PCI PMEs, or as 198a result of the execution of the _DSW (or _PSW) ACPI control method before 199putting the device into a low-power state, have to be caught and handled as 200appropriate. If they are sent while the system is in the working state 201(ACPI S0), they should be translated into interrupts so that the kernel can 202put the devices generating them into the full-power state and take care of the 203events that triggered them. In turn, if they are sent while the system is 204sleeping, they should cause the system's core logic to trigger wakeup. 205 206On ACPI-based systems wakeup signals sent by conventional PCI devices are 207converted into ACPI General-Purpose Events (GPEs) which are hardware signals 208from the system core logic generated in response to various events that need to 209be acted upon. Every GPE is associated with one or more sources of potentially 210interesting events. In particular, a GPE may be associated with a PCI device 211capable of signaling wakeup. The information on the connections between GPEs 212and event sources is recorded in the system's ACPI BIOS from where it can be 213read by the kernel. 214 215If a PCI device known to the system's ACPI BIOS signals wakeup, the GPE 216associated with it (if there is one) is triggered. The GPEs associated with PCI 217bridges may also be triggered in response to a wakeup signal from one of the 218devices below the bridge (this also is the case for root bridges) and, for 219example, native PCI PMEs from devices unknown to the system's ACPI BIOS may be 220handled this way. 221 222A GPE may be triggered when the system is sleeping (i.e. when it is in one of 223the ACPI S1-S4 states), in which case system wakeup is started by its core logic 224(the device that was the source of the signal causing the system wakeup to occur 225may be identified later). The GPEs used in such situations are referred to as 226wakeup GPEs. 227 228Usually, however, GPEs are also triggered when the system is in the working 229state (ACPI S0) and in that case the system's core logic generates a System 230Control Interrupt (SCI) to notify the kernel of the event. Then, the SCI 231handler identifies the GPE that caused the interrupt to be generated which, 232in turn, allows the kernel to identify the source of the event (that may be 233a PCI device signaling wakeup). The GPEs used for notifying the kernel of 234events occurring while the system is in the working state are referred to as 235runtime GPEs. 236 237Unfortunately, there is no standard way of handling wakeup signals sent by 238conventional PCI devices on systems that are not ACPI-based, but there is one 239for PCI Express devices. Namely, the PCI Express Base Specification introduced 240a native mechanism for converting native PCI PMEs into interrupts generated by 241root ports. For conventional PCI devices native PMEs are out-of-band, so they 242are routed separately and they need not pass through bridges (in principle they 243may be routed directly to the system's core logic), but for PCI Express devices 244they are in-band messages that have to pass through the PCI Express hierarchy, 245including the root port on the path from the device to the Root Complex. Thus 246it was possible to introduce a mechanism by which a root port generates an 247interrupt whenever it receives a PME message from one of the devices below it. 248The PCI Express Requester ID of the device that sent the PME message is then 249recorded in one of the root port's configuration registers from where it may be 250read by the interrupt handler allowing the device to be identified. [PME 251messages sent by PCI Express endpoints integrated with the Root Complex don't 252pass through root ports, but instead they cause a Root Complex Event Collector 253(if there is one) to generate interrupts.] 254 255In principle the native PCI Express PME signaling may also be used on ACPI-based 256systems along with the GPEs, but to use it the kernel has to ask the system's 257ACPI BIOS to release control of root port configuration registers. The ACPI 258BIOS, however, is not required to allow the kernel to control these registers 259and if it doesn't do that, the kernel must not modify their contents. Of course 260the native PCI Express PME signaling cannot be used by the kernel in that case. 261 262 2632. PCI Subsystem and Device Power Management 264============================================ 265 2662.1. Device Power Management Callbacks 267-------------------------------------- 268The PCI Subsystem participates in the power management of PCI devices in a 269number of ways. First of all, it provides an intermediate code layer between 270the device power management core (PM core) and PCI device drivers. 271Specifically, the pm field of the PCI subsystem's struct bus_type object, 272pci_bus_type, points to a struct dev_pm_ops object, pci_dev_pm_ops, containing 273pointers to several device power management callbacks: 274 275const struct dev_pm_ops pci_dev_pm_ops = { 276 .prepare = pci_pm_prepare, 277 .complete = pci_pm_complete, 278 .suspend = pci_pm_suspend, 279 .resume = pci_pm_resume, 280 .freeze = pci_pm_freeze, 281 .thaw = pci_pm_thaw, 282 .poweroff = pci_pm_poweroff, 283 .restore = pci_pm_restore, 284 .suspend_noirq = pci_pm_suspend_noirq, 285 .resume_noirq = pci_pm_resume_noirq, 286 .freeze_noirq = pci_pm_freeze_noirq, 287 .thaw_noirq = pci_pm_thaw_noirq, 288 .poweroff_noirq = pci_pm_poweroff_noirq, 289 .restore_noirq = pci_pm_restore_noirq, 290 .runtime_suspend = pci_pm_runtime_suspend, 291 .runtime_resume = pci_pm_runtime_resume, 292 .runtime_idle = pci_pm_runtime_idle, 293}; 294 295These callbacks are executed by the PM core in various situations related to 296device power management and they, in turn, execute power management callbacks 297provided by PCI device drivers. They also perform power management operations 298involving some standard configuration registers of PCI devices that device 299drivers need not know or care about. 300 301The structure representing a PCI device, struct pci_dev, contains several fields 302that these callbacks operate on: 303 304struct pci_dev { 305 ... 306 pci_power_t current_state; /* Current operating state. */ 307 int pm_cap; /* PM capability offset in the 308 configuration space */ 309 unsigned int pme_support:5; /* Bitmask of states from which PME# 310 can be generated */ 311 unsigned int pme_interrupt:1;/* Is native PCIe PME signaling used? */ 312 unsigned int d1_support:1; /* Low power state D1 is supported */ 313 unsigned int d2_support:1; /* Low power state D2 is supported */ 314 unsigned int no_d1d2:1; /* D1 and D2 are forbidden */ 315 unsigned int wakeup_prepared:1; /* Device prepared for wake up */ 316 unsigned int d3_delay; /* D3->D0 transition time in ms */ 317 ... 318}; 319 320They also indirectly use some fields of the struct device that is embedded in 321struct pci_dev. 322 3232.2. Device Initialization 324-------------------------- 325The PCI subsystem's first task related to device power management is to 326prepare the device for power management and initialize the fields of struct 327pci_dev used for this purpose. This happens in two functions defined in 328drivers/pci/pci.c, pci_pm_init() and platform_pci_wakeup_init(). 329 330The first of these functions checks if the device supports native PCI PM 331and if that's the case the offset of its power management capability structure 332in the configuration space is stored in the pm_cap field of the device's struct 333pci_dev object. Next, the function checks which PCI low-power states are 334supported by the device and from which low-power states the device can generate 335native PCI PMEs. The power management fields of the device's struct pci_dev and 336the struct device embedded in it are updated accordingly and the generation of 337PMEs by the device is disabled. 338 339The second function checks if the device can be prepared to signal wakeup with 340the help of the platform firmware, such as the ACPI BIOS. If that is the case, 341the function updates the wakeup fields in struct device embedded in the 342device's struct pci_dev and uses the firmware-provided method to prevent the 343device from signaling wakeup. 344 345At this point the device is ready for power management. For driverless devices, 346however, this functionality is limited to a few basic operations carried out 347during system-wide transitions to a sleep state and back to the working state. 348 3492.3. Runtime Device Power Management 350------------------------------------ 351The PCI subsystem plays a vital role in the runtime power management of PCI 352devices. For this purpose it uses the general runtime power management 353(runtime PM) framework described in Documentation/power/runtime_pm.txt. 354Namely, it provides subsystem-level callbacks: 355 356 pci_pm_runtime_suspend() 357 pci_pm_runtime_resume() 358 pci_pm_runtime_idle() 359 360that are executed by the core runtime PM routines. It also implements the 361entire mechanics necessary for handling runtime wakeup signals from PCI devices 362in low-power states, which at the time of this writing works for both the native 363PCI Express PME signaling and the ACPI GPE-based wakeup signaling described in 364Section 1. 365 366First, a PCI device is put into a low-power state, or suspended, with the help 367of pm_schedule_suspend() or pm_runtime_suspend() which for PCI devices call 368pci_pm_runtime_suspend() to do the actual job. For this to work, the device's 369driver has to provide a pm->runtime_suspend() callback (see below), which is 370run by pci_pm_runtime_suspend() as the first action. If the driver's callback 371returns successfully, the device's standard configuration registers are saved, 372the device is prepared to generate wakeup signals and, finally, it is put into 373the target low-power state. 374 375The low-power state to put the device into is the lowest-power (highest number) 376state from which it can signal wakeup. The exact method of signaling wakeup is 377system-dependent and is determined by the PCI subsystem on the basis of the 378reported capabilities of the device and the platform firmware. To prepare the 379device for signaling wakeup and put it into the selected low-power state, the 380PCI subsystem can use the platform firmware as well as the device's native PCI 381PM capabilities, if supported. 382 383It is expected that the device driver's pm->runtime_suspend() callback will 384not attempt to prepare the device for signaling wakeup or to put it into a 385low-power state. The driver ought to leave these tasks to the PCI subsystem 386that has all of the information necessary to perform them. 387 388A suspended device is brought back into the "active" state, or resumed, 389with the help of pm_request_resume() or pm_runtime_resume() which both call 390pci_pm_runtime_resume() for PCI devices. Again, this only works if the device's 391driver provides a pm->runtime_resume() callback (see below). However, before 392the driver's callback is executed, pci_pm_runtime_resume() brings the device 393back into the full-power state, prevents it from signaling wakeup while in that 394state and restores its standard configuration registers. Thus the driver's 395callback need not worry about the PCI-specific aspects of the device resume. 396 397Note that generally pci_pm_runtime_resume() may be called in two different 398situations. First, it may be called at the request of the device's driver, for 399example if there are some data for it to process. Second, it may be called 400as a result of a wakeup signal from the device itself (this sometimes is 401referred to as "remote wakeup"). Of course, for this purpose the wakeup signal 402is handled in one of the ways described in Section 1 and finally converted into 403a notification for the PCI subsystem after the source device has been 404identified. 405 406The pci_pm_runtime_idle() function, called for PCI devices by pm_runtime_idle() 407and pm_request_idle(), executes the device driver's pm->runtime_idle() 408callback, if defined, and if that callback doesn't return error code (or is not 409present at all), suspends the device with the help of pm_runtime_suspend(). 410Sometimes pci_pm_runtime_idle() is called automatically by the PM core (for 411example, it is called right after the device has just been resumed), in which 412cases it is expected to suspend the device if that makes sense. Usually, 413however, the PCI subsystem doesn't really know if the device really can be 414suspended, so it lets the device's driver decide by running its 415pm->runtime_idle() callback. 416 4172.4. System-Wide Power Transitions 418---------------------------------- 419There are a few different types of system-wide power transitions, described in 420Documentation/power/devices.txt. Each of them requires devices to be handled 421in a specific way and the PM core executes subsystem-level power management 422callbacks for this purpose. They are executed in phases such that each phase 423involves executing the same subsystem-level callback for every device belonging 424to the given subsystem before the next phase begins. These phases always run 425after tasks have been frozen. 426 4272.4.1. System Suspend 428 429When the system is going into a sleep state in which the contents of memory will 430be preserved, such as one of the ACPI sleep states S1-S3, the phases are: 431 432 prepare, suspend, suspend_noirq. 433 434The following PCI bus type's callbacks, respectively, are used in these phases: 435 436 pci_pm_prepare() 437 pci_pm_suspend() 438 pci_pm_suspend_noirq() 439 440The pci_pm_prepare() routine first puts the device into the "fully functional" 441state with the help of pm_runtime_resume(). Then, it executes the device 442driver's pm->prepare() callback if defined (i.e. if the driver's struct 443dev_pm_ops object is present and the prepare pointer in that object is valid). 444 445The pci_pm_suspend() routine first checks if the device's driver implements 446legacy PCI suspend routines (see Section 3), in which case the driver's legacy 447suspend callback is executed, if present, and its result is returned. Next, if 448the device's driver doesn't provide a struct dev_pm_ops object (containing 449pointers to the driver's callbacks), pci_pm_default_suspend() is called, which 450simply turns off the device's bus master capability and runs 451pcibios_disable_device() to disable it, unless the device is a bridge (PCI 452bridges are ignored by this routine). Next, the device driver's pm->suspend() 453callback is executed, if defined, and its result is returned if it fails. 454Finally, pci_fixup_device() is called to apply hardware suspend quirks related 455to the device if necessary. 456 457Note that the suspend phase is carried out asynchronously for PCI devices, so 458the pci_pm_suspend() callback may be executed in parallel for any pair of PCI 459devices that don't depend on each other in a known way (i.e. none of the paths 460in the device tree from the root bridge to a leaf device contains both of them). 461 462The pci_pm_suspend_noirq() routine is executed after suspend_device_irqs() has 463been called, which means that the device driver's interrupt handler won't be 464invoked while this routine is running. It first checks if the device's driver 465implements legacy PCI suspends routines (Section 3), in which case the legacy 466late suspend routine is called and its result is returned (the standard 467configuration registers of the device are saved if the driver's callback hasn't 468done that). Second, if the device driver's struct dev_pm_ops object is not 469present, the device's standard configuration registers are saved and the routine 470returns success. Otherwise the device driver's pm->suspend_noirq() callback is 471executed, if present, and its result is returned if it fails. Next, if the 472device's standard configuration registers haven't been saved yet (one of the 473device driver's callbacks executed before might do that), pci_pm_suspend_noirq() 474saves them, prepares the device to signal wakeup (if necessary) and puts it into 475a low-power state. 476 477The low-power state to put the device into is the lowest-power (highest number) 478state from which it can signal wakeup while the system is in the target sleep 479state. Just like in the runtime PM case described above, the mechanism of 480signaling wakeup is system-dependent and determined by the PCI subsystem, which 481is also responsible for preparing the device to signal wakeup from the system's 482target sleep state as appropriate. 483 484PCI device drivers (that don't implement legacy power management callbacks) are 485generally not expected to prepare devices for signaling wakeup or to put them 486into low-power states. However, if one of the driver's suspend callbacks 487(pm->suspend() or pm->suspend_noirq()) saves the device's standard configuration 488registers, pci_pm_suspend_noirq() will assume that the device has been prepared 489to signal wakeup and put into a low-power state by the driver (the driver is 490then assumed to have used the helper functions provided by the PCI subsystem for 491this purpose). PCI device drivers are not encouraged to do that, but in some 492rare cases doing that in the driver may be the optimum approach. 493 4942.4.2. System Resume 495 496When the system is undergoing a transition from a sleep state in which the 497contents of memory have been preserved, such as one of the ACPI sleep states 498S1-S3, into the working state (ACPI S0), the phases are: 499 500 resume_noirq, resume, complete. 501 502The following PCI bus type's callbacks, respectively, are executed in these 503phases: 504 505 pci_pm_resume_noirq() 506 pci_pm_resume() 507 pci_pm_complete() 508 509The pci_pm_resume_noirq() routine first puts the device into the full-power 510state, restores its standard configuration registers and applies early resume 511hardware quirks related to the device, if necessary. This is done 512unconditionally, regardless of whether or not the device's driver implements 513legacy PCI power management callbacks (this way all PCI devices are in the 514full-power state and their standard configuration registers have been restored 515when their interrupt handlers are invoked for the first time during resume, 516which allows the kernel to avoid problems with the handling of shared interrupts 517by drivers whose devices are still suspended). If legacy PCI power management 518callbacks (see Section 3) are implemented by the device's driver, the legacy 519early resume callback is executed and its result is returned. Otherwise, the 520device driver's pm->resume_noirq() callback is executed, if defined, and its 521result is returned. 522 523The pci_pm_resume() routine first checks if the device's standard configuration 524registers have been restored and restores them if that's not the case (this 525only is necessary in the error path during a failing suspend). Next, resume 526hardware quirks related to the device are applied, if necessary, and if the 527device's driver implements legacy PCI power management callbacks (see 528Section 3), the driver's legacy resume callback is executed and its result is 529returned. Otherwise, the device's wakeup signaling mechanisms are blocked and 530its driver's pm->resume() callback is executed, if defined (the callback's 531result is then returned). 532 533The resume phase is carried out asynchronously for PCI devices, like the 534suspend phase described above, which means that if two PCI devices don't depend 535on each other in a known way, the pci_pm_resume() routine may be executed for 536the both of them in parallel. 537 538The pci_pm_complete() routine only executes the device driver's pm->complete() 539callback, if defined. 540 5412.4.3. System Hibernation 542 543System hibernation is more complicated than system suspend, because it requires 544a system image to be created and written into a persistent storage medium. The 545image is created atomically and all devices are quiesced, or frozen, before that 546happens. 547 548The freezing of devices is carried out after enough memory has been freed (at 549the time of this writing the image creation requires at least 50% of system RAM 550to be free) in the following three phases: 551 552 prepare, freeze, freeze_noirq 553 554that correspond to the PCI bus type's callbacks: 555 556 pci_pm_prepare() 557 pci_pm_freeze() 558 pci_pm_freeze_noirq() 559 560This means that the prepare phase is exactly the same as for system suspend. 561The other two phases, however, are different. 562 563The pci_pm_freeze() routine is quite similar to pci_pm_suspend(), but it runs 564the device driver's pm->freeze() callback, if defined, instead of pm->suspend(), 565and it doesn't apply the suspend-related hardware quirks. It is executed 566asynchronously for different PCI devices that don't depend on each other in a 567known way. 568 569The pci_pm_freeze_noirq() routine, in turn, is similar to 570pci_pm_suspend_noirq(), but it calls the device driver's pm->freeze_noirq() 571routine instead of pm->suspend_noirq(). It also doesn't attempt to prepare the 572device for signaling wakeup and put it into a low-power state. Still, it saves 573the device's standard configuration registers if they haven't been saved by one 574of the driver's callbacks. 575 576Once the image has been created, it has to be saved. However, at this point all 577devices are frozen and they cannot handle I/O, while their ability to handle 578I/O is obviously necessary for the image saving. Thus they have to be brought 579back to the fully functional state and this is done in the following phases: 580 581 thaw_noirq, thaw, complete 582 583using the following PCI bus type's callbacks: 584 585 pci_pm_thaw_noirq() 586 pci_pm_thaw() 587 pci_pm_complete() 588 589respectively. 590 591The first of them, pci_pm_thaw_noirq(), is analogous to pci_pm_resume_noirq(), 592but it doesn't put the device into the full power state and doesn't attempt to 593restore its standard configuration registers. It also executes the device 594driver's pm->thaw_noirq() callback, if defined, instead of pm->resume_noirq(). 595 596The pci_pm_thaw() routine is similar to pci_pm_resume(), but it runs the device 597driver's pm->thaw() callback instead of pm->resume(). It is executed 598asynchronously for different PCI devices that don't depend on each other in a 599known way. 600 601The complete phase it the same as for system resume. 602 603After saving the image, devices need to be powered down before the system can 604enter the target sleep state (ACPI S4 for ACPI-based systems). This is done in 605three phases: 606 607 prepare, poweroff, poweroff_noirq 608 609where the prepare phase is exactly the same as for system suspend. The other 610two phases are analogous to the suspend and suspend_noirq phases, respectively. 611The PCI subsystem-level callbacks they correspond to 612 613 pci_pm_poweroff() 614 pci_pm_poweroff_noirq() 615 616work in analogy with pci_pm_suspend() and pci_pm_poweroff_noirq(), respectively, 617although they don't attempt to save the device's standard configuration 618registers. 619 6202.4.4. System Restore 621 622System restore requires a hibernation image to be loaded into memory and the 623pre-hibernation memory contents to be restored before the pre-hibernation system 624activity can be resumed. 625 626As described in Documentation/power/devices.txt, the hibernation image is loaded 627into memory by a fresh instance of the kernel, called the boot kernel, which in 628turn is loaded and run by a boot loader in the usual way. After the boot kernel 629has loaded the image, it needs to replace its own code and data with the code 630and data of the "hibernated" kernel stored within the image, called the image 631kernel. For this purpose all devices are frozen just like before creating 632the image during hibernation, in the 633 634 prepare, freeze, freeze_noirq 635 636phases described above. However, the devices affected by these phases are only 637those having drivers in the boot kernel; other devices will still be in whatever 638state the boot loader left them. 639 640Should the restoration of the pre-hibernation memory contents fail, the boot 641kernel would go through the "thawing" procedure described above, using the 642thaw_noirq, thaw, and complete phases (that will only affect the devices having 643drivers in the boot kernel), and then continue running normally. 644 645If the pre-hibernation memory contents are restored successfully, which is the 646usual situation, control is passed to the image kernel, which then becomes 647responsible for bringing the system back to the working state. To achieve this, 648it must restore the devices' pre-hibernation functionality, which is done much 649like waking up from the memory sleep state, although it involves different 650phases: 651 652 restore_noirq, restore, complete 653 654The first two of these are analogous to the resume_noirq and resume phases 655described above, respectively, and correspond to the following PCI subsystem 656callbacks: 657 658 pci_pm_restore_noirq() 659 pci_pm_restore() 660 661These callbacks work in analogy with pci_pm_resume_noirq() and pci_pm_resume(), 662respectively, but they execute the device driver's pm->restore_noirq() and 663pm->restore() callbacks, if available. 664 665The complete phase is carried out in exactly the same way as during system 666resume. 667 668 6693. PCI Device Drivers and Power Management 670========================================== 671 6723.1. Power Management Callbacks 673------------------------------- 674PCI device drivers participate in power management by providing callbacks to be 675executed by the PCI subsystem's power management routines described above and by 676controlling the runtime power management of their devices. 677 678At the time of this writing there are two ways to define power management 679callbacks for a PCI device driver, the recommended one, based on using a 680dev_pm_ops structure described in Documentation/power/devices.txt, and the 681"legacy" one, in which the .suspend(), .suspend_late(), .resume_early(), and 682.resume() callbacks from struct pci_driver are used. The legacy approach, 683however, doesn't allow one to define runtime power management callbacks and is 684not really suitable for any new drivers. Therefore it is not covered by this 685document (refer to the source code to learn more about it). 686 687It is recommended that all PCI device drivers define a struct dev_pm_ops object 688containing pointers to power management (PM) callbacks that will be executed by 689the PCI subsystem's PM routines in various circumstances. A pointer to the 690driver's struct dev_pm_ops object has to be assigned to the driver.pm field in 691its struct pci_driver object. Once that has happened, the "legacy" PM callbacks 692in struct pci_driver are ignored (even if they are not NULL). 693 694The PM callbacks in struct dev_pm_ops are not mandatory and if they are not 695defined (i.e. the respective fields of struct dev_pm_ops are unset) the PCI 696subsystem will handle the device in a simplified default manner. If they are 697defined, though, they are expected to behave as described in the following 698subsections. 699 7003.1.1. prepare() 701 702The prepare() callback is executed during system suspend, during hibernation 703(when a hibernation image is about to be created), during power-off after 704saving a hibernation image and during system restore, when a hibernation image 705has just been loaded into memory. 706 707This callback is only necessary if the driver's device has children that in 708general may be registered at any time. In that case the role of the prepare() 709callback is to prevent new children of the device from being registered until 710one of the resume_noirq(), thaw_noirq(), or restore_noirq() callbacks is run. 711 712In addition to that the prepare() callback may carry out some operations 713preparing the device to be suspended, although it should not allocate memory 714(if additional memory is required to suspend the device, it has to be 715preallocated earlier, for example in a suspend/hibernate notifier as described 716in Documentation/power/notifiers.txt). 717 7183.1.2. suspend() 719 720The suspend() callback is only executed during system suspend, after prepare() 721callbacks have been executed for all devices in the system. 722 723This callback is expected to quiesce the device and prepare it to be put into a 724low-power state by the PCI subsystem. It is not required (in fact it even is 725not recommended) that a PCI driver's suspend() callback save the standard 726configuration registers of the device, prepare it for waking up the system, or 727put it into a low-power state. All of these operations can very well be taken 728care of by the PCI subsystem, without the driver's participation. 729 730However, in some rare case it is convenient to carry out these operations in 731a PCI driver. Then, pci_save_state(), pci_prepare_to_sleep(), and 732pci_set_power_state() should be used to save the device's standard configuration 733registers, to prepare it for system wakeup (if necessary), and to put it into a 734low-power state, respectively. Moreover, if the driver calls pci_save_state(), 735the PCI subsystem will not execute either pci_prepare_to_sleep(), or 736pci_set_power_state() for its device, so the driver is then responsible for 737handling the device as appropriate. 738 739While the suspend() callback is being executed, the driver's interrupt handler 740can be invoked to handle an interrupt from the device, so all suspend-related 741operations relying on the driver's ability to handle interrupts should be 742carried out in this callback. 743 7443.1.3. suspend_noirq() 745 746The suspend_noirq() callback is only executed during system suspend, after 747suspend() callbacks have been executed for all devices in the system and 748after device interrupts have been disabled by the PM core. 749 750The difference between suspend_noirq() and suspend() is that the driver's 751interrupt handler will not be invoked while suspend_noirq() is running. Thus 752suspend_noirq() can carry out operations that would cause race conditions to 753arise if they were performed in suspend(). 754 7553.1.4. freeze() 756 757The freeze() callback is hibernation-specific and is executed in two situations, 758during hibernation, after prepare() callbacks have been executed for all devices 759in preparation for the creation of a system image, and during restore, 760after a system image has been loaded into memory from persistent storage and the 761prepare() callbacks have been executed for all devices. 762 763The role of this callback is analogous to the role of the suspend() callback 764described above. In fact, they only need to be different in the rare cases when 765the driver takes the responsibility for putting the device into a low-power 766state. 767 768In that cases the freeze() callback should not prepare the device system wakeup 769or put it into a low-power state. Still, either it or freeze_noirq() should 770save the device's standard configuration registers using pci_save_state(). 771 7723.1.5. freeze_noirq() 773 774The freeze_noirq() callback is hibernation-specific. It is executed during 775hibernation, after prepare() and freeze() callbacks have been executed for all 776devices in preparation for the creation of a system image, and during restore, 777after a system image has been loaded into memory and after prepare() and 778freeze() callbacks have been executed for all devices. It is always executed 779after device interrupts have been disabled by the PM core. 780 781The role of this callback is analogous to the role of the suspend_noirq() 782callback described above and it very rarely is necessary to define 783freeze_noirq(). 784 785The difference between freeze_noirq() and freeze() is analogous to the 786difference between suspend_noirq() and suspend(). 787 7883.1.6. poweroff() 789 790The poweroff() callback is hibernation-specific. It is executed when the system 791is about to be powered off after saving a hibernation image to a persistent 792storage. prepare() callbacks are executed for all devices before poweroff() is 793called. 794 795The role of this callback is analogous to the role of the suspend() and freeze() 796callbacks described above, although it does not need to save the contents of 797the device's registers. In particular, if the driver wants to put the device 798into a low-power state itself instead of allowing the PCI subsystem to do that, 799the poweroff() callback should use pci_prepare_to_sleep() and 800pci_set_power_state() to prepare the device for system wakeup and to put it 801into a low-power state, respectively, but it need not save the device's standard 802configuration registers. 803 8043.1.7. poweroff_noirq() 805 806The poweroff_noirq() callback is hibernation-specific. It is executed after 807poweroff() callbacks have been executed for all devices in the system. 808 809The role of this callback is analogous to the role of the suspend_noirq() and 810freeze_noirq() callbacks described above, but it does not need to save the 811contents of the device's registers. 812 813The difference between poweroff_noirq() and poweroff() is analogous to the 814difference between suspend_noirq() and suspend(). 815 8163.1.8. resume_noirq() 817 818The resume_noirq() callback is only executed during system resume, after the 819PM core has enabled the non-boot CPUs. The driver's interrupt handler will not 820be invoked while resume_noirq() is running, so this callback can carry out 821operations that might race with the interrupt handler. 822 823Since the PCI subsystem unconditionally puts all devices into the full power 824state in the resume_noirq phase of system resume and restores their standard 825configuration registers, resume_noirq() is usually not necessary. In general 826it should only be used for performing operations that would lead to race 827conditions if carried out by resume(). 828 8293.1.9. resume() 830 831The resume() callback is only executed during system resume, after 832resume_noirq() callbacks have been executed for all devices in the system and 833device interrupts have been enabled by the PM core. 834 835This callback is responsible for restoring the pre-suspend configuration of the 836device and bringing it back to the fully functional state. The device should be 837able to process I/O in a usual way after resume() has returned. 838 8393.1.10. thaw_noirq() 840 841The thaw_noirq() callback is hibernation-specific. It is executed after a 842system image has been created and the non-boot CPUs have been enabled by the PM 843core, in the thaw_noirq phase of hibernation. It also may be executed if the 844loading of a hibernation image fails during system restore (it is then executed 845after enabling the non-boot CPUs). The driver's interrupt handler will not be 846invoked while thaw_noirq() is running. 847 848The role of this callback is analogous to the role of resume_noirq(). The 849difference between these two callbacks is that thaw_noirq() is executed after 850freeze() and freeze_noirq(), so in general it does not need to modify the 851contents of the device's registers. 852 8533.1.11. thaw() 854 855The thaw() callback is hibernation-specific. It is executed after thaw_noirq() 856callbacks have been executed for all devices in the system and after device 857interrupts have been enabled by the PM core. 858 859This callback is responsible for restoring the pre-freeze configuration of 860the device, so that it will work in a usual way after thaw() has returned. 861 8623.1.12. restore_noirq() 863 864The restore_noirq() callback is hibernation-specific. It is executed in the 865restore_noirq phase of hibernation, when the boot kernel has passed control to 866the image kernel and the non-boot CPUs have been enabled by the image kernel's 867PM core. 868 869This callback is analogous to resume_noirq() with the exception that it cannot 870make any assumption on the previous state of the device, even if the BIOS (or 871generally the platform firmware) is known to preserve that state over a 872suspend-resume cycle. 873 874For the vast majority of PCI device drivers there is no difference between 875resume_noirq() and restore_noirq(). 876 8773.1.13. restore() 878 879The restore() callback is hibernation-specific. It is executed after 880restore_noirq() callbacks have been executed for all devices in the system and 881after the PM core has enabled device drivers' interrupt handlers to be invoked. 882 883This callback is analogous to resume(), just like restore_noirq() is analogous 884to resume_noirq(). Consequently, the difference between restore_noirq() and 885restore() is analogous to the difference between resume_noirq() and resume(). 886 887For the vast majority of PCI device drivers there is no difference between 888resume() and restore(). 889 8903.1.14. complete() 891 892The complete() callback is executed in the following situations: 893 - during system resume, after resume() callbacks have been executed for all 894 devices, 895 - during hibernation, before saving the system image, after thaw() callbacks 896 have been executed for all devices, 897 - during system restore, when the system is going back to its pre-hibernation 898 state, after restore() callbacks have been executed for all devices. 899It also may be executed if the loading of a hibernation image into memory fails 900(in that case it is run after thaw() callbacks have been executed for all 901devices that have drivers in the boot kernel). 902 903This callback is entirely optional, although it may be necessary if the 904prepare() callback performs operations that need to be reversed. 905 9063.1.15. runtime_suspend() 907 908The runtime_suspend() callback is specific to device runtime power management 909(runtime PM). It is executed by the PM core's runtime PM framework when the 910device is about to be suspended (i.e. quiesced and put into a low-power state) 911at run time. 912 913This callback is responsible for freezing the device and preparing it to be 914put into a low-power state, but it must allow the PCI subsystem to perform all 915of the PCI-specific actions necessary for suspending the device. 916 9173.1.16. runtime_resume() 918 919The runtime_resume() callback is specific to device runtime PM. It is executed 920by the PM core's runtime PM framework when the device is about to be resumed 921(i.e. put into the full-power state and programmed to process I/O normally) at 922run time. 923 924This callback is responsible for restoring the normal functionality of the 925device after it has been put into the full-power state by the PCI subsystem. 926The device is expected to be able to process I/O in the usual way after 927runtime_resume() has returned. 928 9293.1.17. runtime_idle() 930 931The runtime_idle() callback is specific to device runtime PM. It is executed 932by the PM core's runtime PM framework whenever it may be desirable to suspend 933the device according to the PM core's information. In particular, it is 934automatically executed right after runtime_resume() has returned in case the 935resume of the device has happened as a result of a spurious event. 936 937This callback is optional, but if it is not implemented or if it returns 0, the 938PCI subsystem will call pm_runtime_suspend() for the device, which in turn will 939cause the driver's runtime_suspend() callback to be executed. 940 9413.1.18. Pointing Multiple Callback Pointers to One Routine 942 943Although in principle each of the callbacks described in the previous 944subsections can be defined as a separate function, it often is convenient to 945point two or more members of struct dev_pm_ops to the same routine. There are 946a few convenience macros that can be used for this purpose. 947 948The SIMPLE_DEV_PM_OPS macro declares a struct dev_pm_ops object with one 949suspend routine pointed to by the .suspend(), .freeze(), and .poweroff() 950members and one resume routine pointed to by the .resume(), .thaw(), and 951.restore() members. The other function pointers in this struct dev_pm_ops are 952unset. 953 954The UNIVERSAL_DEV_PM_OPS macro is similar to SIMPLE_DEV_PM_OPS, but it 955additionally sets the .runtime_resume() pointer to the same value as 956.resume() (and .thaw(), and .restore()) and the .runtime_suspend() pointer to 957the same value as .suspend() (and .freeze() and .poweroff()). 958 959The SET_SYSTEM_SLEEP_PM_OPS can be used inside of a declaration of struct 960dev_pm_ops to indicate that one suspend routine is to be pointed to by the 961.suspend(), .freeze(), and .poweroff() members and one resume routine is to 962be pointed to by the .resume(), .thaw(), and .restore() members. 963 9643.2. Device Runtime Power Management 965------------------------------------ 966In addition to providing device power management callbacks PCI device drivers 967are responsible for controlling the runtime power management (runtime PM) of 968their devices. 969 970The PCI device runtime PM is optional, but it is recommended that PCI device 971drivers implement it at least in the cases where there is a reliable way of 972verifying that the device is not used (like when the network cable is detached 973from an Ethernet adapter or there are no devices attached to a USB controller). 974 975To support the PCI runtime PM the driver first needs to implement the 976runtime_suspend() and runtime_resume() callbacks. It also may need to implement 977the runtime_idle() callback to prevent the device from being suspended again 978every time right after the runtime_resume() callback has returned 979(alternatively, the runtime_suspend() callback will have to check if the 980device should really be suspended and return -EAGAIN if that is not the case). 981 982The runtime PM of PCI devices is disabled by default. It is also blocked by 983pci_pm_init() that runs the pm_runtime_forbid() helper function. If a PCI 984driver implements the runtime PM callbacks and intends to use the runtime PM 985framework provided by the PM core and the PCI subsystem, it should enable this 986feature by executing the pm_runtime_enable() helper function. However, the 987driver should not call the pm_runtime_allow() helper function unblocking 988the runtime PM of the device. Instead, it should allow user space or some 989platform-specific code to do that (user space can do it via sysfs), although 990once it has called pm_runtime_enable(), it must be prepared to handle the 991runtime PM of the device correctly as soon as pm_runtime_allow() is called 992(which may happen at any time). [It also is possible that user space causes 993pm_runtime_allow() to be called via sysfs before the driver is loaded, so in 994fact the driver has to be prepared to handle the runtime PM of the device as 995soon as it calls pm_runtime_enable().] 996 997The runtime PM framework works by processing requests to suspend or resume 998devices, or to check if they are idle (in which cases it is reasonable to 999subsequently request that they be suspended). These requests are represented 1000by work items put into the power management workqueue, pm_wq. Although there 1001are a few situations in which power management requests are automatically 1002queued by the PM core (for example, after processing a request to resume a 1003device the PM core automatically queues a request to check if the device is 1004idle), device drivers are generally responsible for queuing power management 1005requests for their devices. For this purpose they should use the runtime PM 1006helper functions provided by the PM core, discussed in 1007Documentation/power/runtime_pm.txt. 1008 1009Devices can also be suspended and resumed synchronously, without placing a 1010request into pm_wq. In the majority of cases this also is done by their 1011drivers that use helper functions provided by the PM core for this purpose. 1012 1013For more information on the runtime PM of devices refer to 1014Documentation/power/runtime_pm.txt. 1015 1016 10174. Resources 1018============ 1019 1020PCI Local Bus Specification, Rev. 3.0 1021PCI Bus Power Management Interface Specification, Rev. 1.2 1022Advanced Configuration and Power Interface (ACPI) Specification, Rev. 3.0b 1023PCI Express Base Specification, Rev. 2.0 1024Documentation/power/devices.txt 1025Documentation/power/runtime_pm.txt 1026