Requirements.rst - OpenGrok cross reference for /Documentation/RCU/Design/Requirements/Requirements.rst

Lines Matching +full:multi +full:- +full:socket
16 ------------
18 Read-copy update (RCU) is a synchronization mechanism that is often used
19 as a replacement for reader-writer locking. RCU is unusual in that
20 updaters do not block readers, which means that RCU's read-side
28 thought of as an informal, high-level specification for RCU. It is
40 #. `Fundamental Non-Requirements`_
42 #. `Quality-of-Implementation Requirements`_
44 #. `Software-Engineering Requirements`_
53 ------------------------
58 #. `Grace-Period Guarantee`_
60 #. `Memory-Barrier Guarantees`_
62 #. `Guaranteed Read-to-Write Upgrade`_
64 Grace-Period Guarantee
67 RCU's grace-period guarantee is unusual in being premeditated: Jack
73 RCU's grace-period guarantee allows updaters to wait for the completion
74 of all pre-existing RCU read-side critical sections. An RCU read-side
77 RCU treats a nested set as one big RCU read-side critical section.
78 Production-quality implementations of ``rcu_read_lock()`` and
105 Because the ``synchronize_rcu()`` on line 14 waits for all pre-existing
119 +-----------------------------------------------------------------------+
121 +-----------------------------------------------------------------------+
123 | progress concurrently with readers, but pre-existing readers will     |
126 +-----------------------------------------------------------------------+
128 +-----------------------------------------------------------------------+
131 | Second, even when using ``synchronize_rcu()``, the other update-side  |
132 | code does run concurrently with readers, whether pre-existing or not. |
133 +-----------------------------------------------------------------------+
173 The RCU read-side critical section in ``do_something_dlm()`` works with
178 +-----------------------------------------------------------------------+
180 +-----------------------------------------------------------------------+
182 +-----------------------------------------------------------------------+
184 +-----------------------------------------------------------------------+
188 +-----------------------------------------------------------------------+
190 In order to avoid fatal problems such as deadlocks, an RCU read-side
192 Similarly, an RCU read-side critical section must not contain anything
196 Although RCU's grace-period guarantee is useful in and of itself, with
198 be good to be able to use RCU to coordinate read-side access to linked
199 data structures. For this, the grace-period guarantee is not sufficient,
203 non-\ ``NULL``, locklessly accessing the ``->a`` and ``->b`` fields.
211        5     return -ENOMEM;
217       11   p->a = a;
218       12   p->b = a;
233        5     return -ENOMEM;
240       12   p->a = a;
241       13   p->b = a;
247 executes line 11, it will see garbage in the ``->a`` and ``->b`` fields.
251 which brings us to the publish-subscribe guarantee discussed in the next
257 RCU's publish-subscribe guarantee allows data to be inserted into a
269        5     return -ENOMEM;
275       11   p->a = a;
276       12   p->b = a;
289 +-----------------------------------------------------------------------+
291 +-----------------------------------------------------------------------+
293 | assignments to ``p->a`` and ``p->b`` from being reordered. Can't that |
295 +-----------------------------------------------------------------------+
297 +-----------------------------------------------------------------------+
300 | initialized. So reordering the assignments to ``p->a`` and ``p->b``   |
302 +-----------------------------------------------------------------------+
305 control its accesses to the RCU-protected data, as shown in
315        6     do_something(p->a, p->b);
336        5     do_something(gp->a, gp->b);
345 the current structure with a new one, the fetches of ``gp->a`` and
346 ``gp->b`` might well come from two different structures, which could
357        6     do_something(p->a, p->b);
366 barriers in the Linux kernel. Should a `high-quality implementation of
372 outermost RCU read-side critical section containing that
382 Of course, it is also necessary to remove elements from RCU-protected
386 #. Wait for all pre-existing RCU read-side critical sections to complete
387    (because only pre-existing readers can possibly have a reference to
426    read-side critical section or in a code segment where the pointer
428    update-side lock.
430 +-----------------------------------------------------------------------+
432 +-----------------------------------------------------------------------+
435 +-----------------------------------------------------------------------+
437 +-----------------------------------------------------------------------+
441 | in a byte-at-a-time manner, resulting in *load tearing*, in turn      |
442 | resulting a bytewise mash-up of two distinct pointer values. It might |
443 | even use value-speculation optimizations, where it makes a wrong      |
446 | about any dereferences that returned pre-initialization garbage in    |
453 +-----------------------------------------------------------------------+
455 In short, RCU's publish-subscribe guarantee is provided by the
457 guarantee allows data elements to be safely added to RCU-protected
459 can be used in combination with the grace-period guarantee to also allow
460 data elements to be removed from RCU-protected linked data structures,
466 resembling the dependency-ordering barrier that was later subsumed
469 late-1990s meeting with the DEC Alpha architects, back in the days when
470 DEC was still a free-standing company. It took the Alpha architects a
479 Memory-Barrier Guarantees
482 The previous section's simple linked-data-structure scenario clearly
483 demonstrates the need for RCU's stringent memory-ordering guarantees on
486 #. Each CPU that has an RCU read-side critical section that begins
488    memory barrier between the time that the RCU read-side critical
490    this guarantee, a pre-existing RCU read-side critical section might
493 #. Each CPU that has an RCU read-side critical section that ends after
496    time that the RCU read-side critical section begins. Without this
497    guarantee, a later RCU read-side critical section running after the
513 +-----------------------------------------------------------------------+
515 +-----------------------------------------------------------------------+
516 | Given that multiple CPUs can start RCU read-side critical sections at |
518 | whether or not a given RCU read-side critical section starts before a |
520 +-----------------------------------------------------------------------+
522 +-----------------------------------------------------------------------+
523 | If RCU cannot tell whether or not a given RCU read-side critical      |
525 | it must assume that the RCU read-side critical section started first. |
527 | waiting on a given RCU read-side critical section only if it can      |
533 | within the enclosed RCU read-side critical section to the code        |
535 | then a given RCU read-side critical section begins before a given     |
546 +-----------------------------------------------------------------------+
548 +-----------------------------------------------------------------------+
550 +-----------------------------------------------------------------------+
553 +-----------------------------------------------------------------------+
555 +-----------------------------------------------------------------------+
563 | #. CPU 1: ``do_something_with(q->a);``                                |
570 | end of the RCU read-side critical section and the end of the grace    |
583 | #. CPU 1: ``do_something_with(q->a); /* Boom!!! */``                  |
587 | grace period and the beginning of the RCU read-side critical section, |
593 | believing that you have adhered to the as-if rule than it is to       |
595 +-----------------------------------------------------------------------+
597 +-----------------------------------------------------------------------+
599 +-----------------------------------------------------------------------+
602 | compiler might arbitrarily rearrange consecutive RCU read-side        |
603 | critical sections. Given such rearrangement, if a given RCU read-side |
605 | read-side critical sections are done? Won't the compiler              |
607 +-----------------------------------------------------------------------+
609 +-----------------------------------------------------------------------+
613 | ``schedule()`` had better prevent calling-code accesses to shared     |
615 | RCU detects the end of a given RCU read-side critical section, it     |
616 | will necessarily detect the end of all prior RCU read-side critical   |
620 | loop, into user-mode code, and so on. But if your kernel build allows |
622 +-----------------------------------------------------------------------+
624 Note that these memory-barrier requirements do not replace the
626 pre-existing readers. On the contrary, the memory barriers called out in
634 The common-case RCU primitives are unconditional. They are invoked, they
641 guarantee was reverse-engineered, not premeditated. The unconditional
649 Guaranteed Read-to-Write Upgrade
653 within an RCU read-side critical section. For example, that RCU
654 read-side critical section might search for a given data element, and
655 then might acquire the update-side spinlock in order to update that
656 element, all while remaining in that RCU read-side critical section. Of
657 course, it is necessary to exit the RCU read-side critical section
662 +-----------------------------------------------------------------------+
664 +-----------------------------------------------------------------------+
665 | But how does the upgrade-to-write operation exclude other readers?    |
666 +-----------------------------------------------------------------------+
668 +-----------------------------------------------------------------------+
671 +-----------------------------------------------------------------------+
673 This guarantee allows lookup code to be shared between read-side and
674 update-side code, and was premeditated, appearing in the earliest
677 Fundamental Non-Requirements
678 ----------------------------
680 RCU provides extremely lightweight readers, and its read-side
685 non-guarantees that have caused confusion. Except where otherwise noted,
686 these non-guarantees were premeditated.
691 #. `Grace Periods Don't Partition Read-Side Critical Sections`_
692 #. `Read-Side Critical Sections Don't Partition Grace Periods`_
697 Reader-side markers such as ``rcu_read_lock()`` and
699 through their interaction with the grace-period APIs such as
736 significant ordering constraints would slow down these fast-path APIs.
738 +-----------------------------------------------------------------------+
740 +-----------------------------------------------------------------------+
742 +-----------------------------------------------------------------------+
744 +-----------------------------------------------------------------------+
747 +-----------------------------------------------------------------------+
793 +-----------------------------------------------------------------------+
795 +-----------------------------------------------------------------------+
797 | completed instead of waiting only on pre-existing readers. For how    |
799 +-----------------------------------------------------------------------+
801 +-----------------------------------------------------------------------+
806 +-----------------------------------------------------------------------+
808 Grace Periods Don't Partition Read-Side Critical Sections
811 It is tempting to assume that if any part of one RCU read-side critical
813 read-side critical section follows that same grace period, then all of
814 the first RCU read-side critical section must precede all of the second.
816 partition the set of RCU read-side critical sections. An example of this
854 that the thread cannot be in the midst of an RCU read-side critical
857 .. kernel-figure:: GPpartitionReaders1.svg
859 If it is necessary to partition RCU read-side critical sections in this
901 the two RCU read-side critical sections cannot overlap, guaranteeing
910 This non-requirement was also non-premeditated, but became apparent when
913 Read-Side Critical Sections Don't Partition Grace Periods
916 It is also tempting to assume that if an RCU read-side critical section
969 .. kernel-figure:: ReadersPartitionGP1.svg
971 Again, an RCU read-side critical section can overlap almost all of a
973 period. As a result, an RCU read-side critical section cannot partition
976 +-----------------------------------------------------------------------+
978 +-----------------------------------------------------------------------+
980 | read-side critical section, would be required to partition the RCU    |
981 | read-side critical sections at the beginning and end of the chain?    |
982 +-----------------------------------------------------------------------+
984 +-----------------------------------------------------------------------+
989 +-----------------------------------------------------------------------+
992 -------------------------
999    completely futile. This is most obvious in preemptible user-level
1002    hypervisor), but can also happen in bare-metal environments due to
1006    but where “extremely long” is not long enough to allow wrap-around
1007    when incrementing a 64-bit counter.
1009    matters, RCU must use compiler directives and memory-barrier
1013    writes and more-frequent concurrent writes will result in more
1020 #. Counters are finite, especially on 32-bit systems. RCU's use of
1025    dyntick-idle nesting counter allows 54 bits for interrupt nesting
1026    level (this counter is 64 bits even on a 32-bit system). Overflowing
1027    this counter requires 2\ :sup:`54` half-interrupts on a given CPU
1028    without that CPU ever going idle. If a half-interrupt happened every
1032    kernel in a single shared-memory environment. RCU must therefore pay
1033    close attention to high-end scalability.
1041 Quality-of-Implementation Requirements
1042 --------------------------------------
1044 These sections list quality-of-implementation requirements. Although an
1047 inappropriate for industrial-strength production use. Classes of
1048 quality-of-implementation requirements are as follows:
1061 RCU is and always has been intended primarily for read-mostly
1062 situations, which means that RCU's read-side primitives are optimized,
1063 often at the expense of its update-side primitives. Experience thus far
1066 #. Read-mostly data, where stale and inconsistent data is not a problem:
1068 #. Read-mostly data, where data must be consistent: RCU works well.
1069 #. Read-write data, where data must be consistent: RCU *might* work OK.
1071 #. Write-mostly data, where data must be consistent: RCU is very
1075    a. Existence guarantees for update-friendly mechanisms.
1076    b. Wait-free read-side primitives for real-time use.
1078 This focus on read-mostly situations means that RCU must interoperate
1083 primitives be legal within RCU read-side critical sections, including
1087 +-----------------------------------------------------------------------+
1089 +-----------------------------------------------------------------------+
1091 +-----------------------------------------------------------------------+
1093 +-----------------------------------------------------------------------+
1094 | These are forbidden within Linux-kernel RCU read-side critical        |
1096 | case, voluntary context switch) within an RCU read-side critical      |
1098 | read-side critical sections, and also within Linux-kernel sleepable   |
1099 | RCU `(SRCU) <#Sleepable%20RCU>`__ read-side critical sections. In     |
1100 | addition, the -rt patchset turns spinlocks into a sleeping locks so   |
1103 | locks!) may be acquire within -rt-Linux-kernel RCU read-side critical |
1105 | Note that it *is* legal for a normal RCU read-side critical section   |
1113 +-----------------------------------------------------------------------+
1125 inconsistency must be tolerated due to speed-of-light delays if nothing
1151 For example, the translation between a user-level SystemV semaphore ID
1152 to the corresponding in-kernel data structure is protected by RCU, but
1155 by acquiring spinlocks located in the in-kernel data structure from
1156 within the RCU read-side critical section, and this is indicated by the
1170 Linux-kernel RCU implementations must therefore avoid unnecessarily
1174 energy efficiency in battery-powered systems and on specific
1175 energy-efficiency shortcomings of the Linux-kernel RCU implementation.
1176 In my experience, the battery-powered embedded community will consider
1178 mere Linux-kernel-mailing-list posts are insufficient to vent their ire.
1183 `bloatwatch <http://elinux.org/Linux_Tiny-FAQ>`__ efforts, memory
1184 footprint is critically important on single-CPU systems with
1185 non-preemptible (``CONFIG_PREEMPT=n``) kernels, and thus `tiny
1187 was born. Josh Triplett has since taken over the small-memory banner
1193 unsurprising. For example, in keeping with RCU's read-side
1196 Similarly, in non-preemptible environments, ``rcu_read_lock()`` and
1199 In preemptible environments, in the case where the RCU read-side
1201 highest-priority real-time process), ``rcu_read_lock()`` and
1203 should not contain atomic read-modify-write operations, memory-barrier
1205 branches. However, in the case where the RCU read-side critical section
1207 interrupts. This is why it is better to nest an RCU read-side critical
1208 section within a preempt-disable region than vice versa, at least in
1210 degrading real-time latencies.
1212 The ``synchronize_rcu()`` grace-period-wait primitive is optimized for
1214 addition to the duration of the longest RCU read-side critical section.
1217 they can be satisfied by a single underlying grace-period-wait
1219 single grace-period-wait operation to serve more than `1,000 separate
1220 … <https://www.usenix.org/conference/2004-usenix-annual-technical-conference/making-rcu-safe-deep-s…
1221 of ``synchronize_rcu()``, thus amortizing the per-invocation overhead
1222 down to nearly zero. However, the grace-period optimization is also
1223 required to avoid measurable degradation of real-time scheduling and
1226 In some cases, the multi-millisecond ``synchronize_rcu()`` latencies are
1228 used instead, reducing the grace-period latency down to a few tens of
1229 microseconds on small systems, at least in cases where the RCU read-side
1237 to impose modest degradation of real-time latency on non-idle online
1239 scheduling-clock interrupt.
1242 ``synchronize_rcu_expedited()``'s reduced grace-period latency is
1272       25   call_rcu(&p->rh, remove_gp_cb);
1278 lines 1-5. The function ``remove_gp_cb()`` is passed to ``call_rcu()``
1284 be legal, including within preempt-disable code, ``local_bh_disable()``
1285 code, interrupt-disable code, and interrupt handlers. However, even
1292 takes too long. Long-running operations should be relegated to separate
1295 +-----------------------------------------------------------------------+
1297 +-----------------------------------------------------------------------+
1302 +-----------------------------------------------------------------------+
1304 +-----------------------------------------------------------------------+
1305 | Presumably the ``->gp_lock`` acquired on line 18 excludes any         |
1308 | after ``->gp_lock`` is released on line 25, which in turn means that  |
1310 +-----------------------------------------------------------------------+
1349 open-coded it.
1351 +-----------------------------------------------------------------------+
1353 +-----------------------------------------------------------------------+
1359 +-----------------------------------------------------------------------+
1361 +-----------------------------------------------------------------------+
1363 | definition would say that updates in garbage-collected languages      |
1370 +-----------------------------------------------------------------------+
1374 be carried out in the meantime? The polling-style
1413 In theory, delaying grace-period completion and callback invocation is
1420 example, an infinite loop in an RCU read-side critical section must by
1422 involved example, consider a 64-CPU system built with
1423 ``CONFIG_RCU_NOCB_CPU=y`` and booted with ``rcu_nocbs=1-63``, where
1441    corresponding CPU's next scheduling-clock.
1443    indefinitely in the kernel without scheduling-clock interrupts, which
1448    has been preempted within an RCU read-side critical section is
1459 caution when changing them. Note that these forward-progress measures
1464 invocation of callbacks when any given non-\ ``rcu_nocbs`` CPU has
1475 #. Lifts callback-execution batch limits, which speeds up callback
1479 overridden. Again, these forward-progress measures are provided only for
1481 RCU <#Tasks%20RCU>`__. Even for RCU, callback-invocation forward
1482 progress for ``rcu_nocbs`` CPUs is much less well-developed, in part
1486 additional forward-progress work will be required.
1492 part due to the collision of multicore hardware with object-oriented
1493 techniques designed in single-threaded environments for single-threaded
1494 use. And in theory, RCU read-side critical sections may be composed, and
1496 real-world implementations of composable constructs, there are
1500 ``rcu_read_unlock()`` generate no code, such as Linux-kernel RCU when
1515 the nesting-depth counter. For example, the Linux kernel's preemptible
1517 practical purposes. That said, a consecutive pair of RCU read-side
1519 grace period cannot be enclosed in another RCU read-side critical
1521 within an RCU read-side critical section: To do so would result either
1522 in deadlock or in RCU implicitly splitting the enclosing RCU read-side
1523 critical section, neither of which is conducive to a long-lived and
1527 example, many transactional-memory implementations prohibit composing a
1529 a network receive operation). For another example, lock-based critical
1533 In short, although RCU read-side critical sections are highly
1541 read-side critical sections, perhaps even so intense that there was
1543 read-side critical section in flight. RCU cannot allow this situation to
1544 block grace periods: As long as all the RCU read-side critical sections
1548 RCU read-side critical sections being preempted for long durations,
1549 which has the effect of creating a long-duration RCU read-side critical
1551 systems using real-time priorities are of course more vulnerable.
1562 rates should not delay RCU read-side critical sections, although some
1563 small read-side delays can occur when using
1568 1990s, a simple user-level test consisting of ``close(open(path))`` in a
1570 appreciation of the high-update-rate corner case. This test also
1575 grace-period processing. This evasive action causes the grace period to
1580 Software-Engineering Requirements
1581 ---------------------------------
1588    splat if ``rcu_dereference()`` is used outside of an RCU read-side
1589    critical section. Update-side code can use
1601    that it has been invoked within an RCU read-side critical section. I
1604 #. A given function might wish to check for RCU-related preconditions
1610    assignment. To catch this sort of error, a given RCU-protected
1612    complain about simple-assignment accesses to that pointer. Arnd
1623    non-stack ``rcu_head`` structures must be initialized with
1628 #. An infinite loop in an RCU read-side critical section will eventually
1634    waiting on that particular RCU read-side critical section.
1640    stall warnings are counter-productive during sysrq dumps and during
1653    read-side critical sections, there is currently no good way of doing
1657 #. In kernels built with ``CONFIG_RCU_TRACE=y``, RCU-related information
1659 #. Open-coded use of ``rcu_assign_pointer()`` and ``rcu_dereference()``
1661    error-prone. Therefore, RCU-protected `linked
1663    more recently, RCU-protected `hash
1665    other special-purpose RCU-protected data structures are available in
1675 This not a hard-and-fast list: RCU's diagnostic capabilities will
1677 real-world RCU usage.
1680 --------------------------
1696 #. `Scheduling-Clock Interrupts and RCU`_
1701 notable Linux-kernel complications. Each of the following sections
1719 `remind <https://lkml.kernel.org/g/CA+55aFy4wcCwaL4okTs8wXhGZ5h-ibecy_Meg9C4MNQrUnwMcg@mail.gmail.c…
1731 used to do, it would create too many per-CPU kthreads. Although the
1748 ``task_struct`` is available and the boot CPU's per-CPU variables are
1749 set up. The read-side primitives (``rcu_read_lock()``,
1769 itself is a quiescent state and thus a grace period, so the early-boot
1770 implementation can be a no-op.
1775 reason is that an RCU read-side critical section might be preempted,
1785 +-----------------------------------------------------------------------+
1787 +-----------------------------------------------------------------------+
1790 +-----------------------------------------------------------------------+
1792 +-----------------------------------------------------------------------+
1797 | grace-period mechanism. At runtime, this expedited mechanism relies   |
1799 | drives the desired expedited grace period. Because dead-zone          |
1813 +-----------------------------------------------------------------------+
1815 I learned of these boot-time requirements as a result of a series of
1821 The Linux kernel has interrupts, and RCU read-side critical sections are
1822 legal within interrupt handlers and within interrupt-disabled regions of
1825 Some Linux-kernel architectures can enter an interrupt handler from
1826 non-idle process context, and then just never leave it, instead
1829 “half-interrupts” mean that RCU has to be very careful about how it
1831 way during a rewrite of RCU's dyntick-idle code.
1833 The Linux kernel has non-maskable interrupts (NMIs), and RCU read-side
1835 update-side primitives, including ``call_rcu()``, are prohibited within
1838 The name notwithstanding, some Linux-kernel architectures can have
1840 me <https://lkml.kernel.org/r/CALCETrXLq1y7e_dKFPgou-FKHB6Pu-r8+t-6Ds+8=va7anBWDA@mail.gmail.com>`__
1858 one of its functions results in a segmentation fault. The module-unload
1859 functions must therefore cancel any delayed calls to loadable-module
1867 module unload request, we need some other way to deal with in-flight RCU
1871 in-flight RCU callbacks have been invoked. If a module uses
1874 the underlying module-unload code could invoke ``rcu_barrier()``
1879 filesystem-unmount situation, and Dipankar Sarma incorporated
1893 +-----------------------------------------------------------------------+
1895 +-----------------------------------------------------------------------+
1897 | complete, and ``rcu_barrier()`` must wait for each pre-existing       |
1901 +-----------------------------------------------------------------------+
1903 +-----------------------------------------------------------------------+
1913 | pre-existing callbacks, you will need to invoke both                  |
1916 +-----------------------------------------------------------------------+
1923 offline CPU, with the exception of `SRCU <#Sleepable%20RCU>`__ read-side
1925 DYNIX/ptx, but on the other hand, the Linux kernel's CPU-hotplug
1928 The Linux-kernel CPU-hotplug implementation has notifiers that are used
1930 appropriately to a given CPU-hotplug operation. Most RCU operations may
1931 be invoked from CPU-hotplug notifiers, including even synchronous
1932 grace-period operations such as ``synchronize_rcu()`` and
1935 However, all-callback-wait operations such as ``rcu_barrier()`` are also
1936 not supported, due to the fact that there are phases of CPU-hotplug
1938 after the CPU-hotplug operation ends, which could also result in
1939 deadlock. Furthermore, ``rcu_barrier()`` blocks CPU-hotplug operations
1941 invoked from a CPU-hotplug notifier.
1946 RCU makes use of kthreads, and it is necessary to avoid excessive CPU-time
1948 RCU's violation of it when running context-switch-heavy workloads when
1952 context-switch-heavy ``CONFIG_NO_HZ_FULL=y`` workloads, but there is
1956 scheduler's runqueue or priority-inheritance spinlocks across an
1958 somewhere within the corresponding RCU read-side critical section.
1964 nesting.  The fact that interrupt-disabled regions of code act as RCU
1965 read-side critical sections implicitly avoids earlier issues that used
1982 The kernel needs to access user-space memory, for example, to access data
1983 referenced by system-call parameters.  The ``get_user()`` macro does this job.
1985 However, user-space memory might well be paged out, which means that
1986 ``get_user()`` might well page-fault and thus block while waiting for the
1988 reorder a ``get_user()`` invocation into an RCU read-side critical section.
1996        3 v = p->value;
2009        4 v = p->value;
2015 state in the middle of an RCU read-side critical section.  This misplaced
2016 quiescent state could result in line 4 being a use-after-free access,
2024 ``p->value`` is not volatile, so the compiler would not have any reason to keep
2027 Therefore, the Linux-kernel definitions of ``rcu_read_lock()`` and
2030 of RCU read-side critical sections.
2036 by people with battery-powered embedded systems. RCU therefore conserves
2039 energy-efficiency requirement, so I learned of this via an irate phone
2043 RCU read-side critical section on an idle CPU. (Kernels built with
2047 whether or not it is currently legal to run RCU read-side critical
2049 hand and ``RCU_NONIDLE()`` on the other while inspecting idle-loop code.
2059    order of a million deep, even on 32-bit systems, this should not be a
2086 These energy-efficiency requirements have proven quite difficult to
2088 clean-sheet rewrites of RCU's energy-efficiency code, the last of which
2093 phone calls: Flaming me on the Linux-kernel mailing list was apparently
2094 not sufficient to fully vent their ire at RCU's energy-efficiency bugs!
2096 Scheduling-Clock Interrupts and RCU
2099 The kernel transitions between in-kernel non-idle execution, userspace
2103 +-----------------+------------------+------------------+-----------------+
2104 | ``HZ`` Kconfig  | In-Kernel        | Usermode         | Idle            |
2107 |                 | scheduling-clock | scheduling-clock | RCU's           |
2108 |                 | interrupt.       | interrupt and    | dyntick-idle    |
2112 +-----------------+------------------+------------------+-----------------+
2114 |                 | scheduling-clock | scheduling-clock | RCU's           |
2115 |                 | interrupt.       | interrupt and    | dyntick-idle    |
2119 +-----------------+------------------+------------------+-----------------+
2122 |                 | on               | dyntick-idle     | dyntick-idle    |
2123 |                 | scheduling-clock | detection.       | detection.      |
2131 +-----------------+------------------+------------------+-----------------+
2133 +-----------------------------------------------------------------------+
2135 +-----------------------------------------------------------------------+
2136 | Why can't ``NO_HZ_FULL`` in-kernel execution rely on the              |
2137 | scheduling-clock interrupt, just like ``HZ_PERIODIC`` and             |
2139 +-----------------------------------------------------------------------+
2141 +-----------------------------------------------------------------------+
2143 | necessarily re-enable the scheduling-clock interrupt on entry to each |
2145 +-----------------------------------------------------------------------+
2151 scheduling-clock interrupt be enabled when RCU needs it to be:
2154    is non-idle, the scheduling-clock tick had better be running.
2156    (11-second) grace periods, with a pointless IPI waking the CPU from
2158 #. If a CPU is in a portion of the kernel that executes RCU read-side
2164    no-joking guaranteed to never execute any RCU read-side critical
2166    sort of thing is used by some architectures for light-weight
2173    fact joking about not doing RCU read-side critical sections.
2174 #. If a CPU is executing in the kernel with the scheduling-clock
2175    interrupt disabled and RCU believes this CPU to be non-idle, and if
2185    is usually OK) and the scheduling-clock interrupt is enabled, of
2190 +-----------------------------------------------------------------------+
2192 +-----------------------------------------------------------------------+
2196 +-----------------------------------------------------------------------+
2198 +-----------------------------------------------------------------------+
2200 | often. But given that long-running interrupt handlers can cause other |
2203 +-----------------------------------------------------------------------+
2206 between in-kernel execution, usermode execution, and idle, and as long
2207 as the scheduling-clock interrupt is enabled when RCU needs it to be,
2214 Although small-memory non-realtime systems can simply use Tiny RCU, code
2218 pair of pointers, it does appear in many RCU-protected data structures,
2223 This need for memory efficiency is one reason that RCU uses hand-crafted
2228 posted them. Although this information might appear in debug-only kernel
2229 builds at some point, in the meantime, the ``->func`` field will often
2237 conditions <https://lkml.kernel.org/g/1439976106-137226-1-git-send-email-kirill.shutemov@linux.inte…
2238 the Linux kernel's memory-management subsystem needs a particular bit to
2239 remain zero during all phases of grace-period processing, and that bit
2241 ``->next`` field. RCU makes this guarantee as long as ``call_rcu()`` is
2244 energy-efficiency purposes.
2247 structure be aligned to a two-byte boundary, and passing a misaligned
2251 a four-byte or even eight-byte alignment requirement? Because the m68k
2252 architecture provides only two-byte alignment, and thus acts as
2258 potentially have energy-efficiency benefits, but only if the rate of
2259 non-lazy callbacks decreases significantly for some important workload.
2268 hot code paths in performance-critical portions of the Linux kernel's
2271 read-side primitives. To that end, it would be good if preemptible RCU's
2283 minimal per-operation overhead. In fact, in many cases, increasing load
2284 must *decrease* the per-operation overhead, witness the batching
2290 The Linux kernel is used for real-time workloads, especially in
2291 conjunction with the `-rt
2293 real-time-latency response requirements are such that the traditional
2294 approach of disabling preemption across RCU read-side critical sections
2296 an RCU implementation that allows RCU read-side critical sections to be
2298 clear that an earlier `real-time
2302 encountered by a very early version of the -rt patchset.
2304 In addition, RCU must make do with a sub-100-microsecond real-time
2305 latency budget. In fact, on smaller systems with the -rt patchset, the
2306 Linux kernel provides sub-20-microsecond real-time latencies for the
2309 surprise, the sub-100-microsecond real-time latency budget `applies to
2312 up to and including systems with 4096 CPUs. This real-time requirement
2313 motivated the grace-period kthread, which also simplified handling of a
2316 RCU must avoid degrading real-time response for CPU-bound threads,
2318 ``CONFIG_NO_HZ_FULL=y``) or in the kernel. That said, CPU-bound loops in
2326 stress-test suite. This stress-test suite is called ``rcutorture``.
2332 today, given Android smartphones, Linux-powered televisions, and
2342 jurisdictions, a successful multi-year test of a given mechanism, which
2344 safety-critical certifications. In fact, rumor has it that the Linux
2345 kernel is already being used in production for safety-critical
2351 -----------------
2356 implementations, non-preemptible and preemptible. The other four flavors
2360 #. `Bottom-Half Flavor (Historical)`_
2365 Bottom-Half Flavor (Historical)
2368 The RCU-bh flavor of RCU has since been expressed in terms of the other
2370 single flavor. The read-side API remains, and continues to disable
2374 The softirq-disable (AKA “bottom-half”, hence the “_bh” abbreviations)
2375 flavor of RCU, or *RCU-bh*, was developed by Dipankar Sarma to provide a
2376 flavor of RCU that could withstand the network-based denial-of-service
2381 grace periods from ever ending. The result was an out-of-memory
2384 The solution was the creation of RCU-bh, which does
2385 ``local_bh_disable()`` across its read-side critical sections, and which
2388 offline. This means that RCU-bh grace periods can complete even when
2390 algorithms based on RCU-bh to withstand network-based denial-of-service
2394 re-enable softirq handlers, any attempt to start a softirq handlers
2395 during the RCU-bh read-side critical section will be deferred. In this
2398 overhead should be associated with the code following the RCU-bh
2399 read-side critical section rather than ``rcu_read_unlock_bh()``, but the
2401 of fine distinction. For example, suppose that a three-millisecond-long
2402 RCU-bh read-side critical section executes during a time of heavy
2409 The `RCU-bh
2410 API <https://lwn.net/Articles/609973/#RCU%20Per-Flavor%20API%20Table>`__
2415 ``rcu_read_lock_bh_held()``. However, the update-side APIs are now
2416 simple wrappers for other RCU flavors, namely RCU-sched in
2417 CONFIG_PREEMPT=n kernels and RCU-preempt otherwise.
2422 The RCU-sched flavor of RCU has since been expressed in terms of the
2424 single flavor. The read-side API remains, and continues to disable
2429 effect of also waiting for all pre-existing interrupt and NMI handlers.
2430 However, there are legitimate preemptible-RCU implementations that do
2432 RCU read-side critical section can be a quiescent state. Therefore,
2433 *RCU-sched* was created, which follows “classic” RCU in that an
2434 RCU-sched grace period waits for pre-existing interrupt and NMI
2436 RCU-sched APIs have identical implementations, while kernels built with
2441 re-enable preemption, respectively. This means that if there was a
2442 preemption attempt during the RCU-sched read-side critical section,
2446 very slowly. However, the highest-priority task won't be preempted, so
2447 that task will enjoy low-overhead ``rcu_read_unlock_sched()``
2450 The `RCU-sched
2451 API <https://lwn.net/Articles/609973/#RCU%20Per-Flavor%20API%20Table>`__
2458 preemption also marks an RCU-sched read-side critical section, including
2466 read-side critical section” was a reliable indication that this someone
2468 read-side critical section, you can probably afford to use a
2469 higher-overhead synchronization mechanism. However, that changed with
2470 the advent of the Linux kernel's notifiers, whose RCU read-side critical
2481 That said, one consequence of these domains is that read-side code must
2493 As noted above, it is legal to block within SRCU read-side critical
2495 block forever in one of a given domain's SRCU read-side critical
2498 happen if any operation in a given domain's SRCU read-side critical
2500 period to elapse. For example, this results in a self-deadlock:
2515 ``ss1``-domain SRCU read-side critical section acquired another mutex
2516 that was held across as ``ss``-domain ``synchronize_srcu()``, deadlock
2521 Unlike the other RCU flavors, SRCU read-side critical sections can run
2530 invoked from CPU-hotplug notifiers, due to the fact that SRCU grace
2534 the CPU-hotplug process. The problem is that if a notifier is waiting on
2539 CPU-hotplug notifiers.
2542 non-expedited grace periods are implemented by the same mechanism. This
2552 As of v4.12, SRCU's callbacks are maintained per-CPU, eliminating a
2564 API <https://lwn.net/Articles/609973/#RCU%20Per-Flavor%20API%20Table>`__
2580 anywhere in the code, it is not possible to use read-side markers such
2590 trampoline would be pre-ordained a surprisingly long time before execution
2594 RCU <https://lwn.net/Articles/607117/>`__, is to have implicit read-side
2598 userspace execution also delimit tasks-RCU read-side critical sections.
2600 The tasks-RCU API is quite compact, consisting only of
2610 -----------------------
2612 One of the tricks that RCU uses to attain update-side scalability is to
2613 increase grace-period latency with increasing numbers of CPUs. If this
2615 grace-period state machine so as to avoid the need for the additional
2620 ``rcu_barrier()`` in CPU-hotplug notifiers, it will be necessary to
2624 The tradeoff between grace-period latency on the one hand and
2626 re-examined. The desire is of course for zero grace-period latency as
2636 be unnecessary because the hotpath read-side primitives do not access
2641 number of CPUs in a socket, NUMA node, or whatever. If the number of
2647 carefully run and realistic system-level workload.
2661 Additional work may be required to provide reasonable forward-progress
2666 -------
2674 ---------------