Lines Matching +full:cpu +full:- +full:idle +full:- +full:states
1 <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
5 <meta HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1">
16 <li> <a href="#Data-Structure Relationships">
17 Data-Structure Relationships</a>
28 <li> <a href="#RCU-Specific Fields in the task_struct Structure">
29 RCU-Specific Fields in the <tt>task_struct</tt> Structure</a>
34 <h3><a name="Data-Structure Relationships">Data-Structure Relationships</a></h3>
50 one for each possible CPU.
55 which results in a three-level <tt>rcu_node</tt> tree.
59 </p><p>The purpose of this combining tree is to allow per-CPU events
60 such as quiescent states, dyntick-idle transitions,
61 and CPU hotplug operations to be processed efficiently
63 Quiescent states are recorded by the per-CPU <tt>rcu_data</tt> structures,
64 and other events are recorded by the leaf-level <tt>rcu_node</tt>
69 A grace period can be completed at the root once every CPU
75 </p><p>As can be seen from the diagram, on a 64-bit system
76 a two-level tree with 64 leaves can accommodate 1,024 CPUs, with a fanout
87 Because there are more types of events that affect the leaf-level
90 64, the contention on these structures' <tt>->structures</tt>
98 that the fanout for the non-leaf <tt>rcu_node</tt> structures
103 contention problems on the non-leaf <tt>rcu_node</tt> structures,
105 parameter to reduce the non-leaf fanout as needed.
118 a 32-bit system), then RCU will automatically add more levels to the
120 For example, if you are crazy enough to build a 64-bit system with 65,536
125 </p><p>RCU currently permits up to a four-level tree, which on a 64-bit system
127 32-bit systems.
130 in a 16-CPU test using a 4-level tree.
131 This can be useful for testing large-system capabilities on small test
134 </p><p>This multi-level combining tree allows us to get most of the
136 benefits of partitioning, even though RCU grace-period detection is
138 The trick here is that only the last CPU to report a quiescent state
141 This means that at the leaf-level <tt>rcu_node</tt> structure, only
144 more extreme: Only one access out of sixty-four will progress up
148 No matter how many CPUs there are in the system, at most 64 quiescent-state
180 CPU, although a few can be read and written by other CPUs.
188 read-side performance.
201 tracks grace periods, serves as short-term repository
202 for callbacks orphaned by CPU-hotplug events,
204 tracks expedited grace-period state,
205 and maintains state used to force quiescent states when
208 tree that propagates quiescent-state
210 grace-period information from the root to the leaves.
211 It provides local copies of the grace-period state in order
217 RCU read-side critical section.
220 per-<tt>rcu_node</tt> priority-boosting
222 Finally, it records CPU-hotplug state in order to determine
224 <li> <tt>rcu_data</tt>: This per-CPU structure is the
225 focus of quiescent-state detection and RCU callback queuing.
227 <tt>rcu_node</tt> structure to allow more-efficient
228 propagation of quiescent states up the <tt>rcu_node</tt>
231 copy of the grace-period information to allow for-free
233 access to this information from the corresponding CPU.
234 Finally, this structure records past dyntick-idle state
235 for the corresponding CPU and also tracks statistics.
240 within the RCU-protected data structure.
257 synchronize with CPU-hotplug events,
258 and maintains state used to force quiescent states when
301 <tt>->node[]</tt> array as shown in the following figure:
306 breadth-first traversal of the tree is implemented as a simple
311 </p><p>Each entry of the <tt>->level</tt> array references
322 </p><p>For whatever it is worth, if you draw the tree to be tree-shaped
323 rather than array-shaped, it is easy to draw a planar representation:
327 </p><p>Finally, the <tt>->rda</tt> field references a per-CPU
328 pointer to the corresponding CPU's <tt>rcu_data</tt> structure.
333 <h5>Grace-Period Tracking</h5>
343 the <tt>->gp_seq</tt> field contains the current grace-period
347 In other words, if the bottom two bits of <tt>->gp_seq</tt> are
348 zero, then RCU is idle.
351 <tt>->lock</tt> field.
353 </p><p>There are <tt>->gp_seq</tt> fields
374 <p>The <tt>->gp_max</tt> field tracks the duration of the longest
376 It is protected by the root <tt>rcu_node</tt>'s <tt>->lock</tt>.
378 <p>The <tt>->name</tt> and <tt>->abbr</tt> fields distinguish
380 and non-preemptible RCU (“rcu_sched” and “s”).
387 tree that propagates quiescent-state
389 grace-period information from the root down to the leaves.
390 They provides local copies of the grace-period state in order
396 RCU read-side critical section.
399 per-<tt>rcu_node</tt> priority-boosting
401 Finally, they record CPU-hotplug state in order to determine
421 <p>The <tt>->parent</tt> pointer references the <tt>rcu_node</tt>
425 states up the tree.
426 The <tt>->level</tt> field gives the level in the tree, with
428 The <tt>->grpnum</tt> field gives this node's position within
430 on 32-bit systems and between 0 and 63 on 64-bit systems.
431 The <tt>->level</tt> and <tt>->grpnum</tt> fields are
433 The <tt>->grpmask</tt> field is the bitmask counterpart of
434 <tt>->grpnum</tt>, and therefore always has exactly one bit set.
437 Finally, the <tt>->grplo</tt> and <tt>->grphi</tt> fields
438 contain the lowest and highest numbered CPU served by this
460 <h5>Grace-Period Tracking</h5>
470 <p>The <tt>rcu_node</tt> structures' <tt>->gp_seq</tt> fields are
476 <tt>->gp_seq</tt> field is zero, then this <tt>rcu_node</tt>
477 structure believes that RCU is idle.
482 <p>The <tt>->gp_seq_needed</tt> fields record the
483 furthest-in-the-future grace period request seen by the corresponding
485 the value of the <tt>->gp_seq</tt> field equals or exceeds that of
486 the <tt>->gp_seq_needed</tt> field.
494 Won't wrapping of the <tt>->gp_seq</tt> field cause
499 No, because if the <tt>->gp_seq_needed</tt> field lags behind the
500 <tt>->gp_seq</tt> field, the <tt>->gp_seq_needed</tt> field
502 Modulo-arithmetic comparisons therefore will always get the
508 <h5>Quiescent-State Tracking</h5>
510 <p>These fields manage the propagation of quiescent states up the
523 <p>The <tt>->qsmask</tt> field tracks which of this
525 quiescent states for the current normal grace period.
530 Similarly, the <tt>->expmask</tt> field tracks which
532 quiescent states for the current expedited grace period.
535 expedited implementation accepts extreme CPU overhead to obtain
536 much lower grace-period latency, for example, consuming a few
537 tens of microseconds worth of CPU time to reduce grace-period
539 The <tt>->qsmaskinit</tt> field tracks which of this
541 one online CPU.
542 This mask is used to initialize <tt>->qsmask</tt>,
543 and <tt>->expmaskinit</tt> is used to initialize
544 <tt>->expmask</tt> and the beginning of the
556 Lockless grace-period computation! Such a tantalizing possibility!
563 <li> <font color="ffffff">CPU 0 has been in dyntick-idle
569 <li> <font color="ffffff">Meanwhile, CPU 1 is running
571 and notices that CPU 0 has been in dyntick idle mode,
574 <li> <font color="ffffff">CPU 0's scheduling clock
576 middle of an RCU read-side critical section, and notices
581 <li> <font color="ffffff">CPU 0's softirq handler
586 <li> <font color="ffffff">But CPU 1 beats it to the punch,
590 <li> <font color="ffffff">CPU 0 now reports its quiescent
593 That grace period might now end before the RCU read-side
601 grace-period sequence number in <tt>->gp_seq</tt>.
606 <h5>Blocked-Task Management</h5>
609 midst of their RCU read-side critical sections, and these tasks
612 in a separate article on RCU read-side processing.
623 <p>The <tt>->blkd_tasks</tt> field is a list header for
625 As tasks undergo context switches within RCU read-side critical
627 (via the <tt>task_struct</tt>'s <tt>->rcu_node_entry</tt>
628 field) onto the head of the <tt>->blkd_tasks</tt> list for the
629 leaf <tt>rcu_node</tt> structure corresponding to the CPU
631 As these tasks later exit their RCU read-side critical sections,
638 That pointer is stored in <tt>->gp_tasks</tt> for normal
639 grace periods and in <tt>->exp_tasks</tt> for expedited
645 removes itself from the <tt>->blkd_tasks</tt> list,
651 all hard-affinitied to the largest-numbered CPU in the system.
652 Then if task T1 blocked in an RCU read-side
654 then task T2 blocked in an RCU read-side critical section,
656 in an RCU read-side critical section, then the state of the
657 last leaf <tt>rcu_node</tt> structure's blocked-task list
668 <tt>rcu_read_unlock()</tt> that ends their RCU read-side critical
672 The <tt>->wait_blkd_tasks</tt> field indicates whether or not
678 C-preprocessor expressions as follows:
750 is currently limited to four, as specified by lines 21-24
752 For 32-bit systems, this allows 16*32*32*32=524,288 CPUs, which
754 For 64-bit systems, 16*64*64*64=4,194,304 CPUs is allowed, which
756 This four-level tree also allows kernels built with
763 tree permits better testing of RCU's combining-tree code.
766 are permitted at each non-leaf level of the <tt>rcu_node</tt> tree.
775 the <tt>->qsmask</tt> field on a 64-bit system, results in
777 <tt>->lock</tt> fields.
783 Lines 11-19 perform this computation.
785 </p><p>Lines 21-24 compute the maximum number of CPUs supported by
786 a single-level (which contains a single <tt>rcu_node</tt> structure),
787 two-level, three-level, and four-level <tt>rcu_node</tt> tree,
795 C-preprocessor variables, respectively.
797 </p><p>These variables are used to control the C-preprocessor <tt>#if</tt>
798 statement spanning lines 26-66 that computes the number of
802 C-preprocessor variable by lines 27, 35, 44, and 54.
810 46-47, and 56-58.
811 Lines 31-33, 40-42, 50-52, and 62-63 create initializers
812 for lockdep lock-class names.
813 Finally, lines 64-66 produce an error if the maximum number of
847 grace period is current, hence the <tt>->gp_seq</tt> field.
855 The <tt>->head</tt> pointer references the first callback or
858 Each element of the <tt>->tails[]</tt> array references the
859 <tt>->next</tt> pointer of the last callback in the corresponding
860 segment of the list, or the list's <tt>->head</tt> pointer if
866 This relationship between the <tt>->head</tt> pointer, the
867 <tt>->tails[]</tt> array, and the callbacks is shown in this
872 </p><p>In this figure, the <tt>->head</tt> pointer references the
875 The <tt>->tails[RCU_DONE_TAIL]</tt> array element references
876 the <tt>->head</tt> pointer itself, indicating that none
878 The <tt>->tails[RCU_WAIT_TAIL]</tt> array element references callback
879 CB 2's <tt>->next</tt> pointer, which indicates that
883 The <tt>->tails[RCU_NEXT_READY_TAIL]</tt> array element
884 references the same RCU callback that <tt>->tails[RCU_WAIT_TAIL]</tt>
887 The <tt>->tails[RCU_NEXT_TAIL]</tt> array element references
888 CB 4's <tt>->next</tt> pointer, indicating that all the
891 Note that the <tt>->tails[RCU_NEXT_TAIL]</tt> array element
892 always references the last RCU callback's <tt>->next</tt> pointer
894 the <tt>->head</tt> pointer.
898 <tt>->tails[RCU_NEXT_TAIL]</tt> array element: It can be <tt>NULL</tt>
900 Lists are disabled when the corresponding CPU is offline or when
901 the corresponding CPU's callbacks are offloaded to a kthread,
909 </p><p>The <tt>->gp_seq[]</tt> array records grace-period
914 In particular, this allows CPUs that go idle for extended periods
918 </p><p>The <tt>->len</tt> counter contains the number of
919 callbacks in <tt>->head</tt>, and the
920 <tt>->len_lazy</tt> contains the number of those callbacks that
924 <p><b>Important note</b>: It is the <tt>->len</tt> field that
926 this <tt>rcu_segcblist</tt> structure, <i>not</i> the <tt>->head</tt>
928 The reason for this is that all the ready-to-invoke callbacks
930 all at once at callback-invocation time (<tt>rcu_do_batch</tt>), due
931 to which <tt>->head</tt> may be set to NULL if there are no not-done
934 high-priority process just woke up on this CPU, then the remaining
936 <tt>->head</tt> once again points to the start of the segment.
938 CPU has callbacks present the entire time.
939 Therefore, it is not appropriate to test the <tt>->head</tt> pointer
942 <p>In contrast, the <tt>->len</tt> and <tt>->len_lazy</tt> counts
944 This means that the <tt>->len</tt> count is zero only if
946 Of course, off-CPU sampling of the <tt>->len</tt> count requires
954 <p>The <tt>rcu_data</tt> maintains the per-CPU state for the RCU subsystem.
956 CPU (and from tracing) unless otherwise stated.
958 focus of quiescent-state detection and RCU callback queuing.
960 <tt>rcu_node</tt> structure to allow more-efficient
961 propagation of quiescent states up the <tt>rcu_node</tt>
964 copy of the grace-period information to allow for-free
966 access to this information from the corresponding CPU.
967 Finally, this structure records past dyntick-idle state
968 for the corresponding CPU and also tracks statistics.
979 1 int cpu;
985 <p>The <tt>->cpu</tt> field contains the number of the
986 corresponding CPU and the <tt>->mynode</tt> field references the
988 The <tt>->mynode</tt> is used to propagate quiescent states
992 <p>The <tt>->grpmask</tt> field indicates the bit in
993 the <tt>->mynode->qsmask</tt> corresponding to this
995 quiescent states.
996 The <tt>->beenonline</tt> flag is set whenever the corresponding
997 CPU comes online, which means that the debugfs tracing need not dump
1000 <h5>Quiescent-State and Grace-Period Tracking</h5>
1013 <p>The <tt>->gp_seq</tt> field is the counterpart of the field of the same
1015 <tt>->gp_seq_needed</tt> field is the counterpart of the field of the same
1020 arbitrarily far behind for CPUs in dyntick-idle mode (but these counters
1021 will catch up upon exit from dyntick-idle mode).
1023 <tt>->gp_seq</tt> are zero, then this <tt>rcu_data</tt>
1024 structure believes that RCU is idle.
1040 to carefully manage the numbers on a per-node basis.
1048 <p>The <tt>->cpu_no_qs</tt> flag indicates that the
1049 CPU has not yet passed through a quiescent state,
1050 while the <tt>->core_needs_qs</tt> flag indicates that the
1051 RCU core needs a quiescent state from the corresponding CPU.
1052 The <tt>->gpwrap</tt> field indicates that the corresponding
1053 CPU has remained idle for so long that the
1055 will cause the CPU to disregard the values of its counters on
1056 its next exit from idle.
1060 <p>In the absence of CPU-hotplug events, RCU callbacks are invoked by
1061 the same CPU that registered them.
1062 This is strictly a cache-locality optimization: callbacks can and
1064 After all, if the CPU that registered a given callback has gone
1082 <p>The <tt>->cblist</tt> structure is the segmented callback list
1084 The CPU advances the callbacks in its <tt>rcu_data</tt> structure
1086 The CPU detects the completion of an RCU grace period by noticing
1088 <tt>->gp_seq</tt> field differs from that of its leaf
1091 <tt>->gp_seq</tt> field is updated at the beginnings and ends of each
1095 The <tt>->qlen_last_fqs_check</tt> and
1096 <tt>->n_force_qs_snap</tt> coordinate the forcing of quiescent
1097 states from <tt>call_rcu()</tt> and friends when callback
1100 </p><p>The <tt>->n_cbs_invoked</tt>,
1101 <tt>->n_cbs_orphaned</tt>, and <tt>->n_cbs_adopted</tt>
1103 sent to other CPUs when this CPU goes offline,
1105 The <tt>->n_nocbs_invoked</tt> is used when the CPU's callbacks
1109 Finally, the <tt>->blimit</tt> counter is the maximum number of
1112 <h5>Dyntick-Idle Handling</h5>
1122 The <tt>->dynticks_snap</tt> field is used to take a snapshot
1123 of the corresponding CPU's dyntick-idle state when forcing
1124 quiescent states, and is therefore accessed from other CPUs.
1125 Finally, the <tt>->dynticks_fqs</tt> field is used to
1126 count the number of times this CPU is determined to be in
1127 dyntick-idle state, and is used for tracing and debugging purposes.
1140 <p>These fields in the rcu_data structure maintain the per-CPU dyntick-idle
1141 state for the corresponding CPU.
1142 The fields may be accessed only from the corresponding CPU (and from tracing)
1145 <p>The <tt>->dynticks_nesting</tt> field counts the
1148 NMIs, irqs, and tracers are counted by the <tt>->dynticks_nmi_nesting</tt>
1152 The initial transition from idle adds one, and nested transitions
1154 <tt>->dynticks_nmi_nesting</tt> value of nine.
1156 of reasons why this CPU cannot be permitted to enter dyntick-idle
1157 mode, aside from process-level transitions.
1159 <p>However, it turns out that when running in non-idle kernel context,
1162 Therefore, whenever the <tt>->dynticks_nesting</tt> field is
1163 incremented up from zero, the <tt>->dynticks_nmi_nesting</tt> field
1165 <tt>->dynticks_nesting</tt> field is decremented down to zero,
1166 the the <tt>->dynticks_nmi_nesting</tt> field is set to zero.
1169 <tt>->dynticks_nmi_nesting</tt> field every time the corresponding
1170 CPU enters the idle loop from process context.
1172 </p><p>The <tt>->dynticks</tt> field counts the corresponding
1173 CPU's transitions to and from either dyntick-idle or user mode, so
1174 that this counter has an even value when the CPU is in dyntick-idle
1176 user mode need to be counted for user mode adaptive-ticks support
1179 </p><p>The <tt>->rcu_need_heavy_qs</tt> field is used
1181 see a quiescent state from the corresponding CPU, so much so that
1182 it is willing to call for heavy-weight dyntick-counter operations.
1183 This flag is checked by RCU's context-switch and <tt>cond_resched()</tt>
1184 code, which provide a momentary idle sojourn in response.
1186 </p><p>Finally, the <tt>->rcu_urgent_qs</tt> field is used to record
1188 the corresponding CPU, with the various other fields indicating just how badly
1190 This flag is checked by RCU's context-switch path
1197 Why not simply combine the <tt>->dynticks_nesting</tt>
1198 and <tt>->dynticks_nmi_nesting</tt> counters into a
1200 the corresponding CPU is non-idle?
1206 from a made-up interrupt.
1211 <p>Additional fields are present for some special-purpose
1218 These structures are normally embedded within RCU-protected data
1230 <p>The <tt>->next</tt> field is used
1233 The <tt>->func</tt> field is a pointer to the function
1237 However, <tt>kfree_rcu()</tt> uses the <tt>->func</tt>
1239 structure within the enclosing RCU-protected data structure.
1249 Given that the callback function <tt>->func</tt>
1252 enclosing RCU-protected data structure?
1257 type of RCU-protected data structure.
1259 macro in the Linux kernel (or other pointer-manipulation facilities
1266 <h3><a name="RCU-Specific Fields in the task_struct Structure">
1267 RCU-Specific Fields in the <tt>task_struct</tt> Structure</a></h3>
1287 <p>The <tt>->rcu_read_lock_nesting</tt> field records the
1288 nesting level for RCU read-side critical sections, and
1289 the <tt>->rcu_read_unlock_special</tt> field is a bitmask
1292 The <tt>->rcu_node_entry</tt> field is used to form lists of
1293 tasks that have blocked within preemptible-RCU read-side critical
1294 sections and the <tt>->rcu_blocked_node</tt> field references
1296 or <tt>NULL</tt> if it is not blocked within a preemptible-RCU
1297 read-side critical section.
1299 <p>The <tt>->rcu_tasks_nvcsw</tt> field tracks the number of
1301 beginning of the current tasks-RCU grace period,
1302 <tt>->rcu_tasks_holdout</tt> is set if the current tasks-RCU
1303 grace period is waiting on this task, <tt>->rcu_tasks_holdout_list</tt>
1305 and <tt>->rcu_tasks_idle_cpu</tt> tracks which CPU this
1306 idle task is running, but only if the task is currently running,
1307 that is, if the CPU is currently idle.
1319 3 return &rsp->node[0];
1323 7 for ((rnp) = &(rsp)->node[0]; \
1324 8 (rnp) < &(rsp)->node[NUM_RCU_NODES]; (rnp)++)
1327 11 for ((rnp) = (rsp)->level[NUM_RCU_LVLS - 1]; \
1328 12 (rnp) < &(rsp)->node[NUM_RCU_NODES]; (rnp)++)
1333 <tt>->node[]</tt> array, which is the root <tt>rcu_node</tt>
1339 <tt>->node[]</tt> array, performing a breadth-first traversal by
1355 In the single-node case,
1367 Finally, in <tt>CONFIG_NO_HZ_IDLE</tt> kernels, each CPU's dyntick-idle
1368 state is tracked by dynticks-related fields in the <tt>rcu_data</tt> structure.
1378 for helping me get this document into a more human-readable state.