Lines Matching +full:cpu +full:- +full:centric
1 Explanation of the Linux-Kernel Memory Consistency Model
15 7. THE PROGRAM ORDER RELATION: po AND po-loc
18 10. THE READS-FROM RELATION: rf, rfi, and rfe
20 12. THE FROM-READS RELATION: fr, fri, and fre
22 14. PROPAGATION ORDER RELATION: cumul-fence
28 20. THE HAPPENS-BEFORE RELATION: hb
29 21. THE PROPAGATES-BEFORE RELATION: pb
30 22. RCU RELATIONS: rcu-link, gp, rscs, rcu-fence, and rb
36 ------------
38 The Linux-kernel memory consistency model (LKMM) is rather complex and
40 linux-kernel.bell and linux-kernel.cat files that make up the formal
66 ----------
84 factors such as DMA and mixed-size accesses.) But on multiprocessor
95 ----------------
101 is full. Running concurrently on a different CPU might be a part of
130 CPU and P1() represents the read() routine running on another. The
138 This pattern of memory accesses, where one CPU stores values to two
139 shared memory locations and another CPU loads from those locations in
156 predict that r1 = 42 or r2 = -7, because neither of those values ever
178 ----------------------------
182 if each CPU executed its instructions in order but with unspecified
186 program source for each CPU. The model says that the value obtained
188 store to the same memory location, from any CPU.
220 each CPU stores to its own shared location and then loads from the
221 other CPU's location:
252 -------------------
283 ------
303 Atomic read-modify-write accesses, such as atomic_inc() or xchg(),
310 logical computations, control-flow instructions, or accesses to
311 private memory or CPU registers are not of central interest to the
316 is concerned only with the store itself -- its value and its address
317 -- not the computation leading up to it.
325 THE PROGRAM ORDER RELATION: po AND po-loc
326 -----------------------------------------
332 instructions are presented to a CPU's execution unit. Thus, we say
333 that X is po-before Y (written as "X ->po Y" in formulas) if X occurs
336 This is inherently a single-CPU relation; two instructions executing
340 po-loc is a sub-relation of po. It links two memory accesses when the
342 same memory location (the "-loc" suffix).
345 program order we need to explain. The LKMM was inspired by low-level
371 buf really is po-before its write event to flag, and similarly for the
377 fact, they need not even be stored in normal memory at all -- in
378 principle a private variable could be stored in a CPU register (hence
384 ---------
431 ------------------------------------------
486 come earlier in program order. Symbolically, if we have R ->data X,
487 R ->addr X, or R ->ctrl X (where R is a read event), then we must also
488 have R ->po X. It wouldn't make sense for a computation to depend
493 THE READS-FROM RELATION: rf, rfi, and rfe
494 -----------------------------------------
496 The reads-from relation (rf) links a write event to a read event when
499 write W ->rf R to indicate that the load R reads from the store W. We
501 the same CPU (internal reads-from, or rfi) and where they occur on
502 different CPUs (external reads-from, or rfe).
506 executes on a separate CPU before the program runs.
510 of load-tearing, where a load obtains some of its bits from one store
512 and WRITE_ONCE() will prevent load-tearing; it's not possible to have:
531 On the other hand, load-tearing is unavoidable when mixed-size
552 If r1 = 0x56781234 (little-endian!) at the end, then P1 must have read
553 from both of P0's stores. It is possible to handle mixed-size and
560 ------------------------------------------------------------------
563 multi-processor system, the CPUs must share a consistent view of the
579 hardware-centric view, the order in which the stores get written to
580 x's cache line). We write W ->co W' if W comes before W' in the
587 Write-write coherence: If W ->po-loc W' (i.e., W comes before
589 and W' are two stores, then W ->co W'.
591 Write-read coherence: If W ->po-loc R, where W is a store and R
595 Read-write coherence: If R ->po-loc W, where R is a load and W
599 Read-read coherence: If R ->po-loc R', where R and R' are two
609 requirement that every store eventually becomes visible to every CPU.)
626 write-write coherence rule: Since the store of 23 comes later in
640 If r1 = 666 at the end, this would violate the read-write coherence
663 would violate the read-read coherence rule: The r1 load comes before
670 encoded in Itanium's Very-Long-Instruction-Word format, and it is yet
674 Just like the po relation, co is inherently an ordering -- it is not
677 occur on the same CPU (internal coherence order, or coi) and stores
682 related by po. Coherence order is strictly per-location, or if you
686 THE FROM-READS RELATION: fr, fri, and fre
687 -----------------------------------------
689 The from-reads relation (fr) can be a little difficult for people to
691 overwritten by a store. In other words, we have R ->fr W when the
715 the load and the store are on the same CPU) and fre (when they are on
720 event W for the same location, we will have R ->fr W if and only if
721 the write which R reads from is co-before W. In symbols,
723 (R ->fr W) := (there exists W' with W' ->rf R and W' ->co W).
727 --------------------
736 For the most part, executing an instruction requires a CPU to perform
740 When CPU C executes a store instruction, it tells the memory subsystem
743 special case, we say that the store propagates to its own CPU at the
746 arrange for the store to be co-later than (i.e., to overwrite) any
747 other store to the same location which has already propagated to CPU C.
749 When a CPU executes a load instruction R, it first checks to see
750 whether there are any as-yet unexecuted store instructions, for the
752 uses the value of the po-latest such store as the value obtained by R,
754 CPU asks the memory subsystem for the value to load and we say that R
756 of the co-latest store to the location in question which has already
757 propagated to that CPU.
760 CPUs have local caches, and propagating a store to a CPU really means
761 propagating it to the CPU's local cache. A local cache can take some
763 to satisfy one of the CPU's loads until it has been processed. On
765 First-In-First-Out order, and consequently the processing delay
767 have a partitioned design that results in non-FIFO behavior. We will
776 CPU to do anything special other than informing the memory subsystem
780 First, a fence forces the CPU to execute various instructions in
785 the CPU to execute all po-earlier instructions before any
786 po-later instructions;
788 smp_rmb() forces the CPU to execute all po-earlier loads
789 before any po-later loads;
791 smp_wmb() forces the CPU to execute all po-earlier stores
792 before any po-later stores;
794 Acquire fences, such as smp_load_acquire(), force the CPU to
796 part of an smp_load_acquire()) before any po-later
799 Release fences, such as smp_store_release(), force the CPU to
800 execute all po-earlier instructions before the store
805 propagates stores. When a fence instruction is executed on CPU C:
807 For each other CPU C', smp_wmb() forces all po-earlier stores
808 on C to propagate to C' before any po-later stores do.
810 For each other CPU C', any store which propagates to C before
811 a release fence is executed (including all po-earlier
816 executed (including all po-earlier stores on C) is forced to
817 propagate to all other CPUs before any instructions po-after
821 affects stores from other CPUs that propagate to CPU C before the
824 strong fences are A-cumulative. By contrast, smp_wmb() fences are not
825 A-cumulative; they only affect the propagation of stores that are
833 PROPAGATION ORDER RELATION: cumul-fence
834 ---------------------------------------
837 smp_wmb() fences) are collectively referred to as cumul-fences, even
838 though smp_wmb() isn't A-cumulative. The cumul-fence relation is
841 E and F are both stores on the same CPU and an smp_wmb() fence
845 where either X = E or else E ->rf X; or
848 order, where either X = E or else E ->rf X.
851 and W ->cumul-fence W', then W must propagate to any given CPU
857 -------------------------------------------------
861 maintaining cache coherence and the fact that a CPU can't operate on a
870 Atomicity: This requires that atomic read-modify-write
874 Happens-before: This requires that certain instructions are
880 Rcu: This requires that RCU read-side critical sections and
882 Grace-Period Guarantee.
885 memory models (such as those for C11/C++11). The "happens-before" and
893 -----------------------------------
901 first for CPU 0, then CPU 1, etc.
904 and po-loc relations agree with this global ordering; in other words,
905 whenever we have X ->rf Y or X ->co Y or X ->fr Y or X ->po-loc Y, the
911 X0 -> X1 -> X2 -> ... -> Xn -> X0,
913 where each of the links is either rf, co, fr, or po-loc. This has to
923 -------------------
925 What does it mean to say that a read-modify-write (rmw) update, such
933 CPU 0 loads x obtaining 13;
934 CPU 1 loads x obtaining 13;
935 CPU 0 stores 14 to x;
936 CPU 1 stores 14 to x;
940 In this example, CPU 0's increment effectively gets lost because it
941 occurs in between CPU 1's load and store. To put it another way, the
942 problem is that the position of CPU 0's store in x's coherence order
943 is between the store that CPU 1 reads from and the store that CPU 1
949 atomic read-modify-write and W' is the write event which R reads from,
953 (R ->rmw W) implies (there is no X with R ->fr X and X ->co W),
960 -----------------------------------------
962 There are many situations where a CPU is obligated to execute two
964 "preserved program order") relation, which links the po-earlier
965 instruction to the po-later instruction and is thus a sub-relation of
970 memory accesses with X ->po Y; then the CPU must execute X before Y if
989 X and Y are both loads, X ->addr Y (i.e., there is an address
996 a store W will force the CPU to execute R before W. This is very
997 simply because the CPU cannot tell the memory subsystem about W's
1004 there is no such thing as a data dependency to a load. Next, a CPU
1011 To be fair about it, all Linux-supported architectures do execute
1013 After all, a CPU cannot ask the memory subsystem to load a value from
1015 the split-cache design used by Alpha can cause it to behave in a way
1023 store and a second, po-later load reads from that store:
1025 R ->dep W ->rfi R',
1028 this situation we know it is possible for the CPU to execute R' before
1033 and W then the CPU can speculatively forward W to R' before executing
1034 R; if the speculation turns out to be wrong then the CPU merely has to
1037 (In theory, a CPU might forward a store to a load when it runs across
1052 R ->po-loc W
1054 (the po-loc link says that R comes before W in program order and they
1055 access the same location), the CPU is obliged to execute W after R.
1058 violation of the read-write coherence rule. Similarly, if we had
1060 W ->po-loc W'
1062 and the CPU executed W' before W, then the memory subsystem would put
1064 overwrite W', in violation of the write-write coherence rule.
1066 allowing out-of-order writes like this to occur. The model avoided
1067 violating the write-write coherence rule by requiring the CPU not to
1071 a load-acquire reads from an earlier store-release. For example:
1094 ------------------------
1101 int y = -1;
1135 value may not become available for P1's CPU to read until after the
1148 nothing at all on non-Alpha builds) after every READ_ONCE() and atomic
1149 load. The effect of the fence is to cause the CPU not to execute any
1150 po-later instructions until after the local cache has finished
1172 the CPU to execute any po-later instructions (or po-later loads in the
1174 the local cache. In the case of a strong fence, the CPU first has to
1175 wait for all of its po-earlier stores to propagate to every other CPU
1177 the stores received as of that time -- not just the stores received
1184 THE HAPPENS-BEFORE RELATION: hb
1185 -------------------------------
1187 The happens-before relation (hb) links memory accesses that have to
1191 W ->rfe R implies that W and R are on different CPUs. It also means
1192 that W's store must have propagated to R's CPU before R executed;
1194 must have executed before R, and so we have W ->hb R.
1196 The equivalent fact need not hold if W ->rfi R (i.e., W and R are on
1197 the same CPU). As we have already seen, the operational model allows
1203 W ->coe W'. This means that W and W' are stores to the same location,
1209 R ->fre W means that W overwrites the value which R reads, but it
1211 for the memory subsystem not to propagate W to R's CPU until after R
1215 events that are on the same CPU. However it is more difficult to
1218 on CPU C in situations where a store from some other CPU comes after
1319 outcome is impossible -- as it should be.
1322 followed by an arbitrary number of cumul-fence links, ending with an
1327 release fences are A-cumulative:
1366 requirement is the content of the LKMM's "happens-before" axiom.
1374 THE PROPAGATES-BEFORE RELATION: pb
1375 ----------------------------------
1377 The propagates-before (pb) relation capitalizes on the special
1379 store is coherence-later than E and propagates to every CPU and to RAM
1381 F via a coe or fre link, an arbitrary number of cumul-fences, an
1389 E ->coe W ->cumul-fence* X ->rfe? Y ->strong-fence Z ->hb* F,
1393 be equal to X). Because of the cumul-fence links, we know that W will
1394 propagate to Y's CPU before X does, hence before Y executes and hence
1396 know that W will propagate to every CPU and to RAM before Z executes.
1399 propagate to every CPU and to RAM before F executes.
1406 have propagated to E's CPU before E executed. If E was a store, the
1408 coherence order, contradicting the fact that E ->coe W. If E was a
1411 contradicting the fact that E ->fre W.
1439 In this example, the sequences of cumul-fence and hb links are empty.
1441 because it does not start and end on the same CPU.
1454 RCU RELATIONS: rcu-link, gp, rscs, rcu-fence, and rb
1455 ----------------------------------------------------
1457 RCU (Read-Copy-Update) is a powerful synchronization mechanism. It
1458 rests on two concepts: grace periods and read-side critical sections.
1461 synchronize_rcu(). A read-side critical section (or just critical
1467 Grace-Period Guarantee, which states that a critical section can never
1472 store that propagates to the critical section's CPU before the
1473 end of the critical section must propagate to every CPU before
1478 store that propagates to the grace period's CPU before the
1479 start of the grace period must propagate to every CPU before
1513 to propagate to every CPU are fulfilled by placing strong fences at
1514 suitable places in the RCU-related code. Thus, if a critical section
1515 starts before a grace period does then the critical section's CPU will
1527 rcu-link relation. rcu-link encompasses a very general notion of
1528 "before": Among other things, X ->rcu-link Z includes cases where X
1529 happens-before or is equal to some event Y which is equal to or comes
1530 before Z in the coherence order. When Y = Z this says that X ->rfe Z
1531 implies X ->rcu-link Z. In addition, when Y = X it says that X ->fr Z
1532 and X ->co Z each imply X ->rcu-link Z.
1534 The formal definition of the rcu-link relation is more than a little
1538 about rcu-link is the information in the preceding paragraph.
1541 periods and read-side critical sections into the picture, in the
1544 E ->gp F means there is a synchronize_rcu() fence event S such
1545 that E ->po S and either S ->po F or S = F. In simple terms,
1546 there is a grace period po-between E and F.
1548 E ->rscs F means there is a critical section delimited by an
1550 that E ->po U and either L ->po F or L = F. You can think of
1552 (in fact, it also allows E to be po-before the start of the
1553 critical section and F to be po-after the end).
1555 If we think of the rcu-link relation as standing for an extended
1556 "before", then X ->gp Y ->rcu-link Z says that X executes before a
1559 grace period and some store propagates to Z's CPU before Z executes
1560 but doesn't propagate to some other CPU until after the grace period
1561 ends.) Similarly, X ->rscs Y ->rcu-link Z says that X is part of (or
1565 The LKMM goes on to define the rcu-fence relation as a sequence of gp
1566 and rscs links separated by rcu-link links, in which the number of gp
1569 X ->gp Y ->rcu-link Z ->rscs T ->rcu-link U ->gp V
1571 would imply that X ->rcu-fence V, because this sequence contains two
1572 gp links and only one rscs link. (It also implies that X ->rcu-fence T
1573 and Z ->rcu-fence V.) On the other hand:
1575 X ->rscs Y ->rcu-link Z ->rscs T ->rcu-link U ->gp V
1577 does not imply X ->rcu-fence V, because the sequence contains only
1580 The rcu-fence relation is important because the Grace Period Guarantee
1581 means that rcu-fence acts kind of like a strong fence. In particular,
1582 if W is a write and we have W ->rcu-fence Z, the Guarantee says that W
1583 will propagate to every CPU before Z executes.
1588 W ->gp X ->rcu-link Y ->rscs Z.
1593 1. W is po-before G;
1595 2. X is equal to or po-after G;
1599 4. Y is po-before the end of C;
1601 5. Z is equal to or po-after the start of C.
1603 From 2 - 4 we deduce that the grace period G ends before the critical
1606 on G's CPU before G starts) must propagate to every CPU before C
1607 starts. In particular, W propagates to every CPU before Z executes
1610 can be expanded to handle all the situations covered by rcu-fence.
1612 Finally, the LKMM defines the RCU-before (rb) relation in terms of
1613 rcu-fence. This is done in essentially the same way as the pb
1614 relation was defined in terms of strong-fence. We will omit the
1615 details; the end result is that E ->rb F implies E must execute before
1616 F, just as E ->pb F does (and for much the same reasons).
1621 F with E ->rcu-link F ->rcu-fence E. Or to put it a third way, the
1623 alternating with rcu-link, where the number of gp links is >= the
1631 period, and some store propagates to the critical section's CPU before
1633 CPU until after the end of the grace period.
1639 are events E and F where E is po-after L (which marks the start of the
1640 critical section), E is "before" F in the sense of the rcu-link
1641 relation, and F is po-before the grace period S:
1643 L ->po E ->rcu-link F ->po S.
1647 section's CPU by reading from W, and let Y on some arbitrary CPU be a
1648 witness that W has not propagated to that CPU, where Y happens after
1649 some event X which is po-after S. Symbolically, this amounts to:
1651 S ->po X ->hb* Y ->fr W ->rf Z ->po U.
1653 The fr link from Y to W indicates that W has not propagated to Y's CPU
1655 discussion of the rcu-link relation earlier) that X and Z are related
1656 by rcu-link, yielding:
1658 S ->po X ->rcu-link Z ->po U.
1660 The formulas say that S is po-between F and X, hence F ->gp X. They
1662 comes after its start, hence Z ->rscs E. From all this we obtain:
1664 F ->gp X ->rcu-link Z ->rscs E ->rcu-link F,
1669 For something a little more down-to-earth, let's see how the axiom
1694 P1's load at Z reads from, so we have Z ->fre X and thus Z ->rcu-link X.
1696 we have Y ->gp Z.
1699 so we have W ->rcu-link Y. In addition, W and X are in the same critical
1700 section, so therefore we have X ->rscs W.
1702 Then X ->rscs W ->rcu-link Y ->gp Z ->rcu-link X is a forbidden cycle,
1740 that W ->rscs X ->rcu-link Y ->gp Z ->rcu-link U ->rscs V ->rcu-link W.
1747 -------------------- -------------------- --------------------
1770 -------------
1779 be on the same CPU. These differences are very unimportant; indeed,
1784 CPU.
1795 that are part of a non-value-returning atomic update. For instance,
1804 non-value-returning atomic operations effectively to be executed off
1805 the CPU. Basically, the CPU tells the memory subsystem to increment
1807 no further involvement from the CPU. Since the CPU doesn't ever read
1813 smp_store_release() -- which is basically how the Linux kernel treats
1821 all po-earlier events against all po-later events, as smp_mb() does,
1824 smp_mb__before_atomic() orders all po-earlier events against
1825 po-later atomic updates and the events following them;
1827 smp_mb__after_atomic() orders po-earlier atomic updates and
1828 the events preceding them against all po-later events;
1830 smp_mb_after_spinlock() orders po-earlier lock acquisition
1831 events and the events preceding them against all po-later
1872 non-deadlocking executions. For example:
1896 will self-deadlock in the executions where it stores 36 in y.