resctrl.rst - OpenGrok cross reference for /Documentation/arch/x86/resctrl.rst

Lines Matching +full:system +full:- +full:cache +full:- +full:controller
1 .. SPDX-License-Identifier: GPL-2.0
9 :Authors: - Fenghua Yu <fenghua.yu@intel.com>
10           - Tony Luck <tony.luck@intel.com>
11           - Vikas Shivappa <vikas.shivappa@intel.com>
22 CAT (Cache Allocation Technology)		"cat_l3", "cat_l2"
24 CQM (Cache QoS Monitoring)			"cqm_llc", "cqm_occup_llc"
36 To use the feature mount the file system::
38  # mount -t resctrl resctrl [-o cdp[,cdpl2][,mba_MBps][,debug]] /sys/fs/resctrl
43 	Enable code/data prioritization in L3 cache allocations.
45 	Enable code/data prioritization in L2 cache allocations.
47 	Enable the MBA Software Controller(mba_sc) to specify MBA
55 RDT features are orthogonal. A particular system may support only
56 monitoring, only control, or both monitoring and control.  Cache
57 pseudo-locking is a unique way of using cache control to "pin" or
58 "lock" data in the cache. Details can be found in
59 "Cache Pseudo-Locking".
63 only those files and directories supported by the system will be created.
77 Cache resource(L3/L2)  subdirectory contains the following files
94 		setting up exclusive cache partitions. Note that
96 		own settings for cache use which can over-ride
103 			      Corresponding region is unused. When the system's
128 			      Corresponding region is pseudo-locked. No
131 		Indicates if non-contiguous 1s value in CBM is supported.
136 			      Non-contiguous 1s value in CBM is supported.
155 		non-linear. This field is purely informational
166 		"per-thread":
188 		If the system supports Bandwidth Monitoring Event
216 	5       Reads to slow memory in the non-local NUMA domain
218 	3       Non-temporal writes to non-local NUMA domain
219 	2       Non-temporal writes to local NUMA domain
220 	1       Reads to memory in the non-local NUMA domain
262 		counter can be considered for re-use.
266 via the file system (making new directories or writing to any of the
275 	mask f7 has non-consecutive 1-bits
281 system.  The default group is the root directory which, immediately
282 after mounting, owns all the tasks and cpus in the system and can make
285 On a system with RDT control features additional directories can be
290 On a system with RDT monitoring the root directory and other top level
332 	When the resource group is in pseudo-locked mode this file will
334 	pseudo-locked region.
345 	Each resource has its own line and format - see below for details.
356 	cache pseudo-locked region is created by first writing
357 	"pseudo-locksetup" to the "mode" file before writing the cache
358 	pseudo-locked region's schemata to the resource group's "schemata"
359 	file. On successful pseudo-locked region creation the mode will
360 	automatically change to "pseudo-locked".
370 	RDT event. E.g. on a system with two L3 domains there will
378 	On systems with Sub-NUMA Cluster (SNC) enabled there are extra
380 	for the L3 cache they occupy). These are named "mon_sub_L3_YY"
388 -------------------------
393 1) If the task is a member of a non-default group, then the schemata
403 -------------------------
404 1) If a task is a member of a MON group, or non-default CTRL_MON group
415 Notes on cache occupancy monitoring and control
418 this only affects *new* cache allocations by the task. E.g. you may have
419 a task in a monitor group showing 3 MB of cache occupancy. If you move
422 the new group zero. When the task accesses locations still in cache from
423 before the move, the h/w does not update any counters. On a busy system
424 you will likely see the occupancy in the old group go down as cache lines
425 are evicted and re-used while the occupancy in the new group rises as
426 the task accesses memory and loads into the cache are counted based on
429 The same applies to cache allocation control. Moving a task to a group
430 with a smaller cache partition will not evict any cache lines. The
440 max_threshold_occupancy - generic concepts
441 ------------------------------------------
444 the RMID is still tagged the cache lines of the previous user of RMID.
445 Hence such RMIDs are placed on limbo list and checked back if the cache
446 occupancy has gone down. If there is a time when system has a lot of
447 limbo RMIDs but which are not ready to be used, user may see an -EBUSY
459 Schemata files - general concepts
460 ---------------------------------
463 in each of the instances of that resource on the system.
465 Cache IDs
466 ---------
467 On current generation systems there is one L3 cache per socket and L2
470 caches on a socket, multiple cores could share an L2 cache. So instead
472 a resource we use a "Cache ID". At a given cache level this will be a
473 unique number across the whole system (but it isn't guaranteed to be a
475 CPU look in /sys/devices/system/cpu/cpu*/cache/index*/id
477 Cache Bit Masks (CBM)
478 ---------------------
479 For cache resources we describe the portion of the cache that is available
481 by each cpu model (and may be different for different cache levels). It
483 the resctrl file system in "info/{resource}/cbm_mask". Some Intel hardware
485 0x3, 0x6 and 0xC are legal 4-bit masks with two bits set, but 0x5, 0x9
487 if non-contiguous 1s value is supported. On a system with a 20-bit mask
488 each bit represents 5% of the capacity of the cache. You could partition
489 the cache into four equal parts with masks: 0x1f, 0x3e0, 0x7c00, 0xf8000.
491 Notes on Sub-NUMA Cluster mode
493 When SNC mode is enabled, Linux may load balance tasks between Sub-NUMA
495 on Sub-NUMA nodes share the same L3 cache and the system may report
496 the NUMA distance between Sub-NUMA nodes with a lower value than used
499 The top-level monitoring files in each "mon_L3_XX" directory provide
500 the sum of data across all SNC nodes sharing an L3 cache instance.
501 Users who bind tasks to the CPUs of a specific Sub-NUMA node can read
505 Memory bandwidth allocation is still performed at the L3 cache
508 L3 cache allocation bitmaps also apply to all SNC nodes. But note that
509 the amount of L3 cache represented by each bit is divided by the number
510 of SNC nodes per L3 cache. E.g. with a 100MB cache on a system with 10-bit
512 with two SNC nodes per L3 cache, each bit only represents 5MB.
564 Controller(mba_sc)" which reads the actual bandwidth using MBM counters
570 where as user can switch to the "MBA software controller" mode using
575 ----------------------------------------------------------------
581 ------------------------------------------------------------------
589 ------------------------
602 ------------------------------------------
604 Memory b/w domain is L3 cache.
610 ----------------------------------------------
612 Memory bandwidth domain is L3 cache.
618 ---------------------------------------
623 the system, the throttling logic groups all the slow sources
627 devices presence. If there are no such devices on the system, then
628 configuring SMBA will have no impact on the performance of the system.
630 The bandwidth domain for slow memory is L3 cache. Its schemata file
637 ---------------------------------
652 --------------------------------------------------
655 When writing to the file, you need to specify what cache id you wish to
658 For example, to allocate 2GB/s limit on the first cache id:
672 --------------------------------------------------------------------
676 For example, to allocate 8GB/s limit on the first cache id:
691 Cache Pseudo-Locking
693 CAT enables a user to specify the amount of cache space that an
694 application can fill. Cache pseudo-locking builds on the fact that a
695 CPU can still read and write data pre-allocated outside its current
696 allocated area on a cache hit. With cache pseudo-locking, data can be
697 preloaded into a reserved portion of cache that no application can
698 fill, and from that point on will only serve cache hits. The cache
699 pseudo-locked memory is made accessible to user space where an
703 The creation of a cache pseudo-locked region is triggered by a request
705 to be pseudo-locked. The cache pseudo-locked region is created as follows:
707 - Create a CAT allocation CLOSNEW with a CBM matching the schemata
708   from the user of the cache region that will contain the pseudo-locked
710   on the system and no future overlap with this cache region is allowed
711   while the pseudo-locked region exists.
712 - Create a contiguous region of memory of the same size as the cache
714 - Flush the cache, disable hardware prefetchers, disable preemption.
715 - Make CLOSNEW the active CLOS and touch the allocated memory to load
716   it into the cache.
717 - Set the previous CLOS as active.
718 - At this point the closid CLOSNEW can be released - the cache
719   pseudo-locked region is protected as long as its CBM does not appear in
720   any CAT allocation. Even though the cache pseudo-locked region will from
722   any CLOS will be able to access the memory in the pseudo-locked region since
723   the region continues to serve cache hits.
724 - The contiguous region of memory loaded into the cache is exposed to
725   user-space as a character device.
727 Cache pseudo-locking increases the probability that data will remain
728 in the cache via carefully configuring the CAT feature and controlling
730 cache. Instructions like INVD, WBINVD, CLFLUSH, etc. can still evict
731 “locked” data from cache. Power management C-states may shrink or
732 power off cache. Deeper C-states will automatically be restricted on
733 pseudo-locked region creation.
735 It is required that an application using a pseudo-locked region runs
737 with the cache on which the pseudo-locked region resides. A sanity check
738 within the code will not allow an application to map pseudo-locked memory
739 unless it runs with affinity to cores associated with the cache on which the
740 pseudo-locked region resides. The sanity check is only done during the
744 Pseudo-locking is accomplished in two stages:
746 1) During the first stage the system administrator allocates a portion
747    of cache that should be dedicated to pseudo-locking. At this time an
749    cache portion, and exposed as a character device.
750 2) During the second stage a user-space application maps (mmap()) the
751    pseudo-locked memory into its address space.
753 Cache Pseudo-Locking Interface
754 ------------------------------
755 A pseudo-locked region is created using the resctrl interface as follows:
758 2) Change the new resource group's mode to "pseudo-locksetup" by writing
759    "pseudo-locksetup" to the "mode" file.
760 3) Write the schemata of the pseudo-locked region to the "schemata" file. All
764 On successful pseudo-locked region creation the "mode" file will contain
765 "pseudo-locked" and a new character device with the same name as the resource
767 by user space in order to obtain access to the pseudo-locked memory region.
769 An example of cache pseudo-locked region creation and usage can be found below.
771 Cache Pseudo-Locking Debugging Interface
772 ----------------------------------------
773 The pseudo-locking debugging interface is enabled by default (if
777 location is present in the cache. The pseudo-locking debugging interface uses
778 the tracing infrastructure to provide two ways to measure cache residency of
779 the pseudo-locked region:
783    example below). In this test the pseudo-locked region is traversed at
785    are disabled. This also provides a substitute visualization of cache
787 2) Cache hit and miss measurements using model specific precision counters if
788    available. Depending on the levels of cache on the system the pseudo_lock_l2
791 When a pseudo-locked region is created a new debugfs directory is created for
793 write-only file, pseudo_lock_measure, is present in this directory. The
794 measurement of the pseudo-locked region depends on the number written to this
802      writing "2" to the pseudo_lock_measure file will trigger the L2 cache
803      residency (cache hits and misses) measurement captured in the
806      writing "3" to the pseudo_lock_measure file will trigger the L3 cache
807      residency (cache hits and misses) measurement captured in the
815 In this example a pseudo-locked region named "newlock" was created. Here is
847 Example of cache hits/misses debugging
849 In this example a pseudo-locked region named "newlock" was created on the L2
850 cache of a platform. Here is how we can obtain details of the cache hits
862   #                              _-----=> irqs-off
863   #                             / _----=> need-resched
864   #                            | / _---=> hardirq/softirq
865   #                            || / _--=> preempt-depth
867   #           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
869   pseudo_lock_mea-1672  [002] ....  3132.860500: pseudo_lock_l2: hits=4097 miss=0
877 On a two socket machine (one L3 cache per socket) with just four bits
878 for cache bit masks, minimum b/w of 10% with a memory bandwidth
882   # mount -t resctrl resctrl /sys/fs/resctrl
892 "lower" 50% on cache ID 0, and the "upper" 50% of cache ID 1.
893 Tasks in group "p1" use the "lower" 50% of cache on both sockets.
898 Note that unlike cache masks, memory b/w cannot specify whether these
900 b/w that the group may be able to use and the system admin can configure
903 If resctrl is using the software controller (mba_sc) then user can enter the
915 Again two sockets, but this time with a more realistic 20-bit mask.
918 processor 1 on socket 0 on a 2-socket and dual core machine. To avoid noisy
919 neighbors, each of the two real-time tasks exclusively occupies one quarter
920 of L3 cache on socket 0.
923   # mount -t resctrl resctrl /sys/fs/resctrl
927 50% of the L3 cache on socket 0 and 50% of memory b/w cannot be used by
933 it access to the "top" 25% of the cache on socket 0.
946   # taskset -cp 1 1234
948 Ditto for the second real time task (with the remaining 25% of cache)::
953   # taskset -cp 2 5678
955 For the same 2 socket system with memory b/w resource and CAT L3 the
962   # echo -e "L3:0=f8000;1=fffff\nMB:0=20;1=100" > p0/schemata
968   # echo -e "L3:0=f8000;1=fffff\nMB:0=20;1=100" > p0/schemata
972 A single socket system which has real-time tasks running on core 4-7 and
973 non real-time workload assigned to core 0-3. The real-time tasks share text
979   # mount -t resctrl resctrl /sys/fs/resctrl
983 50% of the L3 cache on socket 0, and 50% of memory bandwidth on socket 0
989 to the "top" 50% of the cache on socket 0 and 50% of memory bandwidth on
996 Finally we move core 4-7 over to the new group and make sure that the
997 kernel and the tasks running there get 50% of the cache. They should
998 also get 50% of memory bandwidth assuming that the cores 4-7 are SMT
999 siblings and only the real time threads are scheduled on the cores 4-7.
1007 mode allowing sharing of their cache allocations. If one resource group
1008 configures a cache allocation then nothing prevents another resource group
1012 system with two L2 cache instances that can be configured with an 8-bit
1014 25% of each cache instance.
1017   # mount -t resctrl resctrl /sys/fs/resctrl/
1021 cache::
1034   -sh: echo: write error: Invalid argument
1061 The bit_usage will reflect how the cache is used::
1069   -sh: echo: write error: Invalid argument
1073 Example of Cache Pseudo-Locking
1075 Lock portion of L2 cache from cache id 1 using CBM 0x3. Pseudo-locked
1080   # mount -t resctrl resctrl /sys/fs/resctrl/
1083 Ensure that there are bits available that can be pseudo-locked, since only
1084 unused bits can be pseudo-locked the bits to be pseudo-locked needs to be
1093 Create a new resource group that will be associated with the pseudo-locked
1094 region, indicate that it will be used for a pseudo-locked region, and
1095 configure the requested pseudo-locked region capacity bitmask::
1098   # echo pseudo-locksetup > newlock/mode
1101 On success the resource group's mode will change to pseudo-locked, the
1102 bit_usage will reflect the pseudo-locked region, and the character device
1103 exposing the pseudo-locked region will exist::
1106   pseudo-locked
1109   # ls -l /dev/pseudo_lock/newlock
1110   crw------- 1 root root 243, 0 Apr  3 05:01 /dev/pseudo_lock/newlock
1115   * Example code to access one page of pseudo-locked cache region
1128   * cores associated with the pseudo-locked region. Here the cpu
1165     /* Application interacts with pseudo-locked memory @mapping */
1179 ----------------------------
1184 As an example, the allocation of an exclusive reservation of L3 cache
1187   1. Read the cbmmasks from each directory or the per-resource "bit_usage"
1218   $ flock -s /sys/fs/resctrl/ find /sys/fs/resctrl
1222   $ cat create-dir.sh
1224   mask = function-of(output.txt)
1228   $ flock /sys/fs/resctrl/ ./create-dir.sh
1247       exit(-1);
1259       exit(-1);
1271       exit(-1);
1280     if (fd == -1) {
1282       exit(-1);
1296 ----------------------
1303 ------------------------------------------------------------------------
1304 On a two socket machine (one L3 cache per socket) with just four bits
1305 for cache bit masks::
1307   # mount -t resctrl resctrl /sys/fs/resctrl
1319 "lower" 50% on cache ID 0, and the "upper" 50% of cache ID 1.
1320 Tasks in group "p1" use the "lower" 50% of cache on both sockets.
1347 --------------------------------------------
1348 On a two socket machine (one L3 cache per socket)::
1350   # mount -t resctrl resctrl /sys/fs/resctrl
1367 ---------------------------------------------------------------------
1369 Assume a system like HSW has only CQM and no CAT support. In this case
1374 This can also be used to profile jobs cache size footprint before being
1378   # mount -t resctrl resctrl /sys/fs/resctrl
1402 -----------------------------------
1404 A single socket system which has real time tasks running on cores 4-7
1405 and non real time tasks on other cpus. We want to monitor the cache
1409   # mount -t resctrl resctrl /sys/fs/resctrl
1413 Move the cpus 4-7 over to p1::
1425 Intel MBM Counters May Report System Memory Bandwidth Incorrectly
1426 -----------------------------------------------------------------
1433 metrics, may report incorrect system bandwidth for certain RMID values.
1435 Implication: Due to the errata, system memory bandwidth may not match
1441 +---------------+---------------+---------------+-----------------+
1443 +---------------+---------------+---------------+-----------------+
1445 +---------------+---------------+---------------+-----------------+
1447 +---------------+---------------+---------------+-----------------+
1449 +---------------+---------------+---------------+-----------------+
1451 +---------------+---------------+---------------+-----------------+
1453 +---------------+---------------+---------------+-----------------+
1455 +---------------+---------------+---------------+-----------------+
1457 +---------------+---------------+---------------+-----------------+
1459 +---------------+---------------+---------------+-----------------+
1461 +---------------+---------------+---------------+-----------------+
1463 +---------------+---------------+---------------+-----------------+
1465 +---------------+---------------+---------------+-----------------+
1467 +---------------+---------------+---------------+-----------------+
1469 +---------------+---------------+---------------+-----------------+
1471 +---------------+---------------+---------------+-----------------+
1473 +---------------+---------------+---------------+-----------------+
1475 +---------------+---------------+---------------+-----------------+
1477 +---------------+---------------+---------------+-----------------+
1479 +---------------+---------------+---------------+-----------------+
1481 +---------------+---------------+---------------+-----------------+
1483 +---------------+---------------+---------------+-----------------+
1485 +---------------+---------------+---------------+-----------------+
1487 +---------------+---------------+---------------+-----------------+
1489 +---------------+---------------+---------------+-----------------+
1491 +---------------+---------------+---------------+-----------------+
1493 +---------------+---------------+---------------+-----------------+
1495 +---------------+---------------+---------------+-----------------+
1497 +---------------+---------------+---------------+-----------------+
1505 …958/https://www.intel.com/content/www/us/en/processors/xeon/scalable/xeon-scalable-spec-update.html
1507 2. Erratum BDF102 in Intel Xeon E5-2600 v4 Processor Product Family Specification Update:
1508 …w.intel.com/content/dam/www/public/us/en/documents/specification-updates/xeon-e5-v4-spec-update.pdf
1511 …are.intel.com/content/www/us/en/develop/articles/intel-resource-director-technology-rdt-reference-…