ras.rst - OpenGrok cross reference for /kernel/linux/linux-4.19/Documentation/admin-guide/ras.rst

Lines Matching +full:system +full:- +full:on +full:- +full:module
10 Reliability, Availability and Serviceability (RAS) is a concept used on
14   is the probability that a system will produce correct outputs.
20   is the probability that a system is operational at a given time
27   is the simplicity and speed with which a system can be repaired or
30   * Generally measured on Mean Time Between Repair (MTBR)
33 -------------
35 In order to reduce systems downtime, a system should be capable of detecting
38 the system administrator to take the action of replacing a component before
39 it causes data loss or system downtime.
47   Self-Monitoring, Analysis and Reporting Technology (SMART).
50 to identify if the probability of hardware errors is increasing, and, on such
55 ---------------
57 Most mechanisms used on modern systems use use technologies like Hamming
58 Codes that allow error correction when the number of errors on a bit packet
63 Also, sometimes an error occur on a component that it is not used. For
68 * **Correctable Error (CE)** - the error detection mechanism detected and
70   Kernel mechanisms allow the system administrator to consider them as fatal.
72 * **Uncorrected Error (UE)** - the amount of errors happened above the error
73   correction threshold, and the system was unable to auto-correct.
75 * **Fatal Error** - when an UE error happens on a critical component of the
76   system (for example, a piece of the Kernel got corrupted by an UE), the
79 * **Non-fatal Error** - when an UE error happens on an unused component,
80   like a CPU in power down state or an unused memory bank, the system may
84   Also, when an error happens on a userspace process, it is also possible to
87 The mechanism for handling non-fatal errors is usually complex and may
89 policy desired by the system administrator.
92 ------------------------------------
94 Just detecting a hardware flaw is usually not enough, as the system needs
104 DMI BIOS usually have a list of memory module labels, with can be obtained
105 using the ``dmidecode`` tool. For example, on a desktop machine, it shows::
113 		Locator: ChannelA-DIMM0
121 On the above example, a DDR4 SO-DIMM memory module is located at the
122 system's memory labeled as "BANK 0", as given by the *bank locator* field.
123 Please notice that, on such system, the *total width* is equal to the
124 *data width*. It means that such memory module doesn't have error
128 bank. On this example, from an older server, ``dmidecode`` shows::
146 There, the DDR3 RDIMM memory module is located at the system's memory labeled
148 memory module has 64 bits of *data width* and 72 bits of *total width*. So,
150 Such kind of memory is called Error-correcting code memory (ECC memory).
153 labels on their system's board to use exactly the same BIOS, meaning that
157 ----------
159 As mentioned on the previous section, ECC memory has extra bits to be
160 used for error correction. So, on 64 bit systems, a memory module
169 on the memory modules.
172 ECC code used on write, producing a word with *data width* and a *syndrome*.
180 The information about the CE/UE errors is stored on some special registers
182 either by BIOS, by some special CPUs or by Linux EDAC driver. On x86 64
186 .. [#f1] Please notice that several memory controllers allow operation on a
187   mode called "Lock-Step", where it groups two memory modules together,
188   doing 128-bit reads/writes. That gives 16 bits for error correction, with
190   that, when an error happens, there's no way to know what memory module is
194   On such mode, the same data is written to two memory modules. At read,
195   the system checks both memory modules, in order to check if both provide
196   identical data. On such configuration, when an error happens, there's no
197   way to know what memory module is to blame. So, it has to blame both
198   memory modules (or 4 memory modules, if the system is also on Lock-step
204 EDAC - Error Detection And Correction
210    was "out-of-tree" and maintained at http://bluesmoke.sourceforge.net.
214    When the subsystem was pushed upstream for the first time, on
218 -------
220 The ``edac`` kernel module's goal is to detect and report hardware errors
221 that occur within the computer system running under linux.
224 ------
232 CE events only, the system can and will continue to operate as no data
237 and system panics.
240 -----------------------
245 This new device type allows for non-memory type of ECC hardware detectors
257 ----------------
263 There are several add-in adapters that do **not** follow the PCI specification
280 ----------
282 EDAC is composed of a "core" module (``edac_core.ko``) and several Memory
283 Controller (MC) driver modules. On a given system, the CORE is loaded
288 Thus, to "report" on what version a system is running, one must report
293 -------
298 hardware-specific modules and have the dependencies load the necessary
305 loads both the ``amd76x_edac.ko`` memory controller module and the
306 ``edac_mc.ko`` core module.
310 ---------------
313 lives in the /sys/devices/system/edac directory.
318 	mc	memory controller(s) system
319 	pci	PCI control and status system
325 ----------------------------
328 are laid out in a Chip-Select Row (``csrowX``) and Channel table (``chX``).
331 .. [#f4] Nowadays, the term DIMM (Dual In-line Memory Module) is widely
332   used to refer to a memory module, although there are other memory
333   packaging alternatives, like SO-DIMM, SIMM, etc. Along this document,
334   and inside the EDAC system, the term "dimm" is used for all memory
338 typical value. Yet, the actual number of csrows depends on the layout of
339 a given motherboard, memory controller and memory module characteristics.
341 Dual channels allow for dual data length (e. g. 128 bits, on 64 bit systems)
343 for more than 2 channels, like Fully Buffered DIMMs (FB-DIMMs) memory
346 	+------------+-----------------------+
348 	+------------+-----------+-----------+
352 	+------------+           |           |
354 	+------------+-----------+-----------+
356 	+------------+           |           |
358 	+------------+-----------+-----------+
360 In the above example, there are 4 physical slots on the motherboard
363 	+---------+---------+
365 	+---------+---------+
367 	+---------+---------+
369 Labels for these slots are usually silk-screened on the motherboard.
371 channel 1. Notice that there are two csrows possible on a physical DIMM.
372 These csrows are allocated their csrow assignment based on the slot into
378 will have just one csrow (csrow0). csrow1 will be empty. On the other
385 ``/sys/devices/system/edac/mc``, each memory controller will be
391 		   |->mc0
392 		   |->mc1
393 		   |->mc2
401 		|->csrow0
402 		|->csrow2
403 		|->csrow3
408 order to have dual-channel mode be operational. Since both csrow2 and
416 -------------------
423 	Documentation/ABI/testing/sysfs-devices-edac
427 ----------------------------------
432 A typical EDAC system has the following structure under
433 ``/sys/devices/system/edac/``\ [#f6]_::
435 	/sys/devices/system/edac/
483 this ``X`` memory module:
485 - ``size`` - Total memory managed by this csrow attribute file
490 - ``dimm_ue_count`` - Uncorrectable Errors count attribute file
493 	errors that have occurred on this DIMM. If panic_on_ue is set
495 	will panic the system.
497 - ``dimm_ce_count`` - Correctable Errors count attribute file
500 	errors that have occurred on this DIMM. This count is very
503 	monitored for non-zero values and report such information
504 	to the system administrator.
506 - ``dimm_dev_type``  - Device type attribute file
509 	being utilized on this DIMM.
512 		- x1
513 		- x2
514 		- x4
515 		- x8
517 - ``dimm_edac_mode`` - EDAC Mode of operation attribute file
522 - ``dimm_label`` - memory module label control file
525 	to it. With this label in the module, when errors occur
526 	the output can provide the DIMM label in the system log.
536 - ``dimm_location`` - location of the memory module
539 	memory controller identifies the location of a memory module.
540 	Depending on the type of memory and memory controller, it
543 		- *csrow* and *channel* - used when the memory controller
544 		  doesn't identify a single DIMM - e. g. in ``rankX`` dir;
545 		- *branch*, *channel*, *slot* - typically used on FB-DIMM memory
547 		- *channel*, *slot* - used on Nehalem and newer Intel drivers.
549 - ``dimm_mem_type`` - Memory Type attribute file
552 	on this csrow. Normally, either buffered or unbuffered memory.
555 		- Registered-DDR
556 		- Unbuffered-DDR
558 .. [#f5] On some systems, the memory controller doesn't have any logic
559 …to identify the memory module. On such systems, the directory is called ``rankX`` and works on a s…
560   On modern Intel memory controllers, the memory controller identifies the
561   memory modules directly. On such systems, the directory is called ``dimmX``.
568 ----------------------
571 directories. As this API doesn't work properly for Rambus, FB-DIMMs and
579 - ``ue_count`` - Total Uncorrectable Errors count attribute file
582 	errors that have occurred on this csrow. If panic_on_ue is set
584 	will panic the system.
587 - ``ce_count`` - Total Correctable Errors count attribute file
590 	errors that have occurred on this csrow. This count is very
593 	monitored for non-zero values and report such information
594 	to the system administrator.
597 - ``size_mb`` - Total memory managed by this csrow attribute file
603 - ``mem_type`` - Memory Type attribute file
606 	on this csrow. Normally, either buffered or unbuffered memory.
609 		- Registered-DDR
610 		- Unbuffered-DDR
613 - ``edac_mode`` - EDAC Mode of operation attribute file
619 - ``dev_type`` - Device type attribute file
622 	being utilized on this DIMM.
625 		- x1
626 		- x2
627 		- x4
628 		- x8
631 - ``ch0_ce_count`` - Channel 0 CE Count attribute file
633 	This attribute file will display the count of CEs on this
637 - ``ch0_ue_count`` - Channel 0 UE Count attribute file
639 	This attribute file will display the count of UEs on this
643 - ``ch0_dimm_label`` - Channel 0 DIMM Label control file
647 	to it. With this label in the module, when errors occur
648 	the output can provide the DIMM label in the system log.
659 - ``ch1_ce_count`` - Channel 1 CE Count attribute file
662 	This attribute file will display the count of CEs on this
666 - ``ch1_ue_count`` - Channel 1 UE Count attribute file
669 	This attribute file will display the count of UEs on this
673 - ``ch1_dimm_label`` - Channel 1 DIMM Label control file
676 	to it. With this label in the module, when errors occur
677 	the output can provide the DIMM label in the system log.
688 System Logging
689 --------------
691 If logging for UEs and CEs is enabled, then system logs will contain
700 	+---------------------------------------+-------------+
704 	+---------------------------------------+-------------+
706 	+---------------------------------------+-------------+
708 	+---------------------------------------+-------------+
710 	+---------------------------------------+-------------+
713 	+---------------------------------------+-------------+
715 	+---------------------------------------+-------------+
717 	+---------------------------------------+-------------+
719 	+---------------------------------------+-------------+
721 	+---------------------------------------+-------------+
722 	| And then an optional, driver-specific |             |
725 	+---------------------------------------+-------------+
728 type, a notice of "no info" and then an optional, driver-specific error
733 ------------------------
735 On Header Type 00 devices, the primary status is looked at for any
736 parity error regardless of whether parity is enabled on the device or
737 not. (The spec indicates parity is generated in some cases). On Header
739 if parity occurred on the bus on the other side of the bridge.
743 -------------------
745 Under ``/sys/devices/system/edac/pci`` are control and attribute files as
749 - ``check_pci_parity`` - Enable/Disable PCI Parity checking control file
757 		echo "1" >/sys/devices/system/edac/pci/check_pci_parity
761 		echo "0" >/sys/devices/system/edac/pci/check_pci_parity
764 - ``pci_parity_count`` - Parity Count
770 Module parameters
771 -----------------
773 - ``edac_mc_panic_on_ue`` - Panic on UE control file
777 	occurs - it is indeterminate what was uncorrected and the operating
778 	system context might be so mangled that continuing will lead to further
784 		module/kernel parameter: edac_mc_panic_on_ue=[0|1]
788 		echo "1" > /sys/module/edac_core/parameters/edac_mc_panic_on_ue
791 - ``edac_mc_log_ue`` - Log UE control file
795 	are reported through the system message log system.  UE statistics
800 		module/kernel parameter: edac_mc_log_ue=[0|1]
804 		echo "1" > /sys/module/edac_core/parameters/edac_mc_log_ue
807 - ``edac_mc_log_ce`` - Log CE control file
811 	errors are reported through the system message log system.
816 		module/kernel parameter: edac_mc_log_ce=[0|1]
820 		echo "1" > /sys/module/edac_core/parameters/edac_mc_log_ce
823 - ``edac_mc_poll_msec`` - Polling period control file
835 		module/kernel parameter: edac_mc_poll_msec=[0|1]
839 		echo "1000" > /sys/module/edac_core/parameters/edac_mc_poll_msec
842 - ``panic_on_pci_parity`` - Panic on PCI PARITY Error
849 	module/kernel parameter::
855 		echo "1" > /sys/module/edac_core/parameters/edac_panic_on_pci_pe
859 		echo "0" > /sys/module/edac_core/parameters/edac_panic_on_pci_pe
864 ----------------
871 At the location ``/sys/devices/system/edac`` (sysfs) new edac_device devices
878 	/sys/devices/system/edac/test-instance
888 	panic_on_ue	boolean to ``panic`` the system if an UE is encountered
900 			One out-of-tree driver uses controls here to allow
908 ---------
913 	+----------------+
914 	| test-instance0 |
915 	+----------------+
927 ------
932 	+-------------+
933 	| test-block0 |
934 	+-------------+
949 	test-block-bits-0	for every POLL cycle this counter
951 	test-block-bits-1	every 10 cycles, this counter is bumped once,
952 				and test-block-bits-0 is set to 0
953 	test-block-bits-2	every 100 cycles, this counter is bumped once,
954 				and test-block-bits-1 is set to 0
955 	test-block-bits-3	every 1000 cycles, this counter is bumped once,
956 				and test-block-bits-2 is set to 0
961 	reset-counters		writing ANY thing to this control will
973 Usage of EDAC APIs on Nehalem and newer Intel CPUs
974 --------------------------------------------------
976 On older Intel architectures, the memory controller was part of the North
982 found on newer Intel CPUs, such as ``i7core_edac``, ``sb_edac`` and
1028    ``/sys/devices/system/edac/mc/mc?/``:
1030    - ``inject_addrmatch/*``:
1048 		echo 2 >/sys/devices/system/edac/mc/mc0/inject_addrmatch/dimm
1049 		echo 1 >/sys/devices/system/edac/mc/mc0/inject_addrmatch/rank
1053 		echo any >/sys/devices/system/edac/mc/mc0/inject_addrmatch/dimm
1054 		echo any >/sys/devices/system/edac/mc/mc0/inject_addrmatch/rank
1056    - ``inject_eccmask``:
1059    - ``inject_section``:
1066    - ``inject_type``:
1069 		bit 0 - repeat
1070 		bit 1 - ecc
1071 		bit 2 - parity
1073    - ``inject_enable``:
1078    Datasheet states that the error will only be generated after a write on an
1083    at socket 0, on any DIMM/address on channel 2::
1085 	echo 2 >/sys/devices/system/edac/mc/mc0/inject_addrmatch/channel
1086 	echo 2 >/sys/devices/system/edac/mc/mc0/inject_type
1087 	echo 64 >/sys/devices/system/edac/mc/mc0/inject_eccmask
1088 	echo 3 >/sys/devices/system/edac/mc/mc0/inject_section
1089 	echo 1 >/sys/devices/system/edac/mc/mc0/inject_enable
1097 …EDAC MC0: UE row 0, channel-a= 0 channel-b= 0 labels "-": NON_FATAL (addr = 0x0075b980, socket=0, …
1102    uses those registers to report Corrected Errors on devices with Registered
1112      $ for i in /sys/devices/system/edac/mc/mc0/all_channel_counts/*; do echo $i; cat $i; done
1113 	/sys/devices/system/edac/mc/mc0/all_channel_counts/udimm0
1115 	/sys/devices/system/edac/mc/mc0/all_channel_counts/udimm1
1117 	/sys/devices/system/edac/mc/mc0/all_channel_counts/udimm2
1120    What happens here is that errors on different csrows, but at the same
1145 Reference documents used on ``amd64_edac``
1146 ------------------------------------------
1148 ``amd64_edac`` module is based on the following documents
1149 (available from http://support.amd.com/en-us/search/tech-docs):
1172 	  Models 30h-3Fh Processors
1176    :Link: http://support.amd.com/TechDocs/49125_15h_Models_30h-3Fh_BKDG.pdf
1179 	  Models 60h-6Fh Processors
1183    :Link: http://support.amd.com/TechDocs/50742_15h_Models_60h-6Fh_BKDG.pdf
1186 	  Models 00h-0Fh Processors
1197   - 7 Dec 2005
1198   - 17 Jul 2007	Updated
1202   - 05 Aug 2009	Nehalem interface
1203   - 26 Oct 2016 Converted to ReST and cleanups at the Nehalem section
1207   - Doug Thompson, Dave Jiang, Dave Peterson et al,
1208   - Mauro Carvalho Chehab
1209   - Borislav Petkov
1210   - original author: Thayne Harbaugh