• Home
  • Line#
  • Scopes#
  • Navigate#
  • Raw
  • Download
1
2
3EDAC - Error Detection And Correction
4
5Written by Doug Thompson <dougthompson@xmission.com>
67 Dec 2005
717 Jul 2007	Updated
8
9
10EDAC is maintained and written by:
11
12	Doug Thompson, Dave Jiang, Dave Peterson et al,
13	original author: Thayne Harbaugh,
14
15Contact:
16	website:	bluesmoke.sourceforge.net
17	mailing list:	bluesmoke-devel@lists.sourceforge.net
18
19"bluesmoke" was the name for this device driver when it was "out-of-tree"
20and maintained at sourceforge.net.  When it was pushed into 2.6.16 for the
21first time, it was renamed to 'EDAC'.
22
23The bluesmoke project at sourceforge.net is now utilized as a 'staging area'
24for EDAC development, before it is sent upstream to kernel.org
25
26At the bluesmoke/EDAC project site, is a series of quilt patches against
27recent kernels, stored in a SVN respository. For easier downloading, there
28is also a tarball snapshot available.
29
30============================================================================
31EDAC PURPOSE
32
33The 'edac' kernel module goal is to detect and report errors that occur
34within the computer system running under linux.
35
36MEMORY
37
38In the initial release, memory Correctable Errors (CE) and Uncorrectable
39Errors (UE) are the primary errors being harvested. These types of errors
40are harvested by the 'edac_mc' class of device.
41
42Detecting CE events, then harvesting those events and reporting them,
43CAN be a predictor of future UE events.  With CE events, the system can
44continue to operate, but with less safety. Preventive maintenance and
45proactive part replacement of memory DIMMs exhibiting CEs can reduce
46the likelihood of the dreaded UE events and system 'panics'.
47
48NON-MEMORY
49
50A new feature for EDAC, the edac_device class of device, was added in
51the 2.6.23 version of the kernel.
52
53This new device type allows for non-memory type of ECC hardware detectors
54to have their states harvested and presented to userspace via the sysfs
55interface.
56
57Some architectures have ECC detectors for L1, L2 and L3 caches, along with DMA
58engines, fabric switches, main data path switches, interconnections,
59and various other hardware data paths. If the hardware reports it, then
60a edac_device device probably can be constructed to harvest and present
61that to userspace.
62
63
64PCI BUS SCANNING
65
66In addition, PCI Bus Parity and SERR Errors are scanned for on PCI devices
67in order to determine if errors are occurring on data transfers.
68
69The presence of PCI Parity errors must be examined with a grain of salt.
70There are several add-in adapters that do NOT follow the PCI specification
71with regards to Parity generation and reporting. The specification says
72the vendor should tie the parity status bits to 0 if they do not intend
73to generate parity.  Some vendors do not do this, and thus the parity bit
74can "float" giving false positives.
75
76In the kernel there is a pci device attribute located in sysfs that is
77checked by the EDAC PCI scanning code. If that attribute is set,
78PCI parity/error scannining is skipped for that device. The attribute
79is:
80
81	broken_parity_status
82
83as is located in /sys/devices/pci<XXX>/0000:XX:YY.Z directorys for
84PCI devices.
85
86FUTURE HARDWARE SCANNING
87
88EDAC will have future error detectors that will be integrated with
89EDAC or added to it, in the following list:
90
91	MCE	Machine Check Exception
92	MCA	Machine Check Architecture
93	NMI	NMI notification of ECC errors
94	MSRs 	Machine Specific Register error cases
95	and other mechanisms.
96
97These errors are usually bus errors, ECC errors, thermal throttling
98and the like.
99
100
101============================================================================
102EDAC VERSIONING
103
104EDAC is composed of a "core" module (edac_core.ko) and several Memory
105Controller (MC) driver modules. On a given system, the CORE
106is loaded and one MC driver will be loaded. Both the CORE and
107the MC driver (or edac_device driver) have individual versions that reflect
108current release level of their respective modules.
109
110Thus, to "report" on what version a system is running, one must report both
111the CORE's and the MC driver's versions.
112
113
114LOADING
115
116If 'edac' was statically linked with the kernel then no loading is
117necessary.  If 'edac' was built as modules then simply modprobe the
118'edac' pieces that you need.  You should be able to modprobe
119hardware-specific modules and have the dependencies load the necessary core
120modules.
121
122Example:
123
124$> modprobe amd76x_edac
125
126loads both the amd76x_edac.ko memory controller module and the edac_mc.ko
127core module.
128
129
130============================================================================
131EDAC sysfs INTERFACE
132
133EDAC presents a 'sysfs' interface for control, reporting and attribute
134reporting purposes.
135
136EDAC lives in the /sys/devices/system/edac directory.
137
138Within this directory there currently reside 2 'edac' components:
139
140	mc	memory controller(s) system
141	pci	PCI control and status system
142
143
144============================================================================
145Memory Controller (mc) Model
146
147First a background on the memory controller's model abstracted in EDAC.
148Each 'mc' device controls a set of DIMM memory modules. These modules are
149laid out in a Chip-Select Row (csrowX) and Channel table (chX). There can
150be multiple csrows and multiple channels.
151
152Memory controllers allow for several csrows, with 8 csrows being a typical value.
153Yet, the actual number of csrows depends on the electrical "loading"
154of a given motherboard, memory controller and DIMM characteristics.
155
156Dual channels allows for 128 bit data transfers to the CPU from memory.
157Some newer chipsets allow for more than 2 channels, like Fully Buffered DIMMs
158(FB-DIMMs). The following example will assume 2 channels:
159
160
161		Channel 0	Channel 1
162	===================================
163	csrow0	| DIMM_A0	| DIMM_B0 |
164	csrow1	| DIMM_A0	| DIMM_B0 |
165	===================================
166
167	===================================
168	csrow2	| DIMM_A1	| DIMM_B1 |
169	csrow3	| DIMM_A1	| DIMM_B1 |
170	===================================
171
172In the above example table there are 4 physical slots on the motherboard
173for memory DIMMs:
174
175	DIMM_A0
176	DIMM_B0
177	DIMM_A1
178	DIMM_B1
179
180Labels for these slots are usually silk screened on the motherboard. Slots
181labeled 'A' are channel 0 in this example. Slots labeled 'B'
182are channel 1. Notice that there are two csrows possible on a
183physical DIMM. These csrows are allocated their csrow assignment
184based on the slot into which the memory DIMM is placed. Thus, when 1 DIMM
185is placed in each Channel, the csrows cross both DIMMs.
186
187Memory DIMMs come single or dual "ranked". A rank is a populated csrow.
188Thus, 2 single ranked DIMMs, placed in slots DIMM_A0 and DIMM_B0 above
189will have 1 csrow, csrow0. csrow1 will be empty. On the other hand,
190when 2 dual ranked DIMMs are similarly placed, then both csrow0 and
191csrow1 will be populated. The pattern repeats itself for csrow2 and
192csrow3.
193
194The representation of the above is reflected in the directory tree
195in EDAC's sysfs interface. Starting in directory
196/sys/devices/system/edac/mc each memory controller will be represented
197by its own 'mcX' directory, where 'X" is the index of the MC.
198
199
200	..../edac/mc/
201		   |
202		   |->mc0
203		   |->mc1
204		   |->mc2
205		   ....
206
207Under each 'mcX' directory each 'csrowX' is again represented by a
208'csrowX', where 'X" is the csrow index:
209
210
211	.../mc/mc0/
212		|
213		|->csrow0
214		|->csrow2
215		|->csrow3
216		....
217
218Notice that there is no csrow1, which indicates that csrow0 is
219composed of a single ranked DIMMs. This should also apply in both
220Channels, in order to have dual-channel mode be operational. Since
221both csrow2 and csrow3 are populated, this indicates a dual ranked
222set of DIMMs for channels 0 and 1.
223
224
225Within each of the 'mcX' and 'csrowX' directories are several
226EDAC control and attribute files.
227
228============================================================================
229'mcX' DIRECTORIES
230
231
232In 'mcX' directories are EDAC control and attribute files for
233this 'X" instance of the memory controllers:
234
235
236Counter reset control file:
237
238	'reset_counters'
239
240	This write-only control file will zero all the statistical counters
241	for UE and CE errors.  Zeroing the counters will also reset the timer
242	indicating how long since the last counter zero.  This is useful
243	for computing errors/time.  Since the counters are always reset at
244	driver initialization time, no module/kernel parameter is available.
245
246	RUN TIME: echo "anything" >/sys/devices/system/edac/mc/mc0/counter_reset
247
248		This resets the counters on memory controller 0
249
250
251Seconds since last counter reset control file:
252
253	'seconds_since_reset'
254
255	This attribute file displays how many seconds have elapsed since the
256	last counter reset. This can be used with the error counters to
257	measure error rates.
258
259
260
261Memory Controller name attribute file:
262
263	'mc_name'
264
265	This attribute file displays the type of memory controller
266	that is being utilized.
267
268
269Total memory managed by this memory controller attribute file:
270
271	'size_mb'
272
273	This attribute file displays, in count of megabytes, of memory
274	that this instance of memory controller manages.
275
276
277Total Uncorrectable Errors count attribute file:
278
279	'ue_count'
280
281	This attribute file displays the total count of uncorrectable
282	errors that have occurred on this memory controller. If panic_on_ue
283	is set this counter will not have a chance to increment,
284	since EDAC will panic the system.
285
286
287Total UE count that had no information attribute fileY:
288
289	'ue_noinfo_count'
290
291	This attribute file displays the number of UEs that
292	have occurred have occurred with  no informations as to which DIMM
293	slot is having errors.
294
295
296Total Correctable Errors count attribute file:
297
298	'ce_count'
299
300	This attribute file displays the total count of correctable
301	errors that have occurred on this memory controller. This
302	count is very important to examine. CEs provide early
303	indications that a DIMM is beginning to fail. This count
304	field should be monitored for non-zero values and report
305	such information to the system administrator.
306
307
308Total Correctable Errors count attribute file:
309
310	'ce_noinfo_count'
311
312	This attribute file displays the number of CEs that
313	have occurred wherewith no informations as to which DIMM slot
314	is having errors. Memory is handicapped, but operational,
315	yet no information is available to indicate which slot
316	the failing memory is in. This count field should be also
317	be monitored for non-zero values.
318
319Device Symlink:
320
321	'device'
322
323	Symlink to the memory controller device.
324
325Sdram memory scrubbing rate:
326
327	'sdram_scrub_rate'
328
329	Read/Write attribute file that controls memory scrubbing. The scrubbing
330	rate is set by writing a minimum bandwidth in bytes/sec to the attribute
331	file. The rate will be translated to an internal value that gives at
332	least the specified rate.
333
334	Reading the file will return the actual scrubbing rate employed.
335
336	If configuration fails or memory scrubbing is not implemented, the value
337	of the attribute file will be -1.
338
339
340
341============================================================================
342'csrowX' DIRECTORIES
343
344In the 'csrowX' directories are EDAC control and attribute files for
345this 'X" instance of csrow:
346
347
348Total Uncorrectable Errors count attribute file:
349
350	'ue_count'
351
352	This attribute file displays the total count of uncorrectable
353	errors that have occurred on this csrow. If panic_on_ue is set
354	this counter will not have a chance to increment, since EDAC
355	will panic the system.
356
357
358Total Correctable Errors count attribute file:
359
360	'ce_count'
361
362	This attribute file displays the total count of correctable
363	errors that have occurred on this csrow. This
364	count is very important to examine. CEs provide early
365	indications that a DIMM is beginning to fail. This count
366	field should be monitored for non-zero values and report
367	such information to the system administrator.
368
369
370Total memory managed by this csrow attribute file:
371
372	'size_mb'
373
374	This attribute file displays, in count of megabytes, of memory
375	that this csrow contains.
376
377
378Memory Type attribute file:
379
380	'mem_type'
381
382	This attribute file will display what type of memory is currently
383	on this csrow. Normally, either buffered or unbuffered memory.
384	Examples:
385		Registered-DDR
386		Unbuffered-DDR
387
388
389EDAC Mode of operation attribute file:
390
391	'edac_mode'
392
393	This attribute file will display what type of Error detection
394	and correction is being utilized.
395
396
397Device type attribute file:
398
399	'dev_type'
400
401	This attribute file will display what type of DRAM device is
402	being utilized on this DIMM.
403	Examples:
404		x1
405		x2
406		x4
407		x8
408
409
410Channel 0 CE Count attribute file:
411
412	'ch0_ce_count'
413
414	This attribute file will display the count of CEs on this
415	DIMM located in channel 0.
416
417
418Channel 0 UE Count attribute file:
419
420	'ch0_ue_count'
421
422	This attribute file will display the count of UEs on this
423	DIMM located in channel 0.
424
425
426Channel 0 DIMM Label control file:
427
428	'ch0_dimm_label'
429
430	This control file allows this DIMM to have a label assigned
431	to it. With this label in the module, when errors occur
432	the output can provide the DIMM label in the system log.
433	This becomes vital for panic events to isolate the
434	cause of the UE event.
435
436	DIMM Labels must be assigned after booting, with information
437	that correctly identifies the physical slot with its
438	silk screen label. This information is currently very
439	motherboard specific and determination of this information
440	must occur in userland at this time.
441
442
443Channel 1 CE Count attribute file:
444
445	'ch1_ce_count'
446
447	This attribute file will display the count of CEs on this
448	DIMM located in channel 1.
449
450
451Channel 1 UE Count attribute file:
452
453	'ch1_ue_count'
454
455	This attribute file will display the count of UEs on this
456	DIMM located in channel 0.
457
458
459Channel 1 DIMM Label control file:
460
461	'ch1_dimm_label'
462
463	This control file allows this DIMM to have a label assigned
464	to it. With this label in the module, when errors occur
465	the output can provide the DIMM label in the system log.
466	This becomes vital for panic events to isolate the
467	cause of the UE event.
468
469	DIMM Labels must be assigned after booting, with information
470	that correctly identifies the physical slot with its
471	silk screen label. This information is currently very
472	motherboard specific and determination of this information
473	must occur in userland at this time.
474
475============================================================================
476SYSTEM LOGGING
477
478If logging for UEs and CEs are enabled then system logs will have
479error notices indicating errors that have been detected:
480
481EDAC MC0: CE page 0x283, offset 0xce0, grain 8, syndrome 0x6ec3, row 0,
482channel 1 "DIMM_B1": amd76x_edac
483
484EDAC MC0: CE page 0x1e5, offset 0xfb0, grain 8, syndrome 0xb741, row 0,
485channel 1 "DIMM_B1": amd76x_edac
486
487
488The structure of the message is:
489	the memory controller			(MC0)
490	Error type				(CE)
491	memory page				(0x283)
492	offset in the page			(0xce0)
493	the byte granularity 			(grain 8)
494		or resolution of the error
495	the error syndrome			(0xb741)
496	memory row				(row 0)
497	memory channel				(channel 1)
498	DIMM label, if set prior		(DIMM B1
499	and then an optional, driver-specific message that may
500		have additional information.
501
502Both UEs and CEs with no info will lack all but memory controller,
503error type, a notice of "no info" and then an optional,
504driver-specific error message.
505
506
507============================================================================
508PCI Bus Parity Detection
509
510
511On Header Type 00 devices the primary status is looked at
512for any parity error regardless of whether Parity is enabled on the
513device.  (The spec indicates parity is generated in some cases).
514On Header Type 01 bridges, the secondary status register is also
515looked at to see if parity occurred on the bus on the other side of
516the bridge.
517
518
519SYSFS CONFIGURATION
520
521Under /sys/devices/system/edac/pci are control and attribute files as follows:
522
523
524Enable/Disable PCI Parity checking control file:
525
526	'check_pci_parity'
527
528
529	This control file enables or disables the PCI Bus Parity scanning
530	operation. Writing a 1 to this file enables the scanning. Writing
531	a 0 to this file disables the scanning.
532
533	Enable:
534	echo "1" >/sys/devices/system/edac/pci/check_pci_parity
535
536	Disable:
537	echo "0" >/sys/devices/system/edac/pci/check_pci_parity
538
539
540Parity Count:
541
542	'pci_parity_count'
543
544	This attribute file will display the number of parity errors that
545	have been detected.
546
547
548============================================================================
549MODULE PARAMETERS
550
551Panic on UE control file:
552
553	'edac_mc_panic_on_ue'
554
555	An uncorrectable error will cause a machine panic.  This is usually
556	desirable.  It is a bad idea to continue when an uncorrectable error
557	occurs - it is indeterminate what was uncorrected and the operating
558	system context might be so mangled that continuing will lead to further
559	corruption. If the kernel has MCE configured, then EDAC will never
560	notice the UE.
561
562	LOAD TIME: module/kernel parameter: edac_mc_panic_on_ue=[0|1]
563
564	RUN TIME:  echo "1" > /sys/module/edac_core/parameters/edac_mc_panic_on_ue
565
566
567Log UE control file:
568
569	'edac_mc_log_ue'
570
571	Generate kernel messages describing uncorrectable errors.  These errors
572	are reported through the system message log system.  UE statistics
573	will be accumulated even when UE logging is disabled.
574
575	LOAD TIME: module/kernel parameter: edac_mc_log_ue=[0|1]
576
577	RUN TIME: echo "1" > /sys/module/edac_core/parameters/edac_mc_log_ue
578
579
580Log CE control file:
581
582	'edac_mc_log_ce'
583
584	Generate kernel messages describing correctable errors.  These
585	errors are reported through the system message log system.
586	CE statistics will be accumulated even when CE logging is disabled.
587
588	LOAD TIME: module/kernel parameter: edac_mc_log_ce=[0|1]
589
590	RUN TIME: echo "1" > /sys/module/edac_core/parameters/edac_mc_log_ce
591
592
593Polling period control file:
594
595	'edac_mc_poll_msec'
596
597	The time period, in milliseconds, for polling for error information.
598	Too small a value wastes resources.  Too large a value might delay
599	necessary handling of errors and might loose valuable information for
600	locating the error.  1000 milliseconds (once each second) is the current
601	default. Systems which require all the bandwidth they can get, may
602	increase this.
603
604	LOAD TIME: module/kernel parameter: edac_mc_poll_msec=[0|1]
605
606	RUN TIME: echo "1000" > /sys/module/edac_core/parameters/edac_mc_poll_msec
607
608
609Panic on PCI PARITY Error:
610
611	'panic_on_pci_parity'
612
613
614	This control files enables or disables panicking when a parity
615	error has been detected.
616
617
618	module/kernel parameter: edac_panic_on_pci_pe=[0|1]
619
620	Enable:
621	echo "1" > /sys/module/edac_core/parameters/edac_panic_on_pci_pe
622
623	Disable:
624	echo "0" > /sys/module/edac_core/parameters/edac_panic_on_pci_pe
625
626
627
628=======================================================================
629
630
631EDAC_DEVICE type of device
632
633In the header file, edac_core.h, there is a series of edac_device structures
634and APIs for the EDAC_DEVICE.
635
636User space access to an edac_device is through the sysfs interface.
637
638At the location /sys/devices/system/edac (sysfs) new edac_device devices will
639appear.
640
641There is a three level tree beneath the above 'edac' directory. For example,
642the 'test_device_edac' device (found at the bluesmoke.sourceforget.net website)
643installs itself as:
644
645	/sys/devices/systm/edac/test-instance
646
647in this directory are various controls, a symlink and one or more 'instance'
648directorys.
649
650The standard default controls are:
651
652	log_ce		boolean to log CE events
653	log_ue		boolean to log UE events
654	panic_on_ue	boolean to 'panic' the system if an UE is encountered
655			(default off, can be set true via startup script)
656	poll_msec	time period between POLL cycles for events
657
658The test_device_edac device adds at least one of its own custom control:
659
660	test_bits	which in the current test driver does nothing but
661			show how it is installed. A ported driver can
662			add one or more such controls and/or attributes
663			for specific uses.
664			One out-of-tree driver uses controls here to allow
665			for ERROR INJECTION operations to hardware
666			injection registers
667
668The symlink points to the 'struct dev' that is registered for this edac_device.
669
670INSTANCES
671
672One or more instance directories are present. For the 'test_device_edac' case:
673
674	test-instance0
675
676
677In this directory there are two default counter attributes, which are totals of
678counter in deeper subdirectories.
679
680	ce_count	total of CE events of subdirectories
681	ue_count	total of UE events of subdirectories
682
683BLOCKS
684
685At the lowest directory level is the 'block' directory. There can be 0, 1
686or more blocks specified in each instance.
687
688	test-block0
689
690
691In this directory the default attributes are:
692
693	ce_count	which is counter of CE events for this 'block'
694			of hardware being monitored
695	ue_count	which is counter of UE events for this 'block'
696			of hardware being monitored
697
698
699The 'test_device_edac' device adds 4 attributes and 1 control:
700
701	test-block-bits-0	for every POLL cycle this counter
702				is incremented
703	test-block-bits-1	every 10 cycles, this counter is bumped once,
704				and test-block-bits-0 is set to 0
705	test-block-bits-2	every 100 cycles, this counter is bumped once,
706				and test-block-bits-1 is set to 0
707	test-block-bits-3	every 1000 cycles, this counter is bumped once,
708				and test-block-bits-2 is set to 0
709
710
711	reset-counters		writing ANY thing to this control will
712				reset all the above counters.
713
714
715Use of the 'test_device_edac' driver should any others to create their own
716unique drivers for their hardware systems.
717
718The 'test_device_edac' sample driver is located at the
719bluesmoke.sourceforge.net project site for EDAC.
720
721