162306a36Sopenharmony_ci.. include:: <isonum.txt>
262306a36Sopenharmony_ci
362306a36Sopenharmony_ci============================================
462306a36Sopenharmony_ciReliability, Availability and Serviceability
562306a36Sopenharmony_ci============================================
662306a36Sopenharmony_ci
762306a36Sopenharmony_ciRAS concepts
862306a36Sopenharmony_ci************
962306a36Sopenharmony_ci
1062306a36Sopenharmony_ciReliability, Availability and Serviceability (RAS) is a concept used on
1162306a36Sopenharmony_ciservers meant to measure their robustness.
1262306a36Sopenharmony_ci
1362306a36Sopenharmony_ciReliability
1462306a36Sopenharmony_ci  is the probability that a system will produce correct outputs.
1562306a36Sopenharmony_ci
1662306a36Sopenharmony_ci  * Generally measured as Mean Time Between Failures (MTBF)
1762306a36Sopenharmony_ci  * Enhanced by features that help to avoid, detect and repair hardware faults
1862306a36Sopenharmony_ci
1962306a36Sopenharmony_ciAvailability
2062306a36Sopenharmony_ci  is the probability that a system is operational at a given time
2162306a36Sopenharmony_ci
2262306a36Sopenharmony_ci  * Generally measured as a percentage of downtime per a period of time
2362306a36Sopenharmony_ci  * Often uses mechanisms to detect and correct hardware faults in
2462306a36Sopenharmony_ci    runtime;
2562306a36Sopenharmony_ci
2662306a36Sopenharmony_ciServiceability (or maintainability)
2762306a36Sopenharmony_ci  is the simplicity and speed with which a system can be repaired or
2862306a36Sopenharmony_ci  maintained
2962306a36Sopenharmony_ci
3062306a36Sopenharmony_ci  * Generally measured on Mean Time Between Repair (MTBR)
3162306a36Sopenharmony_ci
3262306a36Sopenharmony_ciImproving RAS
3362306a36Sopenharmony_ci-------------
3462306a36Sopenharmony_ci
3562306a36Sopenharmony_ciIn order to reduce systems downtime, a system should be capable of detecting
3662306a36Sopenharmony_cihardware errors, and, when possible correcting them in runtime. It should
3762306a36Sopenharmony_cialso provide mechanisms to detect hardware degradation, in order to warn
3862306a36Sopenharmony_cithe system administrator to take the action of replacing a component before
3962306a36Sopenharmony_ciit causes data loss or system downtime.
4062306a36Sopenharmony_ci
4162306a36Sopenharmony_ciAmong the monitoring measures, the most usual ones include:
4262306a36Sopenharmony_ci
4362306a36Sopenharmony_ci* CPU – detect errors at instruction execution and at L1/L2/L3 caches;
4462306a36Sopenharmony_ci* Memory – add error correction logic (ECC) to detect and correct errors;
4562306a36Sopenharmony_ci* I/O – add CRC checksums for transferred data;
4662306a36Sopenharmony_ci* Storage – RAID, journal file systems, checksums,
4762306a36Sopenharmony_ci  Self-Monitoring, Analysis and Reporting Technology (SMART).
4862306a36Sopenharmony_ci
4962306a36Sopenharmony_ciBy monitoring the number of occurrences of error detections, it is possible
5062306a36Sopenharmony_cito identify if the probability of hardware errors is increasing, and, on such
5162306a36Sopenharmony_cicase, do a preventive maintenance to replace a degraded component while
5262306a36Sopenharmony_cithose errors are correctable.
5362306a36Sopenharmony_ci
5462306a36Sopenharmony_ciTypes of errors
5562306a36Sopenharmony_ci---------------
5662306a36Sopenharmony_ci
5762306a36Sopenharmony_ciMost mechanisms used on modern systems use technologies like Hamming
5862306a36Sopenharmony_ciCodes that allow error correction when the number of errors on a bit packet
5962306a36Sopenharmony_ciis below a threshold. If the number of errors is above, those mechanisms
6062306a36Sopenharmony_cican indicate with a high degree of confidence that an error happened, but
6162306a36Sopenharmony_cithey can't correct.
6262306a36Sopenharmony_ci
6362306a36Sopenharmony_ciAlso, sometimes an error occur on a component that it is not used. For
6462306a36Sopenharmony_ciexample, a part of the memory that it is not currently allocated.
6562306a36Sopenharmony_ci
6662306a36Sopenharmony_ciThat defines some categories of errors:
6762306a36Sopenharmony_ci
6862306a36Sopenharmony_ci* **Correctable Error (CE)** - the error detection mechanism detected and
6962306a36Sopenharmony_ci  corrected the error. Such errors are usually not fatal, although some
7062306a36Sopenharmony_ci  Kernel mechanisms allow the system administrator to consider them as fatal.
7162306a36Sopenharmony_ci
7262306a36Sopenharmony_ci* **Uncorrected Error (UE)** - the amount of errors happened above the error
7362306a36Sopenharmony_ci  correction threshold, and the system was unable to auto-correct.
7462306a36Sopenharmony_ci
7562306a36Sopenharmony_ci* **Fatal Error** - when an UE error happens on a critical component of the
7662306a36Sopenharmony_ci  system (for example, a piece of the Kernel got corrupted by an UE), the
7762306a36Sopenharmony_ci  only reliable way to avoid data corruption is to hang or reboot the machine.
7862306a36Sopenharmony_ci
7962306a36Sopenharmony_ci* **Non-fatal Error** - when an UE error happens on an unused component,
8062306a36Sopenharmony_ci  like a CPU in power down state or an unused memory bank, the system may
8162306a36Sopenharmony_ci  still run, eventually replacing the affected hardware by a hot spare,
8262306a36Sopenharmony_ci  if available.
8362306a36Sopenharmony_ci
8462306a36Sopenharmony_ci  Also, when an error happens on a userspace process, it is also possible to
8562306a36Sopenharmony_ci  kill such process and let userspace restart it.
8662306a36Sopenharmony_ci
8762306a36Sopenharmony_ciThe mechanism for handling non-fatal errors is usually complex and may
8862306a36Sopenharmony_cirequire the help of some userspace application, in order to apply the
8962306a36Sopenharmony_cipolicy desired by the system administrator.
9062306a36Sopenharmony_ci
9162306a36Sopenharmony_ciIdentifying a bad hardware component
9262306a36Sopenharmony_ci------------------------------------
9362306a36Sopenharmony_ci
9462306a36Sopenharmony_ciJust detecting a hardware flaw is usually not enough, as the system needs
9562306a36Sopenharmony_cito pinpoint to the minimal replaceable unit (MRU) that should be exchanged
9662306a36Sopenharmony_cito make the hardware reliable again.
9762306a36Sopenharmony_ci
9862306a36Sopenharmony_ciSo, it requires not only error logging facilities, but also mechanisms that
9962306a36Sopenharmony_ciwill translate the error message to the silkscreen or component label for
10062306a36Sopenharmony_cithe MRU.
10162306a36Sopenharmony_ci
10262306a36Sopenharmony_ciTypically, it is very complex for memory, as modern CPUs interlace memory
10362306a36Sopenharmony_cifrom different memory modules, in order to provide a better performance. The
10462306a36Sopenharmony_ciDMI BIOS usually have a list of memory module labels, with can be obtained
10562306a36Sopenharmony_ciusing the ``dmidecode`` tool. For example, on a desktop machine, it shows::
10662306a36Sopenharmony_ci
10762306a36Sopenharmony_ci	Memory Device
10862306a36Sopenharmony_ci		Total Width: 64 bits
10962306a36Sopenharmony_ci		Data Width: 64 bits
11062306a36Sopenharmony_ci		Size: 16384 MB
11162306a36Sopenharmony_ci		Form Factor: SODIMM
11262306a36Sopenharmony_ci		Set: None
11362306a36Sopenharmony_ci		Locator: ChannelA-DIMM0
11462306a36Sopenharmony_ci		Bank Locator: BANK 0
11562306a36Sopenharmony_ci		Type: DDR4
11662306a36Sopenharmony_ci		Type Detail: Synchronous
11762306a36Sopenharmony_ci		Speed: 2133 MHz
11862306a36Sopenharmony_ci		Rank: 2
11962306a36Sopenharmony_ci		Configured Clock Speed: 2133 MHz
12062306a36Sopenharmony_ci
12162306a36Sopenharmony_ciOn the above example, a DDR4 SO-DIMM memory module is located at the
12262306a36Sopenharmony_cisystem's memory labeled as "BANK 0", as given by the *bank locator* field.
12362306a36Sopenharmony_ciPlease notice that, on such system, the *total width* is equal to the
12462306a36Sopenharmony_ci*data width*. It means that such memory module doesn't have error
12562306a36Sopenharmony_cidetection/correction mechanisms.
12662306a36Sopenharmony_ci
12762306a36Sopenharmony_ciUnfortunately, not all systems use the same field to specify the memory
12862306a36Sopenharmony_cibank. On this example, from an older server, ``dmidecode`` shows::
12962306a36Sopenharmony_ci
13062306a36Sopenharmony_ci	Memory Device
13162306a36Sopenharmony_ci		Array Handle: 0x1000
13262306a36Sopenharmony_ci		Error Information Handle: Not Provided
13362306a36Sopenharmony_ci		Total Width: 72 bits
13462306a36Sopenharmony_ci		Data Width: 64 bits
13562306a36Sopenharmony_ci		Size: 8192 MB
13662306a36Sopenharmony_ci		Form Factor: DIMM
13762306a36Sopenharmony_ci		Set: 1
13862306a36Sopenharmony_ci		Locator: DIMM_A1
13962306a36Sopenharmony_ci		Bank Locator: Not Specified
14062306a36Sopenharmony_ci		Type: DDR3
14162306a36Sopenharmony_ci		Type Detail: Synchronous Registered (Buffered)
14262306a36Sopenharmony_ci		Speed: 1600 MHz
14362306a36Sopenharmony_ci		Rank: 2
14462306a36Sopenharmony_ci		Configured Clock Speed: 1600 MHz
14562306a36Sopenharmony_ci
14662306a36Sopenharmony_ciThere, the DDR3 RDIMM memory module is located at the system's memory labeled
14762306a36Sopenharmony_cias "DIMM_A1", as given by the *locator* field. Please notice that this
14862306a36Sopenharmony_cimemory module has 64 bits of *data width* and 72 bits of *total width*. So,
14962306a36Sopenharmony_ciit has 8 extra bits to be used by error detection and correction mechanisms.
15062306a36Sopenharmony_ciSuch kind of memory is called Error-correcting code memory (ECC memory).
15162306a36Sopenharmony_ci
15262306a36Sopenharmony_ciTo make things even worse, it is not uncommon that systems with different
15362306a36Sopenharmony_cilabels on their system's board to use exactly the same BIOS, meaning that
15462306a36Sopenharmony_cithe labels provided by the BIOS won't match the real ones.
15562306a36Sopenharmony_ci
15662306a36Sopenharmony_ciECC memory
15762306a36Sopenharmony_ci----------
15862306a36Sopenharmony_ci
15962306a36Sopenharmony_ciAs mentioned in the previous section, ECC memory has extra bits to be
16062306a36Sopenharmony_ciused for error correction. In the above example, a memory module has
16162306a36Sopenharmony_ci64 bits of *data width*, and 72 bits of *total width*.  The extra 8
16262306a36Sopenharmony_cibits which are used for the error detection and correction mechanisms
16362306a36Sopenharmony_ciare referred to as the *syndrome*\ [#f1]_\ [#f2]_.
16462306a36Sopenharmony_ci
16562306a36Sopenharmony_ciSo, when the cpu requests the memory controller to write a word with
16662306a36Sopenharmony_ci*data width*, the memory controller calculates the *syndrome* in real time,
16762306a36Sopenharmony_ciusing Hamming code, or some other error correction code, like SECDED+,
16862306a36Sopenharmony_ciproducing a code with *total width* size. Such code is then written
16962306a36Sopenharmony_cion the memory modules.
17062306a36Sopenharmony_ci
17162306a36Sopenharmony_ciAt read, the *total width* bits code is converted back, using the same
17262306a36Sopenharmony_ciECC code used on write, producing a word with *data width* and a *syndrome*.
17362306a36Sopenharmony_ciThe word with *data width* is sent to the CPU, even when errors happen.
17462306a36Sopenharmony_ci
17562306a36Sopenharmony_ciThe memory controller also looks at the *syndrome* in order to check if
17662306a36Sopenharmony_cithere was an error, and if the ECC code was able to fix such error.
17762306a36Sopenharmony_ciIf the error was corrected, a Corrected Error (CE) happened. If not, an
17862306a36Sopenharmony_ciUncorrected Error (UE) happened.
17962306a36Sopenharmony_ci
18062306a36Sopenharmony_ciThe information about the CE/UE errors is stored on some special registers
18162306a36Sopenharmony_ciat the memory controller and can be accessed by reading such registers,
18262306a36Sopenharmony_cieither by BIOS, by some special CPUs or by Linux EDAC driver. On x86 64
18362306a36Sopenharmony_cibit CPUs, such errors can also be retrieved via the Machine Check
18462306a36Sopenharmony_ciArchitecture (MCA)\ [#f3]_.
18562306a36Sopenharmony_ci
18662306a36Sopenharmony_ci.. [#f1] Please notice that several memory controllers allow operation on a
18762306a36Sopenharmony_ci  mode called "Lock-Step", where it groups two memory modules together,
18862306a36Sopenharmony_ci  doing 128-bit reads/writes. That gives 16 bits for error correction, with
18962306a36Sopenharmony_ci  significantly improves the error correction mechanism, at the expense
19062306a36Sopenharmony_ci  that, when an error happens, there's no way to know what memory module is
19162306a36Sopenharmony_ci  to blame. So, it has to blame both memory modules.
19262306a36Sopenharmony_ci
19362306a36Sopenharmony_ci.. [#f2] Some memory controllers also allow using memory in mirror mode.
19462306a36Sopenharmony_ci  On such mode, the same data is written to two memory modules. At read,
19562306a36Sopenharmony_ci  the system checks both memory modules, in order to check if both provide
19662306a36Sopenharmony_ci  identical data. On such configuration, when an error happens, there's no
19762306a36Sopenharmony_ci  way to know what memory module is to blame. So, it has to blame both
19862306a36Sopenharmony_ci  memory modules (or 4 memory modules, if the system is also on Lock-step
19962306a36Sopenharmony_ci  mode).
20062306a36Sopenharmony_ci
20162306a36Sopenharmony_ci.. [#f3] For more details about the Machine Check Architecture (MCA),
20262306a36Sopenharmony_ci  please read Documentation/arch/x86/x86_64/machinecheck.rst at the Kernel tree.
20362306a36Sopenharmony_ci
20462306a36Sopenharmony_ciEDAC - Error Detection And Correction
20562306a36Sopenharmony_ci*************************************
20662306a36Sopenharmony_ci
20762306a36Sopenharmony_ci.. note::
20862306a36Sopenharmony_ci
20962306a36Sopenharmony_ci   "bluesmoke" was the name for this device driver subsystem when it
21062306a36Sopenharmony_ci   was "out-of-tree" and maintained at http://bluesmoke.sourceforge.net.
21162306a36Sopenharmony_ci   That site is mostly archaic now and can be used only for historical
21262306a36Sopenharmony_ci   purposes.
21362306a36Sopenharmony_ci
21462306a36Sopenharmony_ci   When the subsystem was pushed upstream for the first time, on
21562306a36Sopenharmony_ci   Kernel 2.6.16, it was renamed to ``EDAC``.
21662306a36Sopenharmony_ci
21762306a36Sopenharmony_ciPurpose
21862306a36Sopenharmony_ci-------
21962306a36Sopenharmony_ci
22062306a36Sopenharmony_ciThe ``edac`` kernel module's goal is to detect and report hardware errors
22162306a36Sopenharmony_cithat occur within the computer system running under linux.
22262306a36Sopenharmony_ci
22362306a36Sopenharmony_ciMemory
22462306a36Sopenharmony_ci------
22562306a36Sopenharmony_ci
22662306a36Sopenharmony_ciMemory Correctable Errors (CE) and Uncorrectable Errors (UE) are the
22762306a36Sopenharmony_ciprimary errors being harvested. These types of errors are harvested by
22862306a36Sopenharmony_cithe ``edac_mc`` device.
22962306a36Sopenharmony_ci
23062306a36Sopenharmony_ciDetecting CE events, then harvesting those events and reporting them,
23162306a36Sopenharmony_ci**can** but must not necessarily be a predictor of future UE events. With
23262306a36Sopenharmony_ciCE events only, the system can and will continue to operate as no data
23362306a36Sopenharmony_cihas been damaged yet.
23462306a36Sopenharmony_ci
23562306a36Sopenharmony_ciHowever, preventive maintenance and proactive part replacement of memory
23662306a36Sopenharmony_cimodules exhibiting CEs can reduce the likelihood of the dreaded UE events
23762306a36Sopenharmony_ciand system panics.
23862306a36Sopenharmony_ci
23962306a36Sopenharmony_ciOther hardware elements
24062306a36Sopenharmony_ci-----------------------
24162306a36Sopenharmony_ci
24262306a36Sopenharmony_ciA new feature for EDAC, the ``edac_device`` class of device, was added in
24362306a36Sopenharmony_cithe 2.6.23 version of the kernel.
24462306a36Sopenharmony_ci
24562306a36Sopenharmony_ciThis new device type allows for non-memory type of ECC hardware detectors
24662306a36Sopenharmony_cito have their states harvested and presented to userspace via the sysfs
24762306a36Sopenharmony_ciinterface.
24862306a36Sopenharmony_ci
24962306a36Sopenharmony_ciSome architectures have ECC detectors for L1, L2 and L3 caches,
25062306a36Sopenharmony_cialong with DMA engines, fabric switches, main data path switches,
25162306a36Sopenharmony_ciinterconnections, and various other hardware data paths. If the hardware
25262306a36Sopenharmony_cireports it, then a edac_device device probably can be constructed to
25362306a36Sopenharmony_ciharvest and present that to userspace.
25462306a36Sopenharmony_ci
25562306a36Sopenharmony_ci
25662306a36Sopenharmony_ciPCI bus scanning
25762306a36Sopenharmony_ci----------------
25862306a36Sopenharmony_ci
25962306a36Sopenharmony_ciIn addition, PCI devices are scanned for PCI Bus Parity and SERR Errors
26062306a36Sopenharmony_ciin order to determine if errors are occurring during data transfers.
26162306a36Sopenharmony_ci
26262306a36Sopenharmony_ciThe presence of PCI Parity errors must be examined with a grain of salt.
26362306a36Sopenharmony_ciThere are several add-in adapters that do **not** follow the PCI specification
26462306a36Sopenharmony_ciwith regards to Parity generation and reporting. The specification says
26562306a36Sopenharmony_cithe vendor should tie the parity status bits to 0 if they do not intend
26662306a36Sopenharmony_cito generate parity.  Some vendors do not do this, and thus the parity bit
26762306a36Sopenharmony_cican "float" giving false positives.
26862306a36Sopenharmony_ci
26962306a36Sopenharmony_ciThere is a PCI device attribute located in sysfs that is checked by
27062306a36Sopenharmony_cithe EDAC PCI scanning code. If that attribute is set, PCI parity/error
27162306a36Sopenharmony_ciscanning is skipped for that device. The attribute is::
27262306a36Sopenharmony_ci
27362306a36Sopenharmony_ci	broken_parity_status
27462306a36Sopenharmony_ci
27562306a36Sopenharmony_ciand is located in ``/sys/devices/pci<XXX>/0000:XX:YY.Z`` directories for
27662306a36Sopenharmony_ciPCI devices.
27762306a36Sopenharmony_ci
27862306a36Sopenharmony_ci
27962306a36Sopenharmony_ciVersioning
28062306a36Sopenharmony_ci----------
28162306a36Sopenharmony_ci
28262306a36Sopenharmony_ciEDAC is composed of a "core" module (``edac_core.ko``) and several Memory
28362306a36Sopenharmony_ciController (MC) driver modules. On a given system, the CORE is loaded
28462306a36Sopenharmony_ciand one MC driver will be loaded. Both the CORE and the MC driver (or
28562306a36Sopenharmony_ci``edac_device`` driver) have individual versions that reflect current
28662306a36Sopenharmony_cirelease level of their respective modules.
28762306a36Sopenharmony_ci
28862306a36Sopenharmony_ciThus, to "report" on what version a system is running, one must report
28962306a36Sopenharmony_ciboth the CORE's and the MC driver's versions.
29062306a36Sopenharmony_ci
29162306a36Sopenharmony_ci
29262306a36Sopenharmony_ciLoading
29362306a36Sopenharmony_ci-------
29462306a36Sopenharmony_ci
29562306a36Sopenharmony_ciIf ``edac`` was statically linked with the kernel then no loading
29662306a36Sopenharmony_ciis necessary. If ``edac`` was built as modules then simply modprobe
29762306a36Sopenharmony_cithe ``edac`` pieces that you need. You should be able to modprobe
29862306a36Sopenharmony_cihardware-specific modules and have the dependencies load the necessary
29962306a36Sopenharmony_cicore modules.
30062306a36Sopenharmony_ci
30162306a36Sopenharmony_ciExample::
30262306a36Sopenharmony_ci
30362306a36Sopenharmony_ci	$ modprobe amd76x_edac
30462306a36Sopenharmony_ci
30562306a36Sopenharmony_ciloads both the ``amd76x_edac.ko`` memory controller module and the
30662306a36Sopenharmony_ci``edac_mc.ko`` core module.
30762306a36Sopenharmony_ci
30862306a36Sopenharmony_ci
30962306a36Sopenharmony_ciSysfs interface
31062306a36Sopenharmony_ci---------------
31162306a36Sopenharmony_ci
31262306a36Sopenharmony_ciEDAC presents a ``sysfs`` interface for control and reporting purposes. It
31362306a36Sopenharmony_cilives in the /sys/devices/system/edac directory.
31462306a36Sopenharmony_ci
31562306a36Sopenharmony_ciWithin this directory there currently reside 2 components:
31662306a36Sopenharmony_ci
31762306a36Sopenharmony_ci	======= ==============================
31862306a36Sopenharmony_ci	mc	memory controller(s) system
31962306a36Sopenharmony_ci	pci	PCI control and status system
32062306a36Sopenharmony_ci	======= ==============================
32162306a36Sopenharmony_ci
32262306a36Sopenharmony_ci
32362306a36Sopenharmony_ci
32462306a36Sopenharmony_ciMemory Controller (mc) Model
32562306a36Sopenharmony_ci----------------------------
32662306a36Sopenharmony_ci
32762306a36Sopenharmony_ciEach ``mc`` device controls a set of memory modules [#f4]_. These modules
32862306a36Sopenharmony_ciare laid out in a Chip-Select Row (``csrowX``) and Channel table (``chX``).
32962306a36Sopenharmony_ciThere can be multiple csrows and multiple channels.
33062306a36Sopenharmony_ci
33162306a36Sopenharmony_ci.. [#f4] Nowadays, the term DIMM (Dual In-line Memory Module) is widely
33262306a36Sopenharmony_ci  used to refer to a memory module, although there are other memory
33362306a36Sopenharmony_ci  packaging alternatives, like SO-DIMM, SIMM, etc. The UEFI
33462306a36Sopenharmony_ci  specification (Version 2.7) defines a memory module in the Common
33562306a36Sopenharmony_ci  Platform Error Record (CPER) section to be an SMBIOS Memory Device
33662306a36Sopenharmony_ci  (Type 17). Along this document, and inside the EDAC subsystem, the term
33762306a36Sopenharmony_ci  "dimm" is used for all memory modules, even when they use a
33862306a36Sopenharmony_ci  different kind of packaging.
33962306a36Sopenharmony_ci
34062306a36Sopenharmony_ciMemory controllers allow for several csrows, with 8 csrows being a
34162306a36Sopenharmony_citypical value. Yet, the actual number of csrows depends on the layout of
34262306a36Sopenharmony_cia given motherboard, memory controller and memory module characteristics.
34362306a36Sopenharmony_ci
34462306a36Sopenharmony_ciDual channels allow for dual data length (e. g. 128 bits, on 64 bit systems)
34562306a36Sopenharmony_cidata transfers to/from the CPU from/to memory. Some newer chipsets allow
34662306a36Sopenharmony_cifor more than 2 channels, like Fully Buffered DIMMs (FB-DIMMs) memory
34762306a36Sopenharmony_cicontrollers. The following example will assume 2 channels:
34862306a36Sopenharmony_ci
34962306a36Sopenharmony_ci	+------------+-----------------------+
35062306a36Sopenharmony_ci	| CS Rows    |       Channels        |
35162306a36Sopenharmony_ci	+------------+-----------+-----------+
35262306a36Sopenharmony_ci	|            |  ``ch0``  |  ``ch1``  |
35362306a36Sopenharmony_ci	+============+===========+===========+
35462306a36Sopenharmony_ci	|            |**DIMM_A0**|**DIMM_B0**|
35562306a36Sopenharmony_ci	+------------+-----------+-----------+
35662306a36Sopenharmony_ci	| ``csrow0`` |   rank0   |   rank0   |
35762306a36Sopenharmony_ci	+------------+-----------+-----------+
35862306a36Sopenharmony_ci	| ``csrow1`` |   rank1   |   rank1   |
35962306a36Sopenharmony_ci	+------------+-----------+-----------+
36062306a36Sopenharmony_ci	|            |**DIMM_A1**|**DIMM_B1**|
36162306a36Sopenharmony_ci	+------------+-----------+-----------+
36262306a36Sopenharmony_ci	| ``csrow2`` |    rank0  |  rank0    |
36362306a36Sopenharmony_ci	+------------+-----------+-----------+
36462306a36Sopenharmony_ci	| ``csrow3`` |    rank1  |  rank1    |
36562306a36Sopenharmony_ci	+------------+-----------+-----------+
36662306a36Sopenharmony_ci
36762306a36Sopenharmony_ciIn the above example, there are 4 physical slots on the motherboard
36862306a36Sopenharmony_cifor memory DIMMs:
36962306a36Sopenharmony_ci
37062306a36Sopenharmony_ci	+---------+---------+
37162306a36Sopenharmony_ci	| DIMM_A0 | DIMM_B0 |
37262306a36Sopenharmony_ci	+---------+---------+
37362306a36Sopenharmony_ci	| DIMM_A1 | DIMM_B1 |
37462306a36Sopenharmony_ci	+---------+---------+
37562306a36Sopenharmony_ci
37662306a36Sopenharmony_ciLabels for these slots are usually silk-screened on the motherboard.
37762306a36Sopenharmony_ciSlots labeled ``A`` are channel 0 in this example. Slots labeled ``B`` are
37862306a36Sopenharmony_cichannel 1. Notice that there are two csrows possible on a physical DIMM.
37962306a36Sopenharmony_ciThese csrows are allocated their csrow assignment based on the slot into
38062306a36Sopenharmony_ciwhich the memory DIMM is placed. Thus, when 1 DIMM is placed in each
38162306a36Sopenharmony_ciChannel, the csrows cross both DIMMs.
38262306a36Sopenharmony_ci
38362306a36Sopenharmony_ciMemory DIMMs come single or dual "ranked". A rank is a populated csrow.
38462306a36Sopenharmony_ciIn the example above 2 dual ranked DIMMs are similarly placed. Thus,
38562306a36Sopenharmony_ciboth csrow0 and csrow1 are populated. On the other hand, when 2 single
38662306a36Sopenharmony_ciranked DIMMs are placed in slots DIMM_A0 and DIMM_B0, then they will
38762306a36Sopenharmony_cihave just one csrow (csrow0) and csrow1 will be empty. The pattern
38862306a36Sopenharmony_cirepeats itself for csrow2 and csrow3. Also note that some memory
38962306a36Sopenharmony_cicontrollers don't have any logic to identify the memory module, see
39062306a36Sopenharmony_ci``rankX`` directories below.
39162306a36Sopenharmony_ci
39262306a36Sopenharmony_ciThe representation of the above is reflected in the directory
39362306a36Sopenharmony_citree in EDAC's sysfs interface. Starting in directory
39462306a36Sopenharmony_ci``/sys/devices/system/edac/mc``, each memory controller will be
39562306a36Sopenharmony_cirepresented by its own ``mcX`` directory, where ``X`` is the
39662306a36Sopenharmony_ciindex of the MC::
39762306a36Sopenharmony_ci
39862306a36Sopenharmony_ci	..../edac/mc/
39962306a36Sopenharmony_ci		   |
40062306a36Sopenharmony_ci		   |->mc0
40162306a36Sopenharmony_ci		   |->mc1
40262306a36Sopenharmony_ci		   |->mc2
40362306a36Sopenharmony_ci		   ....
40462306a36Sopenharmony_ci
40562306a36Sopenharmony_ciUnder each ``mcX`` directory each ``csrowX`` is again represented by a
40662306a36Sopenharmony_ci``csrowX``, where ``X`` is the csrow index::
40762306a36Sopenharmony_ci
40862306a36Sopenharmony_ci	.../mc/mc0/
40962306a36Sopenharmony_ci		|
41062306a36Sopenharmony_ci		|->csrow0
41162306a36Sopenharmony_ci		|->csrow2
41262306a36Sopenharmony_ci		|->csrow3
41362306a36Sopenharmony_ci		....
41462306a36Sopenharmony_ci
41562306a36Sopenharmony_ciNotice that there is no csrow1, which indicates that csrow0 is composed
41662306a36Sopenharmony_ciof a single ranked DIMMs. This should also apply in both Channels, in
41762306a36Sopenharmony_ciorder to have dual-channel mode be operational. Since both csrow2 and
41862306a36Sopenharmony_cicsrow3 are populated, this indicates a dual ranked set of DIMMs for
41962306a36Sopenharmony_cichannels 0 and 1.
42062306a36Sopenharmony_ci
42162306a36Sopenharmony_ciWithin each of the ``mcX`` and ``csrowX`` directories are several EDAC
42262306a36Sopenharmony_cicontrol and attribute files.
42362306a36Sopenharmony_ci
42462306a36Sopenharmony_ci``mcX`` directories
42562306a36Sopenharmony_ci-------------------
42662306a36Sopenharmony_ci
42762306a36Sopenharmony_ciIn ``mcX`` directories are EDAC control and attribute files for
42862306a36Sopenharmony_cithis ``X`` instance of the memory controllers.
42962306a36Sopenharmony_ci
43062306a36Sopenharmony_ciFor a description of the sysfs API, please see:
43162306a36Sopenharmony_ci
43262306a36Sopenharmony_ci	Documentation/ABI/testing/sysfs-devices-edac
43362306a36Sopenharmony_ci
43462306a36Sopenharmony_ci
43562306a36Sopenharmony_ci``dimmX`` or ``rankX`` directories
43662306a36Sopenharmony_ci----------------------------------
43762306a36Sopenharmony_ci
43862306a36Sopenharmony_ciThe recommended way to use the EDAC subsystem is to look at the information
43962306a36Sopenharmony_ciprovided by the ``dimmX`` or ``rankX`` directories [#f5]_.
44062306a36Sopenharmony_ci
44162306a36Sopenharmony_ciA typical EDAC system has the following structure under
44262306a36Sopenharmony_ci``/sys/devices/system/edac/``\ [#f6]_::
44362306a36Sopenharmony_ci
44462306a36Sopenharmony_ci	/sys/devices/system/edac/
44562306a36Sopenharmony_ci	├── mc
44662306a36Sopenharmony_ci	│   ├── mc0
44762306a36Sopenharmony_ci	│   │   ├── ce_count
44862306a36Sopenharmony_ci	│   │   ├── ce_noinfo_count
44962306a36Sopenharmony_ci	│   │   ├── dimm0
45062306a36Sopenharmony_ci	│   │   │   ├── dimm_ce_count
45162306a36Sopenharmony_ci	│   │   │   ├── dimm_dev_type
45262306a36Sopenharmony_ci	│   │   │   ├── dimm_edac_mode
45362306a36Sopenharmony_ci	│   │   │   ├── dimm_label
45462306a36Sopenharmony_ci	│   │   │   ├── dimm_location
45562306a36Sopenharmony_ci	│   │   │   ├── dimm_mem_type
45662306a36Sopenharmony_ci	│   │   │   ├── dimm_ue_count
45762306a36Sopenharmony_ci	│   │   │   ├── size
45862306a36Sopenharmony_ci	│   │   │   └── uevent
45962306a36Sopenharmony_ci	│   │   ├── max_location
46062306a36Sopenharmony_ci	│   │   ├── mc_name
46162306a36Sopenharmony_ci	│   │   ├── reset_counters
46262306a36Sopenharmony_ci	│   │   ├── seconds_since_reset
46362306a36Sopenharmony_ci	│   │   ├── size_mb
46462306a36Sopenharmony_ci	│   │   ├── ue_count
46562306a36Sopenharmony_ci	│   │   ├── ue_noinfo_count
46662306a36Sopenharmony_ci	│   │   └── uevent
46762306a36Sopenharmony_ci	│   ├── mc1
46862306a36Sopenharmony_ci	│   │   ├── ce_count
46962306a36Sopenharmony_ci	│   │   ├── ce_noinfo_count
47062306a36Sopenharmony_ci	│   │   ├── dimm0
47162306a36Sopenharmony_ci	│   │   │   ├── dimm_ce_count
47262306a36Sopenharmony_ci	│   │   │   ├── dimm_dev_type
47362306a36Sopenharmony_ci	│   │   │   ├── dimm_edac_mode
47462306a36Sopenharmony_ci	│   │   │   ├── dimm_label
47562306a36Sopenharmony_ci	│   │   │   ├── dimm_location
47662306a36Sopenharmony_ci	│   │   │   ├── dimm_mem_type
47762306a36Sopenharmony_ci	│   │   │   ├── dimm_ue_count
47862306a36Sopenharmony_ci	│   │   │   ├── size
47962306a36Sopenharmony_ci	│   │   │   └── uevent
48062306a36Sopenharmony_ci	│   │   ├── max_location
48162306a36Sopenharmony_ci	│   │   ├── mc_name
48262306a36Sopenharmony_ci	│   │   ├── reset_counters
48362306a36Sopenharmony_ci	│   │   ├── seconds_since_reset
48462306a36Sopenharmony_ci	│   │   ├── size_mb
48562306a36Sopenharmony_ci	│   │   ├── ue_count
48662306a36Sopenharmony_ci	│   │   ├── ue_noinfo_count
48762306a36Sopenharmony_ci	│   │   └── uevent
48862306a36Sopenharmony_ci	│   └── uevent
48962306a36Sopenharmony_ci	└── uevent
49062306a36Sopenharmony_ci
49162306a36Sopenharmony_ciIn the ``dimmX`` directories are EDAC control and attribute files for
49262306a36Sopenharmony_cithis ``X`` memory module:
49362306a36Sopenharmony_ci
49462306a36Sopenharmony_ci- ``size`` - Total memory managed by this csrow attribute file
49562306a36Sopenharmony_ci
49662306a36Sopenharmony_ci	This attribute file displays, in count of megabytes, the memory
49762306a36Sopenharmony_ci	that this csrow contains.
49862306a36Sopenharmony_ci
49962306a36Sopenharmony_ci- ``dimm_ue_count`` - Uncorrectable Errors count attribute file
50062306a36Sopenharmony_ci
50162306a36Sopenharmony_ci	This attribute file displays the total count of uncorrectable
50262306a36Sopenharmony_ci	errors that have occurred on this DIMM. If panic_on_ue is set
50362306a36Sopenharmony_ci	this counter will not have a chance to increment, since EDAC
50462306a36Sopenharmony_ci	will panic the system.
50562306a36Sopenharmony_ci
50662306a36Sopenharmony_ci- ``dimm_ce_count`` - Correctable Errors count attribute file
50762306a36Sopenharmony_ci
50862306a36Sopenharmony_ci	This attribute file displays the total count of correctable
50962306a36Sopenharmony_ci	errors that have occurred on this DIMM. This count is very
51062306a36Sopenharmony_ci	important to examine. CEs provide early indications that a
51162306a36Sopenharmony_ci	DIMM is beginning to fail. This count field should be
51262306a36Sopenharmony_ci	monitored for non-zero values and report such information
51362306a36Sopenharmony_ci	to the system administrator.
51462306a36Sopenharmony_ci
51562306a36Sopenharmony_ci- ``dimm_dev_type``  - Device type attribute file
51662306a36Sopenharmony_ci
51762306a36Sopenharmony_ci	This attribute file will display what type of DRAM device is
51862306a36Sopenharmony_ci	being utilized on this DIMM.
51962306a36Sopenharmony_ci	Examples:
52062306a36Sopenharmony_ci
52162306a36Sopenharmony_ci		- x1
52262306a36Sopenharmony_ci		- x2
52362306a36Sopenharmony_ci		- x4
52462306a36Sopenharmony_ci		- x8
52562306a36Sopenharmony_ci
52662306a36Sopenharmony_ci- ``dimm_edac_mode`` - EDAC Mode of operation attribute file
52762306a36Sopenharmony_ci
52862306a36Sopenharmony_ci	This attribute file will display what type of Error detection
52962306a36Sopenharmony_ci	and correction is being utilized.
53062306a36Sopenharmony_ci
53162306a36Sopenharmony_ci- ``dimm_label`` - memory module label control file
53262306a36Sopenharmony_ci
53362306a36Sopenharmony_ci	This control file allows this DIMM to have a label assigned
53462306a36Sopenharmony_ci	to it. With this label in the module, when errors occur
53562306a36Sopenharmony_ci	the output can provide the DIMM label in the system log.
53662306a36Sopenharmony_ci	This becomes vital for panic events to isolate the
53762306a36Sopenharmony_ci	cause of the UE event.
53862306a36Sopenharmony_ci
53962306a36Sopenharmony_ci	DIMM Labels must be assigned after booting, with information
54062306a36Sopenharmony_ci	that correctly identifies the physical slot with its
54162306a36Sopenharmony_ci	silk screen label. This information is currently very
54262306a36Sopenharmony_ci	motherboard specific and determination of this information
54362306a36Sopenharmony_ci	must occur in userland at this time.
54462306a36Sopenharmony_ci
54562306a36Sopenharmony_ci- ``dimm_location`` - location of the memory module
54662306a36Sopenharmony_ci
54762306a36Sopenharmony_ci	The location can have up to 3 levels, and describe how the
54862306a36Sopenharmony_ci	memory controller identifies the location of a memory module.
54962306a36Sopenharmony_ci	Depending on the type of memory and memory controller, it
55062306a36Sopenharmony_ci	can be:
55162306a36Sopenharmony_ci
55262306a36Sopenharmony_ci		- *csrow* and *channel* - used when the memory controller
55362306a36Sopenharmony_ci		  doesn't identify a single DIMM - e. g. in ``rankX`` dir;
55462306a36Sopenharmony_ci		- *branch*, *channel*, *slot* - typically used on FB-DIMM memory
55562306a36Sopenharmony_ci		  controllers;
55662306a36Sopenharmony_ci		- *channel*, *slot* - used on Nehalem and newer Intel drivers.
55762306a36Sopenharmony_ci
55862306a36Sopenharmony_ci- ``dimm_mem_type`` - Memory Type attribute file
55962306a36Sopenharmony_ci
56062306a36Sopenharmony_ci	This attribute file will display what type of memory is currently
56162306a36Sopenharmony_ci	on this csrow. Normally, either buffered or unbuffered memory.
56262306a36Sopenharmony_ci	Examples:
56362306a36Sopenharmony_ci
56462306a36Sopenharmony_ci		- Registered-DDR
56562306a36Sopenharmony_ci		- Unbuffered-DDR
56662306a36Sopenharmony_ci
56762306a36Sopenharmony_ci.. [#f5] On some systems, the memory controller doesn't have any logic
56862306a36Sopenharmony_ci  to identify the memory module. On such systems, the directory is called ``rankX`` and works on a similar way as the ``csrowX`` directories.
56962306a36Sopenharmony_ci  On modern Intel memory controllers, the memory controller identifies the
57062306a36Sopenharmony_ci  memory modules directly. On such systems, the directory is called ``dimmX``.
57162306a36Sopenharmony_ci
57262306a36Sopenharmony_ci.. [#f6] There are also some ``power`` directories and ``subsystem``
57362306a36Sopenharmony_ci  symlinks inside the sysfs mapping that are automatically created by
57462306a36Sopenharmony_ci  the sysfs subsystem. Currently, they serve no purpose.
57562306a36Sopenharmony_ci
57662306a36Sopenharmony_ci``csrowX`` directories
57762306a36Sopenharmony_ci----------------------
57862306a36Sopenharmony_ci
57962306a36Sopenharmony_ciWhen CONFIG_EDAC_LEGACY_SYSFS is enabled, sysfs will contain the ``csrowX``
58062306a36Sopenharmony_cidirectories. As this API doesn't work properly for Rambus, FB-DIMMs and
58162306a36Sopenharmony_cimodern Intel Memory Controllers, this is being deprecated in favor of
58262306a36Sopenharmony_ci``dimmX`` directories.
58362306a36Sopenharmony_ci
58462306a36Sopenharmony_ciIn the ``csrowX`` directories are EDAC control and attribute files for
58562306a36Sopenharmony_cithis ``X`` instance of csrow:
58662306a36Sopenharmony_ci
58762306a36Sopenharmony_ci
58862306a36Sopenharmony_ci- ``ue_count`` - Total Uncorrectable Errors count attribute file
58962306a36Sopenharmony_ci
59062306a36Sopenharmony_ci	This attribute file displays the total count of uncorrectable
59162306a36Sopenharmony_ci	errors that have occurred on this csrow. If panic_on_ue is set
59262306a36Sopenharmony_ci	this counter will not have a chance to increment, since EDAC
59362306a36Sopenharmony_ci	will panic the system.
59462306a36Sopenharmony_ci
59562306a36Sopenharmony_ci
59662306a36Sopenharmony_ci- ``ce_count`` - Total Correctable Errors count attribute file
59762306a36Sopenharmony_ci
59862306a36Sopenharmony_ci	This attribute file displays the total count of correctable
59962306a36Sopenharmony_ci	errors that have occurred on this csrow. This count is very
60062306a36Sopenharmony_ci	important to examine. CEs provide early indications that a
60162306a36Sopenharmony_ci	DIMM is beginning to fail. This count field should be
60262306a36Sopenharmony_ci	monitored for non-zero values and report such information
60362306a36Sopenharmony_ci	to the system administrator.
60462306a36Sopenharmony_ci
60562306a36Sopenharmony_ci
60662306a36Sopenharmony_ci- ``size_mb`` - Total memory managed by this csrow attribute file
60762306a36Sopenharmony_ci
60862306a36Sopenharmony_ci	This attribute file displays, in count of megabytes, the memory
60962306a36Sopenharmony_ci	that this csrow contains.
61062306a36Sopenharmony_ci
61162306a36Sopenharmony_ci
61262306a36Sopenharmony_ci- ``mem_type`` - Memory Type attribute file
61362306a36Sopenharmony_ci
61462306a36Sopenharmony_ci	This attribute file will display what type of memory is currently
61562306a36Sopenharmony_ci	on this csrow. Normally, either buffered or unbuffered memory.
61662306a36Sopenharmony_ci	Examples:
61762306a36Sopenharmony_ci
61862306a36Sopenharmony_ci		- Registered-DDR
61962306a36Sopenharmony_ci		- Unbuffered-DDR
62062306a36Sopenharmony_ci
62162306a36Sopenharmony_ci
62262306a36Sopenharmony_ci- ``edac_mode`` - EDAC Mode of operation attribute file
62362306a36Sopenharmony_ci
62462306a36Sopenharmony_ci	This attribute file will display what type of Error detection
62562306a36Sopenharmony_ci	and correction is being utilized.
62662306a36Sopenharmony_ci
62762306a36Sopenharmony_ci
62862306a36Sopenharmony_ci- ``dev_type`` - Device type attribute file
62962306a36Sopenharmony_ci
63062306a36Sopenharmony_ci	This attribute file will display what type of DRAM device is
63162306a36Sopenharmony_ci	being utilized on this DIMM.
63262306a36Sopenharmony_ci	Examples:
63362306a36Sopenharmony_ci
63462306a36Sopenharmony_ci		- x1
63562306a36Sopenharmony_ci		- x2
63662306a36Sopenharmony_ci		- x4
63762306a36Sopenharmony_ci		- x8
63862306a36Sopenharmony_ci
63962306a36Sopenharmony_ci
64062306a36Sopenharmony_ci- ``ch0_ce_count`` - Channel 0 CE Count attribute file
64162306a36Sopenharmony_ci
64262306a36Sopenharmony_ci	This attribute file will display the count of CEs on this
64362306a36Sopenharmony_ci	DIMM located in channel 0.
64462306a36Sopenharmony_ci
64562306a36Sopenharmony_ci
64662306a36Sopenharmony_ci- ``ch0_ue_count`` - Channel 0 UE Count attribute file
64762306a36Sopenharmony_ci
64862306a36Sopenharmony_ci	This attribute file will display the count of UEs on this
64962306a36Sopenharmony_ci	DIMM located in channel 0.
65062306a36Sopenharmony_ci
65162306a36Sopenharmony_ci
65262306a36Sopenharmony_ci- ``ch0_dimm_label`` - Channel 0 DIMM Label control file
65362306a36Sopenharmony_ci
65462306a36Sopenharmony_ci
65562306a36Sopenharmony_ci	This control file allows this DIMM to have a label assigned
65662306a36Sopenharmony_ci	to it. With this label in the module, when errors occur
65762306a36Sopenharmony_ci	the output can provide the DIMM label in the system log.
65862306a36Sopenharmony_ci	This becomes vital for panic events to isolate the
65962306a36Sopenharmony_ci	cause of the UE event.
66062306a36Sopenharmony_ci
66162306a36Sopenharmony_ci	DIMM Labels must be assigned after booting, with information
66262306a36Sopenharmony_ci	that correctly identifies the physical slot with its
66362306a36Sopenharmony_ci	silk screen label. This information is currently very
66462306a36Sopenharmony_ci	motherboard specific and determination of this information
66562306a36Sopenharmony_ci	must occur in userland at this time.
66662306a36Sopenharmony_ci
66762306a36Sopenharmony_ci
66862306a36Sopenharmony_ci- ``ch1_ce_count`` - Channel 1 CE Count attribute file
66962306a36Sopenharmony_ci
67062306a36Sopenharmony_ci
67162306a36Sopenharmony_ci	This attribute file will display the count of CEs on this
67262306a36Sopenharmony_ci	DIMM located in channel 1.
67362306a36Sopenharmony_ci
67462306a36Sopenharmony_ci
67562306a36Sopenharmony_ci- ``ch1_ue_count`` - Channel 1 UE Count attribute file
67662306a36Sopenharmony_ci
67762306a36Sopenharmony_ci
67862306a36Sopenharmony_ci	This attribute file will display the count of UEs on this
67962306a36Sopenharmony_ci	DIMM located in channel 0.
68062306a36Sopenharmony_ci
68162306a36Sopenharmony_ci
68262306a36Sopenharmony_ci- ``ch1_dimm_label`` - Channel 1 DIMM Label control file
68362306a36Sopenharmony_ci
68462306a36Sopenharmony_ci	This control file allows this DIMM to have a label assigned
68562306a36Sopenharmony_ci	to it. With this label in the module, when errors occur
68662306a36Sopenharmony_ci	the output can provide the DIMM label in the system log.
68762306a36Sopenharmony_ci	This becomes vital for panic events to isolate the
68862306a36Sopenharmony_ci	cause of the UE event.
68962306a36Sopenharmony_ci
69062306a36Sopenharmony_ci	DIMM Labels must be assigned after booting, with information
69162306a36Sopenharmony_ci	that correctly identifies the physical slot with its
69262306a36Sopenharmony_ci	silk screen label. This information is currently very
69362306a36Sopenharmony_ci	motherboard specific and determination of this information
69462306a36Sopenharmony_ci	must occur in userland at this time.
69562306a36Sopenharmony_ci
69662306a36Sopenharmony_ci
69762306a36Sopenharmony_ciSystem Logging
69862306a36Sopenharmony_ci--------------
69962306a36Sopenharmony_ci
70062306a36Sopenharmony_ciIf logging for UEs and CEs is enabled, then system logs will contain
70162306a36Sopenharmony_ciinformation indicating that errors have been detected::
70262306a36Sopenharmony_ci
70362306a36Sopenharmony_ci  EDAC MC0: CE page 0x283, offset 0xce0, grain 8, syndrome 0x6ec3, row 0, channel 1 "DIMM_B1": amd76x_edac
70462306a36Sopenharmony_ci  EDAC MC0: CE page 0x1e5, offset 0xfb0, grain 8, syndrome 0xb741, row 0, channel 1 "DIMM_B1": amd76x_edac
70562306a36Sopenharmony_ci
70662306a36Sopenharmony_ci
70762306a36Sopenharmony_ciThe structure of the message is:
70862306a36Sopenharmony_ci
70962306a36Sopenharmony_ci	+---------------------------------------+-------------+
71062306a36Sopenharmony_ci	| Content                               | Example     |
71162306a36Sopenharmony_ci	+=======================================+=============+
71262306a36Sopenharmony_ci	| The memory controller                 | MC0         |
71362306a36Sopenharmony_ci	+---------------------------------------+-------------+
71462306a36Sopenharmony_ci	| Error type                            | CE          |
71562306a36Sopenharmony_ci	+---------------------------------------+-------------+
71662306a36Sopenharmony_ci	| Memory page                           | 0x283       |
71762306a36Sopenharmony_ci	+---------------------------------------+-------------+
71862306a36Sopenharmony_ci	| Offset in the page                    | 0xce0       |
71962306a36Sopenharmony_ci	+---------------------------------------+-------------+
72062306a36Sopenharmony_ci	| The byte granularity                  | grain 8     |
72162306a36Sopenharmony_ci	| or resolution of the error            |             |
72262306a36Sopenharmony_ci	+---------------------------------------+-------------+
72362306a36Sopenharmony_ci	| The error syndrome                    | 0xb741      |
72462306a36Sopenharmony_ci	+---------------------------------------+-------------+
72562306a36Sopenharmony_ci	| Memory row                            | row 0       |
72662306a36Sopenharmony_ci	+---------------------------------------+-------------+
72762306a36Sopenharmony_ci	| Memory channel                        | channel 1   |
72862306a36Sopenharmony_ci	+---------------------------------------+-------------+
72962306a36Sopenharmony_ci	| DIMM label, if set prior              | DIMM B1     |
73062306a36Sopenharmony_ci	+---------------------------------------+-------------+
73162306a36Sopenharmony_ci	| And then an optional, driver-specific |             |
73262306a36Sopenharmony_ci	| message that may have additional      |             |
73362306a36Sopenharmony_ci	| information.                          |             |
73462306a36Sopenharmony_ci	+---------------------------------------+-------------+
73562306a36Sopenharmony_ci
73662306a36Sopenharmony_ciBoth UEs and CEs with no info will lack all but memory controller, error
73762306a36Sopenharmony_citype, a notice of "no info" and then an optional, driver-specific error
73862306a36Sopenharmony_cimessage.
73962306a36Sopenharmony_ci
74062306a36Sopenharmony_ci
74162306a36Sopenharmony_ciPCI Bus Parity Detection
74262306a36Sopenharmony_ci------------------------
74362306a36Sopenharmony_ci
74462306a36Sopenharmony_ciOn Header Type 00 devices, the primary status is looked at for any
74562306a36Sopenharmony_ciparity error regardless of whether parity is enabled on the device or
74662306a36Sopenharmony_cinot. (The spec indicates parity is generated in some cases). On Header
74762306a36Sopenharmony_ciType 01 bridges, the secondary status register is also looked at to see
74862306a36Sopenharmony_ciif parity occurred on the bus on the other side of the bridge.
74962306a36Sopenharmony_ci
75062306a36Sopenharmony_ci
75162306a36Sopenharmony_ciSysfs configuration
75262306a36Sopenharmony_ci-------------------
75362306a36Sopenharmony_ci
75462306a36Sopenharmony_ciUnder ``/sys/devices/system/edac/pci`` are control and attribute files as
75562306a36Sopenharmony_cifollows:
75662306a36Sopenharmony_ci
75762306a36Sopenharmony_ci
75862306a36Sopenharmony_ci- ``check_pci_parity`` - Enable/Disable PCI Parity checking control file
75962306a36Sopenharmony_ci
76062306a36Sopenharmony_ci	This control file enables or disables the PCI Bus Parity scanning
76162306a36Sopenharmony_ci	operation. Writing a 1 to this file enables the scanning. Writing
76262306a36Sopenharmony_ci	a 0 to this file disables the scanning.
76362306a36Sopenharmony_ci
76462306a36Sopenharmony_ci	Enable::
76562306a36Sopenharmony_ci
76662306a36Sopenharmony_ci		echo "1" >/sys/devices/system/edac/pci/check_pci_parity
76762306a36Sopenharmony_ci
76862306a36Sopenharmony_ci	Disable::
76962306a36Sopenharmony_ci
77062306a36Sopenharmony_ci		echo "0" >/sys/devices/system/edac/pci/check_pci_parity
77162306a36Sopenharmony_ci
77262306a36Sopenharmony_ci
77362306a36Sopenharmony_ci- ``pci_parity_count`` - Parity Count
77462306a36Sopenharmony_ci
77562306a36Sopenharmony_ci	This attribute file will display the number of parity errors that
77662306a36Sopenharmony_ci	have been detected.
77762306a36Sopenharmony_ci
77862306a36Sopenharmony_ci
77962306a36Sopenharmony_ciModule parameters
78062306a36Sopenharmony_ci-----------------
78162306a36Sopenharmony_ci
78262306a36Sopenharmony_ci- ``edac_mc_panic_on_ue`` - Panic on UE control file
78362306a36Sopenharmony_ci
78462306a36Sopenharmony_ci	An uncorrectable error will cause a machine panic.  This is usually
78562306a36Sopenharmony_ci	desirable.  It is a bad idea to continue when an uncorrectable error
78662306a36Sopenharmony_ci	occurs - it is indeterminate what was uncorrected and the operating
78762306a36Sopenharmony_ci	system context might be so mangled that continuing will lead to further
78862306a36Sopenharmony_ci	corruption. If the kernel has MCE configured, then EDAC will never
78962306a36Sopenharmony_ci	notice the UE.
79062306a36Sopenharmony_ci
79162306a36Sopenharmony_ci	LOAD TIME::
79262306a36Sopenharmony_ci
79362306a36Sopenharmony_ci		module/kernel parameter: edac_mc_panic_on_ue=[0|1]
79462306a36Sopenharmony_ci
79562306a36Sopenharmony_ci	RUN TIME::
79662306a36Sopenharmony_ci
79762306a36Sopenharmony_ci		echo "1" > /sys/module/edac_core/parameters/edac_mc_panic_on_ue
79862306a36Sopenharmony_ci
79962306a36Sopenharmony_ci
80062306a36Sopenharmony_ci- ``edac_mc_log_ue`` - Log UE control file
80162306a36Sopenharmony_ci
80262306a36Sopenharmony_ci
80362306a36Sopenharmony_ci	Generate kernel messages describing uncorrectable errors.  These errors
80462306a36Sopenharmony_ci	are reported through the system message log system.  UE statistics
80562306a36Sopenharmony_ci	will be accumulated even when UE logging is disabled.
80662306a36Sopenharmony_ci
80762306a36Sopenharmony_ci	LOAD TIME::
80862306a36Sopenharmony_ci
80962306a36Sopenharmony_ci		module/kernel parameter: edac_mc_log_ue=[0|1]
81062306a36Sopenharmony_ci
81162306a36Sopenharmony_ci	RUN TIME::
81262306a36Sopenharmony_ci
81362306a36Sopenharmony_ci		echo "1" > /sys/module/edac_core/parameters/edac_mc_log_ue
81462306a36Sopenharmony_ci
81562306a36Sopenharmony_ci
81662306a36Sopenharmony_ci- ``edac_mc_log_ce`` - Log CE control file
81762306a36Sopenharmony_ci
81862306a36Sopenharmony_ci
81962306a36Sopenharmony_ci	Generate kernel messages describing correctable errors.  These
82062306a36Sopenharmony_ci	errors are reported through the system message log system.
82162306a36Sopenharmony_ci	CE statistics will be accumulated even when CE logging is disabled.
82262306a36Sopenharmony_ci
82362306a36Sopenharmony_ci	LOAD TIME::
82462306a36Sopenharmony_ci
82562306a36Sopenharmony_ci		module/kernel parameter: edac_mc_log_ce=[0|1]
82662306a36Sopenharmony_ci
82762306a36Sopenharmony_ci	RUN TIME::
82862306a36Sopenharmony_ci
82962306a36Sopenharmony_ci		echo "1" > /sys/module/edac_core/parameters/edac_mc_log_ce
83062306a36Sopenharmony_ci
83162306a36Sopenharmony_ci
83262306a36Sopenharmony_ci- ``edac_mc_poll_msec`` - Polling period control file
83362306a36Sopenharmony_ci
83462306a36Sopenharmony_ci
83562306a36Sopenharmony_ci	The time period, in milliseconds, for polling for error information.
83662306a36Sopenharmony_ci	Too small a value wastes resources.  Too large a value might delay
83762306a36Sopenharmony_ci	necessary handling of errors and might loose valuable information for
83862306a36Sopenharmony_ci	locating the error.  1000 milliseconds (once each second) is the current
83962306a36Sopenharmony_ci	default. Systems which require all the bandwidth they can get, may
84062306a36Sopenharmony_ci	increase this.
84162306a36Sopenharmony_ci
84262306a36Sopenharmony_ci	LOAD TIME::
84362306a36Sopenharmony_ci
84462306a36Sopenharmony_ci		module/kernel parameter: edac_mc_poll_msec=[0|1]
84562306a36Sopenharmony_ci
84662306a36Sopenharmony_ci	RUN TIME::
84762306a36Sopenharmony_ci
84862306a36Sopenharmony_ci		echo "1000" > /sys/module/edac_core/parameters/edac_mc_poll_msec
84962306a36Sopenharmony_ci
85062306a36Sopenharmony_ci
85162306a36Sopenharmony_ci- ``panic_on_pci_parity`` - Panic on PCI PARITY Error
85262306a36Sopenharmony_ci
85362306a36Sopenharmony_ci
85462306a36Sopenharmony_ci	This control file enables or disables panicking when a parity
85562306a36Sopenharmony_ci	error has been detected.
85662306a36Sopenharmony_ci
85762306a36Sopenharmony_ci
85862306a36Sopenharmony_ci	module/kernel parameter::
85962306a36Sopenharmony_ci
86062306a36Sopenharmony_ci			edac_panic_on_pci_pe=[0|1]
86162306a36Sopenharmony_ci
86262306a36Sopenharmony_ci	Enable::
86362306a36Sopenharmony_ci
86462306a36Sopenharmony_ci		echo "1" > /sys/module/edac_core/parameters/edac_panic_on_pci_pe
86562306a36Sopenharmony_ci
86662306a36Sopenharmony_ci	Disable::
86762306a36Sopenharmony_ci
86862306a36Sopenharmony_ci		echo "0" > /sys/module/edac_core/parameters/edac_panic_on_pci_pe
86962306a36Sopenharmony_ci
87062306a36Sopenharmony_ci
87162306a36Sopenharmony_ci
87262306a36Sopenharmony_ciEDAC device type
87362306a36Sopenharmony_ci----------------
87462306a36Sopenharmony_ci
87562306a36Sopenharmony_ciIn the header file, edac_pci.h, there is a series of edac_device structures
87662306a36Sopenharmony_ciand APIs for the EDAC_DEVICE.
87762306a36Sopenharmony_ci
87862306a36Sopenharmony_ciUser space access to an edac_device is through the sysfs interface.
87962306a36Sopenharmony_ci
88062306a36Sopenharmony_ciAt the location ``/sys/devices/system/edac`` (sysfs) new edac_device devices
88162306a36Sopenharmony_ciwill appear.
88262306a36Sopenharmony_ci
88362306a36Sopenharmony_ciThere is a three level tree beneath the above ``edac`` directory. For example,
88462306a36Sopenharmony_cithe ``test_device_edac`` device (found at the http://bluesmoke.sourceforget.net
88562306a36Sopenharmony_ciwebsite) installs itself as::
88662306a36Sopenharmony_ci
88762306a36Sopenharmony_ci	/sys/devices/system/edac/test-instance
88862306a36Sopenharmony_ci
88962306a36Sopenharmony_ciin this directory are various controls, a symlink and one or more ``instance``
89062306a36Sopenharmony_cidirectories.
89162306a36Sopenharmony_ci
89262306a36Sopenharmony_ciThe standard default controls are:
89362306a36Sopenharmony_ci
89462306a36Sopenharmony_ci	==============	=======================================================
89562306a36Sopenharmony_ci	log_ce		boolean to log CE events
89662306a36Sopenharmony_ci	log_ue		boolean to log UE events
89762306a36Sopenharmony_ci	panic_on_ue	boolean to ``panic`` the system if an UE is encountered
89862306a36Sopenharmony_ci			(default off, can be set true via startup script)
89962306a36Sopenharmony_ci	poll_msec	time period between POLL cycles for events
90062306a36Sopenharmony_ci	==============	=======================================================
90162306a36Sopenharmony_ci
90262306a36Sopenharmony_ciThe test_device_edac device adds at least one of its own custom control:
90362306a36Sopenharmony_ci
90462306a36Sopenharmony_ci	==============	==================================================
90562306a36Sopenharmony_ci	test_bits	which in the current test driver does nothing but
90662306a36Sopenharmony_ci			show how it is installed. A ported driver can
90762306a36Sopenharmony_ci			add one or more such controls and/or attributes
90862306a36Sopenharmony_ci			for specific uses.
90962306a36Sopenharmony_ci			One out-of-tree driver uses controls here to allow
91062306a36Sopenharmony_ci			for ERROR INJECTION operations to hardware
91162306a36Sopenharmony_ci			injection registers
91262306a36Sopenharmony_ci	==============	==================================================
91362306a36Sopenharmony_ci
91462306a36Sopenharmony_ciThe symlink points to the 'struct dev' that is registered for this edac_device.
91562306a36Sopenharmony_ci
91662306a36Sopenharmony_ciInstances
91762306a36Sopenharmony_ci---------
91862306a36Sopenharmony_ci
91962306a36Sopenharmony_ciOne or more instance directories are present. For the ``test_device_edac``
92062306a36Sopenharmony_cicase:
92162306a36Sopenharmony_ci
92262306a36Sopenharmony_ci	+----------------+
92362306a36Sopenharmony_ci	| test-instance0 |
92462306a36Sopenharmony_ci	+----------------+
92562306a36Sopenharmony_ci
92662306a36Sopenharmony_ci
92762306a36Sopenharmony_ciIn this directory there are two default counter attributes, which are totals of
92862306a36Sopenharmony_cicounter in deeper subdirectories.
92962306a36Sopenharmony_ci
93062306a36Sopenharmony_ci	==============	====================================
93162306a36Sopenharmony_ci	ce_count	total of CE events of subdirectories
93262306a36Sopenharmony_ci	ue_count	total of UE events of subdirectories
93362306a36Sopenharmony_ci	==============	====================================
93462306a36Sopenharmony_ci
93562306a36Sopenharmony_ciBlocks
93662306a36Sopenharmony_ci------
93762306a36Sopenharmony_ci
93862306a36Sopenharmony_ciAt the lowest directory level is the ``block`` directory. There can be 0, 1
93962306a36Sopenharmony_cior more blocks specified in each instance:
94062306a36Sopenharmony_ci
94162306a36Sopenharmony_ci	+-------------+
94262306a36Sopenharmony_ci	| test-block0 |
94362306a36Sopenharmony_ci	+-------------+
94462306a36Sopenharmony_ci
94562306a36Sopenharmony_ciIn this directory the default attributes are:
94662306a36Sopenharmony_ci
94762306a36Sopenharmony_ci	==============	================================================
94862306a36Sopenharmony_ci	ce_count	which is counter of CE events for this ``block``
94962306a36Sopenharmony_ci			of hardware being monitored
95062306a36Sopenharmony_ci	ue_count	which is counter of UE events for this ``block``
95162306a36Sopenharmony_ci			of hardware being monitored
95262306a36Sopenharmony_ci	==============	================================================
95362306a36Sopenharmony_ci
95462306a36Sopenharmony_ci
95562306a36Sopenharmony_ciThe ``test_device_edac`` device adds 4 attributes and 1 control:
95662306a36Sopenharmony_ci
95762306a36Sopenharmony_ci	================== ====================================================
95862306a36Sopenharmony_ci	test-block-bits-0	for every POLL cycle this counter
95962306a36Sopenharmony_ci				is incremented
96062306a36Sopenharmony_ci	test-block-bits-1	every 10 cycles, this counter is bumped once,
96162306a36Sopenharmony_ci				and test-block-bits-0 is set to 0
96262306a36Sopenharmony_ci	test-block-bits-2	every 100 cycles, this counter is bumped once,
96362306a36Sopenharmony_ci				and test-block-bits-1 is set to 0
96462306a36Sopenharmony_ci	test-block-bits-3	every 1000 cycles, this counter is bumped once,
96562306a36Sopenharmony_ci				and test-block-bits-2 is set to 0
96662306a36Sopenharmony_ci	================== ====================================================
96762306a36Sopenharmony_ci
96862306a36Sopenharmony_ci
96962306a36Sopenharmony_ci	================== ====================================================
97062306a36Sopenharmony_ci	reset-counters		writing ANY thing to this control will
97162306a36Sopenharmony_ci				reset all the above counters.
97262306a36Sopenharmony_ci	================== ====================================================
97362306a36Sopenharmony_ci
97462306a36Sopenharmony_ci
97562306a36Sopenharmony_ciUse of the ``test_device_edac`` driver should enable any others to create their own
97662306a36Sopenharmony_ciunique drivers for their hardware systems.
97762306a36Sopenharmony_ci
97862306a36Sopenharmony_ciThe ``test_device_edac`` sample driver is located at the
97962306a36Sopenharmony_cihttp://bluesmoke.sourceforge.net project site for EDAC.
98062306a36Sopenharmony_ci
98162306a36Sopenharmony_ci
98262306a36Sopenharmony_ciUsage of EDAC APIs on Nehalem and newer Intel CPUs
98362306a36Sopenharmony_ci--------------------------------------------------
98462306a36Sopenharmony_ci
98562306a36Sopenharmony_ciOn older Intel architectures, the memory controller was part of the North
98662306a36Sopenharmony_ciBridge chipset. Nehalem, Sandy Bridge, Ivy Bridge, Haswell, Sky Lake and
98762306a36Sopenharmony_cinewer Intel architectures integrated an enhanced version of the memory
98862306a36Sopenharmony_cicontroller (MC) inside the CPUs.
98962306a36Sopenharmony_ci
99062306a36Sopenharmony_ciThis chapter will cover the differences of the enhanced memory controllers
99162306a36Sopenharmony_cifound on newer Intel CPUs, such as ``i7core_edac``, ``sb_edac`` and
99262306a36Sopenharmony_ci``sbx_edac`` drivers.
99362306a36Sopenharmony_ci
99462306a36Sopenharmony_ci.. note::
99562306a36Sopenharmony_ci
99662306a36Sopenharmony_ci   The Xeon E7 processor families use a separate chip for the memory
99762306a36Sopenharmony_ci   controller, called Intel Scalable Memory Buffer. This section doesn't
99862306a36Sopenharmony_ci   apply for such families.
99962306a36Sopenharmony_ci
100062306a36Sopenharmony_ci1) There is one Memory Controller per Quick Patch Interconnect
100162306a36Sopenharmony_ci   (QPI). At the driver, the term "socket" means one QPI. This is
100262306a36Sopenharmony_ci   associated with a physical CPU socket.
100362306a36Sopenharmony_ci
100462306a36Sopenharmony_ci   Each MC have 3 physical read channels, 3 physical write channels and
100562306a36Sopenharmony_ci   3 logic channels. The driver currently sees it as just 3 channels.
100662306a36Sopenharmony_ci   Each channel can have up to 3 DIMMs.
100762306a36Sopenharmony_ci
100862306a36Sopenharmony_ci   The minimum known unity is DIMMs. There are no information about csrows.
100962306a36Sopenharmony_ci   As EDAC API maps the minimum unity is csrows, the driver sequentially
101062306a36Sopenharmony_ci   maps channel/DIMM into different csrows.
101162306a36Sopenharmony_ci
101262306a36Sopenharmony_ci   For example, supposing the following layout::
101362306a36Sopenharmony_ci
101462306a36Sopenharmony_ci	Ch0 phy rd0, wr0 (0x063f4031): 2 ranks, UDIMMs
101562306a36Sopenharmony_ci	  dimm 0 1024 Mb offset: 0, bank: 8, rank: 1, row: 0x4000, col: 0x400
101662306a36Sopenharmony_ci	  dimm 1 1024 Mb offset: 4, bank: 8, rank: 1, row: 0x4000, col: 0x400
101762306a36Sopenharmony_ci        Ch1 phy rd1, wr1 (0x063f4031): 2 ranks, UDIMMs
101862306a36Sopenharmony_ci	  dimm 0 1024 Mb offset: 0, bank: 8, rank: 1, row: 0x4000, col: 0x400
101962306a36Sopenharmony_ci	Ch2 phy rd3, wr3 (0x063f4031): 2 ranks, UDIMMs
102062306a36Sopenharmony_ci	  dimm 0 1024 Mb offset: 0, bank: 8, rank: 1, row: 0x4000, col: 0x400
102162306a36Sopenharmony_ci
102262306a36Sopenharmony_ci   The driver will map it as::
102362306a36Sopenharmony_ci
102462306a36Sopenharmony_ci	csrow0: channel 0, dimm0
102562306a36Sopenharmony_ci	csrow1: channel 0, dimm1
102662306a36Sopenharmony_ci	csrow2: channel 1, dimm0
102762306a36Sopenharmony_ci	csrow3: channel 2, dimm0
102862306a36Sopenharmony_ci
102962306a36Sopenharmony_ci   exports one DIMM per csrow.
103062306a36Sopenharmony_ci
103162306a36Sopenharmony_ci   Each QPI is exported as a different memory controller.
103262306a36Sopenharmony_ci
103362306a36Sopenharmony_ci2) The MC has the ability to inject errors to test drivers. The drivers
103462306a36Sopenharmony_ci   implement this functionality via some error injection nodes:
103562306a36Sopenharmony_ci
103662306a36Sopenharmony_ci   For injecting a memory error, there are some sysfs nodes, under
103762306a36Sopenharmony_ci   ``/sys/devices/system/edac/mc/mc?/``:
103862306a36Sopenharmony_ci
103962306a36Sopenharmony_ci   - ``inject_addrmatch/*``:
104062306a36Sopenharmony_ci      Controls the error injection mask register. It is possible to specify
104162306a36Sopenharmony_ci      several characteristics of the address to match an error code::
104262306a36Sopenharmony_ci
104362306a36Sopenharmony_ci         dimm = the affected dimm. Numbers are relative to a channel;
104462306a36Sopenharmony_ci         rank = the memory rank;
104562306a36Sopenharmony_ci         channel = the channel that will generate an error;
104662306a36Sopenharmony_ci         bank = the affected bank;
104762306a36Sopenharmony_ci         page = the page address;
104862306a36Sopenharmony_ci         column (or col) = the address column.
104962306a36Sopenharmony_ci
105062306a36Sopenharmony_ci      each of the above values can be set to "any" to match any valid value.
105162306a36Sopenharmony_ci
105262306a36Sopenharmony_ci      At driver init, all values are set to any.
105362306a36Sopenharmony_ci
105462306a36Sopenharmony_ci      For example, to generate an error at rank 1 of dimm 2, for any channel,
105562306a36Sopenharmony_ci      any bank, any page, any column::
105662306a36Sopenharmony_ci
105762306a36Sopenharmony_ci		echo 2 >/sys/devices/system/edac/mc/mc0/inject_addrmatch/dimm
105862306a36Sopenharmony_ci		echo 1 >/sys/devices/system/edac/mc/mc0/inject_addrmatch/rank
105962306a36Sopenharmony_ci
106062306a36Sopenharmony_ci	To return to the default behaviour of matching any, you can do::
106162306a36Sopenharmony_ci
106262306a36Sopenharmony_ci		echo any >/sys/devices/system/edac/mc/mc0/inject_addrmatch/dimm
106362306a36Sopenharmony_ci		echo any >/sys/devices/system/edac/mc/mc0/inject_addrmatch/rank
106462306a36Sopenharmony_ci
106562306a36Sopenharmony_ci   - ``inject_eccmask``:
106662306a36Sopenharmony_ci          specifies what bits will have troubles,
106762306a36Sopenharmony_ci
106862306a36Sopenharmony_ci   - ``inject_section``:
106962306a36Sopenharmony_ci       specifies what ECC cache section will get the error::
107062306a36Sopenharmony_ci
107162306a36Sopenharmony_ci		3 for both
107262306a36Sopenharmony_ci		2 for the highest
107362306a36Sopenharmony_ci		1 for the lowest
107462306a36Sopenharmony_ci
107562306a36Sopenharmony_ci   - ``inject_type``:
107662306a36Sopenharmony_ci       specifies the type of error, being a combination of the following bits::
107762306a36Sopenharmony_ci
107862306a36Sopenharmony_ci		bit 0 - repeat
107962306a36Sopenharmony_ci		bit 1 - ecc
108062306a36Sopenharmony_ci		bit 2 - parity
108162306a36Sopenharmony_ci
108262306a36Sopenharmony_ci   - ``inject_enable``:
108362306a36Sopenharmony_ci       starts the error generation when something different than 0 is written.
108462306a36Sopenharmony_ci
108562306a36Sopenharmony_ci   All inject vars can be read. root permission is needed for write.
108662306a36Sopenharmony_ci
108762306a36Sopenharmony_ci   Datasheet states that the error will only be generated after a write on an
108862306a36Sopenharmony_ci   address that matches inject_addrmatch. It seems, however, that reading will
108962306a36Sopenharmony_ci   also produce an error.
109062306a36Sopenharmony_ci
109162306a36Sopenharmony_ci   For example, the following code will generate an error for any write access
109262306a36Sopenharmony_ci   at socket 0, on any DIMM/address on channel 2::
109362306a36Sopenharmony_ci
109462306a36Sopenharmony_ci	echo 2 >/sys/devices/system/edac/mc/mc0/inject_addrmatch/channel
109562306a36Sopenharmony_ci	echo 2 >/sys/devices/system/edac/mc/mc0/inject_type
109662306a36Sopenharmony_ci	echo 64 >/sys/devices/system/edac/mc/mc0/inject_eccmask
109762306a36Sopenharmony_ci	echo 3 >/sys/devices/system/edac/mc/mc0/inject_section
109862306a36Sopenharmony_ci	echo 1 >/sys/devices/system/edac/mc/mc0/inject_enable
109962306a36Sopenharmony_ci	dd if=/dev/mem of=/dev/null seek=16k bs=4k count=1 >& /dev/null
110062306a36Sopenharmony_ci
110162306a36Sopenharmony_ci   For socket 1, it is needed to replace "mc0" by "mc1" at the above
110262306a36Sopenharmony_ci   commands.
110362306a36Sopenharmony_ci
110462306a36Sopenharmony_ci   The generated error message will look like::
110562306a36Sopenharmony_ci
110662306a36Sopenharmony_ci	EDAC MC0: UE row 0, channel-a= 0 channel-b= 0 labels "-": NON_FATAL (addr = 0x0075b980, socket=0, Dimm=0, Channel=2, syndrome=0x00000040, count=1, Err=8c0000400001009f:4000080482 (read error: read ECC error))
110762306a36Sopenharmony_ci
110862306a36Sopenharmony_ci3) Corrected Error memory register counters
110962306a36Sopenharmony_ci
111062306a36Sopenharmony_ci   Those newer MCs have some registers to count memory errors. The driver
111162306a36Sopenharmony_ci   uses those registers to report Corrected Errors on devices with Registered
111262306a36Sopenharmony_ci   DIMMs.
111362306a36Sopenharmony_ci
111462306a36Sopenharmony_ci   However, those counters don't work with Unregistered DIMM. As the chipset
111562306a36Sopenharmony_ci   offers some counters that also work with UDIMMs (but with a worse level of
111662306a36Sopenharmony_ci   granularity than the default ones), the driver exposes those registers for
111762306a36Sopenharmony_ci   UDIMM memories.
111862306a36Sopenharmony_ci
111962306a36Sopenharmony_ci   They can be read by looking at the contents of ``all_channel_counts/``::
112062306a36Sopenharmony_ci
112162306a36Sopenharmony_ci     $ for i in /sys/devices/system/edac/mc/mc0/all_channel_counts/*; do echo $i; cat $i; done
112262306a36Sopenharmony_ci	/sys/devices/system/edac/mc/mc0/all_channel_counts/udimm0
112362306a36Sopenharmony_ci	0
112462306a36Sopenharmony_ci	/sys/devices/system/edac/mc/mc0/all_channel_counts/udimm1
112562306a36Sopenharmony_ci	0
112662306a36Sopenharmony_ci	/sys/devices/system/edac/mc/mc0/all_channel_counts/udimm2
112762306a36Sopenharmony_ci	0
112862306a36Sopenharmony_ci
112962306a36Sopenharmony_ci   What happens here is that errors on different csrows, but at the same
113062306a36Sopenharmony_ci   dimm number will increment the same counter.
113162306a36Sopenharmony_ci   So, in this memory mapping::
113262306a36Sopenharmony_ci
113362306a36Sopenharmony_ci	csrow0: channel 0, dimm0
113462306a36Sopenharmony_ci	csrow1: channel 0, dimm1
113562306a36Sopenharmony_ci	csrow2: channel 1, dimm0
113662306a36Sopenharmony_ci	csrow3: channel 2, dimm0
113762306a36Sopenharmony_ci
113862306a36Sopenharmony_ci   The hardware will increment udimm0 for an error at the first dimm at either
113962306a36Sopenharmony_ci   csrow0, csrow2  or csrow3;
114062306a36Sopenharmony_ci
114162306a36Sopenharmony_ci   The hardware will increment udimm1 for an error at the second dimm at either
114262306a36Sopenharmony_ci   csrow0, csrow2  or csrow3;
114362306a36Sopenharmony_ci
114462306a36Sopenharmony_ci   The hardware will increment udimm2 for an error at the third dimm at either
114562306a36Sopenharmony_ci   csrow0, csrow2  or csrow3;
114662306a36Sopenharmony_ci
114762306a36Sopenharmony_ci4) Standard error counters
114862306a36Sopenharmony_ci
114962306a36Sopenharmony_ci   The standard error counters are generated when an mcelog error is received
115062306a36Sopenharmony_ci   by the driver. Since, with UDIMM, this is counted by software, it is
115162306a36Sopenharmony_ci   possible that some errors could be lost. With RDIMM's, they display the
115262306a36Sopenharmony_ci   contents of the registers
115362306a36Sopenharmony_ci
115462306a36Sopenharmony_ciReference documents used on ``amd64_edac``
115562306a36Sopenharmony_ci------------------------------------------
115662306a36Sopenharmony_ci
115762306a36Sopenharmony_ci``amd64_edac`` module is based on the following documents
115862306a36Sopenharmony_ci(available from http://support.amd.com/en-us/search/tech-docs):
115962306a36Sopenharmony_ci
116062306a36Sopenharmony_ci1. :Title:  BIOS and Kernel Developer's Guide for AMD Athlon 64 and AMD
116162306a36Sopenharmony_ci	   Opteron Processors
116262306a36Sopenharmony_ci   :AMD publication #: 26094
116362306a36Sopenharmony_ci   :Revision: 3.26
116462306a36Sopenharmony_ci   :Link: http://support.amd.com/TechDocs/26094.PDF
116562306a36Sopenharmony_ci
116662306a36Sopenharmony_ci2. :Title:  BIOS and Kernel Developer's Guide for AMD NPT Family 0Fh
116762306a36Sopenharmony_ci	   Processors
116862306a36Sopenharmony_ci   :AMD publication #: 32559
116962306a36Sopenharmony_ci   :Revision: 3.00
117062306a36Sopenharmony_ci   :Issue Date: May 2006
117162306a36Sopenharmony_ci   :Link: http://support.amd.com/TechDocs/32559.pdf
117262306a36Sopenharmony_ci
117362306a36Sopenharmony_ci3. :Title:  BIOS and Kernel Developer's Guide (BKDG) For AMD Family 10h
117462306a36Sopenharmony_ci	   Processors
117562306a36Sopenharmony_ci   :AMD publication #: 31116
117662306a36Sopenharmony_ci   :Revision: 3.00
117762306a36Sopenharmony_ci   :Issue Date: September 07, 2007
117862306a36Sopenharmony_ci   :Link: http://support.amd.com/TechDocs/31116.pdf
117962306a36Sopenharmony_ci
118062306a36Sopenharmony_ci4. :Title: BIOS and Kernel Developer's Guide (BKDG) for AMD Family 15h
118162306a36Sopenharmony_ci	  Models 30h-3Fh Processors
118262306a36Sopenharmony_ci   :AMD publication #: 49125
118362306a36Sopenharmony_ci   :Revision: 3.06
118462306a36Sopenharmony_ci   :Issue Date: 2/12/2015 (latest release)
118562306a36Sopenharmony_ci   :Link: http://support.amd.com/TechDocs/49125_15h_Models_30h-3Fh_BKDG.pdf
118662306a36Sopenharmony_ci
118762306a36Sopenharmony_ci5. :Title: BIOS and Kernel Developer's Guide (BKDG) for AMD Family 15h
118862306a36Sopenharmony_ci	  Models 60h-6Fh Processors
118962306a36Sopenharmony_ci   :AMD publication #: 50742
119062306a36Sopenharmony_ci   :Revision: 3.01
119162306a36Sopenharmony_ci   :Issue Date: 7/23/2015 (latest release)
119262306a36Sopenharmony_ci   :Link: http://support.amd.com/TechDocs/50742_15h_Models_60h-6Fh_BKDG.pdf
119362306a36Sopenharmony_ci
119462306a36Sopenharmony_ci6. :Title: BIOS and Kernel Developer's Guide (BKDG) for AMD Family 16h
119562306a36Sopenharmony_ci	  Models 00h-0Fh Processors
119662306a36Sopenharmony_ci   :AMD publication #: 48751
119762306a36Sopenharmony_ci   :Revision: 3.03
119862306a36Sopenharmony_ci   :Issue Date: 2/23/2015 (latest release)
119962306a36Sopenharmony_ci   :Link: http://support.amd.com/TechDocs/48751_16h_bkdg.pdf
120062306a36Sopenharmony_ci
120162306a36Sopenharmony_ciCredits
120262306a36Sopenharmony_ci=======
120362306a36Sopenharmony_ci
120462306a36Sopenharmony_ci* Written by Doug Thompson <dougthompson@xmission.com>
120562306a36Sopenharmony_ci
120662306a36Sopenharmony_ci  - 7 Dec 2005
120762306a36Sopenharmony_ci  - 17 Jul 2007	Updated
120862306a36Sopenharmony_ci
120962306a36Sopenharmony_ci* |copy| Mauro Carvalho Chehab
121062306a36Sopenharmony_ci
121162306a36Sopenharmony_ci  - 05 Aug 2009	Nehalem interface
121262306a36Sopenharmony_ci  - 26 Oct 2016 Converted to ReST and cleanups at the Nehalem section
121362306a36Sopenharmony_ci
121462306a36Sopenharmony_ci* EDAC authors/maintainers:
121562306a36Sopenharmony_ci
121662306a36Sopenharmony_ci  - Doug Thompson, Dave Jiang, Dave Peterson et al,
121762306a36Sopenharmony_ci  - Mauro Carvalho Chehab
121862306a36Sopenharmony_ci  - Borislav Petkov
121962306a36Sopenharmony_ci  - original author: Thayne Harbaugh
1220