18c2ecf20Sopenharmony_ci.. include:: <isonum.txt>
28c2ecf20Sopenharmony_ci
38c2ecf20Sopenharmony_ci============================================
48c2ecf20Sopenharmony_ciReliability, Availability and Serviceability
58c2ecf20Sopenharmony_ci============================================
68c2ecf20Sopenharmony_ci
78c2ecf20Sopenharmony_ciRAS concepts
88c2ecf20Sopenharmony_ci************
98c2ecf20Sopenharmony_ci
108c2ecf20Sopenharmony_ciReliability, Availability and Serviceability (RAS) is a concept used on
118c2ecf20Sopenharmony_ciservers meant to measure their robustness.
128c2ecf20Sopenharmony_ci
138c2ecf20Sopenharmony_ciReliability
148c2ecf20Sopenharmony_ci  is the probability that a system will produce correct outputs.
158c2ecf20Sopenharmony_ci
168c2ecf20Sopenharmony_ci  * Generally measured as Mean Time Between Failures (MTBF)
178c2ecf20Sopenharmony_ci  * Enhanced by features that help to avoid, detect and repair hardware faults
188c2ecf20Sopenharmony_ci
198c2ecf20Sopenharmony_ciAvailability
208c2ecf20Sopenharmony_ci  is the probability that a system is operational at a given time
218c2ecf20Sopenharmony_ci
228c2ecf20Sopenharmony_ci  * Generally measured as a percentage of downtime per a period of time
238c2ecf20Sopenharmony_ci  * Often uses mechanisms to detect and correct hardware faults in
248c2ecf20Sopenharmony_ci    runtime;
258c2ecf20Sopenharmony_ci
268c2ecf20Sopenharmony_ciServiceability (or maintainability)
278c2ecf20Sopenharmony_ci  is the simplicity and speed with which a system can be repaired or
288c2ecf20Sopenharmony_ci  maintained
298c2ecf20Sopenharmony_ci
308c2ecf20Sopenharmony_ci  * Generally measured on Mean Time Between Repair (MTBR)
318c2ecf20Sopenharmony_ci
328c2ecf20Sopenharmony_ciImproving RAS
338c2ecf20Sopenharmony_ci-------------
348c2ecf20Sopenharmony_ci
358c2ecf20Sopenharmony_ciIn order to reduce systems downtime, a system should be capable of detecting
368c2ecf20Sopenharmony_cihardware errors, and, when possible correcting them in runtime. It should
378c2ecf20Sopenharmony_cialso provide mechanisms to detect hardware degradation, in order to warn
388c2ecf20Sopenharmony_cithe system administrator to take the action of replacing a component before
398c2ecf20Sopenharmony_ciit causes data loss or system downtime.
408c2ecf20Sopenharmony_ci
418c2ecf20Sopenharmony_ciAmong the monitoring measures, the most usual ones include:
428c2ecf20Sopenharmony_ci
438c2ecf20Sopenharmony_ci* CPU – detect errors at instruction execution and at L1/L2/L3 caches;
448c2ecf20Sopenharmony_ci* Memory – add error correction logic (ECC) to detect and correct errors;
458c2ecf20Sopenharmony_ci* I/O – add CRC checksums for transferred data;
468c2ecf20Sopenharmony_ci* Storage – RAID, journal file systems, checksums,
478c2ecf20Sopenharmony_ci  Self-Monitoring, Analysis and Reporting Technology (SMART).
488c2ecf20Sopenharmony_ci
498c2ecf20Sopenharmony_ciBy monitoring the number of occurrences of error detections, it is possible
508c2ecf20Sopenharmony_cito identify if the probability of hardware errors is increasing, and, on such
518c2ecf20Sopenharmony_cicase, do a preventive maintenance to replace a degraded component while
528c2ecf20Sopenharmony_cithose errors are correctable.
538c2ecf20Sopenharmony_ci
548c2ecf20Sopenharmony_ciTypes of errors
558c2ecf20Sopenharmony_ci---------------
568c2ecf20Sopenharmony_ci
578c2ecf20Sopenharmony_ciMost mechanisms used on modern systems use technologies like Hamming
588c2ecf20Sopenharmony_ciCodes that allow error correction when the number of errors on a bit packet
598c2ecf20Sopenharmony_ciis below a threshold. If the number of errors is above, those mechanisms
608c2ecf20Sopenharmony_cican indicate with a high degree of confidence that an error happened, but
618c2ecf20Sopenharmony_cithey can't correct.
628c2ecf20Sopenharmony_ci
638c2ecf20Sopenharmony_ciAlso, sometimes an error occur on a component that it is not used. For
648c2ecf20Sopenharmony_ciexample, a part of the memory that it is not currently allocated.
658c2ecf20Sopenharmony_ci
668c2ecf20Sopenharmony_ciThat defines some categories of errors:
678c2ecf20Sopenharmony_ci
688c2ecf20Sopenharmony_ci* **Correctable Error (CE)** - the error detection mechanism detected and
698c2ecf20Sopenharmony_ci  corrected the error. Such errors are usually not fatal, although some
708c2ecf20Sopenharmony_ci  Kernel mechanisms allow the system administrator to consider them as fatal.
718c2ecf20Sopenharmony_ci
728c2ecf20Sopenharmony_ci* **Uncorrected Error (UE)** - the amount of errors happened above the error
738c2ecf20Sopenharmony_ci  correction threshold, and the system was unable to auto-correct.
748c2ecf20Sopenharmony_ci
758c2ecf20Sopenharmony_ci* **Fatal Error** - when an UE error happens on a critical component of the
768c2ecf20Sopenharmony_ci  system (for example, a piece of the Kernel got corrupted by an UE), the
778c2ecf20Sopenharmony_ci  only reliable way to avoid data corruption is to hang or reboot the machine.
788c2ecf20Sopenharmony_ci
798c2ecf20Sopenharmony_ci* **Non-fatal Error** - when an UE error happens on an unused component,
808c2ecf20Sopenharmony_ci  like a CPU in power down state or an unused memory bank, the system may
818c2ecf20Sopenharmony_ci  still run, eventually replacing the affected hardware by a hot spare,
828c2ecf20Sopenharmony_ci  if available.
838c2ecf20Sopenharmony_ci
848c2ecf20Sopenharmony_ci  Also, when an error happens on a userspace process, it is also possible to
858c2ecf20Sopenharmony_ci  kill such process and let userspace restart it.
868c2ecf20Sopenharmony_ci
878c2ecf20Sopenharmony_ciThe mechanism for handling non-fatal errors is usually complex and may
888c2ecf20Sopenharmony_cirequire the help of some userspace application, in order to apply the
898c2ecf20Sopenharmony_cipolicy desired by the system administrator.
908c2ecf20Sopenharmony_ci
918c2ecf20Sopenharmony_ciIdentifying a bad hardware component
928c2ecf20Sopenharmony_ci------------------------------------
938c2ecf20Sopenharmony_ci
948c2ecf20Sopenharmony_ciJust detecting a hardware flaw is usually not enough, as the system needs
958c2ecf20Sopenharmony_cito pinpoint to the minimal replaceable unit (MRU) that should be exchanged
968c2ecf20Sopenharmony_cito make the hardware reliable again.
978c2ecf20Sopenharmony_ci
988c2ecf20Sopenharmony_ciSo, it requires not only error logging facilities, but also mechanisms that
998c2ecf20Sopenharmony_ciwill translate the error message to the silkscreen or component label for
1008c2ecf20Sopenharmony_cithe MRU.
1018c2ecf20Sopenharmony_ci
1028c2ecf20Sopenharmony_ciTypically, it is very complex for memory, as modern CPUs interlace memory
1038c2ecf20Sopenharmony_cifrom different memory modules, in order to provide a better performance. The
1048c2ecf20Sopenharmony_ciDMI BIOS usually have a list of memory module labels, with can be obtained
1058c2ecf20Sopenharmony_ciusing the ``dmidecode`` tool. For example, on a desktop machine, it shows::
1068c2ecf20Sopenharmony_ci
1078c2ecf20Sopenharmony_ci	Memory Device
1088c2ecf20Sopenharmony_ci		Total Width: 64 bits
1098c2ecf20Sopenharmony_ci		Data Width: 64 bits
1108c2ecf20Sopenharmony_ci		Size: 16384 MB
1118c2ecf20Sopenharmony_ci		Form Factor: SODIMM
1128c2ecf20Sopenharmony_ci		Set: None
1138c2ecf20Sopenharmony_ci		Locator: ChannelA-DIMM0
1148c2ecf20Sopenharmony_ci		Bank Locator: BANK 0
1158c2ecf20Sopenharmony_ci		Type: DDR4
1168c2ecf20Sopenharmony_ci		Type Detail: Synchronous
1178c2ecf20Sopenharmony_ci		Speed: 2133 MHz
1188c2ecf20Sopenharmony_ci		Rank: 2
1198c2ecf20Sopenharmony_ci		Configured Clock Speed: 2133 MHz
1208c2ecf20Sopenharmony_ci
1218c2ecf20Sopenharmony_ciOn the above example, a DDR4 SO-DIMM memory module is located at the
1228c2ecf20Sopenharmony_cisystem's memory labeled as "BANK 0", as given by the *bank locator* field.
1238c2ecf20Sopenharmony_ciPlease notice that, on such system, the *total width* is equal to the
1248c2ecf20Sopenharmony_ci*data width*. It means that such memory module doesn't have error
1258c2ecf20Sopenharmony_cidetection/correction mechanisms.
1268c2ecf20Sopenharmony_ci
1278c2ecf20Sopenharmony_ciUnfortunately, not all systems use the same field to specify the memory
1288c2ecf20Sopenharmony_cibank. On this example, from an older server, ``dmidecode`` shows::
1298c2ecf20Sopenharmony_ci
1308c2ecf20Sopenharmony_ci	Memory Device
1318c2ecf20Sopenharmony_ci		Array Handle: 0x1000
1328c2ecf20Sopenharmony_ci		Error Information Handle: Not Provided
1338c2ecf20Sopenharmony_ci		Total Width: 72 bits
1348c2ecf20Sopenharmony_ci		Data Width: 64 bits
1358c2ecf20Sopenharmony_ci		Size: 8192 MB
1368c2ecf20Sopenharmony_ci		Form Factor: DIMM
1378c2ecf20Sopenharmony_ci		Set: 1
1388c2ecf20Sopenharmony_ci		Locator: DIMM_A1
1398c2ecf20Sopenharmony_ci		Bank Locator: Not Specified
1408c2ecf20Sopenharmony_ci		Type: DDR3
1418c2ecf20Sopenharmony_ci		Type Detail: Synchronous Registered (Buffered)
1428c2ecf20Sopenharmony_ci		Speed: 1600 MHz
1438c2ecf20Sopenharmony_ci		Rank: 2
1448c2ecf20Sopenharmony_ci		Configured Clock Speed: 1600 MHz
1458c2ecf20Sopenharmony_ci
1468c2ecf20Sopenharmony_ciThere, the DDR3 RDIMM memory module is located at the system's memory labeled
1478c2ecf20Sopenharmony_cias "DIMM_A1", as given by the *locator* field. Please notice that this
1488c2ecf20Sopenharmony_cimemory module has 64 bits of *data width* and 72 bits of *total width*. So,
1498c2ecf20Sopenharmony_ciit has 8 extra bits to be used by error detection and correction mechanisms.
1508c2ecf20Sopenharmony_ciSuch kind of memory is called Error-correcting code memory (ECC memory).
1518c2ecf20Sopenharmony_ci
1528c2ecf20Sopenharmony_ciTo make things even worse, it is not uncommon that systems with different
1538c2ecf20Sopenharmony_cilabels on their system's board to use exactly the same BIOS, meaning that
1548c2ecf20Sopenharmony_cithe labels provided by the BIOS won't match the real ones.
1558c2ecf20Sopenharmony_ci
1568c2ecf20Sopenharmony_ciECC memory
1578c2ecf20Sopenharmony_ci----------
1588c2ecf20Sopenharmony_ci
1598c2ecf20Sopenharmony_ciAs mentioned in the previous section, ECC memory has extra bits to be
1608c2ecf20Sopenharmony_ciused for error correction. In the above example, a memory module has
1618c2ecf20Sopenharmony_ci64 bits of *data width*, and 72 bits of *total width*.  The extra 8
1628c2ecf20Sopenharmony_cibits which are used for the error detection and correction mechanisms
1638c2ecf20Sopenharmony_ciare referred to as the *syndrome*\ [#f1]_\ [#f2]_.
1648c2ecf20Sopenharmony_ci
1658c2ecf20Sopenharmony_ciSo, when the cpu requests the memory controller to write a word with
1668c2ecf20Sopenharmony_ci*data width*, the memory controller calculates the *syndrome* in real time,
1678c2ecf20Sopenharmony_ciusing Hamming code, or some other error correction code, like SECDED+,
1688c2ecf20Sopenharmony_ciproducing a code with *total width* size. Such code is then written
1698c2ecf20Sopenharmony_cion the memory modules.
1708c2ecf20Sopenharmony_ci
1718c2ecf20Sopenharmony_ciAt read, the *total width* bits code is converted back, using the same
1728c2ecf20Sopenharmony_ciECC code used on write, producing a word with *data width* and a *syndrome*.
1738c2ecf20Sopenharmony_ciThe word with *data width* is sent to the CPU, even when errors happen.
1748c2ecf20Sopenharmony_ci
1758c2ecf20Sopenharmony_ciThe memory controller also looks at the *syndrome* in order to check if
1768c2ecf20Sopenharmony_cithere was an error, and if the ECC code was able to fix such error.
1778c2ecf20Sopenharmony_ciIf the error was corrected, a Corrected Error (CE) happened. If not, an
1788c2ecf20Sopenharmony_ciUncorrected Error (UE) happened.
1798c2ecf20Sopenharmony_ci
1808c2ecf20Sopenharmony_ciThe information about the CE/UE errors is stored on some special registers
1818c2ecf20Sopenharmony_ciat the memory controller and can be accessed by reading such registers,
1828c2ecf20Sopenharmony_cieither by BIOS, by some special CPUs or by Linux EDAC driver. On x86 64
1838c2ecf20Sopenharmony_cibit CPUs, such errors can also be retrieved via the Machine Check
1848c2ecf20Sopenharmony_ciArchitecture (MCA)\ [#f3]_.
1858c2ecf20Sopenharmony_ci
1868c2ecf20Sopenharmony_ci.. [#f1] Please notice that several memory controllers allow operation on a
1878c2ecf20Sopenharmony_ci  mode called "Lock-Step", where it groups two memory modules together,
1888c2ecf20Sopenharmony_ci  doing 128-bit reads/writes. That gives 16 bits for error correction, with
1898c2ecf20Sopenharmony_ci  significantly improves the error correction mechanism, at the expense
1908c2ecf20Sopenharmony_ci  that, when an error happens, there's no way to know what memory module is
1918c2ecf20Sopenharmony_ci  to blame. So, it has to blame both memory modules.
1928c2ecf20Sopenharmony_ci
1938c2ecf20Sopenharmony_ci.. [#f2] Some memory controllers also allow using memory in mirror mode.
1948c2ecf20Sopenharmony_ci  On such mode, the same data is written to two memory modules. At read,
1958c2ecf20Sopenharmony_ci  the system checks both memory modules, in order to check if both provide
1968c2ecf20Sopenharmony_ci  identical data. On such configuration, when an error happens, there's no
1978c2ecf20Sopenharmony_ci  way to know what memory module is to blame. So, it has to blame both
1988c2ecf20Sopenharmony_ci  memory modules (or 4 memory modules, if the system is also on Lock-step
1998c2ecf20Sopenharmony_ci  mode).
2008c2ecf20Sopenharmony_ci
2018c2ecf20Sopenharmony_ci.. [#f3] For more details about the Machine Check Architecture (MCA),
2028c2ecf20Sopenharmony_ci  please read Documentation/x86/x86_64/machinecheck.rst at the Kernel tree.
2038c2ecf20Sopenharmony_ci
2048c2ecf20Sopenharmony_ciEDAC - Error Detection And Correction
2058c2ecf20Sopenharmony_ci*************************************
2068c2ecf20Sopenharmony_ci
2078c2ecf20Sopenharmony_ci.. note::
2088c2ecf20Sopenharmony_ci
2098c2ecf20Sopenharmony_ci   "bluesmoke" was the name for this device driver subsystem when it
2108c2ecf20Sopenharmony_ci   was "out-of-tree" and maintained at http://bluesmoke.sourceforge.net.
2118c2ecf20Sopenharmony_ci   That site is mostly archaic now and can be used only for historical
2128c2ecf20Sopenharmony_ci   purposes.
2138c2ecf20Sopenharmony_ci
2148c2ecf20Sopenharmony_ci   When the subsystem was pushed upstream for the first time, on
2158c2ecf20Sopenharmony_ci   Kernel 2.6.16, it was renamed to ``EDAC``.
2168c2ecf20Sopenharmony_ci
2178c2ecf20Sopenharmony_ciPurpose
2188c2ecf20Sopenharmony_ci-------
2198c2ecf20Sopenharmony_ci
2208c2ecf20Sopenharmony_ciThe ``edac`` kernel module's goal is to detect and report hardware errors
2218c2ecf20Sopenharmony_cithat occur within the computer system running under linux.
2228c2ecf20Sopenharmony_ci
2238c2ecf20Sopenharmony_ciMemory
2248c2ecf20Sopenharmony_ci------
2258c2ecf20Sopenharmony_ci
2268c2ecf20Sopenharmony_ciMemory Correctable Errors (CE) and Uncorrectable Errors (UE) are the
2278c2ecf20Sopenharmony_ciprimary errors being harvested. These types of errors are harvested by
2288c2ecf20Sopenharmony_cithe ``edac_mc`` device.
2298c2ecf20Sopenharmony_ci
2308c2ecf20Sopenharmony_ciDetecting CE events, then harvesting those events and reporting them,
2318c2ecf20Sopenharmony_ci**can** but must not necessarily be a predictor of future UE events. With
2328c2ecf20Sopenharmony_ciCE events only, the system can and will continue to operate as no data
2338c2ecf20Sopenharmony_cihas been damaged yet.
2348c2ecf20Sopenharmony_ci
2358c2ecf20Sopenharmony_ciHowever, preventive maintenance and proactive part replacement of memory
2368c2ecf20Sopenharmony_cimodules exhibiting CEs can reduce the likelihood of the dreaded UE events
2378c2ecf20Sopenharmony_ciand system panics.
2388c2ecf20Sopenharmony_ci
2398c2ecf20Sopenharmony_ciOther hardware elements
2408c2ecf20Sopenharmony_ci-----------------------
2418c2ecf20Sopenharmony_ci
2428c2ecf20Sopenharmony_ciA new feature for EDAC, the ``edac_device`` class of device, was added in
2438c2ecf20Sopenharmony_cithe 2.6.23 version of the kernel.
2448c2ecf20Sopenharmony_ci
2458c2ecf20Sopenharmony_ciThis new device type allows for non-memory type of ECC hardware detectors
2468c2ecf20Sopenharmony_cito have their states harvested and presented to userspace via the sysfs
2478c2ecf20Sopenharmony_ciinterface.
2488c2ecf20Sopenharmony_ci
2498c2ecf20Sopenharmony_ciSome architectures have ECC detectors for L1, L2 and L3 caches,
2508c2ecf20Sopenharmony_cialong with DMA engines, fabric switches, main data path switches,
2518c2ecf20Sopenharmony_ciinterconnections, and various other hardware data paths. If the hardware
2528c2ecf20Sopenharmony_cireports it, then a edac_device device probably can be constructed to
2538c2ecf20Sopenharmony_ciharvest and present that to userspace.
2548c2ecf20Sopenharmony_ci
2558c2ecf20Sopenharmony_ci
2568c2ecf20Sopenharmony_ciPCI bus scanning
2578c2ecf20Sopenharmony_ci----------------
2588c2ecf20Sopenharmony_ci
2598c2ecf20Sopenharmony_ciIn addition, PCI devices are scanned for PCI Bus Parity and SERR Errors
2608c2ecf20Sopenharmony_ciin order to determine if errors are occurring during data transfers.
2618c2ecf20Sopenharmony_ci
2628c2ecf20Sopenharmony_ciThe presence of PCI Parity errors must be examined with a grain of salt.
2638c2ecf20Sopenharmony_ciThere are several add-in adapters that do **not** follow the PCI specification
2648c2ecf20Sopenharmony_ciwith regards to Parity generation and reporting. The specification says
2658c2ecf20Sopenharmony_cithe vendor should tie the parity status bits to 0 if they do not intend
2668c2ecf20Sopenharmony_cito generate parity.  Some vendors do not do this, and thus the parity bit
2678c2ecf20Sopenharmony_cican "float" giving false positives.
2688c2ecf20Sopenharmony_ci
2698c2ecf20Sopenharmony_ciThere is a PCI device attribute located in sysfs that is checked by
2708c2ecf20Sopenharmony_cithe EDAC PCI scanning code. If that attribute is set, PCI parity/error
2718c2ecf20Sopenharmony_ciscanning is skipped for that device. The attribute is::
2728c2ecf20Sopenharmony_ci
2738c2ecf20Sopenharmony_ci	broken_parity_status
2748c2ecf20Sopenharmony_ci
2758c2ecf20Sopenharmony_ciand is located in ``/sys/devices/pci<XXX>/0000:XX:YY.Z`` directories for
2768c2ecf20Sopenharmony_ciPCI devices.
2778c2ecf20Sopenharmony_ci
2788c2ecf20Sopenharmony_ci
2798c2ecf20Sopenharmony_ciVersioning
2808c2ecf20Sopenharmony_ci----------
2818c2ecf20Sopenharmony_ci
2828c2ecf20Sopenharmony_ciEDAC is composed of a "core" module (``edac_core.ko``) and several Memory
2838c2ecf20Sopenharmony_ciController (MC) driver modules. On a given system, the CORE is loaded
2848c2ecf20Sopenharmony_ciand one MC driver will be loaded. Both the CORE and the MC driver (or
2858c2ecf20Sopenharmony_ci``edac_device`` driver) have individual versions that reflect current
2868c2ecf20Sopenharmony_cirelease level of their respective modules.
2878c2ecf20Sopenharmony_ci
2888c2ecf20Sopenharmony_ciThus, to "report" on what version a system is running, one must report
2898c2ecf20Sopenharmony_ciboth the CORE's and the MC driver's versions.
2908c2ecf20Sopenharmony_ci
2918c2ecf20Sopenharmony_ci
2928c2ecf20Sopenharmony_ciLoading
2938c2ecf20Sopenharmony_ci-------
2948c2ecf20Sopenharmony_ci
2958c2ecf20Sopenharmony_ciIf ``edac`` was statically linked with the kernel then no loading
2968c2ecf20Sopenharmony_ciis necessary. If ``edac`` was built as modules then simply modprobe
2978c2ecf20Sopenharmony_cithe ``edac`` pieces that you need. You should be able to modprobe
2988c2ecf20Sopenharmony_cihardware-specific modules and have the dependencies load the necessary
2998c2ecf20Sopenharmony_cicore modules.
3008c2ecf20Sopenharmony_ci
3018c2ecf20Sopenharmony_ciExample::
3028c2ecf20Sopenharmony_ci
3038c2ecf20Sopenharmony_ci	$ modprobe amd76x_edac
3048c2ecf20Sopenharmony_ci
3058c2ecf20Sopenharmony_ciloads both the ``amd76x_edac.ko`` memory controller module and the
3068c2ecf20Sopenharmony_ci``edac_mc.ko`` core module.
3078c2ecf20Sopenharmony_ci
3088c2ecf20Sopenharmony_ci
3098c2ecf20Sopenharmony_ciSysfs interface
3108c2ecf20Sopenharmony_ci---------------
3118c2ecf20Sopenharmony_ci
3128c2ecf20Sopenharmony_ciEDAC presents a ``sysfs`` interface for control and reporting purposes. It
3138c2ecf20Sopenharmony_cilives in the /sys/devices/system/edac directory.
3148c2ecf20Sopenharmony_ci
3158c2ecf20Sopenharmony_ciWithin this directory there currently reside 2 components:
3168c2ecf20Sopenharmony_ci
3178c2ecf20Sopenharmony_ci	======= ==============================
3188c2ecf20Sopenharmony_ci	mc	memory controller(s) system
3198c2ecf20Sopenharmony_ci	pci	PCI control and status system
3208c2ecf20Sopenharmony_ci	======= ==============================
3218c2ecf20Sopenharmony_ci
3228c2ecf20Sopenharmony_ci
3238c2ecf20Sopenharmony_ci
3248c2ecf20Sopenharmony_ciMemory Controller (mc) Model
3258c2ecf20Sopenharmony_ci----------------------------
3268c2ecf20Sopenharmony_ci
3278c2ecf20Sopenharmony_ciEach ``mc`` device controls a set of memory modules [#f4]_. These modules
3288c2ecf20Sopenharmony_ciare laid out in a Chip-Select Row (``csrowX``) and Channel table (``chX``).
3298c2ecf20Sopenharmony_ciThere can be multiple csrows and multiple channels.
3308c2ecf20Sopenharmony_ci
3318c2ecf20Sopenharmony_ci.. [#f4] Nowadays, the term DIMM (Dual In-line Memory Module) is widely
3328c2ecf20Sopenharmony_ci  used to refer to a memory module, although there are other memory
3338c2ecf20Sopenharmony_ci  packaging alternatives, like SO-DIMM, SIMM, etc. The UEFI
3348c2ecf20Sopenharmony_ci  specification (Version 2.7) defines a memory module in the Common
3358c2ecf20Sopenharmony_ci  Platform Error Record (CPER) section to be an SMBIOS Memory Device
3368c2ecf20Sopenharmony_ci  (Type 17). Along this document, and inside the EDAC subsystem, the term
3378c2ecf20Sopenharmony_ci  "dimm" is used for all memory modules, even when they use a
3388c2ecf20Sopenharmony_ci  different kind of packaging.
3398c2ecf20Sopenharmony_ci
3408c2ecf20Sopenharmony_ciMemory controllers allow for several csrows, with 8 csrows being a
3418c2ecf20Sopenharmony_citypical value. Yet, the actual number of csrows depends on the layout of
3428c2ecf20Sopenharmony_cia given motherboard, memory controller and memory module characteristics.
3438c2ecf20Sopenharmony_ci
3448c2ecf20Sopenharmony_ciDual channels allow for dual data length (e. g. 128 bits, on 64 bit systems)
3458c2ecf20Sopenharmony_cidata transfers to/from the CPU from/to memory. Some newer chipsets allow
3468c2ecf20Sopenharmony_cifor more than 2 channels, like Fully Buffered DIMMs (FB-DIMMs) memory
3478c2ecf20Sopenharmony_cicontrollers. The following example will assume 2 channels:
3488c2ecf20Sopenharmony_ci
3498c2ecf20Sopenharmony_ci	+------------+-----------------------+
3508c2ecf20Sopenharmony_ci	| CS Rows    |       Channels        |
3518c2ecf20Sopenharmony_ci	+------------+-----------+-----------+
3528c2ecf20Sopenharmony_ci	|            |  ``ch0``  |  ``ch1``  |
3538c2ecf20Sopenharmony_ci	+============+===========+===========+
3548c2ecf20Sopenharmony_ci	|            |**DIMM_A0**|**DIMM_B0**|
3558c2ecf20Sopenharmony_ci	+------------+-----------+-----------+
3568c2ecf20Sopenharmony_ci	| ``csrow0`` |   rank0   |   rank0   |
3578c2ecf20Sopenharmony_ci	+------------+-----------+-----------+
3588c2ecf20Sopenharmony_ci	| ``csrow1`` |   rank1   |   rank1   |
3598c2ecf20Sopenharmony_ci	+------------+-----------+-----------+
3608c2ecf20Sopenharmony_ci	|            |**DIMM_A1**|**DIMM_B1**|
3618c2ecf20Sopenharmony_ci	+------------+-----------+-----------+
3628c2ecf20Sopenharmony_ci	| ``csrow2`` |    rank0  |  rank0    |
3638c2ecf20Sopenharmony_ci	+------------+-----------+-----------+
3648c2ecf20Sopenharmony_ci	| ``csrow3`` |    rank1  |  rank1    |
3658c2ecf20Sopenharmony_ci	+------------+-----------+-----------+
3668c2ecf20Sopenharmony_ci
3678c2ecf20Sopenharmony_ciIn the above example, there are 4 physical slots on the motherboard
3688c2ecf20Sopenharmony_cifor memory DIMMs:
3698c2ecf20Sopenharmony_ci
3708c2ecf20Sopenharmony_ci	+---------+---------+
3718c2ecf20Sopenharmony_ci	| DIMM_A0 | DIMM_B0 |
3728c2ecf20Sopenharmony_ci	+---------+---------+
3738c2ecf20Sopenharmony_ci	| DIMM_A1 | DIMM_B1 |
3748c2ecf20Sopenharmony_ci	+---------+---------+
3758c2ecf20Sopenharmony_ci
3768c2ecf20Sopenharmony_ciLabels for these slots are usually silk-screened on the motherboard.
3778c2ecf20Sopenharmony_ciSlots labeled ``A`` are channel 0 in this example. Slots labeled ``B`` are
3788c2ecf20Sopenharmony_cichannel 1. Notice that there are two csrows possible on a physical DIMM.
3798c2ecf20Sopenharmony_ciThese csrows are allocated their csrow assignment based on the slot into
3808c2ecf20Sopenharmony_ciwhich the memory DIMM is placed. Thus, when 1 DIMM is placed in each
3818c2ecf20Sopenharmony_ciChannel, the csrows cross both DIMMs.
3828c2ecf20Sopenharmony_ci
3838c2ecf20Sopenharmony_ciMemory DIMMs come single or dual "ranked". A rank is a populated csrow.
3848c2ecf20Sopenharmony_ciIn the example above 2 dual ranked DIMMs are similarly placed. Thus,
3858c2ecf20Sopenharmony_ciboth csrow0 and csrow1 are populated. On the other hand, when 2 single
3868c2ecf20Sopenharmony_ciranked DIMMs are placed in slots DIMM_A0 and DIMM_B0, then they will
3878c2ecf20Sopenharmony_cihave just one csrow (csrow0) and csrow1 will be empty. The pattern
3888c2ecf20Sopenharmony_cirepeats itself for csrow2 and csrow3. Also note that some memory
3898c2ecf20Sopenharmony_cicontrollers don't have any logic to identify the memory module, see
3908c2ecf20Sopenharmony_ci``rankX`` directories below.
3918c2ecf20Sopenharmony_ci
3928c2ecf20Sopenharmony_ciThe representation of the above is reflected in the directory
3938c2ecf20Sopenharmony_citree in EDAC's sysfs interface. Starting in directory
3948c2ecf20Sopenharmony_ci``/sys/devices/system/edac/mc``, each memory controller will be
3958c2ecf20Sopenharmony_cirepresented by its own ``mcX`` directory, where ``X`` is the
3968c2ecf20Sopenharmony_ciindex of the MC::
3978c2ecf20Sopenharmony_ci
3988c2ecf20Sopenharmony_ci	..../edac/mc/
3998c2ecf20Sopenharmony_ci		   |
4008c2ecf20Sopenharmony_ci		   |->mc0
4018c2ecf20Sopenharmony_ci		   |->mc1
4028c2ecf20Sopenharmony_ci		   |->mc2
4038c2ecf20Sopenharmony_ci		   ....
4048c2ecf20Sopenharmony_ci
4058c2ecf20Sopenharmony_ciUnder each ``mcX`` directory each ``csrowX`` is again represented by a
4068c2ecf20Sopenharmony_ci``csrowX``, where ``X`` is the csrow index::
4078c2ecf20Sopenharmony_ci
4088c2ecf20Sopenharmony_ci	.../mc/mc0/
4098c2ecf20Sopenharmony_ci		|
4108c2ecf20Sopenharmony_ci		|->csrow0
4118c2ecf20Sopenharmony_ci		|->csrow2
4128c2ecf20Sopenharmony_ci		|->csrow3
4138c2ecf20Sopenharmony_ci		....
4148c2ecf20Sopenharmony_ci
4158c2ecf20Sopenharmony_ciNotice that there is no csrow1, which indicates that csrow0 is composed
4168c2ecf20Sopenharmony_ciof a single ranked DIMMs. This should also apply in both Channels, in
4178c2ecf20Sopenharmony_ciorder to have dual-channel mode be operational. Since both csrow2 and
4188c2ecf20Sopenharmony_cicsrow3 are populated, this indicates a dual ranked set of DIMMs for
4198c2ecf20Sopenharmony_cichannels 0 and 1.
4208c2ecf20Sopenharmony_ci
4218c2ecf20Sopenharmony_ciWithin each of the ``mcX`` and ``csrowX`` directories are several EDAC
4228c2ecf20Sopenharmony_cicontrol and attribute files.
4238c2ecf20Sopenharmony_ci
4248c2ecf20Sopenharmony_ci``mcX`` directories
4258c2ecf20Sopenharmony_ci-------------------
4268c2ecf20Sopenharmony_ci
4278c2ecf20Sopenharmony_ciIn ``mcX`` directories are EDAC control and attribute files for
4288c2ecf20Sopenharmony_cithis ``X`` instance of the memory controllers.
4298c2ecf20Sopenharmony_ci
4308c2ecf20Sopenharmony_ciFor a description of the sysfs API, please see:
4318c2ecf20Sopenharmony_ci
4328c2ecf20Sopenharmony_ci	Documentation/ABI/testing/sysfs-devices-edac
4338c2ecf20Sopenharmony_ci
4348c2ecf20Sopenharmony_ci
4358c2ecf20Sopenharmony_ci``dimmX`` or ``rankX`` directories
4368c2ecf20Sopenharmony_ci----------------------------------
4378c2ecf20Sopenharmony_ci
4388c2ecf20Sopenharmony_ciThe recommended way to use the EDAC subsystem is to look at the information
4398c2ecf20Sopenharmony_ciprovided by the ``dimmX`` or ``rankX`` directories [#f5]_.
4408c2ecf20Sopenharmony_ci
4418c2ecf20Sopenharmony_ciA typical EDAC system has the following structure under
4428c2ecf20Sopenharmony_ci``/sys/devices/system/edac/``\ [#f6]_::
4438c2ecf20Sopenharmony_ci
4448c2ecf20Sopenharmony_ci	/sys/devices/system/edac/
4458c2ecf20Sopenharmony_ci	├── mc
4468c2ecf20Sopenharmony_ci	│   ├── mc0
4478c2ecf20Sopenharmony_ci	│   │   ├── ce_count
4488c2ecf20Sopenharmony_ci	│   │   ├── ce_noinfo_count
4498c2ecf20Sopenharmony_ci	│   │   ├── dimm0
4508c2ecf20Sopenharmony_ci	│   │   │   ├── dimm_ce_count
4518c2ecf20Sopenharmony_ci	│   │   │   ├── dimm_dev_type
4528c2ecf20Sopenharmony_ci	│   │   │   ├── dimm_edac_mode
4538c2ecf20Sopenharmony_ci	│   │   │   ├── dimm_label
4548c2ecf20Sopenharmony_ci	│   │   │   ├── dimm_location
4558c2ecf20Sopenharmony_ci	│   │   │   ├── dimm_mem_type
4568c2ecf20Sopenharmony_ci	│   │   │   ├── dimm_ue_count
4578c2ecf20Sopenharmony_ci	│   │   │   ├── size
4588c2ecf20Sopenharmony_ci	│   │   │   └── uevent
4598c2ecf20Sopenharmony_ci	│   │   ├── max_location
4608c2ecf20Sopenharmony_ci	│   │   ├── mc_name
4618c2ecf20Sopenharmony_ci	│   │   ├── reset_counters
4628c2ecf20Sopenharmony_ci	│   │   ├── seconds_since_reset
4638c2ecf20Sopenharmony_ci	│   │   ├── size_mb
4648c2ecf20Sopenharmony_ci	│   │   ├── ue_count
4658c2ecf20Sopenharmony_ci	│   │   ├── ue_noinfo_count
4668c2ecf20Sopenharmony_ci	│   │   └── uevent
4678c2ecf20Sopenharmony_ci	│   ├── mc1
4688c2ecf20Sopenharmony_ci	│   │   ├── ce_count
4698c2ecf20Sopenharmony_ci	│   │   ├── ce_noinfo_count
4708c2ecf20Sopenharmony_ci	│   │   ├── dimm0
4718c2ecf20Sopenharmony_ci	│   │   │   ├── dimm_ce_count
4728c2ecf20Sopenharmony_ci	│   │   │   ├── dimm_dev_type
4738c2ecf20Sopenharmony_ci	│   │   │   ├── dimm_edac_mode
4748c2ecf20Sopenharmony_ci	│   │   │   ├── dimm_label
4758c2ecf20Sopenharmony_ci	│   │   │   ├── dimm_location
4768c2ecf20Sopenharmony_ci	│   │   │   ├── dimm_mem_type
4778c2ecf20Sopenharmony_ci	│   │   │   ├── dimm_ue_count
4788c2ecf20Sopenharmony_ci	│   │   │   ├── size
4798c2ecf20Sopenharmony_ci	│   │   │   └── uevent
4808c2ecf20Sopenharmony_ci	│   │   ├── max_location
4818c2ecf20Sopenharmony_ci	│   │   ├── mc_name
4828c2ecf20Sopenharmony_ci	│   │   ├── reset_counters
4838c2ecf20Sopenharmony_ci	│   │   ├── seconds_since_reset
4848c2ecf20Sopenharmony_ci	│   │   ├── size_mb
4858c2ecf20Sopenharmony_ci	│   │   ├── ue_count
4868c2ecf20Sopenharmony_ci	│   │   ├── ue_noinfo_count
4878c2ecf20Sopenharmony_ci	│   │   └── uevent
4888c2ecf20Sopenharmony_ci	│   └── uevent
4898c2ecf20Sopenharmony_ci	└── uevent
4908c2ecf20Sopenharmony_ci
4918c2ecf20Sopenharmony_ciIn the ``dimmX`` directories are EDAC control and attribute files for
4928c2ecf20Sopenharmony_cithis ``X`` memory module:
4938c2ecf20Sopenharmony_ci
4948c2ecf20Sopenharmony_ci- ``size`` - Total memory managed by this csrow attribute file
4958c2ecf20Sopenharmony_ci
4968c2ecf20Sopenharmony_ci	This attribute file displays, in count of megabytes, the memory
4978c2ecf20Sopenharmony_ci	that this csrow contains.
4988c2ecf20Sopenharmony_ci
4998c2ecf20Sopenharmony_ci- ``dimm_ue_count`` - Uncorrectable Errors count attribute file
5008c2ecf20Sopenharmony_ci
5018c2ecf20Sopenharmony_ci	This attribute file displays the total count of uncorrectable
5028c2ecf20Sopenharmony_ci	errors that have occurred on this DIMM. If panic_on_ue is set
5038c2ecf20Sopenharmony_ci	this counter will not have a chance to increment, since EDAC
5048c2ecf20Sopenharmony_ci	will panic the system.
5058c2ecf20Sopenharmony_ci
5068c2ecf20Sopenharmony_ci- ``dimm_ce_count`` - Correctable Errors count attribute file
5078c2ecf20Sopenharmony_ci
5088c2ecf20Sopenharmony_ci	This attribute file displays the total count of correctable
5098c2ecf20Sopenharmony_ci	errors that have occurred on this DIMM. This count is very
5108c2ecf20Sopenharmony_ci	important to examine. CEs provide early indications that a
5118c2ecf20Sopenharmony_ci	DIMM is beginning to fail. This count field should be
5128c2ecf20Sopenharmony_ci	monitored for non-zero values and report such information
5138c2ecf20Sopenharmony_ci	to the system administrator.
5148c2ecf20Sopenharmony_ci
5158c2ecf20Sopenharmony_ci- ``dimm_dev_type``  - Device type attribute file
5168c2ecf20Sopenharmony_ci
5178c2ecf20Sopenharmony_ci	This attribute file will display what type of DRAM device is
5188c2ecf20Sopenharmony_ci	being utilized on this DIMM.
5198c2ecf20Sopenharmony_ci	Examples:
5208c2ecf20Sopenharmony_ci
5218c2ecf20Sopenharmony_ci		- x1
5228c2ecf20Sopenharmony_ci		- x2
5238c2ecf20Sopenharmony_ci		- x4
5248c2ecf20Sopenharmony_ci		- x8
5258c2ecf20Sopenharmony_ci
5268c2ecf20Sopenharmony_ci- ``dimm_edac_mode`` - EDAC Mode of operation attribute file
5278c2ecf20Sopenharmony_ci
5288c2ecf20Sopenharmony_ci	This attribute file will display what type of Error detection
5298c2ecf20Sopenharmony_ci	and correction is being utilized.
5308c2ecf20Sopenharmony_ci
5318c2ecf20Sopenharmony_ci- ``dimm_label`` - memory module label control file
5328c2ecf20Sopenharmony_ci
5338c2ecf20Sopenharmony_ci	This control file allows this DIMM to have a label assigned
5348c2ecf20Sopenharmony_ci	to it. With this label in the module, when errors occur
5358c2ecf20Sopenharmony_ci	the output can provide the DIMM label in the system log.
5368c2ecf20Sopenharmony_ci	This becomes vital for panic events to isolate the
5378c2ecf20Sopenharmony_ci	cause of the UE event.
5388c2ecf20Sopenharmony_ci
5398c2ecf20Sopenharmony_ci	DIMM Labels must be assigned after booting, with information
5408c2ecf20Sopenharmony_ci	that correctly identifies the physical slot with its
5418c2ecf20Sopenharmony_ci	silk screen label. This information is currently very
5428c2ecf20Sopenharmony_ci	motherboard specific and determination of this information
5438c2ecf20Sopenharmony_ci	must occur in userland at this time.
5448c2ecf20Sopenharmony_ci
5458c2ecf20Sopenharmony_ci- ``dimm_location`` - location of the memory module
5468c2ecf20Sopenharmony_ci
5478c2ecf20Sopenharmony_ci	The location can have up to 3 levels, and describe how the
5488c2ecf20Sopenharmony_ci	memory controller identifies the location of a memory module.
5498c2ecf20Sopenharmony_ci	Depending on the type of memory and memory controller, it
5508c2ecf20Sopenharmony_ci	can be:
5518c2ecf20Sopenharmony_ci
5528c2ecf20Sopenharmony_ci		- *csrow* and *channel* - used when the memory controller
5538c2ecf20Sopenharmony_ci		  doesn't identify a single DIMM - e. g. in ``rankX`` dir;
5548c2ecf20Sopenharmony_ci		- *branch*, *channel*, *slot* - typically used on FB-DIMM memory
5558c2ecf20Sopenharmony_ci		  controllers;
5568c2ecf20Sopenharmony_ci		- *channel*, *slot* - used on Nehalem and newer Intel drivers.
5578c2ecf20Sopenharmony_ci
5588c2ecf20Sopenharmony_ci- ``dimm_mem_type`` - Memory Type attribute file
5598c2ecf20Sopenharmony_ci
5608c2ecf20Sopenharmony_ci	This attribute file will display what type of memory is currently
5618c2ecf20Sopenharmony_ci	on this csrow. Normally, either buffered or unbuffered memory.
5628c2ecf20Sopenharmony_ci	Examples:
5638c2ecf20Sopenharmony_ci
5648c2ecf20Sopenharmony_ci		- Registered-DDR
5658c2ecf20Sopenharmony_ci		- Unbuffered-DDR
5668c2ecf20Sopenharmony_ci
5678c2ecf20Sopenharmony_ci.. [#f5] On some systems, the memory controller doesn't have any logic
5688c2ecf20Sopenharmony_ci  to identify the memory module. On such systems, the directory is called ``rankX`` and works on a similar way as the ``csrowX`` directories.
5698c2ecf20Sopenharmony_ci  On modern Intel memory controllers, the memory controller identifies the
5708c2ecf20Sopenharmony_ci  memory modules directly. On such systems, the directory is called ``dimmX``.
5718c2ecf20Sopenharmony_ci
5728c2ecf20Sopenharmony_ci.. [#f6] There are also some ``power`` directories and ``subsystem``
5738c2ecf20Sopenharmony_ci  symlinks inside the sysfs mapping that are automatically created by
5748c2ecf20Sopenharmony_ci  the sysfs subsystem. Currently, they serve no purpose.
5758c2ecf20Sopenharmony_ci
5768c2ecf20Sopenharmony_ci``csrowX`` directories
5778c2ecf20Sopenharmony_ci----------------------
5788c2ecf20Sopenharmony_ci
5798c2ecf20Sopenharmony_ciWhen CONFIG_EDAC_LEGACY_SYSFS is enabled, sysfs will contain the ``csrowX``
5808c2ecf20Sopenharmony_cidirectories. As this API doesn't work properly for Rambus, FB-DIMMs and
5818c2ecf20Sopenharmony_cimodern Intel Memory Controllers, this is being deprecated in favor of
5828c2ecf20Sopenharmony_ci``dimmX`` directories.
5838c2ecf20Sopenharmony_ci
5848c2ecf20Sopenharmony_ciIn the ``csrowX`` directories are EDAC control and attribute files for
5858c2ecf20Sopenharmony_cithis ``X`` instance of csrow:
5868c2ecf20Sopenharmony_ci
5878c2ecf20Sopenharmony_ci
5888c2ecf20Sopenharmony_ci- ``ue_count`` - Total Uncorrectable Errors count attribute file
5898c2ecf20Sopenharmony_ci
5908c2ecf20Sopenharmony_ci	This attribute file displays the total count of uncorrectable
5918c2ecf20Sopenharmony_ci	errors that have occurred on this csrow. If panic_on_ue is set
5928c2ecf20Sopenharmony_ci	this counter will not have a chance to increment, since EDAC
5938c2ecf20Sopenharmony_ci	will panic the system.
5948c2ecf20Sopenharmony_ci
5958c2ecf20Sopenharmony_ci
5968c2ecf20Sopenharmony_ci- ``ce_count`` - Total Correctable Errors count attribute file
5978c2ecf20Sopenharmony_ci
5988c2ecf20Sopenharmony_ci	This attribute file displays the total count of correctable
5998c2ecf20Sopenharmony_ci	errors that have occurred on this csrow. This count is very
6008c2ecf20Sopenharmony_ci	important to examine. CEs provide early indications that a
6018c2ecf20Sopenharmony_ci	DIMM is beginning to fail. This count field should be
6028c2ecf20Sopenharmony_ci	monitored for non-zero values and report such information
6038c2ecf20Sopenharmony_ci	to the system administrator.
6048c2ecf20Sopenharmony_ci
6058c2ecf20Sopenharmony_ci
6068c2ecf20Sopenharmony_ci- ``size_mb`` - Total memory managed by this csrow attribute file
6078c2ecf20Sopenharmony_ci
6088c2ecf20Sopenharmony_ci	This attribute file displays, in count of megabytes, the memory
6098c2ecf20Sopenharmony_ci	that this csrow contains.
6108c2ecf20Sopenharmony_ci
6118c2ecf20Sopenharmony_ci
6128c2ecf20Sopenharmony_ci- ``mem_type`` - Memory Type attribute file
6138c2ecf20Sopenharmony_ci
6148c2ecf20Sopenharmony_ci	This attribute file will display what type of memory is currently
6158c2ecf20Sopenharmony_ci	on this csrow. Normally, either buffered or unbuffered memory.
6168c2ecf20Sopenharmony_ci	Examples:
6178c2ecf20Sopenharmony_ci
6188c2ecf20Sopenharmony_ci		- Registered-DDR
6198c2ecf20Sopenharmony_ci		- Unbuffered-DDR
6208c2ecf20Sopenharmony_ci
6218c2ecf20Sopenharmony_ci
6228c2ecf20Sopenharmony_ci- ``edac_mode`` - EDAC Mode of operation attribute file
6238c2ecf20Sopenharmony_ci
6248c2ecf20Sopenharmony_ci	This attribute file will display what type of Error detection
6258c2ecf20Sopenharmony_ci	and correction is being utilized.
6268c2ecf20Sopenharmony_ci
6278c2ecf20Sopenharmony_ci
6288c2ecf20Sopenharmony_ci- ``dev_type`` - Device type attribute file
6298c2ecf20Sopenharmony_ci
6308c2ecf20Sopenharmony_ci	This attribute file will display what type of DRAM device is
6318c2ecf20Sopenharmony_ci	being utilized on this DIMM.
6328c2ecf20Sopenharmony_ci	Examples:
6338c2ecf20Sopenharmony_ci
6348c2ecf20Sopenharmony_ci		- x1
6358c2ecf20Sopenharmony_ci		- x2
6368c2ecf20Sopenharmony_ci		- x4
6378c2ecf20Sopenharmony_ci		- x8
6388c2ecf20Sopenharmony_ci
6398c2ecf20Sopenharmony_ci
6408c2ecf20Sopenharmony_ci- ``ch0_ce_count`` - Channel 0 CE Count attribute file
6418c2ecf20Sopenharmony_ci
6428c2ecf20Sopenharmony_ci	This attribute file will display the count of CEs on this
6438c2ecf20Sopenharmony_ci	DIMM located in channel 0.
6448c2ecf20Sopenharmony_ci
6458c2ecf20Sopenharmony_ci
6468c2ecf20Sopenharmony_ci- ``ch0_ue_count`` - Channel 0 UE Count attribute file
6478c2ecf20Sopenharmony_ci
6488c2ecf20Sopenharmony_ci	This attribute file will display the count of UEs on this
6498c2ecf20Sopenharmony_ci	DIMM located in channel 0.
6508c2ecf20Sopenharmony_ci
6518c2ecf20Sopenharmony_ci
6528c2ecf20Sopenharmony_ci- ``ch0_dimm_label`` - Channel 0 DIMM Label control file
6538c2ecf20Sopenharmony_ci
6548c2ecf20Sopenharmony_ci
6558c2ecf20Sopenharmony_ci	This control file allows this DIMM to have a label assigned
6568c2ecf20Sopenharmony_ci	to it. With this label in the module, when errors occur
6578c2ecf20Sopenharmony_ci	the output can provide the DIMM label in the system log.
6588c2ecf20Sopenharmony_ci	This becomes vital for panic events to isolate the
6598c2ecf20Sopenharmony_ci	cause of the UE event.
6608c2ecf20Sopenharmony_ci
6618c2ecf20Sopenharmony_ci	DIMM Labels must be assigned after booting, with information
6628c2ecf20Sopenharmony_ci	that correctly identifies the physical slot with its
6638c2ecf20Sopenharmony_ci	silk screen label. This information is currently very
6648c2ecf20Sopenharmony_ci	motherboard specific and determination of this information
6658c2ecf20Sopenharmony_ci	must occur in userland at this time.
6668c2ecf20Sopenharmony_ci
6678c2ecf20Sopenharmony_ci
6688c2ecf20Sopenharmony_ci- ``ch1_ce_count`` - Channel 1 CE Count attribute file
6698c2ecf20Sopenharmony_ci
6708c2ecf20Sopenharmony_ci
6718c2ecf20Sopenharmony_ci	This attribute file will display the count of CEs on this
6728c2ecf20Sopenharmony_ci	DIMM located in channel 1.
6738c2ecf20Sopenharmony_ci
6748c2ecf20Sopenharmony_ci
6758c2ecf20Sopenharmony_ci- ``ch1_ue_count`` - Channel 1 UE Count attribute file
6768c2ecf20Sopenharmony_ci
6778c2ecf20Sopenharmony_ci
6788c2ecf20Sopenharmony_ci	This attribute file will display the count of UEs on this
6798c2ecf20Sopenharmony_ci	DIMM located in channel 0.
6808c2ecf20Sopenharmony_ci
6818c2ecf20Sopenharmony_ci
6828c2ecf20Sopenharmony_ci- ``ch1_dimm_label`` - Channel 1 DIMM Label control file
6838c2ecf20Sopenharmony_ci
6848c2ecf20Sopenharmony_ci	This control file allows this DIMM to have a label assigned
6858c2ecf20Sopenharmony_ci	to it. With this label in the module, when errors occur
6868c2ecf20Sopenharmony_ci	the output can provide the DIMM label in the system log.
6878c2ecf20Sopenharmony_ci	This becomes vital for panic events to isolate the
6888c2ecf20Sopenharmony_ci	cause of the UE event.
6898c2ecf20Sopenharmony_ci
6908c2ecf20Sopenharmony_ci	DIMM Labels must be assigned after booting, with information
6918c2ecf20Sopenharmony_ci	that correctly identifies the physical slot with its
6928c2ecf20Sopenharmony_ci	silk screen label. This information is currently very
6938c2ecf20Sopenharmony_ci	motherboard specific and determination of this information
6948c2ecf20Sopenharmony_ci	must occur in userland at this time.
6958c2ecf20Sopenharmony_ci
6968c2ecf20Sopenharmony_ci
6978c2ecf20Sopenharmony_ciSystem Logging
6988c2ecf20Sopenharmony_ci--------------
6998c2ecf20Sopenharmony_ci
7008c2ecf20Sopenharmony_ciIf logging for UEs and CEs is enabled, then system logs will contain
7018c2ecf20Sopenharmony_ciinformation indicating that errors have been detected::
7028c2ecf20Sopenharmony_ci
7038c2ecf20Sopenharmony_ci  EDAC MC0: CE page 0x283, offset 0xce0, grain 8, syndrome 0x6ec3, row 0, channel 1 "DIMM_B1": amd76x_edac
7048c2ecf20Sopenharmony_ci  EDAC MC0: CE page 0x1e5, offset 0xfb0, grain 8, syndrome 0xb741, row 0, channel 1 "DIMM_B1": amd76x_edac
7058c2ecf20Sopenharmony_ci
7068c2ecf20Sopenharmony_ci
7078c2ecf20Sopenharmony_ciThe structure of the message is:
7088c2ecf20Sopenharmony_ci
7098c2ecf20Sopenharmony_ci	+---------------------------------------+-------------+
7108c2ecf20Sopenharmony_ci	| Content                               | Example     |
7118c2ecf20Sopenharmony_ci	+=======================================+=============+
7128c2ecf20Sopenharmony_ci	| The memory controller                 | MC0         |
7138c2ecf20Sopenharmony_ci	+---------------------------------------+-------------+
7148c2ecf20Sopenharmony_ci	| Error type                            | CE          |
7158c2ecf20Sopenharmony_ci	+---------------------------------------+-------------+
7168c2ecf20Sopenharmony_ci	| Memory page                           | 0x283       |
7178c2ecf20Sopenharmony_ci	+---------------------------------------+-------------+
7188c2ecf20Sopenharmony_ci	| Offset in the page                    | 0xce0       |
7198c2ecf20Sopenharmony_ci	+---------------------------------------+-------------+
7208c2ecf20Sopenharmony_ci	| The byte granularity                  | grain 8     |
7218c2ecf20Sopenharmony_ci	| or resolution of the error            |             |
7228c2ecf20Sopenharmony_ci	+---------------------------------------+-------------+
7238c2ecf20Sopenharmony_ci	| The error syndrome                    | 0xb741      |
7248c2ecf20Sopenharmony_ci	+---------------------------------------+-------------+
7258c2ecf20Sopenharmony_ci	| Memory row                            | row 0       |
7268c2ecf20Sopenharmony_ci	+---------------------------------------+-------------+
7278c2ecf20Sopenharmony_ci	| Memory channel                        | channel 1   |
7288c2ecf20Sopenharmony_ci	+---------------------------------------+-------------+
7298c2ecf20Sopenharmony_ci	| DIMM label, if set prior              | DIMM B1     |
7308c2ecf20Sopenharmony_ci	+---------------------------------------+-------------+
7318c2ecf20Sopenharmony_ci	| And then an optional, driver-specific |             |
7328c2ecf20Sopenharmony_ci	| message that may have additional      |             |
7338c2ecf20Sopenharmony_ci	| information.                          |             |
7348c2ecf20Sopenharmony_ci	+---------------------------------------+-------------+
7358c2ecf20Sopenharmony_ci
7368c2ecf20Sopenharmony_ciBoth UEs and CEs with no info will lack all but memory controller, error
7378c2ecf20Sopenharmony_citype, a notice of "no info" and then an optional, driver-specific error
7388c2ecf20Sopenharmony_cimessage.
7398c2ecf20Sopenharmony_ci
7408c2ecf20Sopenharmony_ci
7418c2ecf20Sopenharmony_ciPCI Bus Parity Detection
7428c2ecf20Sopenharmony_ci------------------------
7438c2ecf20Sopenharmony_ci
7448c2ecf20Sopenharmony_ciOn Header Type 00 devices, the primary status is looked at for any
7458c2ecf20Sopenharmony_ciparity error regardless of whether parity is enabled on the device or
7468c2ecf20Sopenharmony_cinot. (The spec indicates parity is generated in some cases). On Header
7478c2ecf20Sopenharmony_ciType 01 bridges, the secondary status register is also looked at to see
7488c2ecf20Sopenharmony_ciif parity occurred on the bus on the other side of the bridge.
7498c2ecf20Sopenharmony_ci
7508c2ecf20Sopenharmony_ci
7518c2ecf20Sopenharmony_ciSysfs configuration
7528c2ecf20Sopenharmony_ci-------------------
7538c2ecf20Sopenharmony_ci
7548c2ecf20Sopenharmony_ciUnder ``/sys/devices/system/edac/pci`` are control and attribute files as
7558c2ecf20Sopenharmony_cifollows:
7568c2ecf20Sopenharmony_ci
7578c2ecf20Sopenharmony_ci
7588c2ecf20Sopenharmony_ci- ``check_pci_parity`` - Enable/Disable PCI Parity checking control file
7598c2ecf20Sopenharmony_ci
7608c2ecf20Sopenharmony_ci	This control file enables or disables the PCI Bus Parity scanning
7618c2ecf20Sopenharmony_ci	operation. Writing a 1 to this file enables the scanning. Writing
7628c2ecf20Sopenharmony_ci	a 0 to this file disables the scanning.
7638c2ecf20Sopenharmony_ci
7648c2ecf20Sopenharmony_ci	Enable::
7658c2ecf20Sopenharmony_ci
7668c2ecf20Sopenharmony_ci		echo "1" >/sys/devices/system/edac/pci/check_pci_parity
7678c2ecf20Sopenharmony_ci
7688c2ecf20Sopenharmony_ci	Disable::
7698c2ecf20Sopenharmony_ci
7708c2ecf20Sopenharmony_ci		echo "0" >/sys/devices/system/edac/pci/check_pci_parity
7718c2ecf20Sopenharmony_ci
7728c2ecf20Sopenharmony_ci
7738c2ecf20Sopenharmony_ci- ``pci_parity_count`` - Parity Count
7748c2ecf20Sopenharmony_ci
7758c2ecf20Sopenharmony_ci	This attribute file will display the number of parity errors that
7768c2ecf20Sopenharmony_ci	have been detected.
7778c2ecf20Sopenharmony_ci
7788c2ecf20Sopenharmony_ci
7798c2ecf20Sopenharmony_ciModule parameters
7808c2ecf20Sopenharmony_ci-----------------
7818c2ecf20Sopenharmony_ci
7828c2ecf20Sopenharmony_ci- ``edac_mc_panic_on_ue`` - Panic on UE control file
7838c2ecf20Sopenharmony_ci
7848c2ecf20Sopenharmony_ci	An uncorrectable error will cause a machine panic.  This is usually
7858c2ecf20Sopenharmony_ci	desirable.  It is a bad idea to continue when an uncorrectable error
7868c2ecf20Sopenharmony_ci	occurs - it is indeterminate what was uncorrected and the operating
7878c2ecf20Sopenharmony_ci	system context might be so mangled that continuing will lead to further
7888c2ecf20Sopenharmony_ci	corruption. If the kernel has MCE configured, then EDAC will never
7898c2ecf20Sopenharmony_ci	notice the UE.
7908c2ecf20Sopenharmony_ci
7918c2ecf20Sopenharmony_ci	LOAD TIME::
7928c2ecf20Sopenharmony_ci
7938c2ecf20Sopenharmony_ci		module/kernel parameter: edac_mc_panic_on_ue=[0|1]
7948c2ecf20Sopenharmony_ci
7958c2ecf20Sopenharmony_ci	RUN TIME::
7968c2ecf20Sopenharmony_ci
7978c2ecf20Sopenharmony_ci		echo "1" > /sys/module/edac_core/parameters/edac_mc_panic_on_ue
7988c2ecf20Sopenharmony_ci
7998c2ecf20Sopenharmony_ci
8008c2ecf20Sopenharmony_ci- ``edac_mc_log_ue`` - Log UE control file
8018c2ecf20Sopenharmony_ci
8028c2ecf20Sopenharmony_ci
8038c2ecf20Sopenharmony_ci	Generate kernel messages describing uncorrectable errors.  These errors
8048c2ecf20Sopenharmony_ci	are reported through the system message log system.  UE statistics
8058c2ecf20Sopenharmony_ci	will be accumulated even when UE logging is disabled.
8068c2ecf20Sopenharmony_ci
8078c2ecf20Sopenharmony_ci	LOAD TIME::
8088c2ecf20Sopenharmony_ci
8098c2ecf20Sopenharmony_ci		module/kernel parameter: edac_mc_log_ue=[0|1]
8108c2ecf20Sopenharmony_ci
8118c2ecf20Sopenharmony_ci	RUN TIME::
8128c2ecf20Sopenharmony_ci
8138c2ecf20Sopenharmony_ci		echo "1" > /sys/module/edac_core/parameters/edac_mc_log_ue
8148c2ecf20Sopenharmony_ci
8158c2ecf20Sopenharmony_ci
8168c2ecf20Sopenharmony_ci- ``edac_mc_log_ce`` - Log CE control file
8178c2ecf20Sopenharmony_ci
8188c2ecf20Sopenharmony_ci
8198c2ecf20Sopenharmony_ci	Generate kernel messages describing correctable errors.  These
8208c2ecf20Sopenharmony_ci	errors are reported through the system message log system.
8218c2ecf20Sopenharmony_ci	CE statistics will be accumulated even when CE logging is disabled.
8228c2ecf20Sopenharmony_ci
8238c2ecf20Sopenharmony_ci	LOAD TIME::
8248c2ecf20Sopenharmony_ci
8258c2ecf20Sopenharmony_ci		module/kernel parameter: edac_mc_log_ce=[0|1]
8268c2ecf20Sopenharmony_ci
8278c2ecf20Sopenharmony_ci	RUN TIME::
8288c2ecf20Sopenharmony_ci
8298c2ecf20Sopenharmony_ci		echo "1" > /sys/module/edac_core/parameters/edac_mc_log_ce
8308c2ecf20Sopenharmony_ci
8318c2ecf20Sopenharmony_ci
8328c2ecf20Sopenharmony_ci- ``edac_mc_poll_msec`` - Polling period control file
8338c2ecf20Sopenharmony_ci
8348c2ecf20Sopenharmony_ci
8358c2ecf20Sopenharmony_ci	The time period, in milliseconds, for polling for error information.
8368c2ecf20Sopenharmony_ci	Too small a value wastes resources.  Too large a value might delay
8378c2ecf20Sopenharmony_ci	necessary handling of errors and might loose valuable information for
8388c2ecf20Sopenharmony_ci	locating the error.  1000 milliseconds (once each second) is the current
8398c2ecf20Sopenharmony_ci	default. Systems which require all the bandwidth they can get, may
8408c2ecf20Sopenharmony_ci	increase this.
8418c2ecf20Sopenharmony_ci
8428c2ecf20Sopenharmony_ci	LOAD TIME::
8438c2ecf20Sopenharmony_ci
8448c2ecf20Sopenharmony_ci		module/kernel parameter: edac_mc_poll_msec=[0|1]
8458c2ecf20Sopenharmony_ci
8468c2ecf20Sopenharmony_ci	RUN TIME::
8478c2ecf20Sopenharmony_ci
8488c2ecf20Sopenharmony_ci		echo "1000" > /sys/module/edac_core/parameters/edac_mc_poll_msec
8498c2ecf20Sopenharmony_ci
8508c2ecf20Sopenharmony_ci
8518c2ecf20Sopenharmony_ci- ``panic_on_pci_parity`` - Panic on PCI PARITY Error
8528c2ecf20Sopenharmony_ci
8538c2ecf20Sopenharmony_ci
8548c2ecf20Sopenharmony_ci	This control file enables or disables panicking when a parity
8558c2ecf20Sopenharmony_ci	error has been detected.
8568c2ecf20Sopenharmony_ci
8578c2ecf20Sopenharmony_ci
8588c2ecf20Sopenharmony_ci	module/kernel parameter::
8598c2ecf20Sopenharmony_ci
8608c2ecf20Sopenharmony_ci			edac_panic_on_pci_pe=[0|1]
8618c2ecf20Sopenharmony_ci
8628c2ecf20Sopenharmony_ci	Enable::
8638c2ecf20Sopenharmony_ci
8648c2ecf20Sopenharmony_ci		echo "1" > /sys/module/edac_core/parameters/edac_panic_on_pci_pe
8658c2ecf20Sopenharmony_ci
8668c2ecf20Sopenharmony_ci	Disable::
8678c2ecf20Sopenharmony_ci
8688c2ecf20Sopenharmony_ci		echo "0" > /sys/module/edac_core/parameters/edac_panic_on_pci_pe
8698c2ecf20Sopenharmony_ci
8708c2ecf20Sopenharmony_ci
8718c2ecf20Sopenharmony_ci
8728c2ecf20Sopenharmony_ciEDAC device type
8738c2ecf20Sopenharmony_ci----------------
8748c2ecf20Sopenharmony_ci
8758c2ecf20Sopenharmony_ciIn the header file, edac_pci.h, there is a series of edac_device structures
8768c2ecf20Sopenharmony_ciand APIs for the EDAC_DEVICE.
8778c2ecf20Sopenharmony_ci
8788c2ecf20Sopenharmony_ciUser space access to an edac_device is through the sysfs interface.
8798c2ecf20Sopenharmony_ci
8808c2ecf20Sopenharmony_ciAt the location ``/sys/devices/system/edac`` (sysfs) new edac_device devices
8818c2ecf20Sopenharmony_ciwill appear.
8828c2ecf20Sopenharmony_ci
8838c2ecf20Sopenharmony_ciThere is a three level tree beneath the above ``edac`` directory. For example,
8848c2ecf20Sopenharmony_cithe ``test_device_edac`` device (found at the http://bluesmoke.sourceforget.net
8858c2ecf20Sopenharmony_ciwebsite) installs itself as::
8868c2ecf20Sopenharmony_ci
8878c2ecf20Sopenharmony_ci	/sys/devices/system/edac/test-instance
8888c2ecf20Sopenharmony_ci
8898c2ecf20Sopenharmony_ciin this directory are various controls, a symlink and one or more ``instance``
8908c2ecf20Sopenharmony_cidirectories.
8918c2ecf20Sopenharmony_ci
8928c2ecf20Sopenharmony_ciThe standard default controls are:
8938c2ecf20Sopenharmony_ci
8948c2ecf20Sopenharmony_ci	==============	=======================================================
8958c2ecf20Sopenharmony_ci	log_ce		boolean to log CE events
8968c2ecf20Sopenharmony_ci	log_ue		boolean to log UE events
8978c2ecf20Sopenharmony_ci	panic_on_ue	boolean to ``panic`` the system if an UE is encountered
8988c2ecf20Sopenharmony_ci			(default off, can be set true via startup script)
8998c2ecf20Sopenharmony_ci	poll_msec	time period between POLL cycles for events
9008c2ecf20Sopenharmony_ci	==============	=======================================================
9018c2ecf20Sopenharmony_ci
9028c2ecf20Sopenharmony_ciThe test_device_edac device adds at least one of its own custom control:
9038c2ecf20Sopenharmony_ci
9048c2ecf20Sopenharmony_ci	==============	==================================================
9058c2ecf20Sopenharmony_ci	test_bits	which in the current test driver does nothing but
9068c2ecf20Sopenharmony_ci			show how it is installed. A ported driver can
9078c2ecf20Sopenharmony_ci			add one or more such controls and/or attributes
9088c2ecf20Sopenharmony_ci			for specific uses.
9098c2ecf20Sopenharmony_ci			One out-of-tree driver uses controls here to allow
9108c2ecf20Sopenharmony_ci			for ERROR INJECTION operations to hardware
9118c2ecf20Sopenharmony_ci			injection registers
9128c2ecf20Sopenharmony_ci	==============	==================================================
9138c2ecf20Sopenharmony_ci
9148c2ecf20Sopenharmony_ciThe symlink points to the 'struct dev' that is registered for this edac_device.
9158c2ecf20Sopenharmony_ci
9168c2ecf20Sopenharmony_ciInstances
9178c2ecf20Sopenharmony_ci---------
9188c2ecf20Sopenharmony_ci
9198c2ecf20Sopenharmony_ciOne or more instance directories are present. For the ``test_device_edac``
9208c2ecf20Sopenharmony_cicase:
9218c2ecf20Sopenharmony_ci
9228c2ecf20Sopenharmony_ci	+----------------+
9238c2ecf20Sopenharmony_ci	| test-instance0 |
9248c2ecf20Sopenharmony_ci	+----------------+
9258c2ecf20Sopenharmony_ci
9268c2ecf20Sopenharmony_ci
9278c2ecf20Sopenharmony_ciIn this directory there are two default counter attributes, which are totals of
9288c2ecf20Sopenharmony_cicounter in deeper subdirectories.
9298c2ecf20Sopenharmony_ci
9308c2ecf20Sopenharmony_ci	==============	====================================
9318c2ecf20Sopenharmony_ci	ce_count	total of CE events of subdirectories
9328c2ecf20Sopenharmony_ci	ue_count	total of UE events of subdirectories
9338c2ecf20Sopenharmony_ci	==============	====================================
9348c2ecf20Sopenharmony_ci
9358c2ecf20Sopenharmony_ciBlocks
9368c2ecf20Sopenharmony_ci------
9378c2ecf20Sopenharmony_ci
9388c2ecf20Sopenharmony_ciAt the lowest directory level is the ``block`` directory. There can be 0, 1
9398c2ecf20Sopenharmony_cior more blocks specified in each instance:
9408c2ecf20Sopenharmony_ci
9418c2ecf20Sopenharmony_ci	+-------------+
9428c2ecf20Sopenharmony_ci	| test-block0 |
9438c2ecf20Sopenharmony_ci	+-------------+
9448c2ecf20Sopenharmony_ci
9458c2ecf20Sopenharmony_ciIn this directory the default attributes are:
9468c2ecf20Sopenharmony_ci
9478c2ecf20Sopenharmony_ci	==============	================================================
9488c2ecf20Sopenharmony_ci	ce_count	which is counter of CE events for this ``block``
9498c2ecf20Sopenharmony_ci			of hardware being monitored
9508c2ecf20Sopenharmony_ci	ue_count	which is counter of UE events for this ``block``
9518c2ecf20Sopenharmony_ci			of hardware being monitored
9528c2ecf20Sopenharmony_ci	==============	================================================
9538c2ecf20Sopenharmony_ci
9548c2ecf20Sopenharmony_ci
9558c2ecf20Sopenharmony_ciThe ``test_device_edac`` device adds 4 attributes and 1 control:
9568c2ecf20Sopenharmony_ci
9578c2ecf20Sopenharmony_ci	================== ====================================================
9588c2ecf20Sopenharmony_ci	test-block-bits-0	for every POLL cycle this counter
9598c2ecf20Sopenharmony_ci				is incremented
9608c2ecf20Sopenharmony_ci	test-block-bits-1	every 10 cycles, this counter is bumped once,
9618c2ecf20Sopenharmony_ci				and test-block-bits-0 is set to 0
9628c2ecf20Sopenharmony_ci	test-block-bits-2	every 100 cycles, this counter is bumped once,
9638c2ecf20Sopenharmony_ci				and test-block-bits-1 is set to 0
9648c2ecf20Sopenharmony_ci	test-block-bits-3	every 1000 cycles, this counter is bumped once,
9658c2ecf20Sopenharmony_ci				and test-block-bits-2 is set to 0
9668c2ecf20Sopenharmony_ci	================== ====================================================
9678c2ecf20Sopenharmony_ci
9688c2ecf20Sopenharmony_ci
9698c2ecf20Sopenharmony_ci	================== ====================================================
9708c2ecf20Sopenharmony_ci	reset-counters		writing ANY thing to this control will
9718c2ecf20Sopenharmony_ci				reset all the above counters.
9728c2ecf20Sopenharmony_ci	================== ====================================================
9738c2ecf20Sopenharmony_ci
9748c2ecf20Sopenharmony_ci
9758c2ecf20Sopenharmony_ciUse of the ``test_device_edac`` driver should enable any others to create their own
9768c2ecf20Sopenharmony_ciunique drivers for their hardware systems.
9778c2ecf20Sopenharmony_ci
9788c2ecf20Sopenharmony_ciThe ``test_device_edac`` sample driver is located at the
9798c2ecf20Sopenharmony_cihttp://bluesmoke.sourceforge.net project site for EDAC.
9808c2ecf20Sopenharmony_ci
9818c2ecf20Sopenharmony_ci
9828c2ecf20Sopenharmony_ciUsage of EDAC APIs on Nehalem and newer Intel CPUs
9838c2ecf20Sopenharmony_ci--------------------------------------------------
9848c2ecf20Sopenharmony_ci
9858c2ecf20Sopenharmony_ciOn older Intel architectures, the memory controller was part of the North
9868c2ecf20Sopenharmony_ciBridge chipset. Nehalem, Sandy Bridge, Ivy Bridge, Haswell, Sky Lake and
9878c2ecf20Sopenharmony_cinewer Intel architectures integrated an enhanced version of the memory
9888c2ecf20Sopenharmony_cicontroller (MC) inside the CPUs.
9898c2ecf20Sopenharmony_ci
9908c2ecf20Sopenharmony_ciThis chapter will cover the differences of the enhanced memory controllers
9918c2ecf20Sopenharmony_cifound on newer Intel CPUs, such as ``i7core_edac``, ``sb_edac`` and
9928c2ecf20Sopenharmony_ci``sbx_edac`` drivers.
9938c2ecf20Sopenharmony_ci
9948c2ecf20Sopenharmony_ci.. note::
9958c2ecf20Sopenharmony_ci
9968c2ecf20Sopenharmony_ci   The Xeon E7 processor families use a separate chip for the memory
9978c2ecf20Sopenharmony_ci   controller, called Intel Scalable Memory Buffer. This section doesn't
9988c2ecf20Sopenharmony_ci   apply for such families.
9998c2ecf20Sopenharmony_ci
10008c2ecf20Sopenharmony_ci1) There is one Memory Controller per Quick Patch Interconnect
10018c2ecf20Sopenharmony_ci   (QPI). At the driver, the term "socket" means one QPI. This is
10028c2ecf20Sopenharmony_ci   associated with a physical CPU socket.
10038c2ecf20Sopenharmony_ci
10048c2ecf20Sopenharmony_ci   Each MC have 3 physical read channels, 3 physical write channels and
10058c2ecf20Sopenharmony_ci   3 logic channels. The driver currently sees it as just 3 channels.
10068c2ecf20Sopenharmony_ci   Each channel can have up to 3 DIMMs.
10078c2ecf20Sopenharmony_ci
10088c2ecf20Sopenharmony_ci   The minimum known unity is DIMMs. There are no information about csrows.
10098c2ecf20Sopenharmony_ci   As EDAC API maps the minimum unity is csrows, the driver sequentially
10108c2ecf20Sopenharmony_ci   maps channel/DIMM into different csrows.
10118c2ecf20Sopenharmony_ci
10128c2ecf20Sopenharmony_ci   For example, supposing the following layout::
10138c2ecf20Sopenharmony_ci
10148c2ecf20Sopenharmony_ci	Ch0 phy rd0, wr0 (0x063f4031): 2 ranks, UDIMMs
10158c2ecf20Sopenharmony_ci	  dimm 0 1024 Mb offset: 0, bank: 8, rank: 1, row: 0x4000, col: 0x400
10168c2ecf20Sopenharmony_ci	  dimm 1 1024 Mb offset: 4, bank: 8, rank: 1, row: 0x4000, col: 0x400
10178c2ecf20Sopenharmony_ci        Ch1 phy rd1, wr1 (0x063f4031): 2 ranks, UDIMMs
10188c2ecf20Sopenharmony_ci	  dimm 0 1024 Mb offset: 0, bank: 8, rank: 1, row: 0x4000, col: 0x400
10198c2ecf20Sopenharmony_ci	Ch2 phy rd3, wr3 (0x063f4031): 2 ranks, UDIMMs
10208c2ecf20Sopenharmony_ci	  dimm 0 1024 Mb offset: 0, bank: 8, rank: 1, row: 0x4000, col: 0x400
10218c2ecf20Sopenharmony_ci
10228c2ecf20Sopenharmony_ci   The driver will map it as::
10238c2ecf20Sopenharmony_ci
10248c2ecf20Sopenharmony_ci	csrow0: channel 0, dimm0
10258c2ecf20Sopenharmony_ci	csrow1: channel 0, dimm1
10268c2ecf20Sopenharmony_ci	csrow2: channel 1, dimm0
10278c2ecf20Sopenharmony_ci	csrow3: channel 2, dimm0
10288c2ecf20Sopenharmony_ci
10298c2ecf20Sopenharmony_ci   exports one DIMM per csrow.
10308c2ecf20Sopenharmony_ci
10318c2ecf20Sopenharmony_ci   Each QPI is exported as a different memory controller.
10328c2ecf20Sopenharmony_ci
10338c2ecf20Sopenharmony_ci2) The MC has the ability to inject errors to test drivers. The drivers
10348c2ecf20Sopenharmony_ci   implement this functionality via some error injection nodes:
10358c2ecf20Sopenharmony_ci
10368c2ecf20Sopenharmony_ci   For injecting a memory error, there are some sysfs nodes, under
10378c2ecf20Sopenharmony_ci   ``/sys/devices/system/edac/mc/mc?/``:
10388c2ecf20Sopenharmony_ci
10398c2ecf20Sopenharmony_ci   - ``inject_addrmatch/*``:
10408c2ecf20Sopenharmony_ci      Controls the error injection mask register. It is possible to specify
10418c2ecf20Sopenharmony_ci      several characteristics of the address to match an error code::
10428c2ecf20Sopenharmony_ci
10438c2ecf20Sopenharmony_ci         dimm = the affected dimm. Numbers are relative to a channel;
10448c2ecf20Sopenharmony_ci         rank = the memory rank;
10458c2ecf20Sopenharmony_ci         channel = the channel that will generate an error;
10468c2ecf20Sopenharmony_ci         bank = the affected bank;
10478c2ecf20Sopenharmony_ci         page = the page address;
10488c2ecf20Sopenharmony_ci         column (or col) = the address column.
10498c2ecf20Sopenharmony_ci
10508c2ecf20Sopenharmony_ci      each of the above values can be set to "any" to match any valid value.
10518c2ecf20Sopenharmony_ci
10528c2ecf20Sopenharmony_ci      At driver init, all values are set to any.
10538c2ecf20Sopenharmony_ci
10548c2ecf20Sopenharmony_ci      For example, to generate an error at rank 1 of dimm 2, for any channel,
10558c2ecf20Sopenharmony_ci      any bank, any page, any column::
10568c2ecf20Sopenharmony_ci
10578c2ecf20Sopenharmony_ci		echo 2 >/sys/devices/system/edac/mc/mc0/inject_addrmatch/dimm
10588c2ecf20Sopenharmony_ci		echo 1 >/sys/devices/system/edac/mc/mc0/inject_addrmatch/rank
10598c2ecf20Sopenharmony_ci
10608c2ecf20Sopenharmony_ci	To return to the default behaviour of matching any, you can do::
10618c2ecf20Sopenharmony_ci
10628c2ecf20Sopenharmony_ci		echo any >/sys/devices/system/edac/mc/mc0/inject_addrmatch/dimm
10638c2ecf20Sopenharmony_ci		echo any >/sys/devices/system/edac/mc/mc0/inject_addrmatch/rank
10648c2ecf20Sopenharmony_ci
10658c2ecf20Sopenharmony_ci   - ``inject_eccmask``:
10668c2ecf20Sopenharmony_ci          specifies what bits will have troubles,
10678c2ecf20Sopenharmony_ci
10688c2ecf20Sopenharmony_ci   - ``inject_section``:
10698c2ecf20Sopenharmony_ci       specifies what ECC cache section will get the error::
10708c2ecf20Sopenharmony_ci
10718c2ecf20Sopenharmony_ci		3 for both
10728c2ecf20Sopenharmony_ci		2 for the highest
10738c2ecf20Sopenharmony_ci		1 for the lowest
10748c2ecf20Sopenharmony_ci
10758c2ecf20Sopenharmony_ci   - ``inject_type``:
10768c2ecf20Sopenharmony_ci       specifies the type of error, being a combination of the following bits::
10778c2ecf20Sopenharmony_ci
10788c2ecf20Sopenharmony_ci		bit 0 - repeat
10798c2ecf20Sopenharmony_ci		bit 1 - ecc
10808c2ecf20Sopenharmony_ci		bit 2 - parity
10818c2ecf20Sopenharmony_ci
10828c2ecf20Sopenharmony_ci   - ``inject_enable``:
10838c2ecf20Sopenharmony_ci       starts the error generation when something different than 0 is written.
10848c2ecf20Sopenharmony_ci
10858c2ecf20Sopenharmony_ci   All inject vars can be read. root permission is needed for write.
10868c2ecf20Sopenharmony_ci
10878c2ecf20Sopenharmony_ci   Datasheet states that the error will only be generated after a write on an
10888c2ecf20Sopenharmony_ci   address that matches inject_addrmatch. It seems, however, that reading will
10898c2ecf20Sopenharmony_ci   also produce an error.
10908c2ecf20Sopenharmony_ci
10918c2ecf20Sopenharmony_ci   For example, the following code will generate an error for any write access
10928c2ecf20Sopenharmony_ci   at socket 0, on any DIMM/address on channel 2::
10938c2ecf20Sopenharmony_ci
10948c2ecf20Sopenharmony_ci	echo 2 >/sys/devices/system/edac/mc/mc0/inject_addrmatch/channel
10958c2ecf20Sopenharmony_ci	echo 2 >/sys/devices/system/edac/mc/mc0/inject_type
10968c2ecf20Sopenharmony_ci	echo 64 >/sys/devices/system/edac/mc/mc0/inject_eccmask
10978c2ecf20Sopenharmony_ci	echo 3 >/sys/devices/system/edac/mc/mc0/inject_section
10988c2ecf20Sopenharmony_ci	echo 1 >/sys/devices/system/edac/mc/mc0/inject_enable
10998c2ecf20Sopenharmony_ci	dd if=/dev/mem of=/dev/null seek=16k bs=4k count=1 >& /dev/null
11008c2ecf20Sopenharmony_ci
11018c2ecf20Sopenharmony_ci   For socket 1, it is needed to replace "mc0" by "mc1" at the above
11028c2ecf20Sopenharmony_ci   commands.
11038c2ecf20Sopenharmony_ci
11048c2ecf20Sopenharmony_ci   The generated error message will look like::
11058c2ecf20Sopenharmony_ci
11068c2ecf20Sopenharmony_ci	EDAC MC0: UE row 0, channel-a= 0 channel-b= 0 labels "-": NON_FATAL (addr = 0x0075b980, socket=0, Dimm=0, Channel=2, syndrome=0x00000040, count=1, Err=8c0000400001009f:4000080482 (read error: read ECC error))
11078c2ecf20Sopenharmony_ci
11088c2ecf20Sopenharmony_ci3) Corrected Error memory register counters
11098c2ecf20Sopenharmony_ci
11108c2ecf20Sopenharmony_ci   Those newer MCs have some registers to count memory errors. The driver
11118c2ecf20Sopenharmony_ci   uses those registers to report Corrected Errors on devices with Registered
11128c2ecf20Sopenharmony_ci   DIMMs.
11138c2ecf20Sopenharmony_ci
11148c2ecf20Sopenharmony_ci   However, those counters don't work with Unregistered DIMM. As the chipset
11158c2ecf20Sopenharmony_ci   offers some counters that also work with UDIMMs (but with a worse level of
11168c2ecf20Sopenharmony_ci   granularity than the default ones), the driver exposes those registers for
11178c2ecf20Sopenharmony_ci   UDIMM memories.
11188c2ecf20Sopenharmony_ci
11198c2ecf20Sopenharmony_ci   They can be read by looking at the contents of ``all_channel_counts/``::
11208c2ecf20Sopenharmony_ci
11218c2ecf20Sopenharmony_ci     $ for i in /sys/devices/system/edac/mc/mc0/all_channel_counts/*; do echo $i; cat $i; done
11228c2ecf20Sopenharmony_ci	/sys/devices/system/edac/mc/mc0/all_channel_counts/udimm0
11238c2ecf20Sopenharmony_ci	0
11248c2ecf20Sopenharmony_ci	/sys/devices/system/edac/mc/mc0/all_channel_counts/udimm1
11258c2ecf20Sopenharmony_ci	0
11268c2ecf20Sopenharmony_ci	/sys/devices/system/edac/mc/mc0/all_channel_counts/udimm2
11278c2ecf20Sopenharmony_ci	0
11288c2ecf20Sopenharmony_ci
11298c2ecf20Sopenharmony_ci   What happens here is that errors on different csrows, but at the same
11308c2ecf20Sopenharmony_ci   dimm number will increment the same counter.
11318c2ecf20Sopenharmony_ci   So, in this memory mapping::
11328c2ecf20Sopenharmony_ci
11338c2ecf20Sopenharmony_ci	csrow0: channel 0, dimm0
11348c2ecf20Sopenharmony_ci	csrow1: channel 0, dimm1
11358c2ecf20Sopenharmony_ci	csrow2: channel 1, dimm0
11368c2ecf20Sopenharmony_ci	csrow3: channel 2, dimm0
11378c2ecf20Sopenharmony_ci
11388c2ecf20Sopenharmony_ci   The hardware will increment udimm0 for an error at the first dimm at either
11398c2ecf20Sopenharmony_ci   csrow0, csrow2  or csrow3;
11408c2ecf20Sopenharmony_ci
11418c2ecf20Sopenharmony_ci   The hardware will increment udimm1 for an error at the second dimm at either
11428c2ecf20Sopenharmony_ci   csrow0, csrow2  or csrow3;
11438c2ecf20Sopenharmony_ci
11448c2ecf20Sopenharmony_ci   The hardware will increment udimm2 for an error at the third dimm at either
11458c2ecf20Sopenharmony_ci   csrow0, csrow2  or csrow3;
11468c2ecf20Sopenharmony_ci
11478c2ecf20Sopenharmony_ci4) Standard error counters
11488c2ecf20Sopenharmony_ci
11498c2ecf20Sopenharmony_ci   The standard error counters are generated when an mcelog error is received
11508c2ecf20Sopenharmony_ci   by the driver. Since, with UDIMM, this is counted by software, it is
11518c2ecf20Sopenharmony_ci   possible that some errors could be lost. With RDIMM's, they display the
11528c2ecf20Sopenharmony_ci   contents of the registers
11538c2ecf20Sopenharmony_ci
11548c2ecf20Sopenharmony_ciReference documents used on ``amd64_edac``
11558c2ecf20Sopenharmony_ci------------------------------------------
11568c2ecf20Sopenharmony_ci
11578c2ecf20Sopenharmony_ci``amd64_edac`` module is based on the following documents
11588c2ecf20Sopenharmony_ci(available from http://support.amd.com/en-us/search/tech-docs):
11598c2ecf20Sopenharmony_ci
11608c2ecf20Sopenharmony_ci1. :Title:  BIOS and Kernel Developer's Guide for AMD Athlon 64 and AMD
11618c2ecf20Sopenharmony_ci	   Opteron Processors
11628c2ecf20Sopenharmony_ci   :AMD publication #: 26094
11638c2ecf20Sopenharmony_ci   :Revision: 3.26
11648c2ecf20Sopenharmony_ci   :Link: http://support.amd.com/TechDocs/26094.PDF
11658c2ecf20Sopenharmony_ci
11668c2ecf20Sopenharmony_ci2. :Title:  BIOS and Kernel Developer's Guide for AMD NPT Family 0Fh
11678c2ecf20Sopenharmony_ci	   Processors
11688c2ecf20Sopenharmony_ci   :AMD publication #: 32559
11698c2ecf20Sopenharmony_ci   :Revision: 3.00
11708c2ecf20Sopenharmony_ci   :Issue Date: May 2006
11718c2ecf20Sopenharmony_ci   :Link: http://support.amd.com/TechDocs/32559.pdf
11728c2ecf20Sopenharmony_ci
11738c2ecf20Sopenharmony_ci3. :Title:  BIOS and Kernel Developer's Guide (BKDG) For AMD Family 10h
11748c2ecf20Sopenharmony_ci	   Processors
11758c2ecf20Sopenharmony_ci   :AMD publication #: 31116
11768c2ecf20Sopenharmony_ci   :Revision: 3.00
11778c2ecf20Sopenharmony_ci   :Issue Date: September 07, 2007
11788c2ecf20Sopenharmony_ci   :Link: http://support.amd.com/TechDocs/31116.pdf
11798c2ecf20Sopenharmony_ci
11808c2ecf20Sopenharmony_ci4. :Title: BIOS and Kernel Developer's Guide (BKDG) for AMD Family 15h
11818c2ecf20Sopenharmony_ci	  Models 30h-3Fh Processors
11828c2ecf20Sopenharmony_ci   :AMD publication #: 49125
11838c2ecf20Sopenharmony_ci   :Revision: 3.06
11848c2ecf20Sopenharmony_ci   :Issue Date: 2/12/2015 (latest release)
11858c2ecf20Sopenharmony_ci   :Link: http://support.amd.com/TechDocs/49125_15h_Models_30h-3Fh_BKDG.pdf
11868c2ecf20Sopenharmony_ci
11878c2ecf20Sopenharmony_ci5. :Title: BIOS and Kernel Developer's Guide (BKDG) for AMD Family 15h
11888c2ecf20Sopenharmony_ci	  Models 60h-6Fh Processors
11898c2ecf20Sopenharmony_ci   :AMD publication #: 50742
11908c2ecf20Sopenharmony_ci   :Revision: 3.01
11918c2ecf20Sopenharmony_ci   :Issue Date: 7/23/2015 (latest release)
11928c2ecf20Sopenharmony_ci   :Link: http://support.amd.com/TechDocs/50742_15h_Models_60h-6Fh_BKDG.pdf
11938c2ecf20Sopenharmony_ci
11948c2ecf20Sopenharmony_ci6. :Title: BIOS and Kernel Developer's Guide (BKDG) for AMD Family 16h
11958c2ecf20Sopenharmony_ci	  Models 00h-0Fh Processors
11968c2ecf20Sopenharmony_ci   :AMD publication #: 48751
11978c2ecf20Sopenharmony_ci   :Revision: 3.03
11988c2ecf20Sopenharmony_ci   :Issue Date: 2/23/2015 (latest release)
11998c2ecf20Sopenharmony_ci   :Link: http://support.amd.com/TechDocs/48751_16h_bkdg.pdf
12008c2ecf20Sopenharmony_ci
12018c2ecf20Sopenharmony_ciCredits
12028c2ecf20Sopenharmony_ci=======
12038c2ecf20Sopenharmony_ci
12048c2ecf20Sopenharmony_ci* Written by Doug Thompson <dougthompson@xmission.com>
12058c2ecf20Sopenharmony_ci
12068c2ecf20Sopenharmony_ci  - 7 Dec 2005
12078c2ecf20Sopenharmony_ci  - 17 Jul 2007	Updated
12088c2ecf20Sopenharmony_ci
12098c2ecf20Sopenharmony_ci* |copy| Mauro Carvalho Chehab
12108c2ecf20Sopenharmony_ci
12118c2ecf20Sopenharmony_ci  - 05 Aug 2009	Nehalem interface
12128c2ecf20Sopenharmony_ci  - 26 Oct 2016 Converted to ReST and cleanups at the Nehalem section
12138c2ecf20Sopenharmony_ci
12148c2ecf20Sopenharmony_ci* EDAC authors/maintainers:
12158c2ecf20Sopenharmony_ci
12168c2ecf20Sopenharmony_ci  - Doug Thompson, Dave Jiang, Dave Peterson et al,
12178c2ecf20Sopenharmony_ci  - Mauro Carvalho Chehab
12188c2ecf20Sopenharmony_ci  - Borislav Petkov
12198c2ecf20Sopenharmony_ci  - original author: Thayne Harbaugh
1220