18c2ecf20Sopenharmony_ci.. include:: <isonum.txt> 28c2ecf20Sopenharmony_ci 38c2ecf20Sopenharmony_ci============================================ 48c2ecf20Sopenharmony_ciReliability, Availability and Serviceability 58c2ecf20Sopenharmony_ci============================================ 68c2ecf20Sopenharmony_ci 78c2ecf20Sopenharmony_ciRAS concepts 88c2ecf20Sopenharmony_ci************ 98c2ecf20Sopenharmony_ci 108c2ecf20Sopenharmony_ciReliability, Availability and Serviceability (RAS) is a concept used on 118c2ecf20Sopenharmony_ciservers meant to measure their robustness. 128c2ecf20Sopenharmony_ci 138c2ecf20Sopenharmony_ciReliability 148c2ecf20Sopenharmony_ci is the probability that a system will produce correct outputs. 158c2ecf20Sopenharmony_ci 168c2ecf20Sopenharmony_ci * Generally measured as Mean Time Between Failures (MTBF) 178c2ecf20Sopenharmony_ci * Enhanced by features that help to avoid, detect and repair hardware faults 188c2ecf20Sopenharmony_ci 198c2ecf20Sopenharmony_ciAvailability 208c2ecf20Sopenharmony_ci is the probability that a system is operational at a given time 218c2ecf20Sopenharmony_ci 228c2ecf20Sopenharmony_ci * Generally measured as a percentage of downtime per a period of time 238c2ecf20Sopenharmony_ci * Often uses mechanisms to detect and correct hardware faults in 248c2ecf20Sopenharmony_ci runtime; 258c2ecf20Sopenharmony_ci 268c2ecf20Sopenharmony_ciServiceability (or maintainability) 278c2ecf20Sopenharmony_ci is the simplicity and speed with which a system can be repaired or 288c2ecf20Sopenharmony_ci maintained 298c2ecf20Sopenharmony_ci 308c2ecf20Sopenharmony_ci * Generally measured on Mean Time Between Repair (MTBR) 318c2ecf20Sopenharmony_ci 328c2ecf20Sopenharmony_ciImproving RAS 338c2ecf20Sopenharmony_ci------------- 348c2ecf20Sopenharmony_ci 358c2ecf20Sopenharmony_ciIn order to reduce systems downtime, a system should be capable of detecting 368c2ecf20Sopenharmony_cihardware errors, and, when possible correcting them in runtime. It should 378c2ecf20Sopenharmony_cialso provide mechanisms to detect hardware degradation, in order to warn 388c2ecf20Sopenharmony_cithe system administrator to take the action of replacing a component before 398c2ecf20Sopenharmony_ciit causes data loss or system downtime. 408c2ecf20Sopenharmony_ci 418c2ecf20Sopenharmony_ciAmong the monitoring measures, the most usual ones include: 428c2ecf20Sopenharmony_ci 438c2ecf20Sopenharmony_ci* CPU – detect errors at instruction execution and at L1/L2/L3 caches; 448c2ecf20Sopenharmony_ci* Memory – add error correction logic (ECC) to detect and correct errors; 458c2ecf20Sopenharmony_ci* I/O – add CRC checksums for transferred data; 468c2ecf20Sopenharmony_ci* Storage – RAID, journal file systems, checksums, 478c2ecf20Sopenharmony_ci Self-Monitoring, Analysis and Reporting Technology (SMART). 488c2ecf20Sopenharmony_ci 498c2ecf20Sopenharmony_ciBy monitoring the number of occurrences of error detections, it is possible 508c2ecf20Sopenharmony_cito identify if the probability of hardware errors is increasing, and, on such 518c2ecf20Sopenharmony_cicase, do a preventive maintenance to replace a degraded component while 528c2ecf20Sopenharmony_cithose errors are correctable. 538c2ecf20Sopenharmony_ci 548c2ecf20Sopenharmony_ciTypes of errors 558c2ecf20Sopenharmony_ci--------------- 568c2ecf20Sopenharmony_ci 578c2ecf20Sopenharmony_ciMost mechanisms used on modern systems use technologies like Hamming 588c2ecf20Sopenharmony_ciCodes that allow error correction when the number of errors on a bit packet 598c2ecf20Sopenharmony_ciis below a threshold. If the number of errors is above, those mechanisms 608c2ecf20Sopenharmony_cican indicate with a high degree of confidence that an error happened, but 618c2ecf20Sopenharmony_cithey can't correct. 628c2ecf20Sopenharmony_ci 638c2ecf20Sopenharmony_ciAlso, sometimes an error occur on a component that it is not used. For 648c2ecf20Sopenharmony_ciexample, a part of the memory that it is not currently allocated. 658c2ecf20Sopenharmony_ci 668c2ecf20Sopenharmony_ciThat defines some categories of errors: 678c2ecf20Sopenharmony_ci 688c2ecf20Sopenharmony_ci* **Correctable Error (CE)** - the error detection mechanism detected and 698c2ecf20Sopenharmony_ci corrected the error. Such errors are usually not fatal, although some 708c2ecf20Sopenharmony_ci Kernel mechanisms allow the system administrator to consider them as fatal. 718c2ecf20Sopenharmony_ci 728c2ecf20Sopenharmony_ci* **Uncorrected Error (UE)** - the amount of errors happened above the error 738c2ecf20Sopenharmony_ci correction threshold, and the system was unable to auto-correct. 748c2ecf20Sopenharmony_ci 758c2ecf20Sopenharmony_ci* **Fatal Error** - when an UE error happens on a critical component of the 768c2ecf20Sopenharmony_ci system (for example, a piece of the Kernel got corrupted by an UE), the 778c2ecf20Sopenharmony_ci only reliable way to avoid data corruption is to hang or reboot the machine. 788c2ecf20Sopenharmony_ci 798c2ecf20Sopenharmony_ci* **Non-fatal Error** - when an UE error happens on an unused component, 808c2ecf20Sopenharmony_ci like a CPU in power down state or an unused memory bank, the system may 818c2ecf20Sopenharmony_ci still run, eventually replacing the affected hardware by a hot spare, 828c2ecf20Sopenharmony_ci if available. 838c2ecf20Sopenharmony_ci 848c2ecf20Sopenharmony_ci Also, when an error happens on a userspace process, it is also possible to 858c2ecf20Sopenharmony_ci kill such process and let userspace restart it. 868c2ecf20Sopenharmony_ci 878c2ecf20Sopenharmony_ciThe mechanism for handling non-fatal errors is usually complex and may 888c2ecf20Sopenharmony_cirequire the help of some userspace application, in order to apply the 898c2ecf20Sopenharmony_cipolicy desired by the system administrator. 908c2ecf20Sopenharmony_ci 918c2ecf20Sopenharmony_ciIdentifying a bad hardware component 928c2ecf20Sopenharmony_ci------------------------------------ 938c2ecf20Sopenharmony_ci 948c2ecf20Sopenharmony_ciJust detecting a hardware flaw is usually not enough, as the system needs 958c2ecf20Sopenharmony_cito pinpoint to the minimal replaceable unit (MRU) that should be exchanged 968c2ecf20Sopenharmony_cito make the hardware reliable again. 978c2ecf20Sopenharmony_ci 988c2ecf20Sopenharmony_ciSo, it requires not only error logging facilities, but also mechanisms that 998c2ecf20Sopenharmony_ciwill translate the error message to the silkscreen or component label for 1008c2ecf20Sopenharmony_cithe MRU. 1018c2ecf20Sopenharmony_ci 1028c2ecf20Sopenharmony_ciTypically, it is very complex for memory, as modern CPUs interlace memory 1038c2ecf20Sopenharmony_cifrom different memory modules, in order to provide a better performance. The 1048c2ecf20Sopenharmony_ciDMI BIOS usually have a list of memory module labels, with can be obtained 1058c2ecf20Sopenharmony_ciusing the ``dmidecode`` tool. For example, on a desktop machine, it shows:: 1068c2ecf20Sopenharmony_ci 1078c2ecf20Sopenharmony_ci Memory Device 1088c2ecf20Sopenharmony_ci Total Width: 64 bits 1098c2ecf20Sopenharmony_ci Data Width: 64 bits 1108c2ecf20Sopenharmony_ci Size: 16384 MB 1118c2ecf20Sopenharmony_ci Form Factor: SODIMM 1128c2ecf20Sopenharmony_ci Set: None 1138c2ecf20Sopenharmony_ci Locator: ChannelA-DIMM0 1148c2ecf20Sopenharmony_ci Bank Locator: BANK 0 1158c2ecf20Sopenharmony_ci Type: DDR4 1168c2ecf20Sopenharmony_ci Type Detail: Synchronous 1178c2ecf20Sopenharmony_ci Speed: 2133 MHz 1188c2ecf20Sopenharmony_ci Rank: 2 1198c2ecf20Sopenharmony_ci Configured Clock Speed: 2133 MHz 1208c2ecf20Sopenharmony_ci 1218c2ecf20Sopenharmony_ciOn the above example, a DDR4 SO-DIMM memory module is located at the 1228c2ecf20Sopenharmony_cisystem's memory labeled as "BANK 0", as given by the *bank locator* field. 1238c2ecf20Sopenharmony_ciPlease notice that, on such system, the *total width* is equal to the 1248c2ecf20Sopenharmony_ci*data width*. It means that such memory module doesn't have error 1258c2ecf20Sopenharmony_cidetection/correction mechanisms. 1268c2ecf20Sopenharmony_ci 1278c2ecf20Sopenharmony_ciUnfortunately, not all systems use the same field to specify the memory 1288c2ecf20Sopenharmony_cibank. On this example, from an older server, ``dmidecode`` shows:: 1298c2ecf20Sopenharmony_ci 1308c2ecf20Sopenharmony_ci Memory Device 1318c2ecf20Sopenharmony_ci Array Handle: 0x1000 1328c2ecf20Sopenharmony_ci Error Information Handle: Not Provided 1338c2ecf20Sopenharmony_ci Total Width: 72 bits 1348c2ecf20Sopenharmony_ci Data Width: 64 bits 1358c2ecf20Sopenharmony_ci Size: 8192 MB 1368c2ecf20Sopenharmony_ci Form Factor: DIMM 1378c2ecf20Sopenharmony_ci Set: 1 1388c2ecf20Sopenharmony_ci Locator: DIMM_A1 1398c2ecf20Sopenharmony_ci Bank Locator: Not Specified 1408c2ecf20Sopenharmony_ci Type: DDR3 1418c2ecf20Sopenharmony_ci Type Detail: Synchronous Registered (Buffered) 1428c2ecf20Sopenharmony_ci Speed: 1600 MHz 1438c2ecf20Sopenharmony_ci Rank: 2 1448c2ecf20Sopenharmony_ci Configured Clock Speed: 1600 MHz 1458c2ecf20Sopenharmony_ci 1468c2ecf20Sopenharmony_ciThere, the DDR3 RDIMM memory module is located at the system's memory labeled 1478c2ecf20Sopenharmony_cias "DIMM_A1", as given by the *locator* field. Please notice that this 1488c2ecf20Sopenharmony_cimemory module has 64 bits of *data width* and 72 bits of *total width*. So, 1498c2ecf20Sopenharmony_ciit has 8 extra bits to be used by error detection and correction mechanisms. 1508c2ecf20Sopenharmony_ciSuch kind of memory is called Error-correcting code memory (ECC memory). 1518c2ecf20Sopenharmony_ci 1528c2ecf20Sopenharmony_ciTo make things even worse, it is not uncommon that systems with different 1538c2ecf20Sopenharmony_cilabels on their system's board to use exactly the same BIOS, meaning that 1548c2ecf20Sopenharmony_cithe labels provided by the BIOS won't match the real ones. 1558c2ecf20Sopenharmony_ci 1568c2ecf20Sopenharmony_ciECC memory 1578c2ecf20Sopenharmony_ci---------- 1588c2ecf20Sopenharmony_ci 1598c2ecf20Sopenharmony_ciAs mentioned in the previous section, ECC memory has extra bits to be 1608c2ecf20Sopenharmony_ciused for error correction. In the above example, a memory module has 1618c2ecf20Sopenharmony_ci64 bits of *data width*, and 72 bits of *total width*. The extra 8 1628c2ecf20Sopenharmony_cibits which are used for the error detection and correction mechanisms 1638c2ecf20Sopenharmony_ciare referred to as the *syndrome*\ [#f1]_\ [#f2]_. 1648c2ecf20Sopenharmony_ci 1658c2ecf20Sopenharmony_ciSo, when the cpu requests the memory controller to write a word with 1668c2ecf20Sopenharmony_ci*data width*, the memory controller calculates the *syndrome* in real time, 1678c2ecf20Sopenharmony_ciusing Hamming code, or some other error correction code, like SECDED+, 1688c2ecf20Sopenharmony_ciproducing a code with *total width* size. Such code is then written 1698c2ecf20Sopenharmony_cion the memory modules. 1708c2ecf20Sopenharmony_ci 1718c2ecf20Sopenharmony_ciAt read, the *total width* bits code is converted back, using the same 1728c2ecf20Sopenharmony_ciECC code used on write, producing a word with *data width* and a *syndrome*. 1738c2ecf20Sopenharmony_ciThe word with *data width* is sent to the CPU, even when errors happen. 1748c2ecf20Sopenharmony_ci 1758c2ecf20Sopenharmony_ciThe memory controller also looks at the *syndrome* in order to check if 1768c2ecf20Sopenharmony_cithere was an error, and if the ECC code was able to fix such error. 1778c2ecf20Sopenharmony_ciIf the error was corrected, a Corrected Error (CE) happened. If not, an 1788c2ecf20Sopenharmony_ciUncorrected Error (UE) happened. 1798c2ecf20Sopenharmony_ci 1808c2ecf20Sopenharmony_ciThe information about the CE/UE errors is stored on some special registers 1818c2ecf20Sopenharmony_ciat the memory controller and can be accessed by reading such registers, 1828c2ecf20Sopenharmony_cieither by BIOS, by some special CPUs or by Linux EDAC driver. On x86 64 1838c2ecf20Sopenharmony_cibit CPUs, such errors can also be retrieved via the Machine Check 1848c2ecf20Sopenharmony_ciArchitecture (MCA)\ [#f3]_. 1858c2ecf20Sopenharmony_ci 1868c2ecf20Sopenharmony_ci.. [#f1] Please notice that several memory controllers allow operation on a 1878c2ecf20Sopenharmony_ci mode called "Lock-Step", where it groups two memory modules together, 1888c2ecf20Sopenharmony_ci doing 128-bit reads/writes. That gives 16 bits for error correction, with 1898c2ecf20Sopenharmony_ci significantly improves the error correction mechanism, at the expense 1908c2ecf20Sopenharmony_ci that, when an error happens, there's no way to know what memory module is 1918c2ecf20Sopenharmony_ci to blame. So, it has to blame both memory modules. 1928c2ecf20Sopenharmony_ci 1938c2ecf20Sopenharmony_ci.. [#f2] Some memory controllers also allow using memory in mirror mode. 1948c2ecf20Sopenharmony_ci On such mode, the same data is written to two memory modules. At read, 1958c2ecf20Sopenharmony_ci the system checks both memory modules, in order to check if both provide 1968c2ecf20Sopenharmony_ci identical data. On such configuration, when an error happens, there's no 1978c2ecf20Sopenharmony_ci way to know what memory module is to blame. So, it has to blame both 1988c2ecf20Sopenharmony_ci memory modules (or 4 memory modules, if the system is also on Lock-step 1998c2ecf20Sopenharmony_ci mode). 2008c2ecf20Sopenharmony_ci 2018c2ecf20Sopenharmony_ci.. [#f3] For more details about the Machine Check Architecture (MCA), 2028c2ecf20Sopenharmony_ci please read Documentation/x86/x86_64/machinecheck.rst at the Kernel tree. 2038c2ecf20Sopenharmony_ci 2048c2ecf20Sopenharmony_ciEDAC - Error Detection And Correction 2058c2ecf20Sopenharmony_ci************************************* 2068c2ecf20Sopenharmony_ci 2078c2ecf20Sopenharmony_ci.. note:: 2088c2ecf20Sopenharmony_ci 2098c2ecf20Sopenharmony_ci "bluesmoke" was the name for this device driver subsystem when it 2108c2ecf20Sopenharmony_ci was "out-of-tree" and maintained at http://bluesmoke.sourceforge.net. 2118c2ecf20Sopenharmony_ci That site is mostly archaic now and can be used only for historical 2128c2ecf20Sopenharmony_ci purposes. 2138c2ecf20Sopenharmony_ci 2148c2ecf20Sopenharmony_ci When the subsystem was pushed upstream for the first time, on 2158c2ecf20Sopenharmony_ci Kernel 2.6.16, it was renamed to ``EDAC``. 2168c2ecf20Sopenharmony_ci 2178c2ecf20Sopenharmony_ciPurpose 2188c2ecf20Sopenharmony_ci------- 2198c2ecf20Sopenharmony_ci 2208c2ecf20Sopenharmony_ciThe ``edac`` kernel module's goal is to detect and report hardware errors 2218c2ecf20Sopenharmony_cithat occur within the computer system running under linux. 2228c2ecf20Sopenharmony_ci 2238c2ecf20Sopenharmony_ciMemory 2248c2ecf20Sopenharmony_ci------ 2258c2ecf20Sopenharmony_ci 2268c2ecf20Sopenharmony_ciMemory Correctable Errors (CE) and Uncorrectable Errors (UE) are the 2278c2ecf20Sopenharmony_ciprimary errors being harvested. These types of errors are harvested by 2288c2ecf20Sopenharmony_cithe ``edac_mc`` device. 2298c2ecf20Sopenharmony_ci 2308c2ecf20Sopenharmony_ciDetecting CE events, then harvesting those events and reporting them, 2318c2ecf20Sopenharmony_ci**can** but must not necessarily be a predictor of future UE events. With 2328c2ecf20Sopenharmony_ciCE events only, the system can and will continue to operate as no data 2338c2ecf20Sopenharmony_cihas been damaged yet. 2348c2ecf20Sopenharmony_ci 2358c2ecf20Sopenharmony_ciHowever, preventive maintenance and proactive part replacement of memory 2368c2ecf20Sopenharmony_cimodules exhibiting CEs can reduce the likelihood of the dreaded UE events 2378c2ecf20Sopenharmony_ciand system panics. 2388c2ecf20Sopenharmony_ci 2398c2ecf20Sopenharmony_ciOther hardware elements 2408c2ecf20Sopenharmony_ci----------------------- 2418c2ecf20Sopenharmony_ci 2428c2ecf20Sopenharmony_ciA new feature for EDAC, the ``edac_device`` class of device, was added in 2438c2ecf20Sopenharmony_cithe 2.6.23 version of the kernel. 2448c2ecf20Sopenharmony_ci 2458c2ecf20Sopenharmony_ciThis new device type allows for non-memory type of ECC hardware detectors 2468c2ecf20Sopenharmony_cito have their states harvested and presented to userspace via the sysfs 2478c2ecf20Sopenharmony_ciinterface. 2488c2ecf20Sopenharmony_ci 2498c2ecf20Sopenharmony_ciSome architectures have ECC detectors for L1, L2 and L3 caches, 2508c2ecf20Sopenharmony_cialong with DMA engines, fabric switches, main data path switches, 2518c2ecf20Sopenharmony_ciinterconnections, and various other hardware data paths. If the hardware 2528c2ecf20Sopenharmony_cireports it, then a edac_device device probably can be constructed to 2538c2ecf20Sopenharmony_ciharvest and present that to userspace. 2548c2ecf20Sopenharmony_ci 2558c2ecf20Sopenharmony_ci 2568c2ecf20Sopenharmony_ciPCI bus scanning 2578c2ecf20Sopenharmony_ci---------------- 2588c2ecf20Sopenharmony_ci 2598c2ecf20Sopenharmony_ciIn addition, PCI devices are scanned for PCI Bus Parity and SERR Errors 2608c2ecf20Sopenharmony_ciin order to determine if errors are occurring during data transfers. 2618c2ecf20Sopenharmony_ci 2628c2ecf20Sopenharmony_ciThe presence of PCI Parity errors must be examined with a grain of salt. 2638c2ecf20Sopenharmony_ciThere are several add-in adapters that do **not** follow the PCI specification 2648c2ecf20Sopenharmony_ciwith regards to Parity generation and reporting. The specification says 2658c2ecf20Sopenharmony_cithe vendor should tie the parity status bits to 0 if they do not intend 2668c2ecf20Sopenharmony_cito generate parity. Some vendors do not do this, and thus the parity bit 2678c2ecf20Sopenharmony_cican "float" giving false positives. 2688c2ecf20Sopenharmony_ci 2698c2ecf20Sopenharmony_ciThere is a PCI device attribute located in sysfs that is checked by 2708c2ecf20Sopenharmony_cithe EDAC PCI scanning code. If that attribute is set, PCI parity/error 2718c2ecf20Sopenharmony_ciscanning is skipped for that device. The attribute is:: 2728c2ecf20Sopenharmony_ci 2738c2ecf20Sopenharmony_ci broken_parity_status 2748c2ecf20Sopenharmony_ci 2758c2ecf20Sopenharmony_ciand is located in ``/sys/devices/pci<XXX>/0000:XX:YY.Z`` directories for 2768c2ecf20Sopenharmony_ciPCI devices. 2778c2ecf20Sopenharmony_ci 2788c2ecf20Sopenharmony_ci 2798c2ecf20Sopenharmony_ciVersioning 2808c2ecf20Sopenharmony_ci---------- 2818c2ecf20Sopenharmony_ci 2828c2ecf20Sopenharmony_ciEDAC is composed of a "core" module (``edac_core.ko``) and several Memory 2838c2ecf20Sopenharmony_ciController (MC) driver modules. On a given system, the CORE is loaded 2848c2ecf20Sopenharmony_ciand one MC driver will be loaded. Both the CORE and the MC driver (or 2858c2ecf20Sopenharmony_ci``edac_device`` driver) have individual versions that reflect current 2868c2ecf20Sopenharmony_cirelease level of their respective modules. 2878c2ecf20Sopenharmony_ci 2888c2ecf20Sopenharmony_ciThus, to "report" on what version a system is running, one must report 2898c2ecf20Sopenharmony_ciboth the CORE's and the MC driver's versions. 2908c2ecf20Sopenharmony_ci 2918c2ecf20Sopenharmony_ci 2928c2ecf20Sopenharmony_ciLoading 2938c2ecf20Sopenharmony_ci------- 2948c2ecf20Sopenharmony_ci 2958c2ecf20Sopenharmony_ciIf ``edac`` was statically linked with the kernel then no loading 2968c2ecf20Sopenharmony_ciis necessary. If ``edac`` was built as modules then simply modprobe 2978c2ecf20Sopenharmony_cithe ``edac`` pieces that you need. You should be able to modprobe 2988c2ecf20Sopenharmony_cihardware-specific modules and have the dependencies load the necessary 2998c2ecf20Sopenharmony_cicore modules. 3008c2ecf20Sopenharmony_ci 3018c2ecf20Sopenharmony_ciExample:: 3028c2ecf20Sopenharmony_ci 3038c2ecf20Sopenharmony_ci $ modprobe amd76x_edac 3048c2ecf20Sopenharmony_ci 3058c2ecf20Sopenharmony_ciloads both the ``amd76x_edac.ko`` memory controller module and the 3068c2ecf20Sopenharmony_ci``edac_mc.ko`` core module. 3078c2ecf20Sopenharmony_ci 3088c2ecf20Sopenharmony_ci 3098c2ecf20Sopenharmony_ciSysfs interface 3108c2ecf20Sopenharmony_ci--------------- 3118c2ecf20Sopenharmony_ci 3128c2ecf20Sopenharmony_ciEDAC presents a ``sysfs`` interface for control and reporting purposes. It 3138c2ecf20Sopenharmony_cilives in the /sys/devices/system/edac directory. 3148c2ecf20Sopenharmony_ci 3158c2ecf20Sopenharmony_ciWithin this directory there currently reside 2 components: 3168c2ecf20Sopenharmony_ci 3178c2ecf20Sopenharmony_ci ======= ============================== 3188c2ecf20Sopenharmony_ci mc memory controller(s) system 3198c2ecf20Sopenharmony_ci pci PCI control and status system 3208c2ecf20Sopenharmony_ci ======= ============================== 3218c2ecf20Sopenharmony_ci 3228c2ecf20Sopenharmony_ci 3238c2ecf20Sopenharmony_ci 3248c2ecf20Sopenharmony_ciMemory Controller (mc) Model 3258c2ecf20Sopenharmony_ci---------------------------- 3268c2ecf20Sopenharmony_ci 3278c2ecf20Sopenharmony_ciEach ``mc`` device controls a set of memory modules [#f4]_. These modules 3288c2ecf20Sopenharmony_ciare laid out in a Chip-Select Row (``csrowX``) and Channel table (``chX``). 3298c2ecf20Sopenharmony_ciThere can be multiple csrows and multiple channels. 3308c2ecf20Sopenharmony_ci 3318c2ecf20Sopenharmony_ci.. [#f4] Nowadays, the term DIMM (Dual In-line Memory Module) is widely 3328c2ecf20Sopenharmony_ci used to refer to a memory module, although there are other memory 3338c2ecf20Sopenharmony_ci packaging alternatives, like SO-DIMM, SIMM, etc. The UEFI 3348c2ecf20Sopenharmony_ci specification (Version 2.7) defines a memory module in the Common 3358c2ecf20Sopenharmony_ci Platform Error Record (CPER) section to be an SMBIOS Memory Device 3368c2ecf20Sopenharmony_ci (Type 17). Along this document, and inside the EDAC subsystem, the term 3378c2ecf20Sopenharmony_ci "dimm" is used for all memory modules, even when they use a 3388c2ecf20Sopenharmony_ci different kind of packaging. 3398c2ecf20Sopenharmony_ci 3408c2ecf20Sopenharmony_ciMemory controllers allow for several csrows, with 8 csrows being a 3418c2ecf20Sopenharmony_citypical value. Yet, the actual number of csrows depends on the layout of 3428c2ecf20Sopenharmony_cia given motherboard, memory controller and memory module characteristics. 3438c2ecf20Sopenharmony_ci 3448c2ecf20Sopenharmony_ciDual channels allow for dual data length (e. g. 128 bits, on 64 bit systems) 3458c2ecf20Sopenharmony_cidata transfers to/from the CPU from/to memory. Some newer chipsets allow 3468c2ecf20Sopenharmony_cifor more than 2 channels, like Fully Buffered DIMMs (FB-DIMMs) memory 3478c2ecf20Sopenharmony_cicontrollers. The following example will assume 2 channels: 3488c2ecf20Sopenharmony_ci 3498c2ecf20Sopenharmony_ci +------------+-----------------------+ 3508c2ecf20Sopenharmony_ci | CS Rows | Channels | 3518c2ecf20Sopenharmony_ci +------------+-----------+-----------+ 3528c2ecf20Sopenharmony_ci | | ``ch0`` | ``ch1`` | 3538c2ecf20Sopenharmony_ci +============+===========+===========+ 3548c2ecf20Sopenharmony_ci | |**DIMM_A0**|**DIMM_B0**| 3558c2ecf20Sopenharmony_ci +------------+-----------+-----------+ 3568c2ecf20Sopenharmony_ci | ``csrow0`` | rank0 | rank0 | 3578c2ecf20Sopenharmony_ci +------------+-----------+-----------+ 3588c2ecf20Sopenharmony_ci | ``csrow1`` | rank1 | rank1 | 3598c2ecf20Sopenharmony_ci +------------+-----------+-----------+ 3608c2ecf20Sopenharmony_ci | |**DIMM_A1**|**DIMM_B1**| 3618c2ecf20Sopenharmony_ci +------------+-----------+-----------+ 3628c2ecf20Sopenharmony_ci | ``csrow2`` | rank0 | rank0 | 3638c2ecf20Sopenharmony_ci +------------+-----------+-----------+ 3648c2ecf20Sopenharmony_ci | ``csrow3`` | rank1 | rank1 | 3658c2ecf20Sopenharmony_ci +------------+-----------+-----------+ 3668c2ecf20Sopenharmony_ci 3678c2ecf20Sopenharmony_ciIn the above example, there are 4 physical slots on the motherboard 3688c2ecf20Sopenharmony_cifor memory DIMMs: 3698c2ecf20Sopenharmony_ci 3708c2ecf20Sopenharmony_ci +---------+---------+ 3718c2ecf20Sopenharmony_ci | DIMM_A0 | DIMM_B0 | 3728c2ecf20Sopenharmony_ci +---------+---------+ 3738c2ecf20Sopenharmony_ci | DIMM_A1 | DIMM_B1 | 3748c2ecf20Sopenharmony_ci +---------+---------+ 3758c2ecf20Sopenharmony_ci 3768c2ecf20Sopenharmony_ciLabels for these slots are usually silk-screened on the motherboard. 3778c2ecf20Sopenharmony_ciSlots labeled ``A`` are channel 0 in this example. Slots labeled ``B`` are 3788c2ecf20Sopenharmony_cichannel 1. Notice that there are two csrows possible on a physical DIMM. 3798c2ecf20Sopenharmony_ciThese csrows are allocated their csrow assignment based on the slot into 3808c2ecf20Sopenharmony_ciwhich the memory DIMM is placed. Thus, when 1 DIMM is placed in each 3818c2ecf20Sopenharmony_ciChannel, the csrows cross both DIMMs. 3828c2ecf20Sopenharmony_ci 3838c2ecf20Sopenharmony_ciMemory DIMMs come single or dual "ranked". A rank is a populated csrow. 3848c2ecf20Sopenharmony_ciIn the example above 2 dual ranked DIMMs are similarly placed. Thus, 3858c2ecf20Sopenharmony_ciboth csrow0 and csrow1 are populated. On the other hand, when 2 single 3868c2ecf20Sopenharmony_ciranked DIMMs are placed in slots DIMM_A0 and DIMM_B0, then they will 3878c2ecf20Sopenharmony_cihave just one csrow (csrow0) and csrow1 will be empty. The pattern 3888c2ecf20Sopenharmony_cirepeats itself for csrow2 and csrow3. Also note that some memory 3898c2ecf20Sopenharmony_cicontrollers don't have any logic to identify the memory module, see 3908c2ecf20Sopenharmony_ci``rankX`` directories below. 3918c2ecf20Sopenharmony_ci 3928c2ecf20Sopenharmony_ciThe representation of the above is reflected in the directory 3938c2ecf20Sopenharmony_citree in EDAC's sysfs interface. Starting in directory 3948c2ecf20Sopenharmony_ci``/sys/devices/system/edac/mc``, each memory controller will be 3958c2ecf20Sopenharmony_cirepresented by its own ``mcX`` directory, where ``X`` is the 3968c2ecf20Sopenharmony_ciindex of the MC:: 3978c2ecf20Sopenharmony_ci 3988c2ecf20Sopenharmony_ci ..../edac/mc/ 3998c2ecf20Sopenharmony_ci | 4008c2ecf20Sopenharmony_ci |->mc0 4018c2ecf20Sopenharmony_ci |->mc1 4028c2ecf20Sopenharmony_ci |->mc2 4038c2ecf20Sopenharmony_ci .... 4048c2ecf20Sopenharmony_ci 4058c2ecf20Sopenharmony_ciUnder each ``mcX`` directory each ``csrowX`` is again represented by a 4068c2ecf20Sopenharmony_ci``csrowX``, where ``X`` is the csrow index:: 4078c2ecf20Sopenharmony_ci 4088c2ecf20Sopenharmony_ci .../mc/mc0/ 4098c2ecf20Sopenharmony_ci | 4108c2ecf20Sopenharmony_ci |->csrow0 4118c2ecf20Sopenharmony_ci |->csrow2 4128c2ecf20Sopenharmony_ci |->csrow3 4138c2ecf20Sopenharmony_ci .... 4148c2ecf20Sopenharmony_ci 4158c2ecf20Sopenharmony_ciNotice that there is no csrow1, which indicates that csrow0 is composed 4168c2ecf20Sopenharmony_ciof a single ranked DIMMs. This should also apply in both Channels, in 4178c2ecf20Sopenharmony_ciorder to have dual-channel mode be operational. Since both csrow2 and 4188c2ecf20Sopenharmony_cicsrow3 are populated, this indicates a dual ranked set of DIMMs for 4198c2ecf20Sopenharmony_cichannels 0 and 1. 4208c2ecf20Sopenharmony_ci 4218c2ecf20Sopenharmony_ciWithin each of the ``mcX`` and ``csrowX`` directories are several EDAC 4228c2ecf20Sopenharmony_cicontrol and attribute files. 4238c2ecf20Sopenharmony_ci 4248c2ecf20Sopenharmony_ci``mcX`` directories 4258c2ecf20Sopenharmony_ci------------------- 4268c2ecf20Sopenharmony_ci 4278c2ecf20Sopenharmony_ciIn ``mcX`` directories are EDAC control and attribute files for 4288c2ecf20Sopenharmony_cithis ``X`` instance of the memory controllers. 4298c2ecf20Sopenharmony_ci 4308c2ecf20Sopenharmony_ciFor a description of the sysfs API, please see: 4318c2ecf20Sopenharmony_ci 4328c2ecf20Sopenharmony_ci Documentation/ABI/testing/sysfs-devices-edac 4338c2ecf20Sopenharmony_ci 4348c2ecf20Sopenharmony_ci 4358c2ecf20Sopenharmony_ci``dimmX`` or ``rankX`` directories 4368c2ecf20Sopenharmony_ci---------------------------------- 4378c2ecf20Sopenharmony_ci 4388c2ecf20Sopenharmony_ciThe recommended way to use the EDAC subsystem is to look at the information 4398c2ecf20Sopenharmony_ciprovided by the ``dimmX`` or ``rankX`` directories [#f5]_. 4408c2ecf20Sopenharmony_ci 4418c2ecf20Sopenharmony_ciA typical EDAC system has the following structure under 4428c2ecf20Sopenharmony_ci``/sys/devices/system/edac/``\ [#f6]_:: 4438c2ecf20Sopenharmony_ci 4448c2ecf20Sopenharmony_ci /sys/devices/system/edac/ 4458c2ecf20Sopenharmony_ci ├── mc 4468c2ecf20Sopenharmony_ci │ ├── mc0 4478c2ecf20Sopenharmony_ci │ │ ├── ce_count 4488c2ecf20Sopenharmony_ci │ │ ├── ce_noinfo_count 4498c2ecf20Sopenharmony_ci │ │ ├── dimm0 4508c2ecf20Sopenharmony_ci │ │ │ ├── dimm_ce_count 4518c2ecf20Sopenharmony_ci │ │ │ ├── dimm_dev_type 4528c2ecf20Sopenharmony_ci │ │ │ ├── dimm_edac_mode 4538c2ecf20Sopenharmony_ci │ │ │ ├── dimm_label 4548c2ecf20Sopenharmony_ci │ │ │ ├── dimm_location 4558c2ecf20Sopenharmony_ci │ │ │ ├── dimm_mem_type 4568c2ecf20Sopenharmony_ci │ │ │ ├── dimm_ue_count 4578c2ecf20Sopenharmony_ci │ │ │ ├── size 4588c2ecf20Sopenharmony_ci │ │ │ └── uevent 4598c2ecf20Sopenharmony_ci │ │ ├── max_location 4608c2ecf20Sopenharmony_ci │ │ ├── mc_name 4618c2ecf20Sopenharmony_ci │ │ ├── reset_counters 4628c2ecf20Sopenharmony_ci │ │ ├── seconds_since_reset 4638c2ecf20Sopenharmony_ci │ │ ├── size_mb 4648c2ecf20Sopenharmony_ci │ │ ├── ue_count 4658c2ecf20Sopenharmony_ci │ │ ├── ue_noinfo_count 4668c2ecf20Sopenharmony_ci │ │ └── uevent 4678c2ecf20Sopenharmony_ci │ ├── mc1 4688c2ecf20Sopenharmony_ci │ │ ├── ce_count 4698c2ecf20Sopenharmony_ci │ │ ├── ce_noinfo_count 4708c2ecf20Sopenharmony_ci │ │ ├── dimm0 4718c2ecf20Sopenharmony_ci │ │ │ ├── dimm_ce_count 4728c2ecf20Sopenharmony_ci │ │ │ ├── dimm_dev_type 4738c2ecf20Sopenharmony_ci │ │ │ ├── dimm_edac_mode 4748c2ecf20Sopenharmony_ci │ │ │ ├── dimm_label 4758c2ecf20Sopenharmony_ci │ │ │ ├── dimm_location 4768c2ecf20Sopenharmony_ci │ │ │ ├── dimm_mem_type 4778c2ecf20Sopenharmony_ci │ │ │ ├── dimm_ue_count 4788c2ecf20Sopenharmony_ci │ │ │ ├── size 4798c2ecf20Sopenharmony_ci │ │ │ └── uevent 4808c2ecf20Sopenharmony_ci │ │ ├── max_location 4818c2ecf20Sopenharmony_ci │ │ ├── mc_name 4828c2ecf20Sopenharmony_ci │ │ ├── reset_counters 4838c2ecf20Sopenharmony_ci │ │ ├── seconds_since_reset 4848c2ecf20Sopenharmony_ci │ │ ├── size_mb 4858c2ecf20Sopenharmony_ci │ │ ├── ue_count 4868c2ecf20Sopenharmony_ci │ │ ├── ue_noinfo_count 4878c2ecf20Sopenharmony_ci │ │ └── uevent 4888c2ecf20Sopenharmony_ci │ └── uevent 4898c2ecf20Sopenharmony_ci └── uevent 4908c2ecf20Sopenharmony_ci 4918c2ecf20Sopenharmony_ciIn the ``dimmX`` directories are EDAC control and attribute files for 4928c2ecf20Sopenharmony_cithis ``X`` memory module: 4938c2ecf20Sopenharmony_ci 4948c2ecf20Sopenharmony_ci- ``size`` - Total memory managed by this csrow attribute file 4958c2ecf20Sopenharmony_ci 4968c2ecf20Sopenharmony_ci This attribute file displays, in count of megabytes, the memory 4978c2ecf20Sopenharmony_ci that this csrow contains. 4988c2ecf20Sopenharmony_ci 4998c2ecf20Sopenharmony_ci- ``dimm_ue_count`` - Uncorrectable Errors count attribute file 5008c2ecf20Sopenharmony_ci 5018c2ecf20Sopenharmony_ci This attribute file displays the total count of uncorrectable 5028c2ecf20Sopenharmony_ci errors that have occurred on this DIMM. If panic_on_ue is set 5038c2ecf20Sopenharmony_ci this counter will not have a chance to increment, since EDAC 5048c2ecf20Sopenharmony_ci will panic the system. 5058c2ecf20Sopenharmony_ci 5068c2ecf20Sopenharmony_ci- ``dimm_ce_count`` - Correctable Errors count attribute file 5078c2ecf20Sopenharmony_ci 5088c2ecf20Sopenharmony_ci This attribute file displays the total count of correctable 5098c2ecf20Sopenharmony_ci errors that have occurred on this DIMM. This count is very 5108c2ecf20Sopenharmony_ci important to examine. CEs provide early indications that a 5118c2ecf20Sopenharmony_ci DIMM is beginning to fail. This count field should be 5128c2ecf20Sopenharmony_ci monitored for non-zero values and report such information 5138c2ecf20Sopenharmony_ci to the system administrator. 5148c2ecf20Sopenharmony_ci 5158c2ecf20Sopenharmony_ci- ``dimm_dev_type`` - Device type attribute file 5168c2ecf20Sopenharmony_ci 5178c2ecf20Sopenharmony_ci This attribute file will display what type of DRAM device is 5188c2ecf20Sopenharmony_ci being utilized on this DIMM. 5198c2ecf20Sopenharmony_ci Examples: 5208c2ecf20Sopenharmony_ci 5218c2ecf20Sopenharmony_ci - x1 5228c2ecf20Sopenharmony_ci - x2 5238c2ecf20Sopenharmony_ci - x4 5248c2ecf20Sopenharmony_ci - x8 5258c2ecf20Sopenharmony_ci 5268c2ecf20Sopenharmony_ci- ``dimm_edac_mode`` - EDAC Mode of operation attribute file 5278c2ecf20Sopenharmony_ci 5288c2ecf20Sopenharmony_ci This attribute file will display what type of Error detection 5298c2ecf20Sopenharmony_ci and correction is being utilized. 5308c2ecf20Sopenharmony_ci 5318c2ecf20Sopenharmony_ci- ``dimm_label`` - memory module label control file 5328c2ecf20Sopenharmony_ci 5338c2ecf20Sopenharmony_ci This control file allows this DIMM to have a label assigned 5348c2ecf20Sopenharmony_ci to it. With this label in the module, when errors occur 5358c2ecf20Sopenharmony_ci the output can provide the DIMM label in the system log. 5368c2ecf20Sopenharmony_ci This becomes vital for panic events to isolate the 5378c2ecf20Sopenharmony_ci cause of the UE event. 5388c2ecf20Sopenharmony_ci 5398c2ecf20Sopenharmony_ci DIMM Labels must be assigned after booting, with information 5408c2ecf20Sopenharmony_ci that correctly identifies the physical slot with its 5418c2ecf20Sopenharmony_ci silk screen label. This information is currently very 5428c2ecf20Sopenharmony_ci motherboard specific and determination of this information 5438c2ecf20Sopenharmony_ci must occur in userland at this time. 5448c2ecf20Sopenharmony_ci 5458c2ecf20Sopenharmony_ci- ``dimm_location`` - location of the memory module 5468c2ecf20Sopenharmony_ci 5478c2ecf20Sopenharmony_ci The location can have up to 3 levels, and describe how the 5488c2ecf20Sopenharmony_ci memory controller identifies the location of a memory module. 5498c2ecf20Sopenharmony_ci Depending on the type of memory and memory controller, it 5508c2ecf20Sopenharmony_ci can be: 5518c2ecf20Sopenharmony_ci 5528c2ecf20Sopenharmony_ci - *csrow* and *channel* - used when the memory controller 5538c2ecf20Sopenharmony_ci doesn't identify a single DIMM - e. g. in ``rankX`` dir; 5548c2ecf20Sopenharmony_ci - *branch*, *channel*, *slot* - typically used on FB-DIMM memory 5558c2ecf20Sopenharmony_ci controllers; 5568c2ecf20Sopenharmony_ci - *channel*, *slot* - used on Nehalem and newer Intel drivers. 5578c2ecf20Sopenharmony_ci 5588c2ecf20Sopenharmony_ci- ``dimm_mem_type`` - Memory Type attribute file 5598c2ecf20Sopenharmony_ci 5608c2ecf20Sopenharmony_ci This attribute file will display what type of memory is currently 5618c2ecf20Sopenharmony_ci on this csrow. Normally, either buffered or unbuffered memory. 5628c2ecf20Sopenharmony_ci Examples: 5638c2ecf20Sopenharmony_ci 5648c2ecf20Sopenharmony_ci - Registered-DDR 5658c2ecf20Sopenharmony_ci - Unbuffered-DDR 5668c2ecf20Sopenharmony_ci 5678c2ecf20Sopenharmony_ci.. [#f5] On some systems, the memory controller doesn't have any logic 5688c2ecf20Sopenharmony_ci to identify the memory module. On such systems, the directory is called ``rankX`` and works on a similar way as the ``csrowX`` directories. 5698c2ecf20Sopenharmony_ci On modern Intel memory controllers, the memory controller identifies the 5708c2ecf20Sopenharmony_ci memory modules directly. On such systems, the directory is called ``dimmX``. 5718c2ecf20Sopenharmony_ci 5728c2ecf20Sopenharmony_ci.. [#f6] There are also some ``power`` directories and ``subsystem`` 5738c2ecf20Sopenharmony_ci symlinks inside the sysfs mapping that are automatically created by 5748c2ecf20Sopenharmony_ci the sysfs subsystem. Currently, they serve no purpose. 5758c2ecf20Sopenharmony_ci 5768c2ecf20Sopenharmony_ci``csrowX`` directories 5778c2ecf20Sopenharmony_ci---------------------- 5788c2ecf20Sopenharmony_ci 5798c2ecf20Sopenharmony_ciWhen CONFIG_EDAC_LEGACY_SYSFS is enabled, sysfs will contain the ``csrowX`` 5808c2ecf20Sopenharmony_cidirectories. As this API doesn't work properly for Rambus, FB-DIMMs and 5818c2ecf20Sopenharmony_cimodern Intel Memory Controllers, this is being deprecated in favor of 5828c2ecf20Sopenharmony_ci``dimmX`` directories. 5838c2ecf20Sopenharmony_ci 5848c2ecf20Sopenharmony_ciIn the ``csrowX`` directories are EDAC control and attribute files for 5858c2ecf20Sopenharmony_cithis ``X`` instance of csrow: 5868c2ecf20Sopenharmony_ci 5878c2ecf20Sopenharmony_ci 5888c2ecf20Sopenharmony_ci- ``ue_count`` - Total Uncorrectable Errors count attribute file 5898c2ecf20Sopenharmony_ci 5908c2ecf20Sopenharmony_ci This attribute file displays the total count of uncorrectable 5918c2ecf20Sopenharmony_ci errors that have occurred on this csrow. If panic_on_ue is set 5928c2ecf20Sopenharmony_ci this counter will not have a chance to increment, since EDAC 5938c2ecf20Sopenharmony_ci will panic the system. 5948c2ecf20Sopenharmony_ci 5958c2ecf20Sopenharmony_ci 5968c2ecf20Sopenharmony_ci- ``ce_count`` - Total Correctable Errors count attribute file 5978c2ecf20Sopenharmony_ci 5988c2ecf20Sopenharmony_ci This attribute file displays the total count of correctable 5998c2ecf20Sopenharmony_ci errors that have occurred on this csrow. This count is very 6008c2ecf20Sopenharmony_ci important to examine. CEs provide early indications that a 6018c2ecf20Sopenharmony_ci DIMM is beginning to fail. This count field should be 6028c2ecf20Sopenharmony_ci monitored for non-zero values and report such information 6038c2ecf20Sopenharmony_ci to the system administrator. 6048c2ecf20Sopenharmony_ci 6058c2ecf20Sopenharmony_ci 6068c2ecf20Sopenharmony_ci- ``size_mb`` - Total memory managed by this csrow attribute file 6078c2ecf20Sopenharmony_ci 6088c2ecf20Sopenharmony_ci This attribute file displays, in count of megabytes, the memory 6098c2ecf20Sopenharmony_ci that this csrow contains. 6108c2ecf20Sopenharmony_ci 6118c2ecf20Sopenharmony_ci 6128c2ecf20Sopenharmony_ci- ``mem_type`` - Memory Type attribute file 6138c2ecf20Sopenharmony_ci 6148c2ecf20Sopenharmony_ci This attribute file will display what type of memory is currently 6158c2ecf20Sopenharmony_ci on this csrow. Normally, either buffered or unbuffered memory. 6168c2ecf20Sopenharmony_ci Examples: 6178c2ecf20Sopenharmony_ci 6188c2ecf20Sopenharmony_ci - Registered-DDR 6198c2ecf20Sopenharmony_ci - Unbuffered-DDR 6208c2ecf20Sopenharmony_ci 6218c2ecf20Sopenharmony_ci 6228c2ecf20Sopenharmony_ci- ``edac_mode`` - EDAC Mode of operation attribute file 6238c2ecf20Sopenharmony_ci 6248c2ecf20Sopenharmony_ci This attribute file will display what type of Error detection 6258c2ecf20Sopenharmony_ci and correction is being utilized. 6268c2ecf20Sopenharmony_ci 6278c2ecf20Sopenharmony_ci 6288c2ecf20Sopenharmony_ci- ``dev_type`` - Device type attribute file 6298c2ecf20Sopenharmony_ci 6308c2ecf20Sopenharmony_ci This attribute file will display what type of DRAM device is 6318c2ecf20Sopenharmony_ci being utilized on this DIMM. 6328c2ecf20Sopenharmony_ci Examples: 6338c2ecf20Sopenharmony_ci 6348c2ecf20Sopenharmony_ci - x1 6358c2ecf20Sopenharmony_ci - x2 6368c2ecf20Sopenharmony_ci - x4 6378c2ecf20Sopenharmony_ci - x8 6388c2ecf20Sopenharmony_ci 6398c2ecf20Sopenharmony_ci 6408c2ecf20Sopenharmony_ci- ``ch0_ce_count`` - Channel 0 CE Count attribute file 6418c2ecf20Sopenharmony_ci 6428c2ecf20Sopenharmony_ci This attribute file will display the count of CEs on this 6438c2ecf20Sopenharmony_ci DIMM located in channel 0. 6448c2ecf20Sopenharmony_ci 6458c2ecf20Sopenharmony_ci 6468c2ecf20Sopenharmony_ci- ``ch0_ue_count`` - Channel 0 UE Count attribute file 6478c2ecf20Sopenharmony_ci 6488c2ecf20Sopenharmony_ci This attribute file will display the count of UEs on this 6498c2ecf20Sopenharmony_ci DIMM located in channel 0. 6508c2ecf20Sopenharmony_ci 6518c2ecf20Sopenharmony_ci 6528c2ecf20Sopenharmony_ci- ``ch0_dimm_label`` - Channel 0 DIMM Label control file 6538c2ecf20Sopenharmony_ci 6548c2ecf20Sopenharmony_ci 6558c2ecf20Sopenharmony_ci This control file allows this DIMM to have a label assigned 6568c2ecf20Sopenharmony_ci to it. With this label in the module, when errors occur 6578c2ecf20Sopenharmony_ci the output can provide the DIMM label in the system log. 6588c2ecf20Sopenharmony_ci This becomes vital for panic events to isolate the 6598c2ecf20Sopenharmony_ci cause of the UE event. 6608c2ecf20Sopenharmony_ci 6618c2ecf20Sopenharmony_ci DIMM Labels must be assigned after booting, with information 6628c2ecf20Sopenharmony_ci that correctly identifies the physical slot with its 6638c2ecf20Sopenharmony_ci silk screen label. This information is currently very 6648c2ecf20Sopenharmony_ci motherboard specific and determination of this information 6658c2ecf20Sopenharmony_ci must occur in userland at this time. 6668c2ecf20Sopenharmony_ci 6678c2ecf20Sopenharmony_ci 6688c2ecf20Sopenharmony_ci- ``ch1_ce_count`` - Channel 1 CE Count attribute file 6698c2ecf20Sopenharmony_ci 6708c2ecf20Sopenharmony_ci 6718c2ecf20Sopenharmony_ci This attribute file will display the count of CEs on this 6728c2ecf20Sopenharmony_ci DIMM located in channel 1. 6738c2ecf20Sopenharmony_ci 6748c2ecf20Sopenharmony_ci 6758c2ecf20Sopenharmony_ci- ``ch1_ue_count`` - Channel 1 UE Count attribute file 6768c2ecf20Sopenharmony_ci 6778c2ecf20Sopenharmony_ci 6788c2ecf20Sopenharmony_ci This attribute file will display the count of UEs on this 6798c2ecf20Sopenharmony_ci DIMM located in channel 0. 6808c2ecf20Sopenharmony_ci 6818c2ecf20Sopenharmony_ci 6828c2ecf20Sopenharmony_ci- ``ch1_dimm_label`` - Channel 1 DIMM Label control file 6838c2ecf20Sopenharmony_ci 6848c2ecf20Sopenharmony_ci This control file allows this DIMM to have a label assigned 6858c2ecf20Sopenharmony_ci to it. With this label in the module, when errors occur 6868c2ecf20Sopenharmony_ci the output can provide the DIMM label in the system log. 6878c2ecf20Sopenharmony_ci This becomes vital for panic events to isolate the 6888c2ecf20Sopenharmony_ci cause of the UE event. 6898c2ecf20Sopenharmony_ci 6908c2ecf20Sopenharmony_ci DIMM Labels must be assigned after booting, with information 6918c2ecf20Sopenharmony_ci that correctly identifies the physical slot with its 6928c2ecf20Sopenharmony_ci silk screen label. This information is currently very 6938c2ecf20Sopenharmony_ci motherboard specific and determination of this information 6948c2ecf20Sopenharmony_ci must occur in userland at this time. 6958c2ecf20Sopenharmony_ci 6968c2ecf20Sopenharmony_ci 6978c2ecf20Sopenharmony_ciSystem Logging 6988c2ecf20Sopenharmony_ci-------------- 6998c2ecf20Sopenharmony_ci 7008c2ecf20Sopenharmony_ciIf logging for UEs and CEs is enabled, then system logs will contain 7018c2ecf20Sopenharmony_ciinformation indicating that errors have been detected:: 7028c2ecf20Sopenharmony_ci 7038c2ecf20Sopenharmony_ci EDAC MC0: CE page 0x283, offset 0xce0, grain 8, syndrome 0x6ec3, row 0, channel 1 "DIMM_B1": amd76x_edac 7048c2ecf20Sopenharmony_ci EDAC MC0: CE page 0x1e5, offset 0xfb0, grain 8, syndrome 0xb741, row 0, channel 1 "DIMM_B1": amd76x_edac 7058c2ecf20Sopenharmony_ci 7068c2ecf20Sopenharmony_ci 7078c2ecf20Sopenharmony_ciThe structure of the message is: 7088c2ecf20Sopenharmony_ci 7098c2ecf20Sopenharmony_ci +---------------------------------------+-------------+ 7108c2ecf20Sopenharmony_ci | Content | Example | 7118c2ecf20Sopenharmony_ci +=======================================+=============+ 7128c2ecf20Sopenharmony_ci | The memory controller | MC0 | 7138c2ecf20Sopenharmony_ci +---------------------------------------+-------------+ 7148c2ecf20Sopenharmony_ci | Error type | CE | 7158c2ecf20Sopenharmony_ci +---------------------------------------+-------------+ 7168c2ecf20Sopenharmony_ci | Memory page | 0x283 | 7178c2ecf20Sopenharmony_ci +---------------------------------------+-------------+ 7188c2ecf20Sopenharmony_ci | Offset in the page | 0xce0 | 7198c2ecf20Sopenharmony_ci +---------------------------------------+-------------+ 7208c2ecf20Sopenharmony_ci | The byte granularity | grain 8 | 7218c2ecf20Sopenharmony_ci | or resolution of the error | | 7228c2ecf20Sopenharmony_ci +---------------------------------------+-------------+ 7238c2ecf20Sopenharmony_ci | The error syndrome | 0xb741 | 7248c2ecf20Sopenharmony_ci +---------------------------------------+-------------+ 7258c2ecf20Sopenharmony_ci | Memory row | row 0 | 7268c2ecf20Sopenharmony_ci +---------------------------------------+-------------+ 7278c2ecf20Sopenharmony_ci | Memory channel | channel 1 | 7288c2ecf20Sopenharmony_ci +---------------------------------------+-------------+ 7298c2ecf20Sopenharmony_ci | DIMM label, if set prior | DIMM B1 | 7308c2ecf20Sopenharmony_ci +---------------------------------------+-------------+ 7318c2ecf20Sopenharmony_ci | And then an optional, driver-specific | | 7328c2ecf20Sopenharmony_ci | message that may have additional | | 7338c2ecf20Sopenharmony_ci | information. | | 7348c2ecf20Sopenharmony_ci +---------------------------------------+-------------+ 7358c2ecf20Sopenharmony_ci 7368c2ecf20Sopenharmony_ciBoth UEs and CEs with no info will lack all but memory controller, error 7378c2ecf20Sopenharmony_citype, a notice of "no info" and then an optional, driver-specific error 7388c2ecf20Sopenharmony_cimessage. 7398c2ecf20Sopenharmony_ci 7408c2ecf20Sopenharmony_ci 7418c2ecf20Sopenharmony_ciPCI Bus Parity Detection 7428c2ecf20Sopenharmony_ci------------------------ 7438c2ecf20Sopenharmony_ci 7448c2ecf20Sopenharmony_ciOn Header Type 00 devices, the primary status is looked at for any 7458c2ecf20Sopenharmony_ciparity error regardless of whether parity is enabled on the device or 7468c2ecf20Sopenharmony_cinot. (The spec indicates parity is generated in some cases). On Header 7478c2ecf20Sopenharmony_ciType 01 bridges, the secondary status register is also looked at to see 7488c2ecf20Sopenharmony_ciif parity occurred on the bus on the other side of the bridge. 7498c2ecf20Sopenharmony_ci 7508c2ecf20Sopenharmony_ci 7518c2ecf20Sopenharmony_ciSysfs configuration 7528c2ecf20Sopenharmony_ci------------------- 7538c2ecf20Sopenharmony_ci 7548c2ecf20Sopenharmony_ciUnder ``/sys/devices/system/edac/pci`` are control and attribute files as 7558c2ecf20Sopenharmony_cifollows: 7568c2ecf20Sopenharmony_ci 7578c2ecf20Sopenharmony_ci 7588c2ecf20Sopenharmony_ci- ``check_pci_parity`` - Enable/Disable PCI Parity checking control file 7598c2ecf20Sopenharmony_ci 7608c2ecf20Sopenharmony_ci This control file enables or disables the PCI Bus Parity scanning 7618c2ecf20Sopenharmony_ci operation. Writing a 1 to this file enables the scanning. Writing 7628c2ecf20Sopenharmony_ci a 0 to this file disables the scanning. 7638c2ecf20Sopenharmony_ci 7648c2ecf20Sopenharmony_ci Enable:: 7658c2ecf20Sopenharmony_ci 7668c2ecf20Sopenharmony_ci echo "1" >/sys/devices/system/edac/pci/check_pci_parity 7678c2ecf20Sopenharmony_ci 7688c2ecf20Sopenharmony_ci Disable:: 7698c2ecf20Sopenharmony_ci 7708c2ecf20Sopenharmony_ci echo "0" >/sys/devices/system/edac/pci/check_pci_parity 7718c2ecf20Sopenharmony_ci 7728c2ecf20Sopenharmony_ci 7738c2ecf20Sopenharmony_ci- ``pci_parity_count`` - Parity Count 7748c2ecf20Sopenharmony_ci 7758c2ecf20Sopenharmony_ci This attribute file will display the number of parity errors that 7768c2ecf20Sopenharmony_ci have been detected. 7778c2ecf20Sopenharmony_ci 7788c2ecf20Sopenharmony_ci 7798c2ecf20Sopenharmony_ciModule parameters 7808c2ecf20Sopenharmony_ci----------------- 7818c2ecf20Sopenharmony_ci 7828c2ecf20Sopenharmony_ci- ``edac_mc_panic_on_ue`` - Panic on UE control file 7838c2ecf20Sopenharmony_ci 7848c2ecf20Sopenharmony_ci An uncorrectable error will cause a machine panic. This is usually 7858c2ecf20Sopenharmony_ci desirable. It is a bad idea to continue when an uncorrectable error 7868c2ecf20Sopenharmony_ci occurs - it is indeterminate what was uncorrected and the operating 7878c2ecf20Sopenharmony_ci system context might be so mangled that continuing will lead to further 7888c2ecf20Sopenharmony_ci corruption. If the kernel has MCE configured, then EDAC will never 7898c2ecf20Sopenharmony_ci notice the UE. 7908c2ecf20Sopenharmony_ci 7918c2ecf20Sopenharmony_ci LOAD TIME:: 7928c2ecf20Sopenharmony_ci 7938c2ecf20Sopenharmony_ci module/kernel parameter: edac_mc_panic_on_ue=[0|1] 7948c2ecf20Sopenharmony_ci 7958c2ecf20Sopenharmony_ci RUN TIME:: 7968c2ecf20Sopenharmony_ci 7978c2ecf20Sopenharmony_ci echo "1" > /sys/module/edac_core/parameters/edac_mc_panic_on_ue 7988c2ecf20Sopenharmony_ci 7998c2ecf20Sopenharmony_ci 8008c2ecf20Sopenharmony_ci- ``edac_mc_log_ue`` - Log UE control file 8018c2ecf20Sopenharmony_ci 8028c2ecf20Sopenharmony_ci 8038c2ecf20Sopenharmony_ci Generate kernel messages describing uncorrectable errors. These errors 8048c2ecf20Sopenharmony_ci are reported through the system message log system. UE statistics 8058c2ecf20Sopenharmony_ci will be accumulated even when UE logging is disabled. 8068c2ecf20Sopenharmony_ci 8078c2ecf20Sopenharmony_ci LOAD TIME:: 8088c2ecf20Sopenharmony_ci 8098c2ecf20Sopenharmony_ci module/kernel parameter: edac_mc_log_ue=[0|1] 8108c2ecf20Sopenharmony_ci 8118c2ecf20Sopenharmony_ci RUN TIME:: 8128c2ecf20Sopenharmony_ci 8138c2ecf20Sopenharmony_ci echo "1" > /sys/module/edac_core/parameters/edac_mc_log_ue 8148c2ecf20Sopenharmony_ci 8158c2ecf20Sopenharmony_ci 8168c2ecf20Sopenharmony_ci- ``edac_mc_log_ce`` - Log CE control file 8178c2ecf20Sopenharmony_ci 8188c2ecf20Sopenharmony_ci 8198c2ecf20Sopenharmony_ci Generate kernel messages describing correctable errors. These 8208c2ecf20Sopenharmony_ci errors are reported through the system message log system. 8218c2ecf20Sopenharmony_ci CE statistics will be accumulated even when CE logging is disabled. 8228c2ecf20Sopenharmony_ci 8238c2ecf20Sopenharmony_ci LOAD TIME:: 8248c2ecf20Sopenharmony_ci 8258c2ecf20Sopenharmony_ci module/kernel parameter: edac_mc_log_ce=[0|1] 8268c2ecf20Sopenharmony_ci 8278c2ecf20Sopenharmony_ci RUN TIME:: 8288c2ecf20Sopenharmony_ci 8298c2ecf20Sopenharmony_ci echo "1" > /sys/module/edac_core/parameters/edac_mc_log_ce 8308c2ecf20Sopenharmony_ci 8318c2ecf20Sopenharmony_ci 8328c2ecf20Sopenharmony_ci- ``edac_mc_poll_msec`` - Polling period control file 8338c2ecf20Sopenharmony_ci 8348c2ecf20Sopenharmony_ci 8358c2ecf20Sopenharmony_ci The time period, in milliseconds, for polling for error information. 8368c2ecf20Sopenharmony_ci Too small a value wastes resources. Too large a value might delay 8378c2ecf20Sopenharmony_ci necessary handling of errors and might loose valuable information for 8388c2ecf20Sopenharmony_ci locating the error. 1000 milliseconds (once each second) is the current 8398c2ecf20Sopenharmony_ci default. Systems which require all the bandwidth they can get, may 8408c2ecf20Sopenharmony_ci increase this. 8418c2ecf20Sopenharmony_ci 8428c2ecf20Sopenharmony_ci LOAD TIME:: 8438c2ecf20Sopenharmony_ci 8448c2ecf20Sopenharmony_ci module/kernel parameter: edac_mc_poll_msec=[0|1] 8458c2ecf20Sopenharmony_ci 8468c2ecf20Sopenharmony_ci RUN TIME:: 8478c2ecf20Sopenharmony_ci 8488c2ecf20Sopenharmony_ci echo "1000" > /sys/module/edac_core/parameters/edac_mc_poll_msec 8498c2ecf20Sopenharmony_ci 8508c2ecf20Sopenharmony_ci 8518c2ecf20Sopenharmony_ci- ``panic_on_pci_parity`` - Panic on PCI PARITY Error 8528c2ecf20Sopenharmony_ci 8538c2ecf20Sopenharmony_ci 8548c2ecf20Sopenharmony_ci This control file enables or disables panicking when a parity 8558c2ecf20Sopenharmony_ci error has been detected. 8568c2ecf20Sopenharmony_ci 8578c2ecf20Sopenharmony_ci 8588c2ecf20Sopenharmony_ci module/kernel parameter:: 8598c2ecf20Sopenharmony_ci 8608c2ecf20Sopenharmony_ci edac_panic_on_pci_pe=[0|1] 8618c2ecf20Sopenharmony_ci 8628c2ecf20Sopenharmony_ci Enable:: 8638c2ecf20Sopenharmony_ci 8648c2ecf20Sopenharmony_ci echo "1" > /sys/module/edac_core/parameters/edac_panic_on_pci_pe 8658c2ecf20Sopenharmony_ci 8668c2ecf20Sopenharmony_ci Disable:: 8678c2ecf20Sopenharmony_ci 8688c2ecf20Sopenharmony_ci echo "0" > /sys/module/edac_core/parameters/edac_panic_on_pci_pe 8698c2ecf20Sopenharmony_ci 8708c2ecf20Sopenharmony_ci 8718c2ecf20Sopenharmony_ci 8728c2ecf20Sopenharmony_ciEDAC device type 8738c2ecf20Sopenharmony_ci---------------- 8748c2ecf20Sopenharmony_ci 8758c2ecf20Sopenharmony_ciIn the header file, edac_pci.h, there is a series of edac_device structures 8768c2ecf20Sopenharmony_ciand APIs for the EDAC_DEVICE. 8778c2ecf20Sopenharmony_ci 8788c2ecf20Sopenharmony_ciUser space access to an edac_device is through the sysfs interface. 8798c2ecf20Sopenharmony_ci 8808c2ecf20Sopenharmony_ciAt the location ``/sys/devices/system/edac`` (sysfs) new edac_device devices 8818c2ecf20Sopenharmony_ciwill appear. 8828c2ecf20Sopenharmony_ci 8838c2ecf20Sopenharmony_ciThere is a three level tree beneath the above ``edac`` directory. For example, 8848c2ecf20Sopenharmony_cithe ``test_device_edac`` device (found at the http://bluesmoke.sourceforget.net 8858c2ecf20Sopenharmony_ciwebsite) installs itself as:: 8868c2ecf20Sopenharmony_ci 8878c2ecf20Sopenharmony_ci /sys/devices/system/edac/test-instance 8888c2ecf20Sopenharmony_ci 8898c2ecf20Sopenharmony_ciin this directory are various controls, a symlink and one or more ``instance`` 8908c2ecf20Sopenharmony_cidirectories. 8918c2ecf20Sopenharmony_ci 8928c2ecf20Sopenharmony_ciThe standard default controls are: 8938c2ecf20Sopenharmony_ci 8948c2ecf20Sopenharmony_ci ============== ======================================================= 8958c2ecf20Sopenharmony_ci log_ce boolean to log CE events 8968c2ecf20Sopenharmony_ci log_ue boolean to log UE events 8978c2ecf20Sopenharmony_ci panic_on_ue boolean to ``panic`` the system if an UE is encountered 8988c2ecf20Sopenharmony_ci (default off, can be set true via startup script) 8998c2ecf20Sopenharmony_ci poll_msec time period between POLL cycles for events 9008c2ecf20Sopenharmony_ci ============== ======================================================= 9018c2ecf20Sopenharmony_ci 9028c2ecf20Sopenharmony_ciThe test_device_edac device adds at least one of its own custom control: 9038c2ecf20Sopenharmony_ci 9048c2ecf20Sopenharmony_ci ============== ================================================== 9058c2ecf20Sopenharmony_ci test_bits which in the current test driver does nothing but 9068c2ecf20Sopenharmony_ci show how it is installed. A ported driver can 9078c2ecf20Sopenharmony_ci add one or more such controls and/or attributes 9088c2ecf20Sopenharmony_ci for specific uses. 9098c2ecf20Sopenharmony_ci One out-of-tree driver uses controls here to allow 9108c2ecf20Sopenharmony_ci for ERROR INJECTION operations to hardware 9118c2ecf20Sopenharmony_ci injection registers 9128c2ecf20Sopenharmony_ci ============== ================================================== 9138c2ecf20Sopenharmony_ci 9148c2ecf20Sopenharmony_ciThe symlink points to the 'struct dev' that is registered for this edac_device. 9158c2ecf20Sopenharmony_ci 9168c2ecf20Sopenharmony_ciInstances 9178c2ecf20Sopenharmony_ci--------- 9188c2ecf20Sopenharmony_ci 9198c2ecf20Sopenharmony_ciOne or more instance directories are present. For the ``test_device_edac`` 9208c2ecf20Sopenharmony_cicase: 9218c2ecf20Sopenharmony_ci 9228c2ecf20Sopenharmony_ci +----------------+ 9238c2ecf20Sopenharmony_ci | test-instance0 | 9248c2ecf20Sopenharmony_ci +----------------+ 9258c2ecf20Sopenharmony_ci 9268c2ecf20Sopenharmony_ci 9278c2ecf20Sopenharmony_ciIn this directory there are two default counter attributes, which are totals of 9288c2ecf20Sopenharmony_cicounter in deeper subdirectories. 9298c2ecf20Sopenharmony_ci 9308c2ecf20Sopenharmony_ci ============== ==================================== 9318c2ecf20Sopenharmony_ci ce_count total of CE events of subdirectories 9328c2ecf20Sopenharmony_ci ue_count total of UE events of subdirectories 9338c2ecf20Sopenharmony_ci ============== ==================================== 9348c2ecf20Sopenharmony_ci 9358c2ecf20Sopenharmony_ciBlocks 9368c2ecf20Sopenharmony_ci------ 9378c2ecf20Sopenharmony_ci 9388c2ecf20Sopenharmony_ciAt the lowest directory level is the ``block`` directory. There can be 0, 1 9398c2ecf20Sopenharmony_cior more blocks specified in each instance: 9408c2ecf20Sopenharmony_ci 9418c2ecf20Sopenharmony_ci +-------------+ 9428c2ecf20Sopenharmony_ci | test-block0 | 9438c2ecf20Sopenharmony_ci +-------------+ 9448c2ecf20Sopenharmony_ci 9458c2ecf20Sopenharmony_ciIn this directory the default attributes are: 9468c2ecf20Sopenharmony_ci 9478c2ecf20Sopenharmony_ci ============== ================================================ 9488c2ecf20Sopenharmony_ci ce_count which is counter of CE events for this ``block`` 9498c2ecf20Sopenharmony_ci of hardware being monitored 9508c2ecf20Sopenharmony_ci ue_count which is counter of UE events for this ``block`` 9518c2ecf20Sopenharmony_ci of hardware being monitored 9528c2ecf20Sopenharmony_ci ============== ================================================ 9538c2ecf20Sopenharmony_ci 9548c2ecf20Sopenharmony_ci 9558c2ecf20Sopenharmony_ciThe ``test_device_edac`` device adds 4 attributes and 1 control: 9568c2ecf20Sopenharmony_ci 9578c2ecf20Sopenharmony_ci ================== ==================================================== 9588c2ecf20Sopenharmony_ci test-block-bits-0 for every POLL cycle this counter 9598c2ecf20Sopenharmony_ci is incremented 9608c2ecf20Sopenharmony_ci test-block-bits-1 every 10 cycles, this counter is bumped once, 9618c2ecf20Sopenharmony_ci and test-block-bits-0 is set to 0 9628c2ecf20Sopenharmony_ci test-block-bits-2 every 100 cycles, this counter is bumped once, 9638c2ecf20Sopenharmony_ci and test-block-bits-1 is set to 0 9648c2ecf20Sopenharmony_ci test-block-bits-3 every 1000 cycles, this counter is bumped once, 9658c2ecf20Sopenharmony_ci and test-block-bits-2 is set to 0 9668c2ecf20Sopenharmony_ci ================== ==================================================== 9678c2ecf20Sopenharmony_ci 9688c2ecf20Sopenharmony_ci 9698c2ecf20Sopenharmony_ci ================== ==================================================== 9708c2ecf20Sopenharmony_ci reset-counters writing ANY thing to this control will 9718c2ecf20Sopenharmony_ci reset all the above counters. 9728c2ecf20Sopenharmony_ci ================== ==================================================== 9738c2ecf20Sopenharmony_ci 9748c2ecf20Sopenharmony_ci 9758c2ecf20Sopenharmony_ciUse of the ``test_device_edac`` driver should enable any others to create their own 9768c2ecf20Sopenharmony_ciunique drivers for their hardware systems. 9778c2ecf20Sopenharmony_ci 9788c2ecf20Sopenharmony_ciThe ``test_device_edac`` sample driver is located at the 9798c2ecf20Sopenharmony_cihttp://bluesmoke.sourceforge.net project site for EDAC. 9808c2ecf20Sopenharmony_ci 9818c2ecf20Sopenharmony_ci 9828c2ecf20Sopenharmony_ciUsage of EDAC APIs on Nehalem and newer Intel CPUs 9838c2ecf20Sopenharmony_ci-------------------------------------------------- 9848c2ecf20Sopenharmony_ci 9858c2ecf20Sopenharmony_ciOn older Intel architectures, the memory controller was part of the North 9868c2ecf20Sopenharmony_ciBridge chipset. Nehalem, Sandy Bridge, Ivy Bridge, Haswell, Sky Lake and 9878c2ecf20Sopenharmony_cinewer Intel architectures integrated an enhanced version of the memory 9888c2ecf20Sopenharmony_cicontroller (MC) inside the CPUs. 9898c2ecf20Sopenharmony_ci 9908c2ecf20Sopenharmony_ciThis chapter will cover the differences of the enhanced memory controllers 9918c2ecf20Sopenharmony_cifound on newer Intel CPUs, such as ``i7core_edac``, ``sb_edac`` and 9928c2ecf20Sopenharmony_ci``sbx_edac`` drivers. 9938c2ecf20Sopenharmony_ci 9948c2ecf20Sopenharmony_ci.. note:: 9958c2ecf20Sopenharmony_ci 9968c2ecf20Sopenharmony_ci The Xeon E7 processor families use a separate chip for the memory 9978c2ecf20Sopenharmony_ci controller, called Intel Scalable Memory Buffer. This section doesn't 9988c2ecf20Sopenharmony_ci apply for such families. 9998c2ecf20Sopenharmony_ci 10008c2ecf20Sopenharmony_ci1) There is one Memory Controller per Quick Patch Interconnect 10018c2ecf20Sopenharmony_ci (QPI). At the driver, the term "socket" means one QPI. This is 10028c2ecf20Sopenharmony_ci associated with a physical CPU socket. 10038c2ecf20Sopenharmony_ci 10048c2ecf20Sopenharmony_ci Each MC have 3 physical read channels, 3 physical write channels and 10058c2ecf20Sopenharmony_ci 3 logic channels. The driver currently sees it as just 3 channels. 10068c2ecf20Sopenharmony_ci Each channel can have up to 3 DIMMs. 10078c2ecf20Sopenharmony_ci 10088c2ecf20Sopenharmony_ci The minimum known unity is DIMMs. There are no information about csrows. 10098c2ecf20Sopenharmony_ci As EDAC API maps the minimum unity is csrows, the driver sequentially 10108c2ecf20Sopenharmony_ci maps channel/DIMM into different csrows. 10118c2ecf20Sopenharmony_ci 10128c2ecf20Sopenharmony_ci For example, supposing the following layout:: 10138c2ecf20Sopenharmony_ci 10148c2ecf20Sopenharmony_ci Ch0 phy rd0, wr0 (0x063f4031): 2 ranks, UDIMMs 10158c2ecf20Sopenharmony_ci dimm 0 1024 Mb offset: 0, bank: 8, rank: 1, row: 0x4000, col: 0x400 10168c2ecf20Sopenharmony_ci dimm 1 1024 Mb offset: 4, bank: 8, rank: 1, row: 0x4000, col: 0x400 10178c2ecf20Sopenharmony_ci Ch1 phy rd1, wr1 (0x063f4031): 2 ranks, UDIMMs 10188c2ecf20Sopenharmony_ci dimm 0 1024 Mb offset: 0, bank: 8, rank: 1, row: 0x4000, col: 0x400 10198c2ecf20Sopenharmony_ci Ch2 phy rd3, wr3 (0x063f4031): 2 ranks, UDIMMs 10208c2ecf20Sopenharmony_ci dimm 0 1024 Mb offset: 0, bank: 8, rank: 1, row: 0x4000, col: 0x400 10218c2ecf20Sopenharmony_ci 10228c2ecf20Sopenharmony_ci The driver will map it as:: 10238c2ecf20Sopenharmony_ci 10248c2ecf20Sopenharmony_ci csrow0: channel 0, dimm0 10258c2ecf20Sopenharmony_ci csrow1: channel 0, dimm1 10268c2ecf20Sopenharmony_ci csrow2: channel 1, dimm0 10278c2ecf20Sopenharmony_ci csrow3: channel 2, dimm0 10288c2ecf20Sopenharmony_ci 10298c2ecf20Sopenharmony_ci exports one DIMM per csrow. 10308c2ecf20Sopenharmony_ci 10318c2ecf20Sopenharmony_ci Each QPI is exported as a different memory controller. 10328c2ecf20Sopenharmony_ci 10338c2ecf20Sopenharmony_ci2) The MC has the ability to inject errors to test drivers. The drivers 10348c2ecf20Sopenharmony_ci implement this functionality via some error injection nodes: 10358c2ecf20Sopenharmony_ci 10368c2ecf20Sopenharmony_ci For injecting a memory error, there are some sysfs nodes, under 10378c2ecf20Sopenharmony_ci ``/sys/devices/system/edac/mc/mc?/``: 10388c2ecf20Sopenharmony_ci 10398c2ecf20Sopenharmony_ci - ``inject_addrmatch/*``: 10408c2ecf20Sopenharmony_ci Controls the error injection mask register. It is possible to specify 10418c2ecf20Sopenharmony_ci several characteristics of the address to match an error code:: 10428c2ecf20Sopenharmony_ci 10438c2ecf20Sopenharmony_ci dimm = the affected dimm. Numbers are relative to a channel; 10448c2ecf20Sopenharmony_ci rank = the memory rank; 10458c2ecf20Sopenharmony_ci channel = the channel that will generate an error; 10468c2ecf20Sopenharmony_ci bank = the affected bank; 10478c2ecf20Sopenharmony_ci page = the page address; 10488c2ecf20Sopenharmony_ci column (or col) = the address column. 10498c2ecf20Sopenharmony_ci 10508c2ecf20Sopenharmony_ci each of the above values can be set to "any" to match any valid value. 10518c2ecf20Sopenharmony_ci 10528c2ecf20Sopenharmony_ci At driver init, all values are set to any. 10538c2ecf20Sopenharmony_ci 10548c2ecf20Sopenharmony_ci For example, to generate an error at rank 1 of dimm 2, for any channel, 10558c2ecf20Sopenharmony_ci any bank, any page, any column:: 10568c2ecf20Sopenharmony_ci 10578c2ecf20Sopenharmony_ci echo 2 >/sys/devices/system/edac/mc/mc0/inject_addrmatch/dimm 10588c2ecf20Sopenharmony_ci echo 1 >/sys/devices/system/edac/mc/mc0/inject_addrmatch/rank 10598c2ecf20Sopenharmony_ci 10608c2ecf20Sopenharmony_ci To return to the default behaviour of matching any, you can do:: 10618c2ecf20Sopenharmony_ci 10628c2ecf20Sopenharmony_ci echo any >/sys/devices/system/edac/mc/mc0/inject_addrmatch/dimm 10638c2ecf20Sopenharmony_ci echo any >/sys/devices/system/edac/mc/mc0/inject_addrmatch/rank 10648c2ecf20Sopenharmony_ci 10658c2ecf20Sopenharmony_ci - ``inject_eccmask``: 10668c2ecf20Sopenharmony_ci specifies what bits will have troubles, 10678c2ecf20Sopenharmony_ci 10688c2ecf20Sopenharmony_ci - ``inject_section``: 10698c2ecf20Sopenharmony_ci specifies what ECC cache section will get the error:: 10708c2ecf20Sopenharmony_ci 10718c2ecf20Sopenharmony_ci 3 for both 10728c2ecf20Sopenharmony_ci 2 for the highest 10738c2ecf20Sopenharmony_ci 1 for the lowest 10748c2ecf20Sopenharmony_ci 10758c2ecf20Sopenharmony_ci - ``inject_type``: 10768c2ecf20Sopenharmony_ci specifies the type of error, being a combination of the following bits:: 10778c2ecf20Sopenharmony_ci 10788c2ecf20Sopenharmony_ci bit 0 - repeat 10798c2ecf20Sopenharmony_ci bit 1 - ecc 10808c2ecf20Sopenharmony_ci bit 2 - parity 10818c2ecf20Sopenharmony_ci 10828c2ecf20Sopenharmony_ci - ``inject_enable``: 10838c2ecf20Sopenharmony_ci starts the error generation when something different than 0 is written. 10848c2ecf20Sopenharmony_ci 10858c2ecf20Sopenharmony_ci All inject vars can be read. root permission is needed for write. 10868c2ecf20Sopenharmony_ci 10878c2ecf20Sopenharmony_ci Datasheet states that the error will only be generated after a write on an 10888c2ecf20Sopenharmony_ci address that matches inject_addrmatch. It seems, however, that reading will 10898c2ecf20Sopenharmony_ci also produce an error. 10908c2ecf20Sopenharmony_ci 10918c2ecf20Sopenharmony_ci For example, the following code will generate an error for any write access 10928c2ecf20Sopenharmony_ci at socket 0, on any DIMM/address on channel 2:: 10938c2ecf20Sopenharmony_ci 10948c2ecf20Sopenharmony_ci echo 2 >/sys/devices/system/edac/mc/mc0/inject_addrmatch/channel 10958c2ecf20Sopenharmony_ci echo 2 >/sys/devices/system/edac/mc/mc0/inject_type 10968c2ecf20Sopenharmony_ci echo 64 >/sys/devices/system/edac/mc/mc0/inject_eccmask 10978c2ecf20Sopenharmony_ci echo 3 >/sys/devices/system/edac/mc/mc0/inject_section 10988c2ecf20Sopenharmony_ci echo 1 >/sys/devices/system/edac/mc/mc0/inject_enable 10998c2ecf20Sopenharmony_ci dd if=/dev/mem of=/dev/null seek=16k bs=4k count=1 >& /dev/null 11008c2ecf20Sopenharmony_ci 11018c2ecf20Sopenharmony_ci For socket 1, it is needed to replace "mc0" by "mc1" at the above 11028c2ecf20Sopenharmony_ci commands. 11038c2ecf20Sopenharmony_ci 11048c2ecf20Sopenharmony_ci The generated error message will look like:: 11058c2ecf20Sopenharmony_ci 11068c2ecf20Sopenharmony_ci EDAC MC0: UE row 0, channel-a= 0 channel-b= 0 labels "-": NON_FATAL (addr = 0x0075b980, socket=0, Dimm=0, Channel=2, syndrome=0x00000040, count=1, Err=8c0000400001009f:4000080482 (read error: read ECC error)) 11078c2ecf20Sopenharmony_ci 11088c2ecf20Sopenharmony_ci3) Corrected Error memory register counters 11098c2ecf20Sopenharmony_ci 11108c2ecf20Sopenharmony_ci Those newer MCs have some registers to count memory errors. The driver 11118c2ecf20Sopenharmony_ci uses those registers to report Corrected Errors on devices with Registered 11128c2ecf20Sopenharmony_ci DIMMs. 11138c2ecf20Sopenharmony_ci 11148c2ecf20Sopenharmony_ci However, those counters don't work with Unregistered DIMM. As the chipset 11158c2ecf20Sopenharmony_ci offers some counters that also work with UDIMMs (but with a worse level of 11168c2ecf20Sopenharmony_ci granularity than the default ones), the driver exposes those registers for 11178c2ecf20Sopenharmony_ci UDIMM memories. 11188c2ecf20Sopenharmony_ci 11198c2ecf20Sopenharmony_ci They can be read by looking at the contents of ``all_channel_counts/``:: 11208c2ecf20Sopenharmony_ci 11218c2ecf20Sopenharmony_ci $ for i in /sys/devices/system/edac/mc/mc0/all_channel_counts/*; do echo $i; cat $i; done 11228c2ecf20Sopenharmony_ci /sys/devices/system/edac/mc/mc0/all_channel_counts/udimm0 11238c2ecf20Sopenharmony_ci 0 11248c2ecf20Sopenharmony_ci /sys/devices/system/edac/mc/mc0/all_channel_counts/udimm1 11258c2ecf20Sopenharmony_ci 0 11268c2ecf20Sopenharmony_ci /sys/devices/system/edac/mc/mc0/all_channel_counts/udimm2 11278c2ecf20Sopenharmony_ci 0 11288c2ecf20Sopenharmony_ci 11298c2ecf20Sopenharmony_ci What happens here is that errors on different csrows, but at the same 11308c2ecf20Sopenharmony_ci dimm number will increment the same counter. 11318c2ecf20Sopenharmony_ci So, in this memory mapping:: 11328c2ecf20Sopenharmony_ci 11338c2ecf20Sopenharmony_ci csrow0: channel 0, dimm0 11348c2ecf20Sopenharmony_ci csrow1: channel 0, dimm1 11358c2ecf20Sopenharmony_ci csrow2: channel 1, dimm0 11368c2ecf20Sopenharmony_ci csrow3: channel 2, dimm0 11378c2ecf20Sopenharmony_ci 11388c2ecf20Sopenharmony_ci The hardware will increment udimm0 for an error at the first dimm at either 11398c2ecf20Sopenharmony_ci csrow0, csrow2 or csrow3; 11408c2ecf20Sopenharmony_ci 11418c2ecf20Sopenharmony_ci The hardware will increment udimm1 for an error at the second dimm at either 11428c2ecf20Sopenharmony_ci csrow0, csrow2 or csrow3; 11438c2ecf20Sopenharmony_ci 11448c2ecf20Sopenharmony_ci The hardware will increment udimm2 for an error at the third dimm at either 11458c2ecf20Sopenharmony_ci csrow0, csrow2 or csrow3; 11468c2ecf20Sopenharmony_ci 11478c2ecf20Sopenharmony_ci4) Standard error counters 11488c2ecf20Sopenharmony_ci 11498c2ecf20Sopenharmony_ci The standard error counters are generated when an mcelog error is received 11508c2ecf20Sopenharmony_ci by the driver. Since, with UDIMM, this is counted by software, it is 11518c2ecf20Sopenharmony_ci possible that some errors could be lost. With RDIMM's, they display the 11528c2ecf20Sopenharmony_ci contents of the registers 11538c2ecf20Sopenharmony_ci 11548c2ecf20Sopenharmony_ciReference documents used on ``amd64_edac`` 11558c2ecf20Sopenharmony_ci------------------------------------------ 11568c2ecf20Sopenharmony_ci 11578c2ecf20Sopenharmony_ci``amd64_edac`` module is based on the following documents 11588c2ecf20Sopenharmony_ci(available from http://support.amd.com/en-us/search/tech-docs): 11598c2ecf20Sopenharmony_ci 11608c2ecf20Sopenharmony_ci1. :Title: BIOS and Kernel Developer's Guide for AMD Athlon 64 and AMD 11618c2ecf20Sopenharmony_ci Opteron Processors 11628c2ecf20Sopenharmony_ci :AMD publication #: 26094 11638c2ecf20Sopenharmony_ci :Revision: 3.26 11648c2ecf20Sopenharmony_ci :Link: http://support.amd.com/TechDocs/26094.PDF 11658c2ecf20Sopenharmony_ci 11668c2ecf20Sopenharmony_ci2. :Title: BIOS and Kernel Developer's Guide for AMD NPT Family 0Fh 11678c2ecf20Sopenharmony_ci Processors 11688c2ecf20Sopenharmony_ci :AMD publication #: 32559 11698c2ecf20Sopenharmony_ci :Revision: 3.00 11708c2ecf20Sopenharmony_ci :Issue Date: May 2006 11718c2ecf20Sopenharmony_ci :Link: http://support.amd.com/TechDocs/32559.pdf 11728c2ecf20Sopenharmony_ci 11738c2ecf20Sopenharmony_ci3. :Title: BIOS and Kernel Developer's Guide (BKDG) For AMD Family 10h 11748c2ecf20Sopenharmony_ci Processors 11758c2ecf20Sopenharmony_ci :AMD publication #: 31116 11768c2ecf20Sopenharmony_ci :Revision: 3.00 11778c2ecf20Sopenharmony_ci :Issue Date: September 07, 2007 11788c2ecf20Sopenharmony_ci :Link: http://support.amd.com/TechDocs/31116.pdf 11798c2ecf20Sopenharmony_ci 11808c2ecf20Sopenharmony_ci4. :Title: BIOS and Kernel Developer's Guide (BKDG) for AMD Family 15h 11818c2ecf20Sopenharmony_ci Models 30h-3Fh Processors 11828c2ecf20Sopenharmony_ci :AMD publication #: 49125 11838c2ecf20Sopenharmony_ci :Revision: 3.06 11848c2ecf20Sopenharmony_ci :Issue Date: 2/12/2015 (latest release) 11858c2ecf20Sopenharmony_ci :Link: http://support.amd.com/TechDocs/49125_15h_Models_30h-3Fh_BKDG.pdf 11868c2ecf20Sopenharmony_ci 11878c2ecf20Sopenharmony_ci5. :Title: BIOS and Kernel Developer's Guide (BKDG) for AMD Family 15h 11888c2ecf20Sopenharmony_ci Models 60h-6Fh Processors 11898c2ecf20Sopenharmony_ci :AMD publication #: 50742 11908c2ecf20Sopenharmony_ci :Revision: 3.01 11918c2ecf20Sopenharmony_ci :Issue Date: 7/23/2015 (latest release) 11928c2ecf20Sopenharmony_ci :Link: http://support.amd.com/TechDocs/50742_15h_Models_60h-6Fh_BKDG.pdf 11938c2ecf20Sopenharmony_ci 11948c2ecf20Sopenharmony_ci6. :Title: BIOS and Kernel Developer's Guide (BKDG) for AMD Family 16h 11958c2ecf20Sopenharmony_ci Models 00h-0Fh Processors 11968c2ecf20Sopenharmony_ci :AMD publication #: 48751 11978c2ecf20Sopenharmony_ci :Revision: 3.03 11988c2ecf20Sopenharmony_ci :Issue Date: 2/23/2015 (latest release) 11998c2ecf20Sopenharmony_ci :Link: http://support.amd.com/TechDocs/48751_16h_bkdg.pdf 12008c2ecf20Sopenharmony_ci 12018c2ecf20Sopenharmony_ciCredits 12028c2ecf20Sopenharmony_ci======= 12038c2ecf20Sopenharmony_ci 12048c2ecf20Sopenharmony_ci* Written by Doug Thompson <dougthompson@xmission.com> 12058c2ecf20Sopenharmony_ci 12068c2ecf20Sopenharmony_ci - 7 Dec 2005 12078c2ecf20Sopenharmony_ci - 17 Jul 2007 Updated 12088c2ecf20Sopenharmony_ci 12098c2ecf20Sopenharmony_ci* |copy| Mauro Carvalho Chehab 12108c2ecf20Sopenharmony_ci 12118c2ecf20Sopenharmony_ci - 05 Aug 2009 Nehalem interface 12128c2ecf20Sopenharmony_ci - 26 Oct 2016 Converted to ReST and cleanups at the Nehalem section 12138c2ecf20Sopenharmony_ci 12148c2ecf20Sopenharmony_ci* EDAC authors/maintainers: 12158c2ecf20Sopenharmony_ci 12168c2ecf20Sopenharmony_ci - Doug Thompson, Dave Jiang, Dave Peterson et al, 12178c2ecf20Sopenharmony_ci - Mauro Carvalho Chehab 12188c2ecf20Sopenharmony_ci - Borislav Petkov 12198c2ecf20Sopenharmony_ci - original author: Thayne Harbaugh 1220