162306a36Sopenharmony_ci.. include:: <isonum.txt> 262306a36Sopenharmony_ci 362306a36Sopenharmony_ci============================================ 462306a36Sopenharmony_ciReliability, Availability and Serviceability 562306a36Sopenharmony_ci============================================ 662306a36Sopenharmony_ci 762306a36Sopenharmony_ciRAS concepts 862306a36Sopenharmony_ci************ 962306a36Sopenharmony_ci 1062306a36Sopenharmony_ciReliability, Availability and Serviceability (RAS) is a concept used on 1162306a36Sopenharmony_ciservers meant to measure their robustness. 1262306a36Sopenharmony_ci 1362306a36Sopenharmony_ciReliability 1462306a36Sopenharmony_ci is the probability that a system will produce correct outputs. 1562306a36Sopenharmony_ci 1662306a36Sopenharmony_ci * Generally measured as Mean Time Between Failures (MTBF) 1762306a36Sopenharmony_ci * Enhanced by features that help to avoid, detect and repair hardware faults 1862306a36Sopenharmony_ci 1962306a36Sopenharmony_ciAvailability 2062306a36Sopenharmony_ci is the probability that a system is operational at a given time 2162306a36Sopenharmony_ci 2262306a36Sopenharmony_ci * Generally measured as a percentage of downtime per a period of time 2362306a36Sopenharmony_ci * Often uses mechanisms to detect and correct hardware faults in 2462306a36Sopenharmony_ci runtime; 2562306a36Sopenharmony_ci 2662306a36Sopenharmony_ciServiceability (or maintainability) 2762306a36Sopenharmony_ci is the simplicity and speed with which a system can be repaired or 2862306a36Sopenharmony_ci maintained 2962306a36Sopenharmony_ci 3062306a36Sopenharmony_ci * Generally measured on Mean Time Between Repair (MTBR) 3162306a36Sopenharmony_ci 3262306a36Sopenharmony_ciImproving RAS 3362306a36Sopenharmony_ci------------- 3462306a36Sopenharmony_ci 3562306a36Sopenharmony_ciIn order to reduce systems downtime, a system should be capable of detecting 3662306a36Sopenharmony_cihardware errors, and, when possible correcting them in runtime. It should 3762306a36Sopenharmony_cialso provide mechanisms to detect hardware degradation, in order to warn 3862306a36Sopenharmony_cithe system administrator to take the action of replacing a component before 3962306a36Sopenharmony_ciit causes data loss or system downtime. 4062306a36Sopenharmony_ci 4162306a36Sopenharmony_ciAmong the monitoring measures, the most usual ones include: 4262306a36Sopenharmony_ci 4362306a36Sopenharmony_ci* CPU – detect errors at instruction execution and at L1/L2/L3 caches; 4462306a36Sopenharmony_ci* Memory – add error correction logic (ECC) to detect and correct errors; 4562306a36Sopenharmony_ci* I/O – add CRC checksums for transferred data; 4662306a36Sopenharmony_ci* Storage – RAID, journal file systems, checksums, 4762306a36Sopenharmony_ci Self-Monitoring, Analysis and Reporting Technology (SMART). 4862306a36Sopenharmony_ci 4962306a36Sopenharmony_ciBy monitoring the number of occurrences of error detections, it is possible 5062306a36Sopenharmony_cito identify if the probability of hardware errors is increasing, and, on such 5162306a36Sopenharmony_cicase, do a preventive maintenance to replace a degraded component while 5262306a36Sopenharmony_cithose errors are correctable. 5362306a36Sopenharmony_ci 5462306a36Sopenharmony_ciTypes of errors 5562306a36Sopenharmony_ci--------------- 5662306a36Sopenharmony_ci 5762306a36Sopenharmony_ciMost mechanisms used on modern systems use technologies like Hamming 5862306a36Sopenharmony_ciCodes that allow error correction when the number of errors on a bit packet 5962306a36Sopenharmony_ciis below a threshold. If the number of errors is above, those mechanisms 6062306a36Sopenharmony_cican indicate with a high degree of confidence that an error happened, but 6162306a36Sopenharmony_cithey can't correct. 6262306a36Sopenharmony_ci 6362306a36Sopenharmony_ciAlso, sometimes an error occur on a component that it is not used. For 6462306a36Sopenharmony_ciexample, a part of the memory that it is not currently allocated. 6562306a36Sopenharmony_ci 6662306a36Sopenharmony_ciThat defines some categories of errors: 6762306a36Sopenharmony_ci 6862306a36Sopenharmony_ci* **Correctable Error (CE)** - the error detection mechanism detected and 6962306a36Sopenharmony_ci corrected the error. Such errors are usually not fatal, although some 7062306a36Sopenharmony_ci Kernel mechanisms allow the system administrator to consider them as fatal. 7162306a36Sopenharmony_ci 7262306a36Sopenharmony_ci* **Uncorrected Error (UE)** - the amount of errors happened above the error 7362306a36Sopenharmony_ci correction threshold, and the system was unable to auto-correct. 7462306a36Sopenharmony_ci 7562306a36Sopenharmony_ci* **Fatal Error** - when an UE error happens on a critical component of the 7662306a36Sopenharmony_ci system (for example, a piece of the Kernel got corrupted by an UE), the 7762306a36Sopenharmony_ci only reliable way to avoid data corruption is to hang or reboot the machine. 7862306a36Sopenharmony_ci 7962306a36Sopenharmony_ci* **Non-fatal Error** - when an UE error happens on an unused component, 8062306a36Sopenharmony_ci like a CPU in power down state or an unused memory bank, the system may 8162306a36Sopenharmony_ci still run, eventually replacing the affected hardware by a hot spare, 8262306a36Sopenharmony_ci if available. 8362306a36Sopenharmony_ci 8462306a36Sopenharmony_ci Also, when an error happens on a userspace process, it is also possible to 8562306a36Sopenharmony_ci kill such process and let userspace restart it. 8662306a36Sopenharmony_ci 8762306a36Sopenharmony_ciThe mechanism for handling non-fatal errors is usually complex and may 8862306a36Sopenharmony_cirequire the help of some userspace application, in order to apply the 8962306a36Sopenharmony_cipolicy desired by the system administrator. 9062306a36Sopenharmony_ci 9162306a36Sopenharmony_ciIdentifying a bad hardware component 9262306a36Sopenharmony_ci------------------------------------ 9362306a36Sopenharmony_ci 9462306a36Sopenharmony_ciJust detecting a hardware flaw is usually not enough, as the system needs 9562306a36Sopenharmony_cito pinpoint to the minimal replaceable unit (MRU) that should be exchanged 9662306a36Sopenharmony_cito make the hardware reliable again. 9762306a36Sopenharmony_ci 9862306a36Sopenharmony_ciSo, it requires not only error logging facilities, but also mechanisms that 9962306a36Sopenharmony_ciwill translate the error message to the silkscreen or component label for 10062306a36Sopenharmony_cithe MRU. 10162306a36Sopenharmony_ci 10262306a36Sopenharmony_ciTypically, it is very complex for memory, as modern CPUs interlace memory 10362306a36Sopenharmony_cifrom different memory modules, in order to provide a better performance. The 10462306a36Sopenharmony_ciDMI BIOS usually have a list of memory module labels, with can be obtained 10562306a36Sopenharmony_ciusing the ``dmidecode`` tool. For example, on a desktop machine, it shows:: 10662306a36Sopenharmony_ci 10762306a36Sopenharmony_ci Memory Device 10862306a36Sopenharmony_ci Total Width: 64 bits 10962306a36Sopenharmony_ci Data Width: 64 bits 11062306a36Sopenharmony_ci Size: 16384 MB 11162306a36Sopenharmony_ci Form Factor: SODIMM 11262306a36Sopenharmony_ci Set: None 11362306a36Sopenharmony_ci Locator: ChannelA-DIMM0 11462306a36Sopenharmony_ci Bank Locator: BANK 0 11562306a36Sopenharmony_ci Type: DDR4 11662306a36Sopenharmony_ci Type Detail: Synchronous 11762306a36Sopenharmony_ci Speed: 2133 MHz 11862306a36Sopenharmony_ci Rank: 2 11962306a36Sopenharmony_ci Configured Clock Speed: 2133 MHz 12062306a36Sopenharmony_ci 12162306a36Sopenharmony_ciOn the above example, a DDR4 SO-DIMM memory module is located at the 12262306a36Sopenharmony_cisystem's memory labeled as "BANK 0", as given by the *bank locator* field. 12362306a36Sopenharmony_ciPlease notice that, on such system, the *total width* is equal to the 12462306a36Sopenharmony_ci*data width*. It means that such memory module doesn't have error 12562306a36Sopenharmony_cidetection/correction mechanisms. 12662306a36Sopenharmony_ci 12762306a36Sopenharmony_ciUnfortunately, not all systems use the same field to specify the memory 12862306a36Sopenharmony_cibank. On this example, from an older server, ``dmidecode`` shows:: 12962306a36Sopenharmony_ci 13062306a36Sopenharmony_ci Memory Device 13162306a36Sopenharmony_ci Array Handle: 0x1000 13262306a36Sopenharmony_ci Error Information Handle: Not Provided 13362306a36Sopenharmony_ci Total Width: 72 bits 13462306a36Sopenharmony_ci Data Width: 64 bits 13562306a36Sopenharmony_ci Size: 8192 MB 13662306a36Sopenharmony_ci Form Factor: DIMM 13762306a36Sopenharmony_ci Set: 1 13862306a36Sopenharmony_ci Locator: DIMM_A1 13962306a36Sopenharmony_ci Bank Locator: Not Specified 14062306a36Sopenharmony_ci Type: DDR3 14162306a36Sopenharmony_ci Type Detail: Synchronous Registered (Buffered) 14262306a36Sopenharmony_ci Speed: 1600 MHz 14362306a36Sopenharmony_ci Rank: 2 14462306a36Sopenharmony_ci Configured Clock Speed: 1600 MHz 14562306a36Sopenharmony_ci 14662306a36Sopenharmony_ciThere, the DDR3 RDIMM memory module is located at the system's memory labeled 14762306a36Sopenharmony_cias "DIMM_A1", as given by the *locator* field. Please notice that this 14862306a36Sopenharmony_cimemory module has 64 bits of *data width* and 72 bits of *total width*. So, 14962306a36Sopenharmony_ciit has 8 extra bits to be used by error detection and correction mechanisms. 15062306a36Sopenharmony_ciSuch kind of memory is called Error-correcting code memory (ECC memory). 15162306a36Sopenharmony_ci 15262306a36Sopenharmony_ciTo make things even worse, it is not uncommon that systems with different 15362306a36Sopenharmony_cilabels on their system's board to use exactly the same BIOS, meaning that 15462306a36Sopenharmony_cithe labels provided by the BIOS won't match the real ones. 15562306a36Sopenharmony_ci 15662306a36Sopenharmony_ciECC memory 15762306a36Sopenharmony_ci---------- 15862306a36Sopenharmony_ci 15962306a36Sopenharmony_ciAs mentioned in the previous section, ECC memory has extra bits to be 16062306a36Sopenharmony_ciused for error correction. In the above example, a memory module has 16162306a36Sopenharmony_ci64 bits of *data width*, and 72 bits of *total width*. The extra 8 16262306a36Sopenharmony_cibits which are used for the error detection and correction mechanisms 16362306a36Sopenharmony_ciare referred to as the *syndrome*\ [#f1]_\ [#f2]_. 16462306a36Sopenharmony_ci 16562306a36Sopenharmony_ciSo, when the cpu requests the memory controller to write a word with 16662306a36Sopenharmony_ci*data width*, the memory controller calculates the *syndrome* in real time, 16762306a36Sopenharmony_ciusing Hamming code, or some other error correction code, like SECDED+, 16862306a36Sopenharmony_ciproducing a code with *total width* size. Such code is then written 16962306a36Sopenharmony_cion the memory modules. 17062306a36Sopenharmony_ci 17162306a36Sopenharmony_ciAt read, the *total width* bits code is converted back, using the same 17262306a36Sopenharmony_ciECC code used on write, producing a word with *data width* and a *syndrome*. 17362306a36Sopenharmony_ciThe word with *data width* is sent to the CPU, even when errors happen. 17462306a36Sopenharmony_ci 17562306a36Sopenharmony_ciThe memory controller also looks at the *syndrome* in order to check if 17662306a36Sopenharmony_cithere was an error, and if the ECC code was able to fix such error. 17762306a36Sopenharmony_ciIf the error was corrected, a Corrected Error (CE) happened. If not, an 17862306a36Sopenharmony_ciUncorrected Error (UE) happened. 17962306a36Sopenharmony_ci 18062306a36Sopenharmony_ciThe information about the CE/UE errors is stored on some special registers 18162306a36Sopenharmony_ciat the memory controller and can be accessed by reading such registers, 18262306a36Sopenharmony_cieither by BIOS, by some special CPUs or by Linux EDAC driver. On x86 64 18362306a36Sopenharmony_cibit CPUs, such errors can also be retrieved via the Machine Check 18462306a36Sopenharmony_ciArchitecture (MCA)\ [#f3]_. 18562306a36Sopenharmony_ci 18662306a36Sopenharmony_ci.. [#f1] Please notice that several memory controllers allow operation on a 18762306a36Sopenharmony_ci mode called "Lock-Step", where it groups two memory modules together, 18862306a36Sopenharmony_ci doing 128-bit reads/writes. That gives 16 bits for error correction, with 18962306a36Sopenharmony_ci significantly improves the error correction mechanism, at the expense 19062306a36Sopenharmony_ci that, when an error happens, there's no way to know what memory module is 19162306a36Sopenharmony_ci to blame. So, it has to blame both memory modules. 19262306a36Sopenharmony_ci 19362306a36Sopenharmony_ci.. [#f2] Some memory controllers also allow using memory in mirror mode. 19462306a36Sopenharmony_ci On such mode, the same data is written to two memory modules. At read, 19562306a36Sopenharmony_ci the system checks both memory modules, in order to check if both provide 19662306a36Sopenharmony_ci identical data. On such configuration, when an error happens, there's no 19762306a36Sopenharmony_ci way to know what memory module is to blame. So, it has to blame both 19862306a36Sopenharmony_ci memory modules (or 4 memory modules, if the system is also on Lock-step 19962306a36Sopenharmony_ci mode). 20062306a36Sopenharmony_ci 20162306a36Sopenharmony_ci.. [#f3] For more details about the Machine Check Architecture (MCA), 20262306a36Sopenharmony_ci please read Documentation/arch/x86/x86_64/machinecheck.rst at the Kernel tree. 20362306a36Sopenharmony_ci 20462306a36Sopenharmony_ciEDAC - Error Detection And Correction 20562306a36Sopenharmony_ci************************************* 20662306a36Sopenharmony_ci 20762306a36Sopenharmony_ci.. note:: 20862306a36Sopenharmony_ci 20962306a36Sopenharmony_ci "bluesmoke" was the name for this device driver subsystem when it 21062306a36Sopenharmony_ci was "out-of-tree" and maintained at http://bluesmoke.sourceforge.net. 21162306a36Sopenharmony_ci That site is mostly archaic now and can be used only for historical 21262306a36Sopenharmony_ci purposes. 21362306a36Sopenharmony_ci 21462306a36Sopenharmony_ci When the subsystem was pushed upstream for the first time, on 21562306a36Sopenharmony_ci Kernel 2.6.16, it was renamed to ``EDAC``. 21662306a36Sopenharmony_ci 21762306a36Sopenharmony_ciPurpose 21862306a36Sopenharmony_ci------- 21962306a36Sopenharmony_ci 22062306a36Sopenharmony_ciThe ``edac`` kernel module's goal is to detect and report hardware errors 22162306a36Sopenharmony_cithat occur within the computer system running under linux. 22262306a36Sopenharmony_ci 22362306a36Sopenharmony_ciMemory 22462306a36Sopenharmony_ci------ 22562306a36Sopenharmony_ci 22662306a36Sopenharmony_ciMemory Correctable Errors (CE) and Uncorrectable Errors (UE) are the 22762306a36Sopenharmony_ciprimary errors being harvested. These types of errors are harvested by 22862306a36Sopenharmony_cithe ``edac_mc`` device. 22962306a36Sopenharmony_ci 23062306a36Sopenharmony_ciDetecting CE events, then harvesting those events and reporting them, 23162306a36Sopenharmony_ci**can** but must not necessarily be a predictor of future UE events. With 23262306a36Sopenharmony_ciCE events only, the system can and will continue to operate as no data 23362306a36Sopenharmony_cihas been damaged yet. 23462306a36Sopenharmony_ci 23562306a36Sopenharmony_ciHowever, preventive maintenance and proactive part replacement of memory 23662306a36Sopenharmony_cimodules exhibiting CEs can reduce the likelihood of the dreaded UE events 23762306a36Sopenharmony_ciand system panics. 23862306a36Sopenharmony_ci 23962306a36Sopenharmony_ciOther hardware elements 24062306a36Sopenharmony_ci----------------------- 24162306a36Sopenharmony_ci 24262306a36Sopenharmony_ciA new feature for EDAC, the ``edac_device`` class of device, was added in 24362306a36Sopenharmony_cithe 2.6.23 version of the kernel. 24462306a36Sopenharmony_ci 24562306a36Sopenharmony_ciThis new device type allows for non-memory type of ECC hardware detectors 24662306a36Sopenharmony_cito have their states harvested and presented to userspace via the sysfs 24762306a36Sopenharmony_ciinterface. 24862306a36Sopenharmony_ci 24962306a36Sopenharmony_ciSome architectures have ECC detectors for L1, L2 and L3 caches, 25062306a36Sopenharmony_cialong with DMA engines, fabric switches, main data path switches, 25162306a36Sopenharmony_ciinterconnections, and various other hardware data paths. If the hardware 25262306a36Sopenharmony_cireports it, then a edac_device device probably can be constructed to 25362306a36Sopenharmony_ciharvest and present that to userspace. 25462306a36Sopenharmony_ci 25562306a36Sopenharmony_ci 25662306a36Sopenharmony_ciPCI bus scanning 25762306a36Sopenharmony_ci---------------- 25862306a36Sopenharmony_ci 25962306a36Sopenharmony_ciIn addition, PCI devices are scanned for PCI Bus Parity and SERR Errors 26062306a36Sopenharmony_ciin order to determine if errors are occurring during data transfers. 26162306a36Sopenharmony_ci 26262306a36Sopenharmony_ciThe presence of PCI Parity errors must be examined with a grain of salt. 26362306a36Sopenharmony_ciThere are several add-in adapters that do **not** follow the PCI specification 26462306a36Sopenharmony_ciwith regards to Parity generation and reporting. The specification says 26562306a36Sopenharmony_cithe vendor should tie the parity status bits to 0 if they do not intend 26662306a36Sopenharmony_cito generate parity. Some vendors do not do this, and thus the parity bit 26762306a36Sopenharmony_cican "float" giving false positives. 26862306a36Sopenharmony_ci 26962306a36Sopenharmony_ciThere is a PCI device attribute located in sysfs that is checked by 27062306a36Sopenharmony_cithe EDAC PCI scanning code. If that attribute is set, PCI parity/error 27162306a36Sopenharmony_ciscanning is skipped for that device. The attribute is:: 27262306a36Sopenharmony_ci 27362306a36Sopenharmony_ci broken_parity_status 27462306a36Sopenharmony_ci 27562306a36Sopenharmony_ciand is located in ``/sys/devices/pci<XXX>/0000:XX:YY.Z`` directories for 27662306a36Sopenharmony_ciPCI devices. 27762306a36Sopenharmony_ci 27862306a36Sopenharmony_ci 27962306a36Sopenharmony_ciVersioning 28062306a36Sopenharmony_ci---------- 28162306a36Sopenharmony_ci 28262306a36Sopenharmony_ciEDAC is composed of a "core" module (``edac_core.ko``) and several Memory 28362306a36Sopenharmony_ciController (MC) driver modules. On a given system, the CORE is loaded 28462306a36Sopenharmony_ciand one MC driver will be loaded. Both the CORE and the MC driver (or 28562306a36Sopenharmony_ci``edac_device`` driver) have individual versions that reflect current 28662306a36Sopenharmony_cirelease level of their respective modules. 28762306a36Sopenharmony_ci 28862306a36Sopenharmony_ciThus, to "report" on what version a system is running, one must report 28962306a36Sopenharmony_ciboth the CORE's and the MC driver's versions. 29062306a36Sopenharmony_ci 29162306a36Sopenharmony_ci 29262306a36Sopenharmony_ciLoading 29362306a36Sopenharmony_ci------- 29462306a36Sopenharmony_ci 29562306a36Sopenharmony_ciIf ``edac`` was statically linked with the kernel then no loading 29662306a36Sopenharmony_ciis necessary. If ``edac`` was built as modules then simply modprobe 29762306a36Sopenharmony_cithe ``edac`` pieces that you need. You should be able to modprobe 29862306a36Sopenharmony_cihardware-specific modules and have the dependencies load the necessary 29962306a36Sopenharmony_cicore modules. 30062306a36Sopenharmony_ci 30162306a36Sopenharmony_ciExample:: 30262306a36Sopenharmony_ci 30362306a36Sopenharmony_ci $ modprobe amd76x_edac 30462306a36Sopenharmony_ci 30562306a36Sopenharmony_ciloads both the ``amd76x_edac.ko`` memory controller module and the 30662306a36Sopenharmony_ci``edac_mc.ko`` core module. 30762306a36Sopenharmony_ci 30862306a36Sopenharmony_ci 30962306a36Sopenharmony_ciSysfs interface 31062306a36Sopenharmony_ci--------------- 31162306a36Sopenharmony_ci 31262306a36Sopenharmony_ciEDAC presents a ``sysfs`` interface for control and reporting purposes. It 31362306a36Sopenharmony_cilives in the /sys/devices/system/edac directory. 31462306a36Sopenharmony_ci 31562306a36Sopenharmony_ciWithin this directory there currently reside 2 components: 31662306a36Sopenharmony_ci 31762306a36Sopenharmony_ci ======= ============================== 31862306a36Sopenharmony_ci mc memory controller(s) system 31962306a36Sopenharmony_ci pci PCI control and status system 32062306a36Sopenharmony_ci ======= ============================== 32162306a36Sopenharmony_ci 32262306a36Sopenharmony_ci 32362306a36Sopenharmony_ci 32462306a36Sopenharmony_ciMemory Controller (mc) Model 32562306a36Sopenharmony_ci---------------------------- 32662306a36Sopenharmony_ci 32762306a36Sopenharmony_ciEach ``mc`` device controls a set of memory modules [#f4]_. These modules 32862306a36Sopenharmony_ciare laid out in a Chip-Select Row (``csrowX``) and Channel table (``chX``). 32962306a36Sopenharmony_ciThere can be multiple csrows and multiple channels. 33062306a36Sopenharmony_ci 33162306a36Sopenharmony_ci.. [#f4] Nowadays, the term DIMM (Dual In-line Memory Module) is widely 33262306a36Sopenharmony_ci used to refer to a memory module, although there are other memory 33362306a36Sopenharmony_ci packaging alternatives, like SO-DIMM, SIMM, etc. The UEFI 33462306a36Sopenharmony_ci specification (Version 2.7) defines a memory module in the Common 33562306a36Sopenharmony_ci Platform Error Record (CPER) section to be an SMBIOS Memory Device 33662306a36Sopenharmony_ci (Type 17). Along this document, and inside the EDAC subsystem, the term 33762306a36Sopenharmony_ci "dimm" is used for all memory modules, even when they use a 33862306a36Sopenharmony_ci different kind of packaging. 33962306a36Sopenharmony_ci 34062306a36Sopenharmony_ciMemory controllers allow for several csrows, with 8 csrows being a 34162306a36Sopenharmony_citypical value. Yet, the actual number of csrows depends on the layout of 34262306a36Sopenharmony_cia given motherboard, memory controller and memory module characteristics. 34362306a36Sopenharmony_ci 34462306a36Sopenharmony_ciDual channels allow for dual data length (e. g. 128 bits, on 64 bit systems) 34562306a36Sopenharmony_cidata transfers to/from the CPU from/to memory. Some newer chipsets allow 34662306a36Sopenharmony_cifor more than 2 channels, like Fully Buffered DIMMs (FB-DIMMs) memory 34762306a36Sopenharmony_cicontrollers. The following example will assume 2 channels: 34862306a36Sopenharmony_ci 34962306a36Sopenharmony_ci +------------+-----------------------+ 35062306a36Sopenharmony_ci | CS Rows | Channels | 35162306a36Sopenharmony_ci +------------+-----------+-----------+ 35262306a36Sopenharmony_ci | | ``ch0`` | ``ch1`` | 35362306a36Sopenharmony_ci +============+===========+===========+ 35462306a36Sopenharmony_ci | |**DIMM_A0**|**DIMM_B0**| 35562306a36Sopenharmony_ci +------------+-----------+-----------+ 35662306a36Sopenharmony_ci | ``csrow0`` | rank0 | rank0 | 35762306a36Sopenharmony_ci +------------+-----------+-----------+ 35862306a36Sopenharmony_ci | ``csrow1`` | rank1 | rank1 | 35962306a36Sopenharmony_ci +------------+-----------+-----------+ 36062306a36Sopenharmony_ci | |**DIMM_A1**|**DIMM_B1**| 36162306a36Sopenharmony_ci +------------+-----------+-----------+ 36262306a36Sopenharmony_ci | ``csrow2`` | rank0 | rank0 | 36362306a36Sopenharmony_ci +------------+-----------+-----------+ 36462306a36Sopenharmony_ci | ``csrow3`` | rank1 | rank1 | 36562306a36Sopenharmony_ci +------------+-----------+-----------+ 36662306a36Sopenharmony_ci 36762306a36Sopenharmony_ciIn the above example, there are 4 physical slots on the motherboard 36862306a36Sopenharmony_cifor memory DIMMs: 36962306a36Sopenharmony_ci 37062306a36Sopenharmony_ci +---------+---------+ 37162306a36Sopenharmony_ci | DIMM_A0 | DIMM_B0 | 37262306a36Sopenharmony_ci +---------+---------+ 37362306a36Sopenharmony_ci | DIMM_A1 | DIMM_B1 | 37462306a36Sopenharmony_ci +---------+---------+ 37562306a36Sopenharmony_ci 37662306a36Sopenharmony_ciLabels for these slots are usually silk-screened on the motherboard. 37762306a36Sopenharmony_ciSlots labeled ``A`` are channel 0 in this example. Slots labeled ``B`` are 37862306a36Sopenharmony_cichannel 1. Notice that there are two csrows possible on a physical DIMM. 37962306a36Sopenharmony_ciThese csrows are allocated their csrow assignment based on the slot into 38062306a36Sopenharmony_ciwhich the memory DIMM is placed. Thus, when 1 DIMM is placed in each 38162306a36Sopenharmony_ciChannel, the csrows cross both DIMMs. 38262306a36Sopenharmony_ci 38362306a36Sopenharmony_ciMemory DIMMs come single or dual "ranked". A rank is a populated csrow. 38462306a36Sopenharmony_ciIn the example above 2 dual ranked DIMMs are similarly placed. Thus, 38562306a36Sopenharmony_ciboth csrow0 and csrow1 are populated. On the other hand, when 2 single 38662306a36Sopenharmony_ciranked DIMMs are placed in slots DIMM_A0 and DIMM_B0, then they will 38762306a36Sopenharmony_cihave just one csrow (csrow0) and csrow1 will be empty. The pattern 38862306a36Sopenharmony_cirepeats itself for csrow2 and csrow3. Also note that some memory 38962306a36Sopenharmony_cicontrollers don't have any logic to identify the memory module, see 39062306a36Sopenharmony_ci``rankX`` directories below. 39162306a36Sopenharmony_ci 39262306a36Sopenharmony_ciThe representation of the above is reflected in the directory 39362306a36Sopenharmony_citree in EDAC's sysfs interface. Starting in directory 39462306a36Sopenharmony_ci``/sys/devices/system/edac/mc``, each memory controller will be 39562306a36Sopenharmony_cirepresented by its own ``mcX`` directory, where ``X`` is the 39662306a36Sopenharmony_ciindex of the MC:: 39762306a36Sopenharmony_ci 39862306a36Sopenharmony_ci ..../edac/mc/ 39962306a36Sopenharmony_ci | 40062306a36Sopenharmony_ci |->mc0 40162306a36Sopenharmony_ci |->mc1 40262306a36Sopenharmony_ci |->mc2 40362306a36Sopenharmony_ci .... 40462306a36Sopenharmony_ci 40562306a36Sopenharmony_ciUnder each ``mcX`` directory each ``csrowX`` is again represented by a 40662306a36Sopenharmony_ci``csrowX``, where ``X`` is the csrow index:: 40762306a36Sopenharmony_ci 40862306a36Sopenharmony_ci .../mc/mc0/ 40962306a36Sopenharmony_ci | 41062306a36Sopenharmony_ci |->csrow0 41162306a36Sopenharmony_ci |->csrow2 41262306a36Sopenharmony_ci |->csrow3 41362306a36Sopenharmony_ci .... 41462306a36Sopenharmony_ci 41562306a36Sopenharmony_ciNotice that there is no csrow1, which indicates that csrow0 is composed 41662306a36Sopenharmony_ciof a single ranked DIMMs. This should also apply in both Channels, in 41762306a36Sopenharmony_ciorder to have dual-channel mode be operational. Since both csrow2 and 41862306a36Sopenharmony_cicsrow3 are populated, this indicates a dual ranked set of DIMMs for 41962306a36Sopenharmony_cichannels 0 and 1. 42062306a36Sopenharmony_ci 42162306a36Sopenharmony_ciWithin each of the ``mcX`` and ``csrowX`` directories are several EDAC 42262306a36Sopenharmony_cicontrol and attribute files. 42362306a36Sopenharmony_ci 42462306a36Sopenharmony_ci``mcX`` directories 42562306a36Sopenharmony_ci------------------- 42662306a36Sopenharmony_ci 42762306a36Sopenharmony_ciIn ``mcX`` directories are EDAC control and attribute files for 42862306a36Sopenharmony_cithis ``X`` instance of the memory controllers. 42962306a36Sopenharmony_ci 43062306a36Sopenharmony_ciFor a description of the sysfs API, please see: 43162306a36Sopenharmony_ci 43262306a36Sopenharmony_ci Documentation/ABI/testing/sysfs-devices-edac 43362306a36Sopenharmony_ci 43462306a36Sopenharmony_ci 43562306a36Sopenharmony_ci``dimmX`` or ``rankX`` directories 43662306a36Sopenharmony_ci---------------------------------- 43762306a36Sopenharmony_ci 43862306a36Sopenharmony_ciThe recommended way to use the EDAC subsystem is to look at the information 43962306a36Sopenharmony_ciprovided by the ``dimmX`` or ``rankX`` directories [#f5]_. 44062306a36Sopenharmony_ci 44162306a36Sopenharmony_ciA typical EDAC system has the following structure under 44262306a36Sopenharmony_ci``/sys/devices/system/edac/``\ [#f6]_:: 44362306a36Sopenharmony_ci 44462306a36Sopenharmony_ci /sys/devices/system/edac/ 44562306a36Sopenharmony_ci ├── mc 44662306a36Sopenharmony_ci │ ├── mc0 44762306a36Sopenharmony_ci │ │ ├── ce_count 44862306a36Sopenharmony_ci │ │ ├── ce_noinfo_count 44962306a36Sopenharmony_ci │ │ ├── dimm0 45062306a36Sopenharmony_ci │ │ │ ├── dimm_ce_count 45162306a36Sopenharmony_ci │ │ │ ├── dimm_dev_type 45262306a36Sopenharmony_ci │ │ │ ├── dimm_edac_mode 45362306a36Sopenharmony_ci │ │ │ ├── dimm_label 45462306a36Sopenharmony_ci │ │ │ ├── dimm_location 45562306a36Sopenharmony_ci │ │ │ ├── dimm_mem_type 45662306a36Sopenharmony_ci │ │ │ ├── dimm_ue_count 45762306a36Sopenharmony_ci │ │ │ ├── size 45862306a36Sopenharmony_ci │ │ │ └── uevent 45962306a36Sopenharmony_ci │ │ ├── max_location 46062306a36Sopenharmony_ci │ │ ├── mc_name 46162306a36Sopenharmony_ci │ │ ├── reset_counters 46262306a36Sopenharmony_ci │ │ ├── seconds_since_reset 46362306a36Sopenharmony_ci │ │ ├── size_mb 46462306a36Sopenharmony_ci │ │ ├── ue_count 46562306a36Sopenharmony_ci │ │ ├── ue_noinfo_count 46662306a36Sopenharmony_ci │ │ └── uevent 46762306a36Sopenharmony_ci │ ├── mc1 46862306a36Sopenharmony_ci │ │ ├── ce_count 46962306a36Sopenharmony_ci │ │ ├── ce_noinfo_count 47062306a36Sopenharmony_ci │ │ ├── dimm0 47162306a36Sopenharmony_ci │ │ │ ├── dimm_ce_count 47262306a36Sopenharmony_ci │ │ │ ├── dimm_dev_type 47362306a36Sopenharmony_ci │ │ │ ├── dimm_edac_mode 47462306a36Sopenharmony_ci │ │ │ ├── dimm_label 47562306a36Sopenharmony_ci │ │ │ ├── dimm_location 47662306a36Sopenharmony_ci │ │ │ ├── dimm_mem_type 47762306a36Sopenharmony_ci │ │ │ ├── dimm_ue_count 47862306a36Sopenharmony_ci │ │ │ ├── size 47962306a36Sopenharmony_ci │ │ │ └── uevent 48062306a36Sopenharmony_ci │ │ ├── max_location 48162306a36Sopenharmony_ci │ │ ├── mc_name 48262306a36Sopenharmony_ci │ │ ├── reset_counters 48362306a36Sopenharmony_ci │ │ ├── seconds_since_reset 48462306a36Sopenharmony_ci │ │ ├── size_mb 48562306a36Sopenharmony_ci │ │ ├── ue_count 48662306a36Sopenharmony_ci │ │ ├── ue_noinfo_count 48762306a36Sopenharmony_ci │ │ └── uevent 48862306a36Sopenharmony_ci │ └── uevent 48962306a36Sopenharmony_ci └── uevent 49062306a36Sopenharmony_ci 49162306a36Sopenharmony_ciIn the ``dimmX`` directories are EDAC control and attribute files for 49262306a36Sopenharmony_cithis ``X`` memory module: 49362306a36Sopenharmony_ci 49462306a36Sopenharmony_ci- ``size`` - Total memory managed by this csrow attribute file 49562306a36Sopenharmony_ci 49662306a36Sopenharmony_ci This attribute file displays, in count of megabytes, the memory 49762306a36Sopenharmony_ci that this csrow contains. 49862306a36Sopenharmony_ci 49962306a36Sopenharmony_ci- ``dimm_ue_count`` - Uncorrectable Errors count attribute file 50062306a36Sopenharmony_ci 50162306a36Sopenharmony_ci This attribute file displays the total count of uncorrectable 50262306a36Sopenharmony_ci errors that have occurred on this DIMM. If panic_on_ue is set 50362306a36Sopenharmony_ci this counter will not have a chance to increment, since EDAC 50462306a36Sopenharmony_ci will panic the system. 50562306a36Sopenharmony_ci 50662306a36Sopenharmony_ci- ``dimm_ce_count`` - Correctable Errors count attribute file 50762306a36Sopenharmony_ci 50862306a36Sopenharmony_ci This attribute file displays the total count of correctable 50962306a36Sopenharmony_ci errors that have occurred on this DIMM. This count is very 51062306a36Sopenharmony_ci important to examine. CEs provide early indications that a 51162306a36Sopenharmony_ci DIMM is beginning to fail. This count field should be 51262306a36Sopenharmony_ci monitored for non-zero values and report such information 51362306a36Sopenharmony_ci to the system administrator. 51462306a36Sopenharmony_ci 51562306a36Sopenharmony_ci- ``dimm_dev_type`` - Device type attribute file 51662306a36Sopenharmony_ci 51762306a36Sopenharmony_ci This attribute file will display what type of DRAM device is 51862306a36Sopenharmony_ci being utilized on this DIMM. 51962306a36Sopenharmony_ci Examples: 52062306a36Sopenharmony_ci 52162306a36Sopenharmony_ci - x1 52262306a36Sopenharmony_ci - x2 52362306a36Sopenharmony_ci - x4 52462306a36Sopenharmony_ci - x8 52562306a36Sopenharmony_ci 52662306a36Sopenharmony_ci- ``dimm_edac_mode`` - EDAC Mode of operation attribute file 52762306a36Sopenharmony_ci 52862306a36Sopenharmony_ci This attribute file will display what type of Error detection 52962306a36Sopenharmony_ci and correction is being utilized. 53062306a36Sopenharmony_ci 53162306a36Sopenharmony_ci- ``dimm_label`` - memory module label control file 53262306a36Sopenharmony_ci 53362306a36Sopenharmony_ci This control file allows this DIMM to have a label assigned 53462306a36Sopenharmony_ci to it. With this label in the module, when errors occur 53562306a36Sopenharmony_ci the output can provide the DIMM label in the system log. 53662306a36Sopenharmony_ci This becomes vital for panic events to isolate the 53762306a36Sopenharmony_ci cause of the UE event. 53862306a36Sopenharmony_ci 53962306a36Sopenharmony_ci DIMM Labels must be assigned after booting, with information 54062306a36Sopenharmony_ci that correctly identifies the physical slot with its 54162306a36Sopenharmony_ci silk screen label. This information is currently very 54262306a36Sopenharmony_ci motherboard specific and determination of this information 54362306a36Sopenharmony_ci must occur in userland at this time. 54462306a36Sopenharmony_ci 54562306a36Sopenharmony_ci- ``dimm_location`` - location of the memory module 54662306a36Sopenharmony_ci 54762306a36Sopenharmony_ci The location can have up to 3 levels, and describe how the 54862306a36Sopenharmony_ci memory controller identifies the location of a memory module. 54962306a36Sopenharmony_ci Depending on the type of memory and memory controller, it 55062306a36Sopenharmony_ci can be: 55162306a36Sopenharmony_ci 55262306a36Sopenharmony_ci - *csrow* and *channel* - used when the memory controller 55362306a36Sopenharmony_ci doesn't identify a single DIMM - e. g. in ``rankX`` dir; 55462306a36Sopenharmony_ci - *branch*, *channel*, *slot* - typically used on FB-DIMM memory 55562306a36Sopenharmony_ci controllers; 55662306a36Sopenharmony_ci - *channel*, *slot* - used on Nehalem and newer Intel drivers. 55762306a36Sopenharmony_ci 55862306a36Sopenharmony_ci- ``dimm_mem_type`` - Memory Type attribute file 55962306a36Sopenharmony_ci 56062306a36Sopenharmony_ci This attribute file will display what type of memory is currently 56162306a36Sopenharmony_ci on this csrow. Normally, either buffered or unbuffered memory. 56262306a36Sopenharmony_ci Examples: 56362306a36Sopenharmony_ci 56462306a36Sopenharmony_ci - Registered-DDR 56562306a36Sopenharmony_ci - Unbuffered-DDR 56662306a36Sopenharmony_ci 56762306a36Sopenharmony_ci.. [#f5] On some systems, the memory controller doesn't have any logic 56862306a36Sopenharmony_ci to identify the memory module. On such systems, the directory is called ``rankX`` and works on a similar way as the ``csrowX`` directories. 56962306a36Sopenharmony_ci On modern Intel memory controllers, the memory controller identifies the 57062306a36Sopenharmony_ci memory modules directly. On such systems, the directory is called ``dimmX``. 57162306a36Sopenharmony_ci 57262306a36Sopenharmony_ci.. [#f6] There are also some ``power`` directories and ``subsystem`` 57362306a36Sopenharmony_ci symlinks inside the sysfs mapping that are automatically created by 57462306a36Sopenharmony_ci the sysfs subsystem. Currently, they serve no purpose. 57562306a36Sopenharmony_ci 57662306a36Sopenharmony_ci``csrowX`` directories 57762306a36Sopenharmony_ci---------------------- 57862306a36Sopenharmony_ci 57962306a36Sopenharmony_ciWhen CONFIG_EDAC_LEGACY_SYSFS is enabled, sysfs will contain the ``csrowX`` 58062306a36Sopenharmony_cidirectories. As this API doesn't work properly for Rambus, FB-DIMMs and 58162306a36Sopenharmony_cimodern Intel Memory Controllers, this is being deprecated in favor of 58262306a36Sopenharmony_ci``dimmX`` directories. 58362306a36Sopenharmony_ci 58462306a36Sopenharmony_ciIn the ``csrowX`` directories are EDAC control and attribute files for 58562306a36Sopenharmony_cithis ``X`` instance of csrow: 58662306a36Sopenharmony_ci 58762306a36Sopenharmony_ci 58862306a36Sopenharmony_ci- ``ue_count`` - Total Uncorrectable Errors count attribute file 58962306a36Sopenharmony_ci 59062306a36Sopenharmony_ci This attribute file displays the total count of uncorrectable 59162306a36Sopenharmony_ci errors that have occurred on this csrow. If panic_on_ue is set 59262306a36Sopenharmony_ci this counter will not have a chance to increment, since EDAC 59362306a36Sopenharmony_ci will panic the system. 59462306a36Sopenharmony_ci 59562306a36Sopenharmony_ci 59662306a36Sopenharmony_ci- ``ce_count`` - Total Correctable Errors count attribute file 59762306a36Sopenharmony_ci 59862306a36Sopenharmony_ci This attribute file displays the total count of correctable 59962306a36Sopenharmony_ci errors that have occurred on this csrow. This count is very 60062306a36Sopenharmony_ci important to examine. CEs provide early indications that a 60162306a36Sopenharmony_ci DIMM is beginning to fail. This count field should be 60262306a36Sopenharmony_ci monitored for non-zero values and report such information 60362306a36Sopenharmony_ci to the system administrator. 60462306a36Sopenharmony_ci 60562306a36Sopenharmony_ci 60662306a36Sopenharmony_ci- ``size_mb`` - Total memory managed by this csrow attribute file 60762306a36Sopenharmony_ci 60862306a36Sopenharmony_ci This attribute file displays, in count of megabytes, the memory 60962306a36Sopenharmony_ci that this csrow contains. 61062306a36Sopenharmony_ci 61162306a36Sopenharmony_ci 61262306a36Sopenharmony_ci- ``mem_type`` - Memory Type attribute file 61362306a36Sopenharmony_ci 61462306a36Sopenharmony_ci This attribute file will display what type of memory is currently 61562306a36Sopenharmony_ci on this csrow. Normally, either buffered or unbuffered memory. 61662306a36Sopenharmony_ci Examples: 61762306a36Sopenharmony_ci 61862306a36Sopenharmony_ci - Registered-DDR 61962306a36Sopenharmony_ci - Unbuffered-DDR 62062306a36Sopenharmony_ci 62162306a36Sopenharmony_ci 62262306a36Sopenharmony_ci- ``edac_mode`` - EDAC Mode of operation attribute file 62362306a36Sopenharmony_ci 62462306a36Sopenharmony_ci This attribute file will display what type of Error detection 62562306a36Sopenharmony_ci and correction is being utilized. 62662306a36Sopenharmony_ci 62762306a36Sopenharmony_ci 62862306a36Sopenharmony_ci- ``dev_type`` - Device type attribute file 62962306a36Sopenharmony_ci 63062306a36Sopenharmony_ci This attribute file will display what type of DRAM device is 63162306a36Sopenharmony_ci being utilized on this DIMM. 63262306a36Sopenharmony_ci Examples: 63362306a36Sopenharmony_ci 63462306a36Sopenharmony_ci - x1 63562306a36Sopenharmony_ci - x2 63662306a36Sopenharmony_ci - x4 63762306a36Sopenharmony_ci - x8 63862306a36Sopenharmony_ci 63962306a36Sopenharmony_ci 64062306a36Sopenharmony_ci- ``ch0_ce_count`` - Channel 0 CE Count attribute file 64162306a36Sopenharmony_ci 64262306a36Sopenharmony_ci This attribute file will display the count of CEs on this 64362306a36Sopenharmony_ci DIMM located in channel 0. 64462306a36Sopenharmony_ci 64562306a36Sopenharmony_ci 64662306a36Sopenharmony_ci- ``ch0_ue_count`` - Channel 0 UE Count attribute file 64762306a36Sopenharmony_ci 64862306a36Sopenharmony_ci This attribute file will display the count of UEs on this 64962306a36Sopenharmony_ci DIMM located in channel 0. 65062306a36Sopenharmony_ci 65162306a36Sopenharmony_ci 65262306a36Sopenharmony_ci- ``ch0_dimm_label`` - Channel 0 DIMM Label control file 65362306a36Sopenharmony_ci 65462306a36Sopenharmony_ci 65562306a36Sopenharmony_ci This control file allows this DIMM to have a label assigned 65662306a36Sopenharmony_ci to it. With this label in the module, when errors occur 65762306a36Sopenharmony_ci the output can provide the DIMM label in the system log. 65862306a36Sopenharmony_ci This becomes vital for panic events to isolate the 65962306a36Sopenharmony_ci cause of the UE event. 66062306a36Sopenharmony_ci 66162306a36Sopenharmony_ci DIMM Labels must be assigned after booting, with information 66262306a36Sopenharmony_ci that correctly identifies the physical slot with its 66362306a36Sopenharmony_ci silk screen label. This information is currently very 66462306a36Sopenharmony_ci motherboard specific and determination of this information 66562306a36Sopenharmony_ci must occur in userland at this time. 66662306a36Sopenharmony_ci 66762306a36Sopenharmony_ci 66862306a36Sopenharmony_ci- ``ch1_ce_count`` - Channel 1 CE Count attribute file 66962306a36Sopenharmony_ci 67062306a36Sopenharmony_ci 67162306a36Sopenharmony_ci This attribute file will display the count of CEs on this 67262306a36Sopenharmony_ci DIMM located in channel 1. 67362306a36Sopenharmony_ci 67462306a36Sopenharmony_ci 67562306a36Sopenharmony_ci- ``ch1_ue_count`` - Channel 1 UE Count attribute file 67662306a36Sopenharmony_ci 67762306a36Sopenharmony_ci 67862306a36Sopenharmony_ci This attribute file will display the count of UEs on this 67962306a36Sopenharmony_ci DIMM located in channel 0. 68062306a36Sopenharmony_ci 68162306a36Sopenharmony_ci 68262306a36Sopenharmony_ci- ``ch1_dimm_label`` - Channel 1 DIMM Label control file 68362306a36Sopenharmony_ci 68462306a36Sopenharmony_ci This control file allows this DIMM to have a label assigned 68562306a36Sopenharmony_ci to it. With this label in the module, when errors occur 68662306a36Sopenharmony_ci the output can provide the DIMM label in the system log. 68762306a36Sopenharmony_ci This becomes vital for panic events to isolate the 68862306a36Sopenharmony_ci cause of the UE event. 68962306a36Sopenharmony_ci 69062306a36Sopenharmony_ci DIMM Labels must be assigned after booting, with information 69162306a36Sopenharmony_ci that correctly identifies the physical slot with its 69262306a36Sopenharmony_ci silk screen label. This information is currently very 69362306a36Sopenharmony_ci motherboard specific and determination of this information 69462306a36Sopenharmony_ci must occur in userland at this time. 69562306a36Sopenharmony_ci 69662306a36Sopenharmony_ci 69762306a36Sopenharmony_ciSystem Logging 69862306a36Sopenharmony_ci-------------- 69962306a36Sopenharmony_ci 70062306a36Sopenharmony_ciIf logging for UEs and CEs is enabled, then system logs will contain 70162306a36Sopenharmony_ciinformation indicating that errors have been detected:: 70262306a36Sopenharmony_ci 70362306a36Sopenharmony_ci EDAC MC0: CE page 0x283, offset 0xce0, grain 8, syndrome 0x6ec3, row 0, channel 1 "DIMM_B1": amd76x_edac 70462306a36Sopenharmony_ci EDAC MC0: CE page 0x1e5, offset 0xfb0, grain 8, syndrome 0xb741, row 0, channel 1 "DIMM_B1": amd76x_edac 70562306a36Sopenharmony_ci 70662306a36Sopenharmony_ci 70762306a36Sopenharmony_ciThe structure of the message is: 70862306a36Sopenharmony_ci 70962306a36Sopenharmony_ci +---------------------------------------+-------------+ 71062306a36Sopenharmony_ci | Content | Example | 71162306a36Sopenharmony_ci +=======================================+=============+ 71262306a36Sopenharmony_ci | The memory controller | MC0 | 71362306a36Sopenharmony_ci +---------------------------------------+-------------+ 71462306a36Sopenharmony_ci | Error type | CE | 71562306a36Sopenharmony_ci +---------------------------------------+-------------+ 71662306a36Sopenharmony_ci | Memory page | 0x283 | 71762306a36Sopenharmony_ci +---------------------------------------+-------------+ 71862306a36Sopenharmony_ci | Offset in the page | 0xce0 | 71962306a36Sopenharmony_ci +---------------------------------------+-------------+ 72062306a36Sopenharmony_ci | The byte granularity | grain 8 | 72162306a36Sopenharmony_ci | or resolution of the error | | 72262306a36Sopenharmony_ci +---------------------------------------+-------------+ 72362306a36Sopenharmony_ci | The error syndrome | 0xb741 | 72462306a36Sopenharmony_ci +---------------------------------------+-------------+ 72562306a36Sopenharmony_ci | Memory row | row 0 | 72662306a36Sopenharmony_ci +---------------------------------------+-------------+ 72762306a36Sopenharmony_ci | Memory channel | channel 1 | 72862306a36Sopenharmony_ci +---------------------------------------+-------------+ 72962306a36Sopenharmony_ci | DIMM label, if set prior | DIMM B1 | 73062306a36Sopenharmony_ci +---------------------------------------+-------------+ 73162306a36Sopenharmony_ci | And then an optional, driver-specific | | 73262306a36Sopenharmony_ci | message that may have additional | | 73362306a36Sopenharmony_ci | information. | | 73462306a36Sopenharmony_ci +---------------------------------------+-------------+ 73562306a36Sopenharmony_ci 73662306a36Sopenharmony_ciBoth UEs and CEs with no info will lack all but memory controller, error 73762306a36Sopenharmony_citype, a notice of "no info" and then an optional, driver-specific error 73862306a36Sopenharmony_cimessage. 73962306a36Sopenharmony_ci 74062306a36Sopenharmony_ci 74162306a36Sopenharmony_ciPCI Bus Parity Detection 74262306a36Sopenharmony_ci------------------------ 74362306a36Sopenharmony_ci 74462306a36Sopenharmony_ciOn Header Type 00 devices, the primary status is looked at for any 74562306a36Sopenharmony_ciparity error regardless of whether parity is enabled on the device or 74662306a36Sopenharmony_cinot. (The spec indicates parity is generated in some cases). On Header 74762306a36Sopenharmony_ciType 01 bridges, the secondary status register is also looked at to see 74862306a36Sopenharmony_ciif parity occurred on the bus on the other side of the bridge. 74962306a36Sopenharmony_ci 75062306a36Sopenharmony_ci 75162306a36Sopenharmony_ciSysfs configuration 75262306a36Sopenharmony_ci------------------- 75362306a36Sopenharmony_ci 75462306a36Sopenharmony_ciUnder ``/sys/devices/system/edac/pci`` are control and attribute files as 75562306a36Sopenharmony_cifollows: 75662306a36Sopenharmony_ci 75762306a36Sopenharmony_ci 75862306a36Sopenharmony_ci- ``check_pci_parity`` - Enable/Disable PCI Parity checking control file 75962306a36Sopenharmony_ci 76062306a36Sopenharmony_ci This control file enables or disables the PCI Bus Parity scanning 76162306a36Sopenharmony_ci operation. Writing a 1 to this file enables the scanning. Writing 76262306a36Sopenharmony_ci a 0 to this file disables the scanning. 76362306a36Sopenharmony_ci 76462306a36Sopenharmony_ci Enable:: 76562306a36Sopenharmony_ci 76662306a36Sopenharmony_ci echo "1" >/sys/devices/system/edac/pci/check_pci_parity 76762306a36Sopenharmony_ci 76862306a36Sopenharmony_ci Disable:: 76962306a36Sopenharmony_ci 77062306a36Sopenharmony_ci echo "0" >/sys/devices/system/edac/pci/check_pci_parity 77162306a36Sopenharmony_ci 77262306a36Sopenharmony_ci 77362306a36Sopenharmony_ci- ``pci_parity_count`` - Parity Count 77462306a36Sopenharmony_ci 77562306a36Sopenharmony_ci This attribute file will display the number of parity errors that 77662306a36Sopenharmony_ci have been detected. 77762306a36Sopenharmony_ci 77862306a36Sopenharmony_ci 77962306a36Sopenharmony_ciModule parameters 78062306a36Sopenharmony_ci----------------- 78162306a36Sopenharmony_ci 78262306a36Sopenharmony_ci- ``edac_mc_panic_on_ue`` - Panic on UE control file 78362306a36Sopenharmony_ci 78462306a36Sopenharmony_ci An uncorrectable error will cause a machine panic. This is usually 78562306a36Sopenharmony_ci desirable. It is a bad idea to continue when an uncorrectable error 78662306a36Sopenharmony_ci occurs - it is indeterminate what was uncorrected and the operating 78762306a36Sopenharmony_ci system context might be so mangled that continuing will lead to further 78862306a36Sopenharmony_ci corruption. If the kernel has MCE configured, then EDAC will never 78962306a36Sopenharmony_ci notice the UE. 79062306a36Sopenharmony_ci 79162306a36Sopenharmony_ci LOAD TIME:: 79262306a36Sopenharmony_ci 79362306a36Sopenharmony_ci module/kernel parameter: edac_mc_panic_on_ue=[0|1] 79462306a36Sopenharmony_ci 79562306a36Sopenharmony_ci RUN TIME:: 79662306a36Sopenharmony_ci 79762306a36Sopenharmony_ci echo "1" > /sys/module/edac_core/parameters/edac_mc_panic_on_ue 79862306a36Sopenharmony_ci 79962306a36Sopenharmony_ci 80062306a36Sopenharmony_ci- ``edac_mc_log_ue`` - Log UE control file 80162306a36Sopenharmony_ci 80262306a36Sopenharmony_ci 80362306a36Sopenharmony_ci Generate kernel messages describing uncorrectable errors. These errors 80462306a36Sopenharmony_ci are reported through the system message log system. UE statistics 80562306a36Sopenharmony_ci will be accumulated even when UE logging is disabled. 80662306a36Sopenharmony_ci 80762306a36Sopenharmony_ci LOAD TIME:: 80862306a36Sopenharmony_ci 80962306a36Sopenharmony_ci module/kernel parameter: edac_mc_log_ue=[0|1] 81062306a36Sopenharmony_ci 81162306a36Sopenharmony_ci RUN TIME:: 81262306a36Sopenharmony_ci 81362306a36Sopenharmony_ci echo "1" > /sys/module/edac_core/parameters/edac_mc_log_ue 81462306a36Sopenharmony_ci 81562306a36Sopenharmony_ci 81662306a36Sopenharmony_ci- ``edac_mc_log_ce`` - Log CE control file 81762306a36Sopenharmony_ci 81862306a36Sopenharmony_ci 81962306a36Sopenharmony_ci Generate kernel messages describing correctable errors. These 82062306a36Sopenharmony_ci errors are reported through the system message log system. 82162306a36Sopenharmony_ci CE statistics will be accumulated even when CE logging is disabled. 82262306a36Sopenharmony_ci 82362306a36Sopenharmony_ci LOAD TIME:: 82462306a36Sopenharmony_ci 82562306a36Sopenharmony_ci module/kernel parameter: edac_mc_log_ce=[0|1] 82662306a36Sopenharmony_ci 82762306a36Sopenharmony_ci RUN TIME:: 82862306a36Sopenharmony_ci 82962306a36Sopenharmony_ci echo "1" > /sys/module/edac_core/parameters/edac_mc_log_ce 83062306a36Sopenharmony_ci 83162306a36Sopenharmony_ci 83262306a36Sopenharmony_ci- ``edac_mc_poll_msec`` - Polling period control file 83362306a36Sopenharmony_ci 83462306a36Sopenharmony_ci 83562306a36Sopenharmony_ci The time period, in milliseconds, for polling for error information. 83662306a36Sopenharmony_ci Too small a value wastes resources. Too large a value might delay 83762306a36Sopenharmony_ci necessary handling of errors and might loose valuable information for 83862306a36Sopenharmony_ci locating the error. 1000 milliseconds (once each second) is the current 83962306a36Sopenharmony_ci default. Systems which require all the bandwidth they can get, may 84062306a36Sopenharmony_ci increase this. 84162306a36Sopenharmony_ci 84262306a36Sopenharmony_ci LOAD TIME:: 84362306a36Sopenharmony_ci 84462306a36Sopenharmony_ci module/kernel parameter: edac_mc_poll_msec=[0|1] 84562306a36Sopenharmony_ci 84662306a36Sopenharmony_ci RUN TIME:: 84762306a36Sopenharmony_ci 84862306a36Sopenharmony_ci echo "1000" > /sys/module/edac_core/parameters/edac_mc_poll_msec 84962306a36Sopenharmony_ci 85062306a36Sopenharmony_ci 85162306a36Sopenharmony_ci- ``panic_on_pci_parity`` - Panic on PCI PARITY Error 85262306a36Sopenharmony_ci 85362306a36Sopenharmony_ci 85462306a36Sopenharmony_ci This control file enables or disables panicking when a parity 85562306a36Sopenharmony_ci error has been detected. 85662306a36Sopenharmony_ci 85762306a36Sopenharmony_ci 85862306a36Sopenharmony_ci module/kernel parameter:: 85962306a36Sopenharmony_ci 86062306a36Sopenharmony_ci edac_panic_on_pci_pe=[0|1] 86162306a36Sopenharmony_ci 86262306a36Sopenharmony_ci Enable:: 86362306a36Sopenharmony_ci 86462306a36Sopenharmony_ci echo "1" > /sys/module/edac_core/parameters/edac_panic_on_pci_pe 86562306a36Sopenharmony_ci 86662306a36Sopenharmony_ci Disable:: 86762306a36Sopenharmony_ci 86862306a36Sopenharmony_ci echo "0" > /sys/module/edac_core/parameters/edac_panic_on_pci_pe 86962306a36Sopenharmony_ci 87062306a36Sopenharmony_ci 87162306a36Sopenharmony_ci 87262306a36Sopenharmony_ciEDAC device type 87362306a36Sopenharmony_ci---------------- 87462306a36Sopenharmony_ci 87562306a36Sopenharmony_ciIn the header file, edac_pci.h, there is a series of edac_device structures 87662306a36Sopenharmony_ciand APIs for the EDAC_DEVICE. 87762306a36Sopenharmony_ci 87862306a36Sopenharmony_ciUser space access to an edac_device is through the sysfs interface. 87962306a36Sopenharmony_ci 88062306a36Sopenharmony_ciAt the location ``/sys/devices/system/edac`` (sysfs) new edac_device devices 88162306a36Sopenharmony_ciwill appear. 88262306a36Sopenharmony_ci 88362306a36Sopenharmony_ciThere is a three level tree beneath the above ``edac`` directory. For example, 88462306a36Sopenharmony_cithe ``test_device_edac`` device (found at the http://bluesmoke.sourceforget.net 88562306a36Sopenharmony_ciwebsite) installs itself as:: 88662306a36Sopenharmony_ci 88762306a36Sopenharmony_ci /sys/devices/system/edac/test-instance 88862306a36Sopenharmony_ci 88962306a36Sopenharmony_ciin this directory are various controls, a symlink and one or more ``instance`` 89062306a36Sopenharmony_cidirectories. 89162306a36Sopenharmony_ci 89262306a36Sopenharmony_ciThe standard default controls are: 89362306a36Sopenharmony_ci 89462306a36Sopenharmony_ci ============== ======================================================= 89562306a36Sopenharmony_ci log_ce boolean to log CE events 89662306a36Sopenharmony_ci log_ue boolean to log UE events 89762306a36Sopenharmony_ci panic_on_ue boolean to ``panic`` the system if an UE is encountered 89862306a36Sopenharmony_ci (default off, can be set true via startup script) 89962306a36Sopenharmony_ci poll_msec time period between POLL cycles for events 90062306a36Sopenharmony_ci ============== ======================================================= 90162306a36Sopenharmony_ci 90262306a36Sopenharmony_ciThe test_device_edac device adds at least one of its own custom control: 90362306a36Sopenharmony_ci 90462306a36Sopenharmony_ci ============== ================================================== 90562306a36Sopenharmony_ci test_bits which in the current test driver does nothing but 90662306a36Sopenharmony_ci show how it is installed. A ported driver can 90762306a36Sopenharmony_ci add one or more such controls and/or attributes 90862306a36Sopenharmony_ci for specific uses. 90962306a36Sopenharmony_ci One out-of-tree driver uses controls here to allow 91062306a36Sopenharmony_ci for ERROR INJECTION operations to hardware 91162306a36Sopenharmony_ci injection registers 91262306a36Sopenharmony_ci ============== ================================================== 91362306a36Sopenharmony_ci 91462306a36Sopenharmony_ciThe symlink points to the 'struct dev' that is registered for this edac_device. 91562306a36Sopenharmony_ci 91662306a36Sopenharmony_ciInstances 91762306a36Sopenharmony_ci--------- 91862306a36Sopenharmony_ci 91962306a36Sopenharmony_ciOne or more instance directories are present. For the ``test_device_edac`` 92062306a36Sopenharmony_cicase: 92162306a36Sopenharmony_ci 92262306a36Sopenharmony_ci +----------------+ 92362306a36Sopenharmony_ci | test-instance0 | 92462306a36Sopenharmony_ci +----------------+ 92562306a36Sopenharmony_ci 92662306a36Sopenharmony_ci 92762306a36Sopenharmony_ciIn this directory there are two default counter attributes, which are totals of 92862306a36Sopenharmony_cicounter in deeper subdirectories. 92962306a36Sopenharmony_ci 93062306a36Sopenharmony_ci ============== ==================================== 93162306a36Sopenharmony_ci ce_count total of CE events of subdirectories 93262306a36Sopenharmony_ci ue_count total of UE events of subdirectories 93362306a36Sopenharmony_ci ============== ==================================== 93462306a36Sopenharmony_ci 93562306a36Sopenharmony_ciBlocks 93662306a36Sopenharmony_ci------ 93762306a36Sopenharmony_ci 93862306a36Sopenharmony_ciAt the lowest directory level is the ``block`` directory. There can be 0, 1 93962306a36Sopenharmony_cior more blocks specified in each instance: 94062306a36Sopenharmony_ci 94162306a36Sopenharmony_ci +-------------+ 94262306a36Sopenharmony_ci | test-block0 | 94362306a36Sopenharmony_ci +-------------+ 94462306a36Sopenharmony_ci 94562306a36Sopenharmony_ciIn this directory the default attributes are: 94662306a36Sopenharmony_ci 94762306a36Sopenharmony_ci ============== ================================================ 94862306a36Sopenharmony_ci ce_count which is counter of CE events for this ``block`` 94962306a36Sopenharmony_ci of hardware being monitored 95062306a36Sopenharmony_ci ue_count which is counter of UE events for this ``block`` 95162306a36Sopenharmony_ci of hardware being monitored 95262306a36Sopenharmony_ci ============== ================================================ 95362306a36Sopenharmony_ci 95462306a36Sopenharmony_ci 95562306a36Sopenharmony_ciThe ``test_device_edac`` device adds 4 attributes and 1 control: 95662306a36Sopenharmony_ci 95762306a36Sopenharmony_ci ================== ==================================================== 95862306a36Sopenharmony_ci test-block-bits-0 for every POLL cycle this counter 95962306a36Sopenharmony_ci is incremented 96062306a36Sopenharmony_ci test-block-bits-1 every 10 cycles, this counter is bumped once, 96162306a36Sopenharmony_ci and test-block-bits-0 is set to 0 96262306a36Sopenharmony_ci test-block-bits-2 every 100 cycles, this counter is bumped once, 96362306a36Sopenharmony_ci and test-block-bits-1 is set to 0 96462306a36Sopenharmony_ci test-block-bits-3 every 1000 cycles, this counter is bumped once, 96562306a36Sopenharmony_ci and test-block-bits-2 is set to 0 96662306a36Sopenharmony_ci ================== ==================================================== 96762306a36Sopenharmony_ci 96862306a36Sopenharmony_ci 96962306a36Sopenharmony_ci ================== ==================================================== 97062306a36Sopenharmony_ci reset-counters writing ANY thing to this control will 97162306a36Sopenharmony_ci reset all the above counters. 97262306a36Sopenharmony_ci ================== ==================================================== 97362306a36Sopenharmony_ci 97462306a36Sopenharmony_ci 97562306a36Sopenharmony_ciUse of the ``test_device_edac`` driver should enable any others to create their own 97662306a36Sopenharmony_ciunique drivers for their hardware systems. 97762306a36Sopenharmony_ci 97862306a36Sopenharmony_ciThe ``test_device_edac`` sample driver is located at the 97962306a36Sopenharmony_cihttp://bluesmoke.sourceforge.net project site for EDAC. 98062306a36Sopenharmony_ci 98162306a36Sopenharmony_ci 98262306a36Sopenharmony_ciUsage of EDAC APIs on Nehalem and newer Intel CPUs 98362306a36Sopenharmony_ci-------------------------------------------------- 98462306a36Sopenharmony_ci 98562306a36Sopenharmony_ciOn older Intel architectures, the memory controller was part of the North 98662306a36Sopenharmony_ciBridge chipset. Nehalem, Sandy Bridge, Ivy Bridge, Haswell, Sky Lake and 98762306a36Sopenharmony_cinewer Intel architectures integrated an enhanced version of the memory 98862306a36Sopenharmony_cicontroller (MC) inside the CPUs. 98962306a36Sopenharmony_ci 99062306a36Sopenharmony_ciThis chapter will cover the differences of the enhanced memory controllers 99162306a36Sopenharmony_cifound on newer Intel CPUs, such as ``i7core_edac``, ``sb_edac`` and 99262306a36Sopenharmony_ci``sbx_edac`` drivers. 99362306a36Sopenharmony_ci 99462306a36Sopenharmony_ci.. note:: 99562306a36Sopenharmony_ci 99662306a36Sopenharmony_ci The Xeon E7 processor families use a separate chip for the memory 99762306a36Sopenharmony_ci controller, called Intel Scalable Memory Buffer. This section doesn't 99862306a36Sopenharmony_ci apply for such families. 99962306a36Sopenharmony_ci 100062306a36Sopenharmony_ci1) There is one Memory Controller per Quick Patch Interconnect 100162306a36Sopenharmony_ci (QPI). At the driver, the term "socket" means one QPI. This is 100262306a36Sopenharmony_ci associated with a physical CPU socket. 100362306a36Sopenharmony_ci 100462306a36Sopenharmony_ci Each MC have 3 physical read channels, 3 physical write channels and 100562306a36Sopenharmony_ci 3 logic channels. The driver currently sees it as just 3 channels. 100662306a36Sopenharmony_ci Each channel can have up to 3 DIMMs. 100762306a36Sopenharmony_ci 100862306a36Sopenharmony_ci The minimum known unity is DIMMs. There are no information about csrows. 100962306a36Sopenharmony_ci As EDAC API maps the minimum unity is csrows, the driver sequentially 101062306a36Sopenharmony_ci maps channel/DIMM into different csrows. 101162306a36Sopenharmony_ci 101262306a36Sopenharmony_ci For example, supposing the following layout:: 101362306a36Sopenharmony_ci 101462306a36Sopenharmony_ci Ch0 phy rd0, wr0 (0x063f4031): 2 ranks, UDIMMs 101562306a36Sopenharmony_ci dimm 0 1024 Mb offset: 0, bank: 8, rank: 1, row: 0x4000, col: 0x400 101662306a36Sopenharmony_ci dimm 1 1024 Mb offset: 4, bank: 8, rank: 1, row: 0x4000, col: 0x400 101762306a36Sopenharmony_ci Ch1 phy rd1, wr1 (0x063f4031): 2 ranks, UDIMMs 101862306a36Sopenharmony_ci dimm 0 1024 Mb offset: 0, bank: 8, rank: 1, row: 0x4000, col: 0x400 101962306a36Sopenharmony_ci Ch2 phy rd3, wr3 (0x063f4031): 2 ranks, UDIMMs 102062306a36Sopenharmony_ci dimm 0 1024 Mb offset: 0, bank: 8, rank: 1, row: 0x4000, col: 0x400 102162306a36Sopenharmony_ci 102262306a36Sopenharmony_ci The driver will map it as:: 102362306a36Sopenharmony_ci 102462306a36Sopenharmony_ci csrow0: channel 0, dimm0 102562306a36Sopenharmony_ci csrow1: channel 0, dimm1 102662306a36Sopenharmony_ci csrow2: channel 1, dimm0 102762306a36Sopenharmony_ci csrow3: channel 2, dimm0 102862306a36Sopenharmony_ci 102962306a36Sopenharmony_ci exports one DIMM per csrow. 103062306a36Sopenharmony_ci 103162306a36Sopenharmony_ci Each QPI is exported as a different memory controller. 103262306a36Sopenharmony_ci 103362306a36Sopenharmony_ci2) The MC has the ability to inject errors to test drivers. The drivers 103462306a36Sopenharmony_ci implement this functionality via some error injection nodes: 103562306a36Sopenharmony_ci 103662306a36Sopenharmony_ci For injecting a memory error, there are some sysfs nodes, under 103762306a36Sopenharmony_ci ``/sys/devices/system/edac/mc/mc?/``: 103862306a36Sopenharmony_ci 103962306a36Sopenharmony_ci - ``inject_addrmatch/*``: 104062306a36Sopenharmony_ci Controls the error injection mask register. It is possible to specify 104162306a36Sopenharmony_ci several characteristics of the address to match an error code:: 104262306a36Sopenharmony_ci 104362306a36Sopenharmony_ci dimm = the affected dimm. Numbers are relative to a channel; 104462306a36Sopenharmony_ci rank = the memory rank; 104562306a36Sopenharmony_ci channel = the channel that will generate an error; 104662306a36Sopenharmony_ci bank = the affected bank; 104762306a36Sopenharmony_ci page = the page address; 104862306a36Sopenharmony_ci column (or col) = the address column. 104962306a36Sopenharmony_ci 105062306a36Sopenharmony_ci each of the above values can be set to "any" to match any valid value. 105162306a36Sopenharmony_ci 105262306a36Sopenharmony_ci At driver init, all values are set to any. 105362306a36Sopenharmony_ci 105462306a36Sopenharmony_ci For example, to generate an error at rank 1 of dimm 2, for any channel, 105562306a36Sopenharmony_ci any bank, any page, any column:: 105662306a36Sopenharmony_ci 105762306a36Sopenharmony_ci echo 2 >/sys/devices/system/edac/mc/mc0/inject_addrmatch/dimm 105862306a36Sopenharmony_ci echo 1 >/sys/devices/system/edac/mc/mc0/inject_addrmatch/rank 105962306a36Sopenharmony_ci 106062306a36Sopenharmony_ci To return to the default behaviour of matching any, you can do:: 106162306a36Sopenharmony_ci 106262306a36Sopenharmony_ci echo any >/sys/devices/system/edac/mc/mc0/inject_addrmatch/dimm 106362306a36Sopenharmony_ci echo any >/sys/devices/system/edac/mc/mc0/inject_addrmatch/rank 106462306a36Sopenharmony_ci 106562306a36Sopenharmony_ci - ``inject_eccmask``: 106662306a36Sopenharmony_ci specifies what bits will have troubles, 106762306a36Sopenharmony_ci 106862306a36Sopenharmony_ci - ``inject_section``: 106962306a36Sopenharmony_ci specifies what ECC cache section will get the error:: 107062306a36Sopenharmony_ci 107162306a36Sopenharmony_ci 3 for both 107262306a36Sopenharmony_ci 2 for the highest 107362306a36Sopenharmony_ci 1 for the lowest 107462306a36Sopenharmony_ci 107562306a36Sopenharmony_ci - ``inject_type``: 107662306a36Sopenharmony_ci specifies the type of error, being a combination of the following bits:: 107762306a36Sopenharmony_ci 107862306a36Sopenharmony_ci bit 0 - repeat 107962306a36Sopenharmony_ci bit 1 - ecc 108062306a36Sopenharmony_ci bit 2 - parity 108162306a36Sopenharmony_ci 108262306a36Sopenharmony_ci - ``inject_enable``: 108362306a36Sopenharmony_ci starts the error generation when something different than 0 is written. 108462306a36Sopenharmony_ci 108562306a36Sopenharmony_ci All inject vars can be read. root permission is needed for write. 108662306a36Sopenharmony_ci 108762306a36Sopenharmony_ci Datasheet states that the error will only be generated after a write on an 108862306a36Sopenharmony_ci address that matches inject_addrmatch. It seems, however, that reading will 108962306a36Sopenharmony_ci also produce an error. 109062306a36Sopenharmony_ci 109162306a36Sopenharmony_ci For example, the following code will generate an error for any write access 109262306a36Sopenharmony_ci at socket 0, on any DIMM/address on channel 2:: 109362306a36Sopenharmony_ci 109462306a36Sopenharmony_ci echo 2 >/sys/devices/system/edac/mc/mc0/inject_addrmatch/channel 109562306a36Sopenharmony_ci echo 2 >/sys/devices/system/edac/mc/mc0/inject_type 109662306a36Sopenharmony_ci echo 64 >/sys/devices/system/edac/mc/mc0/inject_eccmask 109762306a36Sopenharmony_ci echo 3 >/sys/devices/system/edac/mc/mc0/inject_section 109862306a36Sopenharmony_ci echo 1 >/sys/devices/system/edac/mc/mc0/inject_enable 109962306a36Sopenharmony_ci dd if=/dev/mem of=/dev/null seek=16k bs=4k count=1 >& /dev/null 110062306a36Sopenharmony_ci 110162306a36Sopenharmony_ci For socket 1, it is needed to replace "mc0" by "mc1" at the above 110262306a36Sopenharmony_ci commands. 110362306a36Sopenharmony_ci 110462306a36Sopenharmony_ci The generated error message will look like:: 110562306a36Sopenharmony_ci 110662306a36Sopenharmony_ci EDAC MC0: UE row 0, channel-a= 0 channel-b= 0 labels "-": NON_FATAL (addr = 0x0075b980, socket=0, Dimm=0, Channel=2, syndrome=0x00000040, count=1, Err=8c0000400001009f:4000080482 (read error: read ECC error)) 110762306a36Sopenharmony_ci 110862306a36Sopenharmony_ci3) Corrected Error memory register counters 110962306a36Sopenharmony_ci 111062306a36Sopenharmony_ci Those newer MCs have some registers to count memory errors. The driver 111162306a36Sopenharmony_ci uses those registers to report Corrected Errors on devices with Registered 111262306a36Sopenharmony_ci DIMMs. 111362306a36Sopenharmony_ci 111462306a36Sopenharmony_ci However, those counters don't work with Unregistered DIMM. As the chipset 111562306a36Sopenharmony_ci offers some counters that also work with UDIMMs (but with a worse level of 111662306a36Sopenharmony_ci granularity than the default ones), the driver exposes those registers for 111762306a36Sopenharmony_ci UDIMM memories. 111862306a36Sopenharmony_ci 111962306a36Sopenharmony_ci They can be read by looking at the contents of ``all_channel_counts/``:: 112062306a36Sopenharmony_ci 112162306a36Sopenharmony_ci $ for i in /sys/devices/system/edac/mc/mc0/all_channel_counts/*; do echo $i; cat $i; done 112262306a36Sopenharmony_ci /sys/devices/system/edac/mc/mc0/all_channel_counts/udimm0 112362306a36Sopenharmony_ci 0 112462306a36Sopenharmony_ci /sys/devices/system/edac/mc/mc0/all_channel_counts/udimm1 112562306a36Sopenharmony_ci 0 112662306a36Sopenharmony_ci /sys/devices/system/edac/mc/mc0/all_channel_counts/udimm2 112762306a36Sopenharmony_ci 0 112862306a36Sopenharmony_ci 112962306a36Sopenharmony_ci What happens here is that errors on different csrows, but at the same 113062306a36Sopenharmony_ci dimm number will increment the same counter. 113162306a36Sopenharmony_ci So, in this memory mapping:: 113262306a36Sopenharmony_ci 113362306a36Sopenharmony_ci csrow0: channel 0, dimm0 113462306a36Sopenharmony_ci csrow1: channel 0, dimm1 113562306a36Sopenharmony_ci csrow2: channel 1, dimm0 113662306a36Sopenharmony_ci csrow3: channel 2, dimm0 113762306a36Sopenharmony_ci 113862306a36Sopenharmony_ci The hardware will increment udimm0 for an error at the first dimm at either 113962306a36Sopenharmony_ci csrow0, csrow2 or csrow3; 114062306a36Sopenharmony_ci 114162306a36Sopenharmony_ci The hardware will increment udimm1 for an error at the second dimm at either 114262306a36Sopenharmony_ci csrow0, csrow2 or csrow3; 114362306a36Sopenharmony_ci 114462306a36Sopenharmony_ci The hardware will increment udimm2 for an error at the third dimm at either 114562306a36Sopenharmony_ci csrow0, csrow2 or csrow3; 114662306a36Sopenharmony_ci 114762306a36Sopenharmony_ci4) Standard error counters 114862306a36Sopenharmony_ci 114962306a36Sopenharmony_ci The standard error counters are generated when an mcelog error is received 115062306a36Sopenharmony_ci by the driver. Since, with UDIMM, this is counted by software, it is 115162306a36Sopenharmony_ci possible that some errors could be lost. With RDIMM's, they display the 115262306a36Sopenharmony_ci contents of the registers 115362306a36Sopenharmony_ci 115462306a36Sopenharmony_ciReference documents used on ``amd64_edac`` 115562306a36Sopenharmony_ci------------------------------------------ 115662306a36Sopenharmony_ci 115762306a36Sopenharmony_ci``amd64_edac`` module is based on the following documents 115862306a36Sopenharmony_ci(available from http://support.amd.com/en-us/search/tech-docs): 115962306a36Sopenharmony_ci 116062306a36Sopenharmony_ci1. :Title: BIOS and Kernel Developer's Guide for AMD Athlon 64 and AMD 116162306a36Sopenharmony_ci Opteron Processors 116262306a36Sopenharmony_ci :AMD publication #: 26094 116362306a36Sopenharmony_ci :Revision: 3.26 116462306a36Sopenharmony_ci :Link: http://support.amd.com/TechDocs/26094.PDF 116562306a36Sopenharmony_ci 116662306a36Sopenharmony_ci2. :Title: BIOS and Kernel Developer's Guide for AMD NPT Family 0Fh 116762306a36Sopenharmony_ci Processors 116862306a36Sopenharmony_ci :AMD publication #: 32559 116962306a36Sopenharmony_ci :Revision: 3.00 117062306a36Sopenharmony_ci :Issue Date: May 2006 117162306a36Sopenharmony_ci :Link: http://support.amd.com/TechDocs/32559.pdf 117262306a36Sopenharmony_ci 117362306a36Sopenharmony_ci3. :Title: BIOS and Kernel Developer's Guide (BKDG) For AMD Family 10h 117462306a36Sopenharmony_ci Processors 117562306a36Sopenharmony_ci :AMD publication #: 31116 117662306a36Sopenharmony_ci :Revision: 3.00 117762306a36Sopenharmony_ci :Issue Date: September 07, 2007 117862306a36Sopenharmony_ci :Link: http://support.amd.com/TechDocs/31116.pdf 117962306a36Sopenharmony_ci 118062306a36Sopenharmony_ci4. :Title: BIOS and Kernel Developer's Guide (BKDG) for AMD Family 15h 118162306a36Sopenharmony_ci Models 30h-3Fh Processors 118262306a36Sopenharmony_ci :AMD publication #: 49125 118362306a36Sopenharmony_ci :Revision: 3.06 118462306a36Sopenharmony_ci :Issue Date: 2/12/2015 (latest release) 118562306a36Sopenharmony_ci :Link: http://support.amd.com/TechDocs/49125_15h_Models_30h-3Fh_BKDG.pdf 118662306a36Sopenharmony_ci 118762306a36Sopenharmony_ci5. :Title: BIOS and Kernel Developer's Guide (BKDG) for AMD Family 15h 118862306a36Sopenharmony_ci Models 60h-6Fh Processors 118962306a36Sopenharmony_ci :AMD publication #: 50742 119062306a36Sopenharmony_ci :Revision: 3.01 119162306a36Sopenharmony_ci :Issue Date: 7/23/2015 (latest release) 119262306a36Sopenharmony_ci :Link: http://support.amd.com/TechDocs/50742_15h_Models_60h-6Fh_BKDG.pdf 119362306a36Sopenharmony_ci 119462306a36Sopenharmony_ci6. :Title: BIOS and Kernel Developer's Guide (BKDG) for AMD Family 16h 119562306a36Sopenharmony_ci Models 00h-0Fh Processors 119662306a36Sopenharmony_ci :AMD publication #: 48751 119762306a36Sopenharmony_ci :Revision: 3.03 119862306a36Sopenharmony_ci :Issue Date: 2/23/2015 (latest release) 119962306a36Sopenharmony_ci :Link: http://support.amd.com/TechDocs/48751_16h_bkdg.pdf 120062306a36Sopenharmony_ci 120162306a36Sopenharmony_ciCredits 120262306a36Sopenharmony_ci======= 120362306a36Sopenharmony_ci 120462306a36Sopenharmony_ci* Written by Doug Thompson <dougthompson@xmission.com> 120562306a36Sopenharmony_ci 120662306a36Sopenharmony_ci - 7 Dec 2005 120762306a36Sopenharmony_ci - 17 Jul 2007 Updated 120862306a36Sopenharmony_ci 120962306a36Sopenharmony_ci* |copy| Mauro Carvalho Chehab 121062306a36Sopenharmony_ci 121162306a36Sopenharmony_ci - 05 Aug 2009 Nehalem interface 121262306a36Sopenharmony_ci - 26 Oct 2016 Converted to ReST and cleanups at the Nehalem section 121362306a36Sopenharmony_ci 121462306a36Sopenharmony_ci* EDAC authors/maintainers: 121562306a36Sopenharmony_ci 121662306a36Sopenharmony_ci - Doug Thompson, Dave Jiang, Dave Peterson et al, 121762306a36Sopenharmony_ci - Mauro Carvalho Chehab 121862306a36Sopenharmony_ci - Borislav Petkov 121962306a36Sopenharmony_ci - original author: Thayne Harbaugh 1220