18c2ecf20Sopenharmony_ci.. SPDX-License-Identifier: GPL-2.0 28c2ecf20Sopenharmony_ci 38c2ecf20Sopenharmony_ci=============================================================== 48c2ecf20Sopenharmony_ciConfigurable sysfs parameters for the x86-64 machine check code 58c2ecf20Sopenharmony_ci=============================================================== 68c2ecf20Sopenharmony_ci 78c2ecf20Sopenharmony_ciMachine checks report internal hardware error conditions detected 88c2ecf20Sopenharmony_ciby the CPU. Uncorrected errors typically cause a machine check 98c2ecf20Sopenharmony_ci(often with panic), corrected ones cause a machine check log entry. 108c2ecf20Sopenharmony_ci 118c2ecf20Sopenharmony_ciMachine checks are organized in banks (normally associated with 128c2ecf20Sopenharmony_cia hardware subsystem) and subevents in a bank. The exact meaning 138c2ecf20Sopenharmony_ciof the banks and subevent is CPU specific. 148c2ecf20Sopenharmony_ci 158c2ecf20Sopenharmony_cimcelog knows how to decode them. 168c2ecf20Sopenharmony_ci 178c2ecf20Sopenharmony_ciWhen you see the "Machine check errors logged" message in the system 188c2ecf20Sopenharmony_cilog then mcelog should run to collect and decode machine check entries 198c2ecf20Sopenharmony_cifrom /dev/mcelog. Normally mcelog should be run regularly from a cronjob. 208c2ecf20Sopenharmony_ci 218c2ecf20Sopenharmony_ciEach CPU has a directory in /sys/devices/system/machinecheck/machinecheckN 228c2ecf20Sopenharmony_ci(N = CPU number). 238c2ecf20Sopenharmony_ci 248c2ecf20Sopenharmony_ciThe directory contains some configurable entries: 258c2ecf20Sopenharmony_ci 268c2ecf20Sopenharmony_cibankNctl 278c2ecf20Sopenharmony_ci (N bank number) 288c2ecf20Sopenharmony_ci 298c2ecf20Sopenharmony_ci 64bit Hex bitmask enabling/disabling specific subevents for bank N 308c2ecf20Sopenharmony_ci When a bit in the bitmask is zero then the respective 318c2ecf20Sopenharmony_ci subevent will not be reported. 328c2ecf20Sopenharmony_ci By default all events are enabled. 338c2ecf20Sopenharmony_ci Note that BIOS maintain another mask to disable specific events 348c2ecf20Sopenharmony_ci per bank. This is not visible here 358c2ecf20Sopenharmony_ci 368c2ecf20Sopenharmony_ciThe following entries appear for each CPU, but they are truly shared 378c2ecf20Sopenharmony_cibetween all CPUs. 388c2ecf20Sopenharmony_ci 398c2ecf20Sopenharmony_cicheck_interval 408c2ecf20Sopenharmony_ci How often to poll for corrected machine check errors, in seconds 418c2ecf20Sopenharmony_ci (Note output is hexadecimal). Default 5 minutes. When the poller 428c2ecf20Sopenharmony_ci finds MCEs it triggers an exponential speedup (poll more often) on 438c2ecf20Sopenharmony_ci the polling interval. When the poller stops finding MCEs, it 448c2ecf20Sopenharmony_ci triggers an exponential backoff (poll less often) on the polling 458c2ecf20Sopenharmony_ci interval. The check_interval variable is both the initial and 468c2ecf20Sopenharmony_ci maximum polling interval. 0 means no polling for corrected machine 478c2ecf20Sopenharmony_ci check errors (but some corrected errors might be still reported 488c2ecf20Sopenharmony_ci in other ways) 498c2ecf20Sopenharmony_ci 508c2ecf20Sopenharmony_citolerant 518c2ecf20Sopenharmony_ci Tolerance level. When a machine check exception occurs for a non 528c2ecf20Sopenharmony_ci corrected machine check the kernel can take different actions. 538c2ecf20Sopenharmony_ci Since machine check exceptions can happen any time it is sometimes 548c2ecf20Sopenharmony_ci risky for the kernel to kill a process because it defies 558c2ecf20Sopenharmony_ci normal kernel locking rules. The tolerance level configures 568c2ecf20Sopenharmony_ci how hard the kernel tries to recover even at some risk of 578c2ecf20Sopenharmony_ci deadlock. Higher tolerant values trade potentially better uptime 588c2ecf20Sopenharmony_ci with the risk of a crash or even corruption (for tolerant >= 3). 598c2ecf20Sopenharmony_ci 608c2ecf20Sopenharmony_ci 0: always panic on uncorrected errors, log corrected errors 618c2ecf20Sopenharmony_ci 1: panic or SIGBUS on uncorrected errors, log corrected errors 628c2ecf20Sopenharmony_ci 2: SIGBUS or log uncorrected errors, log corrected errors 638c2ecf20Sopenharmony_ci 3: never panic or SIGBUS, log all errors (for testing only) 648c2ecf20Sopenharmony_ci 658c2ecf20Sopenharmony_ci Default: 1 668c2ecf20Sopenharmony_ci 678c2ecf20Sopenharmony_ci Note this only makes a difference if the CPU allows recovery 688c2ecf20Sopenharmony_ci from a machine check exception. Current x86 CPUs generally do not. 698c2ecf20Sopenharmony_ci 708c2ecf20Sopenharmony_citrigger 718c2ecf20Sopenharmony_ci Program to run when a machine check event is detected. 728c2ecf20Sopenharmony_ci This is an alternative to running mcelog regularly from cron 738c2ecf20Sopenharmony_ci and allows to detect events faster. 748c2ecf20Sopenharmony_cimonarch_timeout 758c2ecf20Sopenharmony_ci How long to wait for the other CPUs to machine check too on a 768c2ecf20Sopenharmony_ci exception. 0 to disable waiting for other CPUs. 778c2ecf20Sopenharmony_ci Unit: us 788c2ecf20Sopenharmony_ci 798c2ecf20Sopenharmony_ciTBD document entries for AMD threshold interrupt configuration 808c2ecf20Sopenharmony_ci 818c2ecf20Sopenharmony_ciFor more details about the x86 machine check architecture 828c2ecf20Sopenharmony_cisee the Intel and AMD architecture manuals from their developer websites. 838c2ecf20Sopenharmony_ci 848c2ecf20Sopenharmony_ciFor more details about the architecture 858c2ecf20Sopenharmony_cisee http://one.firstfloor.org/~andi/mce.pdf 86