18c2ecf20Sopenharmony_ci.. SPDX-License-Identifier: GPL-2.0
28c2ecf20Sopenharmony_ci
38c2ecf20Sopenharmony_ci===============================================================
48c2ecf20Sopenharmony_ciConfigurable sysfs parameters for the x86-64 machine check code
58c2ecf20Sopenharmony_ci===============================================================
68c2ecf20Sopenharmony_ci
78c2ecf20Sopenharmony_ciMachine checks report internal hardware error conditions detected
88c2ecf20Sopenharmony_ciby the CPU. Uncorrected errors typically cause a machine check
98c2ecf20Sopenharmony_ci(often with panic), corrected ones cause a machine check log entry.
108c2ecf20Sopenharmony_ci
118c2ecf20Sopenharmony_ciMachine checks are organized in banks (normally associated with
128c2ecf20Sopenharmony_cia hardware subsystem) and subevents in a bank. The exact meaning
138c2ecf20Sopenharmony_ciof the banks and subevent is CPU specific.
148c2ecf20Sopenharmony_ci
158c2ecf20Sopenharmony_cimcelog knows how to decode them.
168c2ecf20Sopenharmony_ci
178c2ecf20Sopenharmony_ciWhen you see the "Machine check errors logged" message in the system
188c2ecf20Sopenharmony_cilog then mcelog should run to collect and decode machine check entries
198c2ecf20Sopenharmony_cifrom /dev/mcelog. Normally mcelog should be run regularly from a cronjob.
208c2ecf20Sopenharmony_ci
218c2ecf20Sopenharmony_ciEach CPU has a directory in /sys/devices/system/machinecheck/machinecheckN
228c2ecf20Sopenharmony_ci(N = CPU number).
238c2ecf20Sopenharmony_ci
248c2ecf20Sopenharmony_ciThe directory contains some configurable entries:
258c2ecf20Sopenharmony_ci
268c2ecf20Sopenharmony_cibankNctl
278c2ecf20Sopenharmony_ci	(N bank number)
288c2ecf20Sopenharmony_ci
298c2ecf20Sopenharmony_ci	64bit Hex bitmask enabling/disabling specific subevents for bank N
308c2ecf20Sopenharmony_ci	When a bit in the bitmask is zero then the respective
318c2ecf20Sopenharmony_ci	subevent will not be reported.
328c2ecf20Sopenharmony_ci	By default all events are enabled.
338c2ecf20Sopenharmony_ci	Note that BIOS maintain another mask to disable specific events
348c2ecf20Sopenharmony_ci	per bank.  This is not visible here
358c2ecf20Sopenharmony_ci
368c2ecf20Sopenharmony_ciThe following entries appear for each CPU, but they are truly shared
378c2ecf20Sopenharmony_cibetween all CPUs.
388c2ecf20Sopenharmony_ci
398c2ecf20Sopenharmony_cicheck_interval
408c2ecf20Sopenharmony_ci	How often to poll for corrected machine check errors, in seconds
418c2ecf20Sopenharmony_ci	(Note output is hexadecimal). Default 5 minutes.  When the poller
428c2ecf20Sopenharmony_ci	finds MCEs it triggers an exponential speedup (poll more often) on
438c2ecf20Sopenharmony_ci	the polling interval.  When the poller stops finding MCEs, it
448c2ecf20Sopenharmony_ci	triggers an exponential backoff (poll less often) on the polling
458c2ecf20Sopenharmony_ci	interval. The check_interval variable is both the initial and
468c2ecf20Sopenharmony_ci	maximum polling interval. 0 means no polling for corrected machine
478c2ecf20Sopenharmony_ci	check errors (but some corrected errors might be still reported
488c2ecf20Sopenharmony_ci	in other ways)
498c2ecf20Sopenharmony_ci
508c2ecf20Sopenharmony_citolerant
518c2ecf20Sopenharmony_ci	Tolerance level. When a machine check exception occurs for a non
528c2ecf20Sopenharmony_ci	corrected machine check the kernel can take different actions.
538c2ecf20Sopenharmony_ci	Since machine check exceptions can happen any time it is sometimes
548c2ecf20Sopenharmony_ci	risky for the kernel to kill a process because it defies
558c2ecf20Sopenharmony_ci	normal kernel locking rules. The tolerance level configures
568c2ecf20Sopenharmony_ci	how hard the kernel tries to recover even at some risk of
578c2ecf20Sopenharmony_ci	deadlock.  Higher tolerant values trade potentially better uptime
588c2ecf20Sopenharmony_ci	with the risk of a crash or even corruption (for tolerant >= 3).
598c2ecf20Sopenharmony_ci
608c2ecf20Sopenharmony_ci	0: always panic on uncorrected errors, log corrected errors
618c2ecf20Sopenharmony_ci	1: panic or SIGBUS on uncorrected errors, log corrected errors
628c2ecf20Sopenharmony_ci	2: SIGBUS or log uncorrected errors, log corrected errors
638c2ecf20Sopenharmony_ci	3: never panic or SIGBUS, log all errors (for testing only)
648c2ecf20Sopenharmony_ci
658c2ecf20Sopenharmony_ci	Default: 1
668c2ecf20Sopenharmony_ci
678c2ecf20Sopenharmony_ci	Note this only makes a difference if the CPU allows recovery
688c2ecf20Sopenharmony_ci	from a machine check exception. Current x86 CPUs generally do not.
698c2ecf20Sopenharmony_ci
708c2ecf20Sopenharmony_citrigger
718c2ecf20Sopenharmony_ci	Program to run when a machine check event is detected.
728c2ecf20Sopenharmony_ci	This is an alternative to running mcelog regularly from cron
738c2ecf20Sopenharmony_ci	and allows to detect events faster.
748c2ecf20Sopenharmony_cimonarch_timeout
758c2ecf20Sopenharmony_ci	How long to wait for the other CPUs to machine check too on a
768c2ecf20Sopenharmony_ci	exception. 0 to disable waiting for other CPUs.
778c2ecf20Sopenharmony_ci	Unit: us
788c2ecf20Sopenharmony_ci
798c2ecf20Sopenharmony_ciTBD document entries for AMD threshold interrupt configuration
808c2ecf20Sopenharmony_ci
818c2ecf20Sopenharmony_ciFor more details about the x86 machine check architecture
828c2ecf20Sopenharmony_cisee the Intel and AMD architecture manuals from their developer websites.
838c2ecf20Sopenharmony_ci
848c2ecf20Sopenharmony_ciFor more details about the architecture
858c2ecf20Sopenharmony_cisee http://one.firstfloor.org/~andi/mce.pdf
86