162306a36Sopenharmony_ci.. SPDX-License-Identifier: GPL-2.0 262306a36Sopenharmony_ci 362306a36Sopenharmony_ci============== 462306a36Sopenharmony_ciDevlink Health 562306a36Sopenharmony_ci============== 662306a36Sopenharmony_ci 762306a36Sopenharmony_ciBackground 862306a36Sopenharmony_ci========== 962306a36Sopenharmony_ci 1062306a36Sopenharmony_ciThe ``devlink`` health mechanism is targeted for Real Time Alerting, in 1162306a36Sopenharmony_ciorder to know when something bad happened to a PCI device. 1262306a36Sopenharmony_ci 1362306a36Sopenharmony_ci * Provide alert debug information. 1462306a36Sopenharmony_ci * Self healing. 1562306a36Sopenharmony_ci * If problem needs vendor support, provide a way to gather all needed 1662306a36Sopenharmony_ci debugging information. 1762306a36Sopenharmony_ci 1862306a36Sopenharmony_ciOverview 1962306a36Sopenharmony_ci======== 2062306a36Sopenharmony_ci 2162306a36Sopenharmony_ciThe main idea is to unify and centralize driver health reports in the 2262306a36Sopenharmony_cigeneric ``devlink`` instance and allow the user to set different 2362306a36Sopenharmony_ciattributes of the health reporting and recovery procedures. 2462306a36Sopenharmony_ci 2562306a36Sopenharmony_ciThe ``devlink`` health reporter: 2662306a36Sopenharmony_ciDevice driver creates a "health reporter" per each error/health type. 2762306a36Sopenharmony_ciError/Health type can be a known/generic (e.g. PCI error, fw error, rx/tx error) 2862306a36Sopenharmony_cior unknown (driver specific). 2962306a36Sopenharmony_ciFor each registered health reporter a driver can issue error/health reports 3062306a36Sopenharmony_ciasynchronously. All health reports handling is done by ``devlink``. 3162306a36Sopenharmony_ciDevice driver can provide specific callbacks for each "health reporter", e.g.: 3262306a36Sopenharmony_ci 3362306a36Sopenharmony_ci * Recovery procedures 3462306a36Sopenharmony_ci * Diagnostics procedures 3562306a36Sopenharmony_ci * Object dump procedures 3662306a36Sopenharmony_ci * Out Of Box initial parameters 3762306a36Sopenharmony_ci 3862306a36Sopenharmony_ciDifferent parts of the driver can register different types of health reporters 3962306a36Sopenharmony_ciwith different handlers. 4062306a36Sopenharmony_ci 4162306a36Sopenharmony_ciActions 4262306a36Sopenharmony_ci======= 4362306a36Sopenharmony_ci 4462306a36Sopenharmony_ciOnce an error is reported, devlink health will perform the following actions: 4562306a36Sopenharmony_ci 4662306a36Sopenharmony_ci * A log is being send to the kernel trace events buffer 4762306a36Sopenharmony_ci * Health status and statistics are being updated for the reporter instance 4862306a36Sopenharmony_ci * Object dump is being taken and saved at the reporter instance (as long as 4962306a36Sopenharmony_ci auto-dump is set and there is no other dump which is already stored) 5062306a36Sopenharmony_ci * Auto recovery attempt is being done. Depends on: 5162306a36Sopenharmony_ci 5262306a36Sopenharmony_ci - Auto-recovery configuration 5362306a36Sopenharmony_ci - Grace period vs. time passed since last recover 5462306a36Sopenharmony_ci 5562306a36Sopenharmony_ciDevlink formatted message 5662306a36Sopenharmony_ci========================= 5762306a36Sopenharmony_ci 5862306a36Sopenharmony_ciTo handle devlink health diagnose and health dump requests, devlink creates a 5962306a36Sopenharmony_ciformatted message structure ``devlink_fmsg`` and send it to the driver's callback 6062306a36Sopenharmony_cito fill the data in using the devlink fmsg API. 6162306a36Sopenharmony_ci 6262306a36Sopenharmony_ciDevlink fmsg is a mechanism to pass descriptors between drivers and devlink, in 6362306a36Sopenharmony_cijson-like format. The API allows the driver to add nested attributes such as 6462306a36Sopenharmony_ciobject, object pair and value array, in addition to attributes such as name and 6562306a36Sopenharmony_civalue. 6662306a36Sopenharmony_ci 6762306a36Sopenharmony_ciDriver should use this API to fill the fmsg context in a format which will be 6862306a36Sopenharmony_citranslated by the devlink to the netlink message later. When it needs to send 6962306a36Sopenharmony_cithe data using SKBs to the netlink layer, it fragments the data between 7062306a36Sopenharmony_cidifferent SKBs. In order to do this fragmentation, it uses virtual nests 7162306a36Sopenharmony_ciattributes, to avoid actual nesting use which cannot be divided between 7262306a36Sopenharmony_cidifferent SKBs. 7362306a36Sopenharmony_ci 7462306a36Sopenharmony_ciUser Interface 7562306a36Sopenharmony_ci============== 7662306a36Sopenharmony_ci 7762306a36Sopenharmony_ciUser can access/change each reporter's parameters and driver specific callbacks 7862306a36Sopenharmony_civia ``devlink``, e.g per error type (per health reporter): 7962306a36Sopenharmony_ci 8062306a36Sopenharmony_ci * Configure reporter's generic parameters (like: disable/enable auto recovery) 8162306a36Sopenharmony_ci * Invoke recovery procedure 8262306a36Sopenharmony_ci * Run diagnostics 8362306a36Sopenharmony_ci * Object dump 8462306a36Sopenharmony_ci 8562306a36Sopenharmony_ci.. list-table:: List of devlink health interfaces 8662306a36Sopenharmony_ci :widths: 10 90 8762306a36Sopenharmony_ci 8862306a36Sopenharmony_ci * - Name 8962306a36Sopenharmony_ci - Description 9062306a36Sopenharmony_ci * - ``DEVLINK_CMD_HEALTH_REPORTER_GET`` 9162306a36Sopenharmony_ci - Retrieves status and configuration info per DEV and reporter. 9262306a36Sopenharmony_ci * - ``DEVLINK_CMD_HEALTH_REPORTER_SET`` 9362306a36Sopenharmony_ci - Allows reporter-related configuration setting. 9462306a36Sopenharmony_ci * - ``DEVLINK_CMD_HEALTH_REPORTER_RECOVER`` 9562306a36Sopenharmony_ci - Triggers reporter's recovery procedure. 9662306a36Sopenharmony_ci * - ``DEVLINK_CMD_HEALTH_REPORTER_TEST`` 9762306a36Sopenharmony_ci - Triggers a fake health event on the reporter. The effects of the test 9862306a36Sopenharmony_ci event in terms of recovery flow should follow closely that of a real 9962306a36Sopenharmony_ci event. 10062306a36Sopenharmony_ci * - ``DEVLINK_CMD_HEALTH_REPORTER_DIAGNOSE`` 10162306a36Sopenharmony_ci - Retrieves current device state related to the reporter. 10262306a36Sopenharmony_ci * - ``DEVLINK_CMD_HEALTH_REPORTER_DUMP_GET`` 10362306a36Sopenharmony_ci - Retrieves the last stored dump. Devlink health 10462306a36Sopenharmony_ci saves a single dump. If an dump is not already stored by devlink 10562306a36Sopenharmony_ci for this reporter, devlink generates a new dump. 10662306a36Sopenharmony_ci Dump output is defined by the reporter. 10762306a36Sopenharmony_ci * - ``DEVLINK_CMD_HEALTH_REPORTER_DUMP_CLEAR`` 10862306a36Sopenharmony_ci - Clears the last saved dump file for the specified reporter. 10962306a36Sopenharmony_ci 11062306a36Sopenharmony_ciThe following diagram provides a general overview of ``devlink-health``:: 11162306a36Sopenharmony_ci 11262306a36Sopenharmony_ci netlink 11362306a36Sopenharmony_ci +--------------------------+ 11462306a36Sopenharmony_ci | | 11562306a36Sopenharmony_ci | + | 11662306a36Sopenharmony_ci | | | 11762306a36Sopenharmony_ci +--------------------------+ 11862306a36Sopenharmony_ci |request for ops 11962306a36Sopenharmony_ci |(diagnose, 12062306a36Sopenharmony_ci driver devlink |recover, 12162306a36Sopenharmony_ci |dump) 12262306a36Sopenharmony_ci +--------+ +--------------------------+ 12362306a36Sopenharmony_ci | | | reporter| | 12462306a36Sopenharmony_ci | | | +---------v----------+ | 12562306a36Sopenharmony_ci | | ops execution | | | | 12662306a36Sopenharmony_ci | <----------------------------------+ | | 12762306a36Sopenharmony_ci | | | | | | 12862306a36Sopenharmony_ci | | | + ^------------------+ | 12962306a36Sopenharmony_ci | | | | request for ops | 13062306a36Sopenharmony_ci | | | | (recover, dump) | 13162306a36Sopenharmony_ci | | | | | 13262306a36Sopenharmony_ci | | | +-+------------------+ | 13362306a36Sopenharmony_ci | | health report | | health handler | | 13462306a36Sopenharmony_ci | +-------------------------------> | | 13562306a36Sopenharmony_ci | | | +--------------------+ | 13662306a36Sopenharmony_ci | | health reporter create | | 13762306a36Sopenharmony_ci | +----------------------------> | 13862306a36Sopenharmony_ci +--------+ +--------------------------+ 139