162306a36Sopenharmony_ci.. SPDX-License-Identifier: GPL-2.0
262306a36Sopenharmony_ci
362306a36Sopenharmony_ci==============
462306a36Sopenharmony_ciDevlink Health
562306a36Sopenharmony_ci==============
662306a36Sopenharmony_ci
762306a36Sopenharmony_ciBackground
862306a36Sopenharmony_ci==========
962306a36Sopenharmony_ci
1062306a36Sopenharmony_ciThe ``devlink`` health mechanism is targeted for Real Time Alerting, in
1162306a36Sopenharmony_ciorder to know when something bad happened to a PCI device.
1262306a36Sopenharmony_ci
1362306a36Sopenharmony_ci  * Provide alert debug information.
1462306a36Sopenharmony_ci  * Self healing.
1562306a36Sopenharmony_ci  * If problem needs vendor support, provide a way to gather all needed
1662306a36Sopenharmony_ci    debugging information.
1762306a36Sopenharmony_ci
1862306a36Sopenharmony_ciOverview
1962306a36Sopenharmony_ci========
2062306a36Sopenharmony_ci
2162306a36Sopenharmony_ciThe main idea is to unify and centralize driver health reports in the
2262306a36Sopenharmony_cigeneric ``devlink`` instance and allow the user to set different
2362306a36Sopenharmony_ciattributes of the health reporting and recovery procedures.
2462306a36Sopenharmony_ci
2562306a36Sopenharmony_ciThe ``devlink`` health reporter:
2662306a36Sopenharmony_ciDevice driver creates a "health reporter" per each error/health type.
2762306a36Sopenharmony_ciError/Health type can be a known/generic (e.g. PCI error, fw error, rx/tx error)
2862306a36Sopenharmony_cior unknown (driver specific).
2962306a36Sopenharmony_ciFor each registered health reporter a driver can issue error/health reports
3062306a36Sopenharmony_ciasynchronously. All health reports handling is done by ``devlink``.
3162306a36Sopenharmony_ciDevice driver can provide specific callbacks for each "health reporter", e.g.:
3262306a36Sopenharmony_ci
3362306a36Sopenharmony_ci  * Recovery procedures
3462306a36Sopenharmony_ci  * Diagnostics procedures
3562306a36Sopenharmony_ci  * Object dump procedures
3662306a36Sopenharmony_ci  * Out Of Box initial parameters
3762306a36Sopenharmony_ci
3862306a36Sopenharmony_ciDifferent parts of the driver can register different types of health reporters
3962306a36Sopenharmony_ciwith different handlers.
4062306a36Sopenharmony_ci
4162306a36Sopenharmony_ciActions
4262306a36Sopenharmony_ci=======
4362306a36Sopenharmony_ci
4462306a36Sopenharmony_ciOnce an error is reported, devlink health will perform the following actions:
4562306a36Sopenharmony_ci
4662306a36Sopenharmony_ci  * A log is being send to the kernel trace events buffer
4762306a36Sopenharmony_ci  * Health status and statistics are being updated for the reporter instance
4862306a36Sopenharmony_ci  * Object dump is being taken and saved at the reporter instance (as long as
4962306a36Sopenharmony_ci    auto-dump is set and there is no other dump which is already stored)
5062306a36Sopenharmony_ci  * Auto recovery attempt is being done. Depends on:
5162306a36Sopenharmony_ci
5262306a36Sopenharmony_ci    - Auto-recovery configuration
5362306a36Sopenharmony_ci    - Grace period vs. time passed since last recover
5462306a36Sopenharmony_ci
5562306a36Sopenharmony_ciDevlink formatted message
5662306a36Sopenharmony_ci=========================
5762306a36Sopenharmony_ci
5862306a36Sopenharmony_ciTo handle devlink health diagnose and health dump requests, devlink creates a
5962306a36Sopenharmony_ciformatted message structure ``devlink_fmsg`` and send it to the driver's callback
6062306a36Sopenharmony_cito fill the data in using the devlink fmsg API.
6162306a36Sopenharmony_ci
6262306a36Sopenharmony_ciDevlink fmsg is a mechanism to pass descriptors between drivers and devlink, in
6362306a36Sopenharmony_cijson-like format. The API allows the driver to add nested attributes such as
6462306a36Sopenharmony_ciobject, object pair and value array, in addition to attributes such as name and
6562306a36Sopenharmony_civalue.
6662306a36Sopenharmony_ci
6762306a36Sopenharmony_ciDriver should use this API to fill the fmsg context in a format which will be
6862306a36Sopenharmony_citranslated by the devlink to the netlink message later. When it needs to send
6962306a36Sopenharmony_cithe data using SKBs to the netlink layer, it fragments the data between
7062306a36Sopenharmony_cidifferent SKBs. In order to do this fragmentation, it uses virtual nests
7162306a36Sopenharmony_ciattributes, to avoid actual nesting use which cannot be divided between
7262306a36Sopenharmony_cidifferent SKBs.
7362306a36Sopenharmony_ci
7462306a36Sopenharmony_ciUser Interface
7562306a36Sopenharmony_ci==============
7662306a36Sopenharmony_ci
7762306a36Sopenharmony_ciUser can access/change each reporter's parameters and driver specific callbacks
7862306a36Sopenharmony_civia ``devlink``, e.g per error type (per health reporter):
7962306a36Sopenharmony_ci
8062306a36Sopenharmony_ci  * Configure reporter's generic parameters (like: disable/enable auto recovery)
8162306a36Sopenharmony_ci  * Invoke recovery procedure
8262306a36Sopenharmony_ci  * Run diagnostics
8362306a36Sopenharmony_ci  * Object dump
8462306a36Sopenharmony_ci
8562306a36Sopenharmony_ci.. list-table:: List of devlink health interfaces
8662306a36Sopenharmony_ci   :widths: 10 90
8762306a36Sopenharmony_ci
8862306a36Sopenharmony_ci   * - Name
8962306a36Sopenharmony_ci     - Description
9062306a36Sopenharmony_ci   * - ``DEVLINK_CMD_HEALTH_REPORTER_GET``
9162306a36Sopenharmony_ci     - Retrieves status and configuration info per DEV and reporter.
9262306a36Sopenharmony_ci   * - ``DEVLINK_CMD_HEALTH_REPORTER_SET``
9362306a36Sopenharmony_ci     - Allows reporter-related configuration setting.
9462306a36Sopenharmony_ci   * - ``DEVLINK_CMD_HEALTH_REPORTER_RECOVER``
9562306a36Sopenharmony_ci     - Triggers reporter's recovery procedure.
9662306a36Sopenharmony_ci   * - ``DEVLINK_CMD_HEALTH_REPORTER_TEST``
9762306a36Sopenharmony_ci     - Triggers a fake health event on the reporter. The effects of the test
9862306a36Sopenharmony_ci       event in terms of recovery flow should follow closely that of a real
9962306a36Sopenharmony_ci       event.
10062306a36Sopenharmony_ci   * - ``DEVLINK_CMD_HEALTH_REPORTER_DIAGNOSE``
10162306a36Sopenharmony_ci     - Retrieves current device state related to the reporter.
10262306a36Sopenharmony_ci   * - ``DEVLINK_CMD_HEALTH_REPORTER_DUMP_GET``
10362306a36Sopenharmony_ci     - Retrieves the last stored dump. Devlink health
10462306a36Sopenharmony_ci       saves a single dump. If an dump is not already stored by devlink
10562306a36Sopenharmony_ci       for this reporter, devlink generates a new dump.
10662306a36Sopenharmony_ci       Dump output is defined by the reporter.
10762306a36Sopenharmony_ci   * - ``DEVLINK_CMD_HEALTH_REPORTER_DUMP_CLEAR``
10862306a36Sopenharmony_ci     - Clears the last saved dump file for the specified reporter.
10962306a36Sopenharmony_ci
11062306a36Sopenharmony_ciThe following diagram provides a general overview of ``devlink-health``::
11162306a36Sopenharmony_ci
11262306a36Sopenharmony_ci                                                   netlink
11362306a36Sopenharmony_ci                                          +--------------------------+
11462306a36Sopenharmony_ci                                          |                          |
11562306a36Sopenharmony_ci                                          |            +             |
11662306a36Sopenharmony_ci                                          |            |             |
11762306a36Sopenharmony_ci                                          +--------------------------+
11862306a36Sopenharmony_ci                                                       |request for ops
11962306a36Sopenharmony_ci                                                       |(diagnose,
12062306a36Sopenharmony_ci      driver                               devlink     |recover,
12162306a36Sopenharmony_ci                                                       |dump)
12262306a36Sopenharmony_ci    +--------+                            +--------------------------+
12362306a36Sopenharmony_ci    |        |                            |    reporter|             |
12462306a36Sopenharmony_ci    |        |                            |  +---------v----------+  |
12562306a36Sopenharmony_ci    |        |   ops execution            |  |                    |  |
12662306a36Sopenharmony_ci    |     <----------------------------------+                    |  |
12762306a36Sopenharmony_ci    |        |                            |  |                    |  |
12862306a36Sopenharmony_ci    |        |                            |  + ^------------------+  |
12962306a36Sopenharmony_ci    |        |                            |    | request for ops     |
13062306a36Sopenharmony_ci    |        |                            |    | (recover, dump)     |
13162306a36Sopenharmony_ci    |        |                            |    |                     |
13262306a36Sopenharmony_ci    |        |                            |  +-+------------------+  |
13362306a36Sopenharmony_ci    |        |     health report          |  | health handler     |  |
13462306a36Sopenharmony_ci    |        +------------------------------->                    |  |
13562306a36Sopenharmony_ci    |        |                            |  +--------------------+  |
13662306a36Sopenharmony_ci    |        |     health reporter create |                          |
13762306a36Sopenharmony_ci    |        +---------------------------->                          |
13862306a36Sopenharmony_ci    +--------+                            +--------------------------+
139