162306a36Sopenharmony_ci.. SPDX-License-Identifier: GPL-2.0
262306a36Sopenharmony_ci
362306a36Sopenharmony_ci=============
462306a36Sopenharmony_ciDevlink DPIPE
562306a36Sopenharmony_ci=============
662306a36Sopenharmony_ci
762306a36Sopenharmony_ciBackground
862306a36Sopenharmony_ci==========
962306a36Sopenharmony_ci
1062306a36Sopenharmony_ciWhile performing the hardware offloading process, much of the hardware
1162306a36Sopenharmony_cispecifics cannot be presented. These details are useful for debugging, and
1262306a36Sopenharmony_ci``devlink-dpipe`` provides a standardized way to provide visibility into the
1362306a36Sopenharmony_cioffloading process.
1462306a36Sopenharmony_ci
1562306a36Sopenharmony_ciFor example, the routing longest prefix match (LPM) algorithm used by the
1662306a36Sopenharmony_ciLinux kernel may differ from the hardware implementation. The pipeline debug
1762306a36Sopenharmony_ciAPI (DPIPE) is aimed at providing the user visibility into the ASIC's
1862306a36Sopenharmony_cipipeline in a generic way.
1962306a36Sopenharmony_ci
2062306a36Sopenharmony_ciThe hardware offload process is expected to be done in a way that the user
2162306a36Sopenharmony_cishould not be able to distinguish between the hardware vs. software
2262306a36Sopenharmony_ciimplementation. In this process, hardware specifics are neglected. In
2362306a36Sopenharmony_cireality those details can have lots of meaning and should be exposed in some
2462306a36Sopenharmony_cistandard way.
2562306a36Sopenharmony_ci
2662306a36Sopenharmony_ciThis problem is made even more complex when one wishes to offload the
2762306a36Sopenharmony_cicontrol path of the whole networking stack to a switch ASIC. Due to
2862306a36Sopenharmony_cidifferences in the hardware and software models some processes cannot be
2962306a36Sopenharmony_cirepresented correctly.
3062306a36Sopenharmony_ci
3162306a36Sopenharmony_ciOne example is the kernel's LPM algorithm which in many cases differs
3262306a36Sopenharmony_cigreatly to the hardware implementation. The configuration API is the same,
3362306a36Sopenharmony_cibut one cannot rely on the Forward Information Base (FIB) to look like the
3462306a36Sopenharmony_ciLevel Path Compression trie (LPC-trie) in hardware.
3562306a36Sopenharmony_ci
3662306a36Sopenharmony_ciIn many situations trying to analyze systems failure solely based on the
3762306a36Sopenharmony_cikernel's dump may not be enough. By combining this data with complementary
3862306a36Sopenharmony_ciinformation about the underlying hardware, this debugging can be made
3962306a36Sopenharmony_cieasier; additionally, the information can be useful when debugging
4062306a36Sopenharmony_ciperformance issues.
4162306a36Sopenharmony_ci
4262306a36Sopenharmony_ciOverview
4362306a36Sopenharmony_ci========
4462306a36Sopenharmony_ci
4562306a36Sopenharmony_ciThe ``devlink-dpipe`` interface closes this gap. The hardware's pipeline is
4662306a36Sopenharmony_cimodeled as a graph of match/action tables. Each table represents a specific
4762306a36Sopenharmony_cihardware block. This model is not new, first being used by the P4 language.
4862306a36Sopenharmony_ci
4962306a36Sopenharmony_ciTraditionally it has been used as an alternative model for hardware
5062306a36Sopenharmony_ciconfiguration, but the ``devlink-dpipe`` interface uses it for visibility
5162306a36Sopenharmony_cipurposes as a standard complementary tool. The system's view from
5262306a36Sopenharmony_ci``devlink-dpipe`` should change according to the changes done by the
5362306a36Sopenharmony_cistandard configuration tools.
5462306a36Sopenharmony_ci
5562306a36Sopenharmony_ciFor example, it’s quite common to  implement Access Control Lists (ACL)
5662306a36Sopenharmony_ciusing Ternary Content Addressable Memory (TCAM). The TCAM memory can be
5762306a36Sopenharmony_cidivided into TCAM regions. Complex TC filters can have multiple rules with
5862306a36Sopenharmony_cidifferent priorities and different lookup keys. On the other hand hardware
5962306a36Sopenharmony_ciTCAM regions have a predefined lookup key. Offloading the TC filter rules
6062306a36Sopenharmony_ciusing TCAM engine can result in multiple TCAM regions being interconnected
6162306a36Sopenharmony_ciin a chain (which may affect the data path latency). In response to a new TC
6262306a36Sopenharmony_cifilter new tables should be created describing those regions.
6362306a36Sopenharmony_ci
6462306a36Sopenharmony_ciModel
6562306a36Sopenharmony_ci=====
6662306a36Sopenharmony_ci
6762306a36Sopenharmony_ciThe ``DPIPE`` model introduces several objects:
6862306a36Sopenharmony_ci
6962306a36Sopenharmony_ci  * headers
7062306a36Sopenharmony_ci  * tables
7162306a36Sopenharmony_ci  * entries
7262306a36Sopenharmony_ci
7362306a36Sopenharmony_ciA ``header`` describes packet formats and provides names for fields within
7462306a36Sopenharmony_cithe packet. A ``table`` describes hardware blocks. An ``entry`` describes
7562306a36Sopenharmony_cithe actual content of a specific table.
7662306a36Sopenharmony_ci
7762306a36Sopenharmony_ciThe hardware pipeline is not port specific, but rather describes the whole
7862306a36Sopenharmony_ciASIC. Thus it is tied to the top of the ``devlink`` infrastructure.
7962306a36Sopenharmony_ci
8062306a36Sopenharmony_ciDrivers can register and unregister tables at run time, in order to support
8162306a36Sopenharmony_cidynamic behavior. This dynamic behavior is mandatory for describing hardware
8262306a36Sopenharmony_ciblocks like TCAM regions which can be allocated and freed dynamically.
8362306a36Sopenharmony_ci
8462306a36Sopenharmony_ci``devlink-dpipe`` generally is not intended for configuration. The exception
8562306a36Sopenharmony_ciis hardware counting for a specific table.
8662306a36Sopenharmony_ci
8762306a36Sopenharmony_ciThe following commands are used to obtain the ``dpipe`` objects from
8862306a36Sopenharmony_ciuserspace:
8962306a36Sopenharmony_ci
9062306a36Sopenharmony_ci  * ``table_get``: Receive a table's description.
9162306a36Sopenharmony_ci  * ``headers_get``: Receive a device's supported headers.
9262306a36Sopenharmony_ci  * ``entries_get``: Receive a table's current entries.
9362306a36Sopenharmony_ci  * ``counters_set``: Enable or disable counters on a table.
9462306a36Sopenharmony_ci
9562306a36Sopenharmony_ciTable
9662306a36Sopenharmony_ci-----
9762306a36Sopenharmony_ci
9862306a36Sopenharmony_ciThe driver should implement the following operations for each table:
9962306a36Sopenharmony_ci
10062306a36Sopenharmony_ci  * ``matches_dump``: Dump the supported matches.
10162306a36Sopenharmony_ci  * ``actions_dump``: Dump the supported actions.
10262306a36Sopenharmony_ci  * ``entries_dump``: Dump the actual content of the table.
10362306a36Sopenharmony_ci  * ``counters_set_update``: Synchronize hardware with counters enabled or
10462306a36Sopenharmony_ci    disabled.
10562306a36Sopenharmony_ci
10662306a36Sopenharmony_ciHeader/Field
10762306a36Sopenharmony_ci------------
10862306a36Sopenharmony_ci
10962306a36Sopenharmony_ciIn a similar way to P4 headers and fields are used to describe a table's
11062306a36Sopenharmony_cibehavior. There is a slight difference between the standard protocol headers
11162306a36Sopenharmony_ciand specific ASIC metadata. The protocol headers should be declared in the
11262306a36Sopenharmony_ci``devlink`` core API. On the other hand ASIC meta data is driver specific
11362306a36Sopenharmony_ciand should be defined in the driver. Additionally, each driver-specific
11462306a36Sopenharmony_cidevlink documentation file should document the driver-specific ``dpipe``
11562306a36Sopenharmony_ciheaders it implements. The headers and fields are identified by enumeration.
11662306a36Sopenharmony_ci
11762306a36Sopenharmony_ciIn order to provide further visibility some ASIC metadata fields could be
11862306a36Sopenharmony_cimapped to kernel objects. For example, internal router interface indexes can
11962306a36Sopenharmony_cibe directly mapped to the net device ifindex. FIB table indexes used by
12062306a36Sopenharmony_cidifferent Virtual Routing and Forwarding (VRF) tables can be mapped to
12162306a36Sopenharmony_ciinternal routing table indexes.
12262306a36Sopenharmony_ci
12362306a36Sopenharmony_ciMatch
12462306a36Sopenharmony_ci-----
12562306a36Sopenharmony_ci
12662306a36Sopenharmony_ciMatches are kept primitive and close to hardware operation. Match types like
12762306a36Sopenharmony_ciLPM are not supported due to the fact that this is exactly a process we wish
12862306a36Sopenharmony_cito describe in full detail. Example of matches:
12962306a36Sopenharmony_ci
13062306a36Sopenharmony_ci  * ``field_exact``: Exact match on a specific field.
13162306a36Sopenharmony_ci  * ``field_exact_mask``: Exact match on a specific field after masking.
13262306a36Sopenharmony_ci  * ``field_range``: Match on a specific range.
13362306a36Sopenharmony_ci
13462306a36Sopenharmony_ciThe id's of the header and the field should be specified in order to
13562306a36Sopenharmony_ciidentify the specific field. Furthermore, the header index should be
13662306a36Sopenharmony_cispecified in order to distinguish multiple headers of the same type in a
13762306a36Sopenharmony_cipacket (tunneling).
13862306a36Sopenharmony_ci
13962306a36Sopenharmony_ciAction
14062306a36Sopenharmony_ci------
14162306a36Sopenharmony_ci
14262306a36Sopenharmony_ciSimilar to match, the actions are kept primitive and close to hardware
14362306a36Sopenharmony_cioperation. For example:
14462306a36Sopenharmony_ci
14562306a36Sopenharmony_ci  * ``field_modify``: Modify the field value.
14662306a36Sopenharmony_ci  * ``field_inc``: Increment the field value.
14762306a36Sopenharmony_ci  * ``push_header``: Add a header.
14862306a36Sopenharmony_ci  * ``pop_header``: Remove a header.
14962306a36Sopenharmony_ci
15062306a36Sopenharmony_ciEntry
15162306a36Sopenharmony_ci-----
15262306a36Sopenharmony_ci
15362306a36Sopenharmony_ciEntries of a specific table can be dumped on demand. Each eentry is
15462306a36Sopenharmony_ciidentified with an index and its properties are described by a list of
15562306a36Sopenharmony_cimatch/action values and specific counter. By dumping the tables content the
15662306a36Sopenharmony_ciinteractions between tables can be resolved.
15762306a36Sopenharmony_ci
15862306a36Sopenharmony_ciAbstraction Example
15962306a36Sopenharmony_ci===================
16062306a36Sopenharmony_ci
16162306a36Sopenharmony_ciThe following is an example of the abstraction model of the L3 part of
16262306a36Sopenharmony_ciMellanox Spectrum ASIC. The blocks are described in the order they appear in
16362306a36Sopenharmony_cithe pipeline. The table sizes in the following examples are not real
16462306a36Sopenharmony_cihardware sizes and are provided for demonstration purposes.
16562306a36Sopenharmony_ci
16662306a36Sopenharmony_ciLPM
16762306a36Sopenharmony_ci---
16862306a36Sopenharmony_ci
16962306a36Sopenharmony_ciThe LPM algorithm can be implemented as a list of hash tables. Each hash
17062306a36Sopenharmony_citable contains routes with the same prefix length. The root of the list is
17162306a36Sopenharmony_ci/32, and in case of a miss the hardware will continue to the next hash
17262306a36Sopenharmony_citable. The depth of the search will affect the data path latency.
17362306a36Sopenharmony_ci
17462306a36Sopenharmony_ciIn case of a hit the entry contains information about the next stage of the
17562306a36Sopenharmony_cipipeline which resolves the MAC address. The next stage can be either local
17662306a36Sopenharmony_cihost table for directly connected routes, or adjacency table for next-hops.
17762306a36Sopenharmony_ciThe ``meta.lpm_prefix`` field is used to connect two LPM tables.
17862306a36Sopenharmony_ci
17962306a36Sopenharmony_ci.. code::
18062306a36Sopenharmony_ci
18162306a36Sopenharmony_ci    table lpm_prefix_16 {
18262306a36Sopenharmony_ci      size: 4096,
18362306a36Sopenharmony_ci      counters_enabled: true,
18462306a36Sopenharmony_ci      match: { meta.vr_id: exact,
18562306a36Sopenharmony_ci               ipv4.dst_addr: exact_mask,
18662306a36Sopenharmony_ci               ipv6.dst_addr: exact_mask,
18762306a36Sopenharmony_ci               meta.lpm_prefix: exact },
18862306a36Sopenharmony_ci      action: { meta.adj_index: set,
18962306a36Sopenharmony_ci                meta.adj_group_size: set,
19062306a36Sopenharmony_ci                meta.rif_port: set,
19162306a36Sopenharmony_ci                meta.lpm_prefix: set },
19262306a36Sopenharmony_ci    }
19362306a36Sopenharmony_ci
19462306a36Sopenharmony_ciLocal Host
19562306a36Sopenharmony_ci----------
19662306a36Sopenharmony_ci
19762306a36Sopenharmony_ciIn the case of local routes the LPM lookup already resolves the egress
19862306a36Sopenharmony_cirouter interface (RIF), yet the exact MAC address is not known. The local
19962306a36Sopenharmony_cihost table is a hash table combining the output interface id with
20062306a36Sopenharmony_cidestination IP address as a key. The result is the MAC address.
20162306a36Sopenharmony_ci
20262306a36Sopenharmony_ci.. code::
20362306a36Sopenharmony_ci
20462306a36Sopenharmony_ci    table local_host {
20562306a36Sopenharmony_ci      size: 4096,
20662306a36Sopenharmony_ci      counters_enabled: true,
20762306a36Sopenharmony_ci      match: { meta.rif_port: exact,
20862306a36Sopenharmony_ci               ipv4.dst_addr: exact},
20962306a36Sopenharmony_ci      action: { ethernet.daddr: set }
21062306a36Sopenharmony_ci    }
21162306a36Sopenharmony_ci
21262306a36Sopenharmony_ciAdjacency
21362306a36Sopenharmony_ci---------
21462306a36Sopenharmony_ci
21562306a36Sopenharmony_ciIn case of remote routes this table does the ECMP. The LPM lookup results in
21662306a36Sopenharmony_ciECMP group size and index that serves as a global offset into this table.
21762306a36Sopenharmony_ciConcurrently a hash of the packet is generated. Based on the ECMP group size
21862306a36Sopenharmony_ciand the packet's hash a local offset is generated. Multiple LPM entries can
21962306a36Sopenharmony_cipoint to the same adjacency group.
22062306a36Sopenharmony_ci
22162306a36Sopenharmony_ci.. code::
22262306a36Sopenharmony_ci
22362306a36Sopenharmony_ci    table adjacency {
22462306a36Sopenharmony_ci      size: 4096,
22562306a36Sopenharmony_ci      counters_enabled: true,
22662306a36Sopenharmony_ci      match: { meta.adj_index: exact,
22762306a36Sopenharmony_ci               meta.adj_group_size: exact,
22862306a36Sopenharmony_ci               meta.packet_hash_index: exact },
22962306a36Sopenharmony_ci      action: { ethernet.daddr: set,
23062306a36Sopenharmony_ci                meta.erif: set }
23162306a36Sopenharmony_ci    }
23262306a36Sopenharmony_ci
23362306a36Sopenharmony_ciERIF
23462306a36Sopenharmony_ci----
23562306a36Sopenharmony_ci
23662306a36Sopenharmony_ciIn case the egress RIF and destination MAC have been resolved by previous
23762306a36Sopenharmony_citables this table does multiple operations like TTL decrease and MTU check.
23862306a36Sopenharmony_ciThen the decision of forward/drop is taken and the port L3 statistics are
23962306a36Sopenharmony_ciupdated based on the packet's type (broadcast, unicast, multicast).
24062306a36Sopenharmony_ci
24162306a36Sopenharmony_ci.. code::
24262306a36Sopenharmony_ci
24362306a36Sopenharmony_ci    table erif {
24462306a36Sopenharmony_ci      size: 800,
24562306a36Sopenharmony_ci      counters_enabled: true,
24662306a36Sopenharmony_ci      match: { meta.rif_port: exact,
24762306a36Sopenharmony_ci               meta.is_l3_unicast: exact,
24862306a36Sopenharmony_ci               meta.is_l3_broadcast: exact,
24962306a36Sopenharmony_ci               meta.is_l3_multicast, exact },
25062306a36Sopenharmony_ci      action: { meta.l3_drop: set,
25162306a36Sopenharmony_ci                meta.l3_forward: set }
25262306a36Sopenharmony_ci    }
253