162306a36Sopenharmony_ci=========================
262306a36Sopenharmony_ciBPF Graph Data Structures
362306a36Sopenharmony_ci=========================
462306a36Sopenharmony_ci
562306a36Sopenharmony_ciThis document describes implementation details of new-style "graph" data
662306a36Sopenharmony_cistructures (linked_list, rbtree), with particular focus on the verifier's
762306a36Sopenharmony_ciimplementation of semantics specific to those data structures.
862306a36Sopenharmony_ci
962306a36Sopenharmony_ciAlthough no specific verifier code is referred to in this document, the document
1062306a36Sopenharmony_ciassumes that the reader has general knowledge of BPF verifier internals, BPF
1162306a36Sopenharmony_cimaps, and BPF program writing.
1262306a36Sopenharmony_ci
1362306a36Sopenharmony_ciNote that the intent of this document is to describe the current state of
1462306a36Sopenharmony_cithese graph data structures. **No guarantees** of stability for either
1562306a36Sopenharmony_cisemantics or APIs are made or implied here.
1662306a36Sopenharmony_ci
1762306a36Sopenharmony_ci.. contents::
1862306a36Sopenharmony_ci    :local:
1962306a36Sopenharmony_ci    :depth: 2
2062306a36Sopenharmony_ci
2162306a36Sopenharmony_ciIntroduction
2262306a36Sopenharmony_ci------------
2362306a36Sopenharmony_ci
2462306a36Sopenharmony_ciThe BPF map API has historically been the main way to expose data structures
2562306a36Sopenharmony_ciof various types for use within BPF programs. Some data structures fit naturally
2662306a36Sopenharmony_ciwith the map API (HASH, ARRAY), others less so. Consequently, programs
2762306a36Sopenharmony_ciinteracting with the latter group of data structures can be hard to parse
2862306a36Sopenharmony_cifor kernel programmers without previous BPF experience.
2962306a36Sopenharmony_ci
3062306a36Sopenharmony_ciLuckily, some restrictions which necessitated the use of BPF map semantics are
3162306a36Sopenharmony_cino longer relevant. With the introduction of kfuncs, kptrs, and the any-context
3262306a36Sopenharmony_ciBPF allocator, it is now possible to implement BPF data structures whose API
3362306a36Sopenharmony_ciand semantics more closely match those exposed to the rest of the kernel.
3462306a36Sopenharmony_ci
3562306a36Sopenharmony_ciTwo such data structures - linked_list and rbtree - have many verification
3662306a36Sopenharmony_cidetails in common. Because both have "root"s ("head" for linked_list) and
3762306a36Sopenharmony_ci"node"s, the verifier code and this document refer to common functionality
3862306a36Sopenharmony_cias "graph_api", "graph_root", "graph_node", etc.
3962306a36Sopenharmony_ci
4062306a36Sopenharmony_ciUnless otherwise stated, examples and semantics below apply to both graph data
4162306a36Sopenharmony_cistructures.
4262306a36Sopenharmony_ci
4362306a36Sopenharmony_ciUnstable API
4462306a36Sopenharmony_ci------------
4562306a36Sopenharmony_ci
4662306a36Sopenharmony_ciData structures implemented using the BPF map API have historically used BPF
4762306a36Sopenharmony_cihelper functions - either standard map API helpers like ``bpf_map_update_elem``
4862306a36Sopenharmony_cior map-specific helpers. The new-style graph data structures instead use kfuncs
4962306a36Sopenharmony_cito define their manipulation helpers. Because there are no stability guarantees
5062306a36Sopenharmony_cifor kfuncs, the API and semantics for these data structures can be evolved in
5162306a36Sopenharmony_cia way that breaks backwards compatibility if necessary.
5262306a36Sopenharmony_ci
5362306a36Sopenharmony_ciRoot and node types for the new data structures are opaquely defined in the
5462306a36Sopenharmony_ci``uapi/linux/bpf.h`` header.
5562306a36Sopenharmony_ci
5662306a36Sopenharmony_ciLocking
5762306a36Sopenharmony_ci-------
5862306a36Sopenharmony_ci
5962306a36Sopenharmony_ciThe new-style data structures are intrusive and are defined similarly to their
6062306a36Sopenharmony_civanilla kernel counterparts:
6162306a36Sopenharmony_ci
6262306a36Sopenharmony_ci.. code-block:: c
6362306a36Sopenharmony_ci
6462306a36Sopenharmony_ci        struct node_data {
6562306a36Sopenharmony_ci          long key;
6662306a36Sopenharmony_ci          long data;
6762306a36Sopenharmony_ci          struct bpf_rb_node node;
6862306a36Sopenharmony_ci        };
6962306a36Sopenharmony_ci
7062306a36Sopenharmony_ci        struct bpf_spin_lock glock;
7162306a36Sopenharmony_ci        struct bpf_rb_root groot __contains(node_data, node);
7262306a36Sopenharmony_ci
7362306a36Sopenharmony_ciThe "root" type for both linked_list and rbtree expects to be in a map_value
7462306a36Sopenharmony_ciwhich also contains a ``bpf_spin_lock`` - in the above example both global
7562306a36Sopenharmony_civariables are placed in a single-value arraymap. The verifier considers this
7662306a36Sopenharmony_cispin_lock to be associated with the ``bpf_rb_root`` by virtue of both being in
7762306a36Sopenharmony_cithe same map_value and will enforce that the correct lock is held when
7862306a36Sopenharmony_civerifying BPF programs that manipulate the tree. Since this lock checking
7962306a36Sopenharmony_cihappens at verification time, there is no runtime penalty.
8062306a36Sopenharmony_ci
8162306a36Sopenharmony_ciNon-owning references
8262306a36Sopenharmony_ci---------------------
8362306a36Sopenharmony_ci
8462306a36Sopenharmony_ci**Motivation**
8562306a36Sopenharmony_ci
8662306a36Sopenharmony_ciConsider the following BPF code:
8762306a36Sopenharmony_ci
8862306a36Sopenharmony_ci.. code-block:: c
8962306a36Sopenharmony_ci
9062306a36Sopenharmony_ci        struct node_data *n = bpf_obj_new(typeof(*n)); /* ACQUIRED */
9162306a36Sopenharmony_ci
9262306a36Sopenharmony_ci        bpf_spin_lock(&lock);
9362306a36Sopenharmony_ci
9462306a36Sopenharmony_ci        bpf_rbtree_add(&tree, n); /* PASSED */
9562306a36Sopenharmony_ci
9662306a36Sopenharmony_ci        bpf_spin_unlock(&lock);
9762306a36Sopenharmony_ci
9862306a36Sopenharmony_ciFrom the verifier's perspective, the pointer ``n`` returned from ``bpf_obj_new``
9962306a36Sopenharmony_cihas type ``PTR_TO_BTF_ID | MEM_ALLOC``, with a ``btf_id`` of
10062306a36Sopenharmony_ci``struct node_data`` and a nonzero ``ref_obj_id``. Because it holds ``n``, the
10162306a36Sopenharmony_ciprogram has ownership of the pointee's (object pointed to by ``n``) lifetime.
10262306a36Sopenharmony_ciThe BPF program must pass off ownership before exiting - either via
10362306a36Sopenharmony_ci``bpf_obj_drop``, which ``free``'s the object, or by adding it to ``tree`` with
10462306a36Sopenharmony_ci``bpf_rbtree_add``.
10562306a36Sopenharmony_ci
10662306a36Sopenharmony_ci(``ACQUIRED`` and ``PASSED`` comments in the example denote statements where
10762306a36Sopenharmony_ci"ownership is acquired" and "ownership is passed", respectively)
10862306a36Sopenharmony_ci
10962306a36Sopenharmony_ciWhat should the verifier do with ``n`` after ownership is passed off? If the
11062306a36Sopenharmony_ciobject was ``free``'d with ``bpf_obj_drop`` the answer is obvious: the verifier
11162306a36Sopenharmony_cishould reject programs which attempt to access ``n`` after ``bpf_obj_drop`` as
11262306a36Sopenharmony_cithe object is no longer valid. The underlying memory may have been reused for
11362306a36Sopenharmony_cisome other allocation, unmapped, etc.
11462306a36Sopenharmony_ci
11562306a36Sopenharmony_ciWhen ownership is passed to ``tree`` via ``bpf_rbtree_add`` the answer is less
11662306a36Sopenharmony_ciobvious. The verifier could enforce the same semantics as for ``bpf_obj_drop``,
11762306a36Sopenharmony_cibut that would result in programs with useful, common coding patterns being
11862306a36Sopenharmony_cirejected, e.g.:
11962306a36Sopenharmony_ci
12062306a36Sopenharmony_ci.. code-block:: c
12162306a36Sopenharmony_ci
12262306a36Sopenharmony_ci        int x;
12362306a36Sopenharmony_ci        struct node_data *n = bpf_obj_new(typeof(*n)); /* ACQUIRED */
12462306a36Sopenharmony_ci
12562306a36Sopenharmony_ci        bpf_spin_lock(&lock);
12662306a36Sopenharmony_ci
12762306a36Sopenharmony_ci        bpf_rbtree_add(&tree, n); /* PASSED */
12862306a36Sopenharmony_ci        x = n->data;
12962306a36Sopenharmony_ci        n->data = 42;
13062306a36Sopenharmony_ci
13162306a36Sopenharmony_ci        bpf_spin_unlock(&lock);
13262306a36Sopenharmony_ci
13362306a36Sopenharmony_ciBoth the read from and write to ``n->data`` would be rejected. The verifier
13462306a36Sopenharmony_cican do better, though, by taking advantage of two details:
13562306a36Sopenharmony_ci
13662306a36Sopenharmony_ci  * Graph data structure APIs can only be used when the ``bpf_spin_lock``
13762306a36Sopenharmony_ci    associated with the graph root is held
13862306a36Sopenharmony_ci
13962306a36Sopenharmony_ci  * Both graph data structures have pointer stability
14062306a36Sopenharmony_ci
14162306a36Sopenharmony_ci     * Because graph nodes are allocated with ``bpf_obj_new`` and
14262306a36Sopenharmony_ci       adding / removing from the root involves fiddling with the
14362306a36Sopenharmony_ci       ``bpf_{list,rb}_node`` field of the node struct, a graph node will
14462306a36Sopenharmony_ci       remain at the same address after either operation.
14562306a36Sopenharmony_ci
14662306a36Sopenharmony_ciBecause the associated ``bpf_spin_lock`` must be held by any program adding
14762306a36Sopenharmony_cior removing, if we're in the critical section bounded by that lock, we know
14862306a36Sopenharmony_cithat no other program can add or remove until the end of the critical section.
14962306a36Sopenharmony_ciThis combined with pointer stability means that, until the critical section
15062306a36Sopenharmony_ciends, we can safely access the graph node through ``n`` even after it was used
15162306a36Sopenharmony_cito pass ownership.
15262306a36Sopenharmony_ci
15362306a36Sopenharmony_ciThe verifier considers such a reference a *non-owning reference*. The ref
15462306a36Sopenharmony_cireturned by ``bpf_obj_new`` is accordingly considered an *owning reference*.
15562306a36Sopenharmony_ciBoth terms currently only have meaning in the context of graph nodes and API.
15662306a36Sopenharmony_ci
15762306a36Sopenharmony_ci**Details**
15862306a36Sopenharmony_ci
15962306a36Sopenharmony_ciLet's enumerate the properties of both types of references.
16062306a36Sopenharmony_ci
16162306a36Sopenharmony_ci*owning reference*
16262306a36Sopenharmony_ci
16362306a36Sopenharmony_ci  * This reference controls the lifetime of the pointee
16462306a36Sopenharmony_ci
16562306a36Sopenharmony_ci  * Ownership of pointee must be 'released' by passing it to some graph API
16662306a36Sopenharmony_ci    kfunc, or via ``bpf_obj_drop``, which ``free``'s the pointee
16762306a36Sopenharmony_ci
16862306a36Sopenharmony_ci    * If not released before program ends, verifier considers program invalid
16962306a36Sopenharmony_ci
17062306a36Sopenharmony_ci  * Access to the pointee's memory will not page fault
17162306a36Sopenharmony_ci
17262306a36Sopenharmony_ci*non-owning reference*
17362306a36Sopenharmony_ci
17462306a36Sopenharmony_ci  * This reference does not own the pointee
17562306a36Sopenharmony_ci
17662306a36Sopenharmony_ci     * It cannot be used to add the graph node to a graph root, nor ``free``'d via
17762306a36Sopenharmony_ci       ``bpf_obj_drop``
17862306a36Sopenharmony_ci
17962306a36Sopenharmony_ci  * No explicit control of lifetime, but can infer valid lifetime based on
18062306a36Sopenharmony_ci    non-owning ref existence (see explanation below)
18162306a36Sopenharmony_ci
18262306a36Sopenharmony_ci  * Access to the pointee's memory will not page fault
18362306a36Sopenharmony_ci
18462306a36Sopenharmony_ciFrom verifier's perspective non-owning references can only exist
18562306a36Sopenharmony_cibetween spin_lock and spin_unlock. Why? After spin_unlock another program
18662306a36Sopenharmony_cican do arbitrary operations on the data structure like removing and ``free``-ing
18762306a36Sopenharmony_civia bpf_obj_drop. A non-owning ref to some chunk of memory that was remove'd,
18862306a36Sopenharmony_ci``free``'d, and reused via bpf_obj_new would point to an entirely different thing.
18962306a36Sopenharmony_ciOr the memory could go away.
19062306a36Sopenharmony_ci
19162306a36Sopenharmony_ciTo prevent this logic violation all non-owning references are invalidated by the
19262306a36Sopenharmony_civerifier after a critical section ends. This is necessary to ensure the "will
19362306a36Sopenharmony_cinot page fault" property of non-owning references. So if the verifier hasn't
19462306a36Sopenharmony_ciinvalidated a non-owning ref, accessing it will not page fault.
19562306a36Sopenharmony_ci
19662306a36Sopenharmony_ciCurrently ``bpf_obj_drop`` is not allowed in the critical section, so
19762306a36Sopenharmony_ciif there's a valid non-owning ref, we must be in a critical section, and can
19862306a36Sopenharmony_ciconclude that the ref's memory hasn't been dropped-and- ``free``'d or
19962306a36Sopenharmony_cidropped-and-reused.
20062306a36Sopenharmony_ci
20162306a36Sopenharmony_ciAny reference to a node that is in an rbtree _must_ be non-owning, since
20262306a36Sopenharmony_cithe tree has control of the pointee's lifetime. Similarly, any ref to a node
20362306a36Sopenharmony_cithat isn't in rbtree _must_ be owning. This results in a nice property:
20462306a36Sopenharmony_cigraph API add / remove implementations don't need to check if a node
20562306a36Sopenharmony_cihas already been added (or already removed), as the ownership model
20662306a36Sopenharmony_ciallows the verifier to prevent such a state from being valid by simply checking
20762306a36Sopenharmony_citypes.
20862306a36Sopenharmony_ci
20962306a36Sopenharmony_ciHowever, pointer aliasing poses an issue for the above "nice property".
21062306a36Sopenharmony_ciConsider the following example:
21162306a36Sopenharmony_ci
21262306a36Sopenharmony_ci.. code-block:: c
21362306a36Sopenharmony_ci
21462306a36Sopenharmony_ci        struct node_data *n, *m, *o, *p;
21562306a36Sopenharmony_ci        n = bpf_obj_new(typeof(*n));     /* 1 */
21662306a36Sopenharmony_ci
21762306a36Sopenharmony_ci        bpf_spin_lock(&lock);
21862306a36Sopenharmony_ci
21962306a36Sopenharmony_ci        bpf_rbtree_add(&tree, n);        /* 2 */
22062306a36Sopenharmony_ci        m = bpf_rbtree_first(&tree);     /* 3 */
22162306a36Sopenharmony_ci
22262306a36Sopenharmony_ci        o = bpf_rbtree_remove(&tree, n); /* 4 */
22362306a36Sopenharmony_ci        p = bpf_rbtree_remove(&tree, m); /* 5 */
22462306a36Sopenharmony_ci
22562306a36Sopenharmony_ci        bpf_spin_unlock(&lock);
22662306a36Sopenharmony_ci
22762306a36Sopenharmony_ci        bpf_obj_drop(o);
22862306a36Sopenharmony_ci        bpf_obj_drop(p); /* 6 */
22962306a36Sopenharmony_ci
23062306a36Sopenharmony_ciAssume the tree is empty before this program runs. If we track verifier state
23162306a36Sopenharmony_cichanges here using numbers in above comments:
23262306a36Sopenharmony_ci
23362306a36Sopenharmony_ci  1) n is an owning reference
23462306a36Sopenharmony_ci
23562306a36Sopenharmony_ci  2) n is a non-owning reference, it's been added to the tree
23662306a36Sopenharmony_ci
23762306a36Sopenharmony_ci  3) n and m are non-owning references, they both point to the same node
23862306a36Sopenharmony_ci
23962306a36Sopenharmony_ci  4) o is an owning reference, n and m non-owning, all point to same node
24062306a36Sopenharmony_ci
24162306a36Sopenharmony_ci  5) o and p are owning, n and m non-owning, all point to the same node
24262306a36Sopenharmony_ci
24362306a36Sopenharmony_ci  6) a double-free has occurred, since o and p point to same node and o was
24462306a36Sopenharmony_ci     ``free``'d in previous statement
24562306a36Sopenharmony_ci
24662306a36Sopenharmony_ciStates 4 and 5 violate our "nice property", as there are non-owning refs to
24762306a36Sopenharmony_cia node which is not in an rbtree. Statement 5 will try to remove a node which
24862306a36Sopenharmony_cihas already been removed as a result of this violation. State 6 is a dangerous
24962306a36Sopenharmony_cidouble-free.
25062306a36Sopenharmony_ci
25162306a36Sopenharmony_ciAt a minimum we should prevent state 6 from being possible. If we can't also
25262306a36Sopenharmony_ciprevent state 5 then we must abandon our "nice property" and check whether a
25362306a36Sopenharmony_cinode has already been removed at runtime.
25462306a36Sopenharmony_ci
25562306a36Sopenharmony_ciWe prevent both by generalizing the "invalidate non-owning references" behavior
25662306a36Sopenharmony_ciof ``bpf_spin_unlock`` and doing similar invalidation after
25762306a36Sopenharmony_ci``bpf_rbtree_remove``. The logic here being that any graph API kfunc which:
25862306a36Sopenharmony_ci
25962306a36Sopenharmony_ci  * takes an arbitrary node argument
26062306a36Sopenharmony_ci
26162306a36Sopenharmony_ci  * removes it from the data structure
26262306a36Sopenharmony_ci
26362306a36Sopenharmony_ci  * returns an owning reference to the removed node
26462306a36Sopenharmony_ci
26562306a36Sopenharmony_ciMay result in a state where some other non-owning reference points to the same
26662306a36Sopenharmony_cinode. So ``remove``-type kfuncs must be considered a non-owning reference
26762306a36Sopenharmony_ciinvalidation point as well.
268