162306a36Sopenharmony_ci========================= 262306a36Sopenharmony_ciBPF Graph Data Structures 362306a36Sopenharmony_ci========================= 462306a36Sopenharmony_ci 562306a36Sopenharmony_ciThis document describes implementation details of new-style "graph" data 662306a36Sopenharmony_cistructures (linked_list, rbtree), with particular focus on the verifier's 762306a36Sopenharmony_ciimplementation of semantics specific to those data structures. 862306a36Sopenharmony_ci 962306a36Sopenharmony_ciAlthough no specific verifier code is referred to in this document, the document 1062306a36Sopenharmony_ciassumes that the reader has general knowledge of BPF verifier internals, BPF 1162306a36Sopenharmony_cimaps, and BPF program writing. 1262306a36Sopenharmony_ci 1362306a36Sopenharmony_ciNote that the intent of this document is to describe the current state of 1462306a36Sopenharmony_cithese graph data structures. **No guarantees** of stability for either 1562306a36Sopenharmony_cisemantics or APIs are made or implied here. 1662306a36Sopenharmony_ci 1762306a36Sopenharmony_ci.. contents:: 1862306a36Sopenharmony_ci :local: 1962306a36Sopenharmony_ci :depth: 2 2062306a36Sopenharmony_ci 2162306a36Sopenharmony_ciIntroduction 2262306a36Sopenharmony_ci------------ 2362306a36Sopenharmony_ci 2462306a36Sopenharmony_ciThe BPF map API has historically been the main way to expose data structures 2562306a36Sopenharmony_ciof various types for use within BPF programs. Some data structures fit naturally 2662306a36Sopenharmony_ciwith the map API (HASH, ARRAY), others less so. Consequently, programs 2762306a36Sopenharmony_ciinteracting with the latter group of data structures can be hard to parse 2862306a36Sopenharmony_cifor kernel programmers without previous BPF experience. 2962306a36Sopenharmony_ci 3062306a36Sopenharmony_ciLuckily, some restrictions which necessitated the use of BPF map semantics are 3162306a36Sopenharmony_cino longer relevant. With the introduction of kfuncs, kptrs, and the any-context 3262306a36Sopenharmony_ciBPF allocator, it is now possible to implement BPF data structures whose API 3362306a36Sopenharmony_ciand semantics more closely match those exposed to the rest of the kernel. 3462306a36Sopenharmony_ci 3562306a36Sopenharmony_ciTwo such data structures - linked_list and rbtree - have many verification 3662306a36Sopenharmony_cidetails in common. Because both have "root"s ("head" for linked_list) and 3762306a36Sopenharmony_ci"node"s, the verifier code and this document refer to common functionality 3862306a36Sopenharmony_cias "graph_api", "graph_root", "graph_node", etc. 3962306a36Sopenharmony_ci 4062306a36Sopenharmony_ciUnless otherwise stated, examples and semantics below apply to both graph data 4162306a36Sopenharmony_cistructures. 4262306a36Sopenharmony_ci 4362306a36Sopenharmony_ciUnstable API 4462306a36Sopenharmony_ci------------ 4562306a36Sopenharmony_ci 4662306a36Sopenharmony_ciData structures implemented using the BPF map API have historically used BPF 4762306a36Sopenharmony_cihelper functions - either standard map API helpers like ``bpf_map_update_elem`` 4862306a36Sopenharmony_cior map-specific helpers. The new-style graph data structures instead use kfuncs 4962306a36Sopenharmony_cito define their manipulation helpers. Because there are no stability guarantees 5062306a36Sopenharmony_cifor kfuncs, the API and semantics for these data structures can be evolved in 5162306a36Sopenharmony_cia way that breaks backwards compatibility if necessary. 5262306a36Sopenharmony_ci 5362306a36Sopenharmony_ciRoot and node types for the new data structures are opaquely defined in the 5462306a36Sopenharmony_ci``uapi/linux/bpf.h`` header. 5562306a36Sopenharmony_ci 5662306a36Sopenharmony_ciLocking 5762306a36Sopenharmony_ci------- 5862306a36Sopenharmony_ci 5962306a36Sopenharmony_ciThe new-style data structures are intrusive and are defined similarly to their 6062306a36Sopenharmony_civanilla kernel counterparts: 6162306a36Sopenharmony_ci 6262306a36Sopenharmony_ci.. code-block:: c 6362306a36Sopenharmony_ci 6462306a36Sopenharmony_ci struct node_data { 6562306a36Sopenharmony_ci long key; 6662306a36Sopenharmony_ci long data; 6762306a36Sopenharmony_ci struct bpf_rb_node node; 6862306a36Sopenharmony_ci }; 6962306a36Sopenharmony_ci 7062306a36Sopenharmony_ci struct bpf_spin_lock glock; 7162306a36Sopenharmony_ci struct bpf_rb_root groot __contains(node_data, node); 7262306a36Sopenharmony_ci 7362306a36Sopenharmony_ciThe "root" type for both linked_list and rbtree expects to be in a map_value 7462306a36Sopenharmony_ciwhich also contains a ``bpf_spin_lock`` - in the above example both global 7562306a36Sopenharmony_civariables are placed in a single-value arraymap. The verifier considers this 7662306a36Sopenharmony_cispin_lock to be associated with the ``bpf_rb_root`` by virtue of both being in 7762306a36Sopenharmony_cithe same map_value and will enforce that the correct lock is held when 7862306a36Sopenharmony_civerifying BPF programs that manipulate the tree. Since this lock checking 7962306a36Sopenharmony_cihappens at verification time, there is no runtime penalty. 8062306a36Sopenharmony_ci 8162306a36Sopenharmony_ciNon-owning references 8262306a36Sopenharmony_ci--------------------- 8362306a36Sopenharmony_ci 8462306a36Sopenharmony_ci**Motivation** 8562306a36Sopenharmony_ci 8662306a36Sopenharmony_ciConsider the following BPF code: 8762306a36Sopenharmony_ci 8862306a36Sopenharmony_ci.. code-block:: c 8962306a36Sopenharmony_ci 9062306a36Sopenharmony_ci struct node_data *n = bpf_obj_new(typeof(*n)); /* ACQUIRED */ 9162306a36Sopenharmony_ci 9262306a36Sopenharmony_ci bpf_spin_lock(&lock); 9362306a36Sopenharmony_ci 9462306a36Sopenharmony_ci bpf_rbtree_add(&tree, n); /* PASSED */ 9562306a36Sopenharmony_ci 9662306a36Sopenharmony_ci bpf_spin_unlock(&lock); 9762306a36Sopenharmony_ci 9862306a36Sopenharmony_ciFrom the verifier's perspective, the pointer ``n`` returned from ``bpf_obj_new`` 9962306a36Sopenharmony_cihas type ``PTR_TO_BTF_ID | MEM_ALLOC``, with a ``btf_id`` of 10062306a36Sopenharmony_ci``struct node_data`` and a nonzero ``ref_obj_id``. Because it holds ``n``, the 10162306a36Sopenharmony_ciprogram has ownership of the pointee's (object pointed to by ``n``) lifetime. 10262306a36Sopenharmony_ciThe BPF program must pass off ownership before exiting - either via 10362306a36Sopenharmony_ci``bpf_obj_drop``, which ``free``'s the object, or by adding it to ``tree`` with 10462306a36Sopenharmony_ci``bpf_rbtree_add``. 10562306a36Sopenharmony_ci 10662306a36Sopenharmony_ci(``ACQUIRED`` and ``PASSED`` comments in the example denote statements where 10762306a36Sopenharmony_ci"ownership is acquired" and "ownership is passed", respectively) 10862306a36Sopenharmony_ci 10962306a36Sopenharmony_ciWhat should the verifier do with ``n`` after ownership is passed off? If the 11062306a36Sopenharmony_ciobject was ``free``'d with ``bpf_obj_drop`` the answer is obvious: the verifier 11162306a36Sopenharmony_cishould reject programs which attempt to access ``n`` after ``bpf_obj_drop`` as 11262306a36Sopenharmony_cithe object is no longer valid. The underlying memory may have been reused for 11362306a36Sopenharmony_cisome other allocation, unmapped, etc. 11462306a36Sopenharmony_ci 11562306a36Sopenharmony_ciWhen ownership is passed to ``tree`` via ``bpf_rbtree_add`` the answer is less 11662306a36Sopenharmony_ciobvious. The verifier could enforce the same semantics as for ``bpf_obj_drop``, 11762306a36Sopenharmony_cibut that would result in programs with useful, common coding patterns being 11862306a36Sopenharmony_cirejected, e.g.: 11962306a36Sopenharmony_ci 12062306a36Sopenharmony_ci.. code-block:: c 12162306a36Sopenharmony_ci 12262306a36Sopenharmony_ci int x; 12362306a36Sopenharmony_ci struct node_data *n = bpf_obj_new(typeof(*n)); /* ACQUIRED */ 12462306a36Sopenharmony_ci 12562306a36Sopenharmony_ci bpf_spin_lock(&lock); 12662306a36Sopenharmony_ci 12762306a36Sopenharmony_ci bpf_rbtree_add(&tree, n); /* PASSED */ 12862306a36Sopenharmony_ci x = n->data; 12962306a36Sopenharmony_ci n->data = 42; 13062306a36Sopenharmony_ci 13162306a36Sopenharmony_ci bpf_spin_unlock(&lock); 13262306a36Sopenharmony_ci 13362306a36Sopenharmony_ciBoth the read from and write to ``n->data`` would be rejected. The verifier 13462306a36Sopenharmony_cican do better, though, by taking advantage of two details: 13562306a36Sopenharmony_ci 13662306a36Sopenharmony_ci * Graph data structure APIs can only be used when the ``bpf_spin_lock`` 13762306a36Sopenharmony_ci associated with the graph root is held 13862306a36Sopenharmony_ci 13962306a36Sopenharmony_ci * Both graph data structures have pointer stability 14062306a36Sopenharmony_ci 14162306a36Sopenharmony_ci * Because graph nodes are allocated with ``bpf_obj_new`` and 14262306a36Sopenharmony_ci adding / removing from the root involves fiddling with the 14362306a36Sopenharmony_ci ``bpf_{list,rb}_node`` field of the node struct, a graph node will 14462306a36Sopenharmony_ci remain at the same address after either operation. 14562306a36Sopenharmony_ci 14662306a36Sopenharmony_ciBecause the associated ``bpf_spin_lock`` must be held by any program adding 14762306a36Sopenharmony_cior removing, if we're in the critical section bounded by that lock, we know 14862306a36Sopenharmony_cithat no other program can add or remove until the end of the critical section. 14962306a36Sopenharmony_ciThis combined with pointer stability means that, until the critical section 15062306a36Sopenharmony_ciends, we can safely access the graph node through ``n`` even after it was used 15162306a36Sopenharmony_cito pass ownership. 15262306a36Sopenharmony_ci 15362306a36Sopenharmony_ciThe verifier considers such a reference a *non-owning reference*. The ref 15462306a36Sopenharmony_cireturned by ``bpf_obj_new`` is accordingly considered an *owning reference*. 15562306a36Sopenharmony_ciBoth terms currently only have meaning in the context of graph nodes and API. 15662306a36Sopenharmony_ci 15762306a36Sopenharmony_ci**Details** 15862306a36Sopenharmony_ci 15962306a36Sopenharmony_ciLet's enumerate the properties of both types of references. 16062306a36Sopenharmony_ci 16162306a36Sopenharmony_ci*owning reference* 16262306a36Sopenharmony_ci 16362306a36Sopenharmony_ci * This reference controls the lifetime of the pointee 16462306a36Sopenharmony_ci 16562306a36Sopenharmony_ci * Ownership of pointee must be 'released' by passing it to some graph API 16662306a36Sopenharmony_ci kfunc, or via ``bpf_obj_drop``, which ``free``'s the pointee 16762306a36Sopenharmony_ci 16862306a36Sopenharmony_ci * If not released before program ends, verifier considers program invalid 16962306a36Sopenharmony_ci 17062306a36Sopenharmony_ci * Access to the pointee's memory will not page fault 17162306a36Sopenharmony_ci 17262306a36Sopenharmony_ci*non-owning reference* 17362306a36Sopenharmony_ci 17462306a36Sopenharmony_ci * This reference does not own the pointee 17562306a36Sopenharmony_ci 17662306a36Sopenharmony_ci * It cannot be used to add the graph node to a graph root, nor ``free``'d via 17762306a36Sopenharmony_ci ``bpf_obj_drop`` 17862306a36Sopenharmony_ci 17962306a36Sopenharmony_ci * No explicit control of lifetime, but can infer valid lifetime based on 18062306a36Sopenharmony_ci non-owning ref existence (see explanation below) 18162306a36Sopenharmony_ci 18262306a36Sopenharmony_ci * Access to the pointee's memory will not page fault 18362306a36Sopenharmony_ci 18462306a36Sopenharmony_ciFrom verifier's perspective non-owning references can only exist 18562306a36Sopenharmony_cibetween spin_lock and spin_unlock. Why? After spin_unlock another program 18662306a36Sopenharmony_cican do arbitrary operations on the data structure like removing and ``free``-ing 18762306a36Sopenharmony_civia bpf_obj_drop. A non-owning ref to some chunk of memory that was remove'd, 18862306a36Sopenharmony_ci``free``'d, and reused via bpf_obj_new would point to an entirely different thing. 18962306a36Sopenharmony_ciOr the memory could go away. 19062306a36Sopenharmony_ci 19162306a36Sopenharmony_ciTo prevent this logic violation all non-owning references are invalidated by the 19262306a36Sopenharmony_civerifier after a critical section ends. This is necessary to ensure the "will 19362306a36Sopenharmony_cinot page fault" property of non-owning references. So if the verifier hasn't 19462306a36Sopenharmony_ciinvalidated a non-owning ref, accessing it will not page fault. 19562306a36Sopenharmony_ci 19662306a36Sopenharmony_ciCurrently ``bpf_obj_drop`` is not allowed in the critical section, so 19762306a36Sopenharmony_ciif there's a valid non-owning ref, we must be in a critical section, and can 19862306a36Sopenharmony_ciconclude that the ref's memory hasn't been dropped-and- ``free``'d or 19962306a36Sopenharmony_cidropped-and-reused. 20062306a36Sopenharmony_ci 20162306a36Sopenharmony_ciAny reference to a node that is in an rbtree _must_ be non-owning, since 20262306a36Sopenharmony_cithe tree has control of the pointee's lifetime. Similarly, any ref to a node 20362306a36Sopenharmony_cithat isn't in rbtree _must_ be owning. This results in a nice property: 20462306a36Sopenharmony_cigraph API add / remove implementations don't need to check if a node 20562306a36Sopenharmony_cihas already been added (or already removed), as the ownership model 20662306a36Sopenharmony_ciallows the verifier to prevent such a state from being valid by simply checking 20762306a36Sopenharmony_citypes. 20862306a36Sopenharmony_ci 20962306a36Sopenharmony_ciHowever, pointer aliasing poses an issue for the above "nice property". 21062306a36Sopenharmony_ciConsider the following example: 21162306a36Sopenharmony_ci 21262306a36Sopenharmony_ci.. code-block:: c 21362306a36Sopenharmony_ci 21462306a36Sopenharmony_ci struct node_data *n, *m, *o, *p; 21562306a36Sopenharmony_ci n = bpf_obj_new(typeof(*n)); /* 1 */ 21662306a36Sopenharmony_ci 21762306a36Sopenharmony_ci bpf_spin_lock(&lock); 21862306a36Sopenharmony_ci 21962306a36Sopenharmony_ci bpf_rbtree_add(&tree, n); /* 2 */ 22062306a36Sopenharmony_ci m = bpf_rbtree_first(&tree); /* 3 */ 22162306a36Sopenharmony_ci 22262306a36Sopenharmony_ci o = bpf_rbtree_remove(&tree, n); /* 4 */ 22362306a36Sopenharmony_ci p = bpf_rbtree_remove(&tree, m); /* 5 */ 22462306a36Sopenharmony_ci 22562306a36Sopenharmony_ci bpf_spin_unlock(&lock); 22662306a36Sopenharmony_ci 22762306a36Sopenharmony_ci bpf_obj_drop(o); 22862306a36Sopenharmony_ci bpf_obj_drop(p); /* 6 */ 22962306a36Sopenharmony_ci 23062306a36Sopenharmony_ciAssume the tree is empty before this program runs. If we track verifier state 23162306a36Sopenharmony_cichanges here using numbers in above comments: 23262306a36Sopenharmony_ci 23362306a36Sopenharmony_ci 1) n is an owning reference 23462306a36Sopenharmony_ci 23562306a36Sopenharmony_ci 2) n is a non-owning reference, it's been added to the tree 23662306a36Sopenharmony_ci 23762306a36Sopenharmony_ci 3) n and m are non-owning references, they both point to the same node 23862306a36Sopenharmony_ci 23962306a36Sopenharmony_ci 4) o is an owning reference, n and m non-owning, all point to same node 24062306a36Sopenharmony_ci 24162306a36Sopenharmony_ci 5) o and p are owning, n and m non-owning, all point to the same node 24262306a36Sopenharmony_ci 24362306a36Sopenharmony_ci 6) a double-free has occurred, since o and p point to same node and o was 24462306a36Sopenharmony_ci ``free``'d in previous statement 24562306a36Sopenharmony_ci 24662306a36Sopenharmony_ciStates 4 and 5 violate our "nice property", as there are non-owning refs to 24762306a36Sopenharmony_cia node which is not in an rbtree. Statement 5 will try to remove a node which 24862306a36Sopenharmony_cihas already been removed as a result of this violation. State 6 is a dangerous 24962306a36Sopenharmony_cidouble-free. 25062306a36Sopenharmony_ci 25162306a36Sopenharmony_ciAt a minimum we should prevent state 6 from being possible. If we can't also 25262306a36Sopenharmony_ciprevent state 5 then we must abandon our "nice property" and check whether a 25362306a36Sopenharmony_cinode has already been removed at runtime. 25462306a36Sopenharmony_ci 25562306a36Sopenharmony_ciWe prevent both by generalizing the "invalidate non-owning references" behavior 25662306a36Sopenharmony_ciof ``bpf_spin_unlock`` and doing similar invalidation after 25762306a36Sopenharmony_ci``bpf_rbtree_remove``. The logic here being that any graph API kfunc which: 25862306a36Sopenharmony_ci 25962306a36Sopenharmony_ci * takes an arbitrary node argument 26062306a36Sopenharmony_ci 26162306a36Sopenharmony_ci * removes it from the data structure 26262306a36Sopenharmony_ci 26362306a36Sopenharmony_ci * returns an owning reference to the removed node 26462306a36Sopenharmony_ci 26562306a36Sopenharmony_ciMay result in a state where some other non-owning reference points to the same 26662306a36Sopenharmony_cinode. So ``remove``-type kfuncs must be considered a non-owning reference 26762306a36Sopenharmony_ciinvalidation point as well. 268