162306a36Sopenharmony_ci.. SPDX-License-Identifier: GPL-2.0
262306a36Sopenharmony_ci
362306a36Sopenharmony_ci=====================================
462306a36Sopenharmony_ciNetwork Devices, the Kernel, and You!
562306a36Sopenharmony_ci=====================================
662306a36Sopenharmony_ci
762306a36Sopenharmony_ci
862306a36Sopenharmony_ciIntroduction
962306a36Sopenharmony_ci============
1062306a36Sopenharmony_ciThe following is a random collection of documentation regarding
1162306a36Sopenharmony_cinetwork devices.
1262306a36Sopenharmony_ci
1362306a36Sopenharmony_cistruct net_device lifetime rules
1462306a36Sopenharmony_ci================================
1562306a36Sopenharmony_ciNetwork device structures need to persist even after module is unloaded and
1662306a36Sopenharmony_cimust be allocated with alloc_netdev_mqs() and friends.
1762306a36Sopenharmony_ciIf device has registered successfully, it will be freed on last use
1862306a36Sopenharmony_ciby free_netdev(). This is required to handle the pathological case cleanly
1962306a36Sopenharmony_ci(example: ``rmmod mydriver </sys/class/net/myeth/mtu``)
2062306a36Sopenharmony_ci
2162306a36Sopenharmony_cialloc_netdev_mqs() / alloc_netdev() reserve extra space for driver
2262306a36Sopenharmony_ciprivate data which gets freed when the network device is freed. If
2362306a36Sopenharmony_ciseparately allocated data is attached to the network device
2462306a36Sopenharmony_ci(netdev_priv()) then it is up to the module exit handler to free that.
2562306a36Sopenharmony_ci
2662306a36Sopenharmony_ciThere are two groups of APIs for registering struct net_device.
2762306a36Sopenharmony_ciFirst group can be used in normal contexts where ``rtnl_lock`` is not already
2862306a36Sopenharmony_ciheld: register_netdev(), unregister_netdev().
2962306a36Sopenharmony_ciSecond group can be used when ``rtnl_lock`` is already held:
3062306a36Sopenharmony_ciregister_netdevice(), unregister_netdevice(), free_netdevice().
3162306a36Sopenharmony_ci
3262306a36Sopenharmony_ciSimple drivers
3362306a36Sopenharmony_ci--------------
3462306a36Sopenharmony_ci
3562306a36Sopenharmony_ciMost drivers (especially device drivers) handle lifetime of struct net_device
3662306a36Sopenharmony_ciin context where ``rtnl_lock`` is not held (e.g. driver probe and remove paths).
3762306a36Sopenharmony_ci
3862306a36Sopenharmony_ciIn that case the struct net_device registration is done using
3962306a36Sopenharmony_cithe register_netdev(), and unregister_netdev() functions:
4062306a36Sopenharmony_ci
4162306a36Sopenharmony_ci.. code-block:: c
4262306a36Sopenharmony_ci
4362306a36Sopenharmony_ci  int probe()
4462306a36Sopenharmony_ci  {
4562306a36Sopenharmony_ci    struct my_device_priv *priv;
4662306a36Sopenharmony_ci    int err;
4762306a36Sopenharmony_ci
4862306a36Sopenharmony_ci    dev = alloc_netdev_mqs(...);
4962306a36Sopenharmony_ci    if (!dev)
5062306a36Sopenharmony_ci      return -ENOMEM;
5162306a36Sopenharmony_ci    priv = netdev_priv(dev);
5262306a36Sopenharmony_ci
5362306a36Sopenharmony_ci    /* ... do all device setup before calling register_netdev() ...
5462306a36Sopenharmony_ci     */
5562306a36Sopenharmony_ci
5662306a36Sopenharmony_ci    err = register_netdev(dev);
5762306a36Sopenharmony_ci    if (err)
5862306a36Sopenharmony_ci      goto err_undo;
5962306a36Sopenharmony_ci
6062306a36Sopenharmony_ci    /* net_device is visible to the user! */
6162306a36Sopenharmony_ci
6262306a36Sopenharmony_ci  err_undo:
6362306a36Sopenharmony_ci    /* ... undo the device setup ... */
6462306a36Sopenharmony_ci    free_netdev(dev);
6562306a36Sopenharmony_ci    return err;
6662306a36Sopenharmony_ci  }
6762306a36Sopenharmony_ci
6862306a36Sopenharmony_ci  void remove()
6962306a36Sopenharmony_ci  {
7062306a36Sopenharmony_ci    unregister_netdev(dev);
7162306a36Sopenharmony_ci    free_netdev(dev);
7262306a36Sopenharmony_ci  }
7362306a36Sopenharmony_ci
7462306a36Sopenharmony_ciNote that after calling register_netdev() the device is visible in the system.
7562306a36Sopenharmony_ciUsers can open it and start sending / receiving traffic immediately,
7662306a36Sopenharmony_cior run any other callback, so all initialization must be done prior to
7762306a36Sopenharmony_ciregistration.
7862306a36Sopenharmony_ci
7962306a36Sopenharmony_ciunregister_netdev() closes the device and waits for all users to be done
8062306a36Sopenharmony_ciwith it. The memory of struct net_device itself may still be referenced
8162306a36Sopenharmony_ciby sysfs but all operations on that device will fail.
8262306a36Sopenharmony_ci
8362306a36Sopenharmony_cifree_netdev() can be called after unregister_netdev() returns on when
8462306a36Sopenharmony_ciregister_netdev() failed.
8562306a36Sopenharmony_ci
8662306a36Sopenharmony_ciDevice management under RTNL
8762306a36Sopenharmony_ci----------------------------
8862306a36Sopenharmony_ci
8962306a36Sopenharmony_ciRegistering struct net_device while in context which already holds
9062306a36Sopenharmony_cithe ``rtnl_lock`` requires extra care. In those scenarios most drivers
9162306a36Sopenharmony_ciwill want to make use of struct net_device's ``needs_free_netdev``
9262306a36Sopenharmony_ciand ``priv_destructor`` members for freeing of state.
9362306a36Sopenharmony_ci
9462306a36Sopenharmony_ciExample flow of netdev handling under ``rtnl_lock``:
9562306a36Sopenharmony_ci
9662306a36Sopenharmony_ci.. code-block:: c
9762306a36Sopenharmony_ci
9862306a36Sopenharmony_ci  static void my_setup(struct net_device *dev)
9962306a36Sopenharmony_ci  {
10062306a36Sopenharmony_ci    dev->needs_free_netdev = true;
10162306a36Sopenharmony_ci  }
10262306a36Sopenharmony_ci
10362306a36Sopenharmony_ci  static void my_destructor(struct net_device *dev)
10462306a36Sopenharmony_ci  {
10562306a36Sopenharmony_ci    some_obj_destroy(priv->obj);
10662306a36Sopenharmony_ci    some_uninit(priv);
10762306a36Sopenharmony_ci  }
10862306a36Sopenharmony_ci
10962306a36Sopenharmony_ci  int create_link()
11062306a36Sopenharmony_ci  {
11162306a36Sopenharmony_ci    struct my_device_priv *priv;
11262306a36Sopenharmony_ci    int err;
11362306a36Sopenharmony_ci
11462306a36Sopenharmony_ci    ASSERT_RTNL();
11562306a36Sopenharmony_ci
11662306a36Sopenharmony_ci    dev = alloc_netdev(sizeof(*priv), "net%d", NET_NAME_UNKNOWN, my_setup);
11762306a36Sopenharmony_ci    if (!dev)
11862306a36Sopenharmony_ci      return -ENOMEM;
11962306a36Sopenharmony_ci    priv = netdev_priv(dev);
12062306a36Sopenharmony_ci
12162306a36Sopenharmony_ci    /* Implicit constructor */
12262306a36Sopenharmony_ci    err = some_init(priv);
12362306a36Sopenharmony_ci    if (err)
12462306a36Sopenharmony_ci      goto err_free_dev;
12562306a36Sopenharmony_ci
12662306a36Sopenharmony_ci    priv->obj = some_obj_create();
12762306a36Sopenharmony_ci    if (!priv->obj) {
12862306a36Sopenharmony_ci      err = -ENOMEM;
12962306a36Sopenharmony_ci      goto err_some_uninit;
13062306a36Sopenharmony_ci    }
13162306a36Sopenharmony_ci    /* End of constructor, set the destructor: */
13262306a36Sopenharmony_ci    dev->priv_destructor = my_destructor;
13362306a36Sopenharmony_ci
13462306a36Sopenharmony_ci    err = register_netdevice(dev);
13562306a36Sopenharmony_ci    if (err)
13662306a36Sopenharmony_ci      /* register_netdevice() calls destructor on failure */
13762306a36Sopenharmony_ci      goto err_free_dev;
13862306a36Sopenharmony_ci
13962306a36Sopenharmony_ci    /* If anything fails now unregister_netdevice() (or unregister_netdev())
14062306a36Sopenharmony_ci     * will take care of calling my_destructor and free_netdev().
14162306a36Sopenharmony_ci     */
14262306a36Sopenharmony_ci
14362306a36Sopenharmony_ci    return 0;
14462306a36Sopenharmony_ci
14562306a36Sopenharmony_ci  err_some_uninit:
14662306a36Sopenharmony_ci    some_uninit(priv);
14762306a36Sopenharmony_ci  err_free_dev:
14862306a36Sopenharmony_ci    free_netdev(dev);
14962306a36Sopenharmony_ci    return err;
15062306a36Sopenharmony_ci  }
15162306a36Sopenharmony_ci
15262306a36Sopenharmony_ciIf struct net_device.priv_destructor is set it will be called by the core
15362306a36Sopenharmony_cisome time after unregister_netdevice(), it will also be called if
15462306a36Sopenharmony_ciregister_netdevice() fails. The callback may be invoked with or without
15562306a36Sopenharmony_ci``rtnl_lock`` held.
15662306a36Sopenharmony_ci
15762306a36Sopenharmony_ciThere is no explicit constructor callback, driver "constructs" the private
15862306a36Sopenharmony_cinetdev state after allocating it and before registration.
15962306a36Sopenharmony_ci
16062306a36Sopenharmony_ciSetting struct net_device.needs_free_netdev makes core call free_netdevice()
16162306a36Sopenharmony_ciautomatically after unregister_netdevice() when all references to the device
16262306a36Sopenharmony_ciare gone. It only takes effect after a successful call to register_netdevice()
16362306a36Sopenharmony_ciso if register_netdevice() fails driver is responsible for calling
16462306a36Sopenharmony_cifree_netdev().
16562306a36Sopenharmony_ci
16662306a36Sopenharmony_cifree_netdev() is safe to call on error paths right after unregister_netdevice()
16762306a36Sopenharmony_cior when register_netdevice() fails. Parts of netdev (de)registration process
16862306a36Sopenharmony_cihappen after ``rtnl_lock`` is released, therefore in those cases free_netdev()
16962306a36Sopenharmony_ciwill defer some of the processing until ``rtnl_lock`` is released.
17062306a36Sopenharmony_ci
17162306a36Sopenharmony_ciDevices spawned from struct rtnl_link_ops should never free the
17262306a36Sopenharmony_cistruct net_device directly.
17362306a36Sopenharmony_ci
17462306a36Sopenharmony_ci.ndo_init and .ndo_uninit
17562306a36Sopenharmony_ci~~~~~~~~~~~~~~~~~~~~~~~~~
17662306a36Sopenharmony_ci
17762306a36Sopenharmony_ci``.ndo_init`` and ``.ndo_uninit`` callbacks are called during net_device
17862306a36Sopenharmony_ciregistration and de-registration, under ``rtnl_lock``. Drivers can use
17962306a36Sopenharmony_cithose e.g. when parts of their init process need to run under ``rtnl_lock``.
18062306a36Sopenharmony_ci
18162306a36Sopenharmony_ci``.ndo_init`` runs before device is visible in the system, ``.ndo_uninit``
18262306a36Sopenharmony_ciruns during de-registering after device is closed but other subsystems
18362306a36Sopenharmony_cimay still have outstanding references to the netdevice.
18462306a36Sopenharmony_ci
18562306a36Sopenharmony_ciMTU
18662306a36Sopenharmony_ci===
18762306a36Sopenharmony_ciEach network device has a Maximum Transfer Unit. The MTU does not
18862306a36Sopenharmony_ciinclude any link layer protocol overhead. Upper layer protocols must
18962306a36Sopenharmony_cinot pass a socket buffer (skb) to a device to transmit with more data
19062306a36Sopenharmony_cithan the mtu. The MTU does not include link layer header overhead, so
19162306a36Sopenharmony_cifor example on Ethernet if the standard MTU is 1500 bytes used, the
19262306a36Sopenharmony_ciactual skb will contain up to 1514 bytes because of the Ethernet
19362306a36Sopenharmony_ciheader. Devices should allow for the 4 byte VLAN header as well.
19462306a36Sopenharmony_ci
19562306a36Sopenharmony_ciSegmentation Offload (GSO, TSO) is an exception to this rule.  The
19662306a36Sopenharmony_ciupper layer protocol may pass a large socket buffer to the device
19762306a36Sopenharmony_citransmit routine, and the device will break that up into separate
19862306a36Sopenharmony_cipackets based on the current MTU.
19962306a36Sopenharmony_ci
20062306a36Sopenharmony_ciMTU is symmetrical and applies both to receive and transmit. A device
20162306a36Sopenharmony_cimust be able to receive at least the maximum size packet allowed by
20262306a36Sopenharmony_cithe MTU. A network device may use the MTU as mechanism to size receive
20362306a36Sopenharmony_cibuffers, but the device should allow packets with VLAN header. With
20462306a36Sopenharmony_cistandard Ethernet mtu of 1500 bytes, the device should allow up to
20562306a36Sopenharmony_ci1518 byte packets (1500 + 14 header + 4 tag).  The device may either:
20662306a36Sopenharmony_cidrop, truncate, or pass up oversize packets, but dropping oversize
20762306a36Sopenharmony_cipackets is preferred.
20862306a36Sopenharmony_ci
20962306a36Sopenharmony_ci
21062306a36Sopenharmony_cistruct net_device synchronization rules
21162306a36Sopenharmony_ci=======================================
21262306a36Sopenharmony_cindo_open:
21362306a36Sopenharmony_ci	Synchronization: rtnl_lock() semaphore.
21462306a36Sopenharmony_ci	Context: process
21562306a36Sopenharmony_ci
21662306a36Sopenharmony_cindo_stop:
21762306a36Sopenharmony_ci	Synchronization: rtnl_lock() semaphore.
21862306a36Sopenharmony_ci	Context: process
21962306a36Sopenharmony_ci	Note: netif_running() is guaranteed false
22062306a36Sopenharmony_ci
22162306a36Sopenharmony_cindo_do_ioctl:
22262306a36Sopenharmony_ci	Synchronization: rtnl_lock() semaphore.
22362306a36Sopenharmony_ci	Context: process
22462306a36Sopenharmony_ci
22562306a36Sopenharmony_ci        This is only called by network subsystems internally,
22662306a36Sopenharmony_ci        not by user space calling ioctl as it was in before
22762306a36Sopenharmony_ci        linux-5.14.
22862306a36Sopenharmony_ci
22962306a36Sopenharmony_cindo_siocbond:
23062306a36Sopenharmony_ci        Synchronization: rtnl_lock() semaphore.
23162306a36Sopenharmony_ci        Context: process
23262306a36Sopenharmony_ci
23362306a36Sopenharmony_ci        Used by the bonding driver for the SIOCBOND family of
23462306a36Sopenharmony_ci        ioctl commands.
23562306a36Sopenharmony_ci
23662306a36Sopenharmony_cindo_siocwandev:
23762306a36Sopenharmony_ci	Synchronization: rtnl_lock() semaphore.
23862306a36Sopenharmony_ci	Context: process
23962306a36Sopenharmony_ci
24062306a36Sopenharmony_ci	Used by the drivers/net/wan framework to handle
24162306a36Sopenharmony_ci	the SIOCWANDEV ioctl with the if_settings structure.
24262306a36Sopenharmony_ci
24362306a36Sopenharmony_cindo_siocdevprivate:
24462306a36Sopenharmony_ci	Synchronization: rtnl_lock() semaphore.
24562306a36Sopenharmony_ci	Context: process
24662306a36Sopenharmony_ci
24762306a36Sopenharmony_ci	This is used to implement SIOCDEVPRIVATE ioctl helpers.
24862306a36Sopenharmony_ci	These should not be added to new drivers, so don't use.
24962306a36Sopenharmony_ci
25062306a36Sopenharmony_cindo_eth_ioctl:
25162306a36Sopenharmony_ci	Synchronization: rtnl_lock() semaphore.
25262306a36Sopenharmony_ci	Context: process
25362306a36Sopenharmony_ci
25462306a36Sopenharmony_cindo_get_stats:
25562306a36Sopenharmony_ci	Synchronization: rtnl_lock() semaphore, dev_base_lock rwlock, or RCU.
25662306a36Sopenharmony_ci	Context: atomic (can't sleep under rwlock or RCU)
25762306a36Sopenharmony_ci
25862306a36Sopenharmony_cindo_start_xmit:
25962306a36Sopenharmony_ci	Synchronization: __netif_tx_lock spinlock.
26062306a36Sopenharmony_ci
26162306a36Sopenharmony_ci	When the driver sets NETIF_F_LLTX in dev->features this will be
26262306a36Sopenharmony_ci	called without holding netif_tx_lock. In this case the driver
26362306a36Sopenharmony_ci	has to lock by itself when needed.
26462306a36Sopenharmony_ci	The locking there should also properly protect against
26562306a36Sopenharmony_ci	set_rx_mode. WARNING: use of NETIF_F_LLTX is deprecated.
26662306a36Sopenharmony_ci	Don't use it for new drivers.
26762306a36Sopenharmony_ci
26862306a36Sopenharmony_ci	Context: Process with BHs disabled or BH (timer),
26962306a36Sopenharmony_ci		 will be called with interrupts disabled by netconsole.
27062306a36Sopenharmony_ci
27162306a36Sopenharmony_ci	Return codes:
27262306a36Sopenharmony_ci
27362306a36Sopenharmony_ci	* NETDEV_TX_OK everything ok.
27462306a36Sopenharmony_ci	* NETDEV_TX_BUSY Cannot transmit packet, try later
27562306a36Sopenharmony_ci	  Usually a bug, means queue start/stop flow control is broken in
27662306a36Sopenharmony_ci	  the driver. Note: the driver must NOT put the skb in its DMA ring.
27762306a36Sopenharmony_ci
27862306a36Sopenharmony_cindo_tx_timeout:
27962306a36Sopenharmony_ci	Synchronization: netif_tx_lock spinlock; all TX queues frozen.
28062306a36Sopenharmony_ci	Context: BHs disabled
28162306a36Sopenharmony_ci	Notes: netif_queue_stopped() is guaranteed true
28262306a36Sopenharmony_ci
28362306a36Sopenharmony_cindo_set_rx_mode:
28462306a36Sopenharmony_ci	Synchronization: netif_addr_lock spinlock.
28562306a36Sopenharmony_ci	Context: BHs disabled
28662306a36Sopenharmony_ci
28762306a36Sopenharmony_cistruct napi_struct synchronization rules
28862306a36Sopenharmony_ci========================================
28962306a36Sopenharmony_cinapi->poll:
29062306a36Sopenharmony_ci	Synchronization:
29162306a36Sopenharmony_ci		NAPI_STATE_SCHED bit in napi->state.  Device
29262306a36Sopenharmony_ci		driver's ndo_stop method will invoke napi_disable() on
29362306a36Sopenharmony_ci		all NAPI instances which will do a sleeping poll on the
29462306a36Sopenharmony_ci		NAPI_STATE_SCHED napi->state bit, waiting for all pending
29562306a36Sopenharmony_ci		NAPI activity to cease.
29662306a36Sopenharmony_ci
29762306a36Sopenharmony_ci	Context:
29862306a36Sopenharmony_ci		 softirq
29962306a36Sopenharmony_ci		 will be called with interrupts disabled by netconsole.
300