162306a36Sopenharmony_ci.. SPDX-License-Identifier: GPL-2.0 262306a36Sopenharmony_ci 362306a36Sopenharmony_ci===================================== 462306a36Sopenharmony_ciNetwork Devices, the Kernel, and You! 562306a36Sopenharmony_ci===================================== 662306a36Sopenharmony_ci 762306a36Sopenharmony_ci 862306a36Sopenharmony_ciIntroduction 962306a36Sopenharmony_ci============ 1062306a36Sopenharmony_ciThe following is a random collection of documentation regarding 1162306a36Sopenharmony_cinetwork devices. 1262306a36Sopenharmony_ci 1362306a36Sopenharmony_cistruct net_device lifetime rules 1462306a36Sopenharmony_ci================================ 1562306a36Sopenharmony_ciNetwork device structures need to persist even after module is unloaded and 1662306a36Sopenharmony_cimust be allocated with alloc_netdev_mqs() and friends. 1762306a36Sopenharmony_ciIf device has registered successfully, it will be freed on last use 1862306a36Sopenharmony_ciby free_netdev(). This is required to handle the pathological case cleanly 1962306a36Sopenharmony_ci(example: ``rmmod mydriver </sys/class/net/myeth/mtu``) 2062306a36Sopenharmony_ci 2162306a36Sopenharmony_cialloc_netdev_mqs() / alloc_netdev() reserve extra space for driver 2262306a36Sopenharmony_ciprivate data which gets freed when the network device is freed. If 2362306a36Sopenharmony_ciseparately allocated data is attached to the network device 2462306a36Sopenharmony_ci(netdev_priv()) then it is up to the module exit handler to free that. 2562306a36Sopenharmony_ci 2662306a36Sopenharmony_ciThere are two groups of APIs for registering struct net_device. 2762306a36Sopenharmony_ciFirst group can be used in normal contexts where ``rtnl_lock`` is not already 2862306a36Sopenharmony_ciheld: register_netdev(), unregister_netdev(). 2962306a36Sopenharmony_ciSecond group can be used when ``rtnl_lock`` is already held: 3062306a36Sopenharmony_ciregister_netdevice(), unregister_netdevice(), free_netdevice(). 3162306a36Sopenharmony_ci 3262306a36Sopenharmony_ciSimple drivers 3362306a36Sopenharmony_ci-------------- 3462306a36Sopenharmony_ci 3562306a36Sopenharmony_ciMost drivers (especially device drivers) handle lifetime of struct net_device 3662306a36Sopenharmony_ciin context where ``rtnl_lock`` is not held (e.g. driver probe and remove paths). 3762306a36Sopenharmony_ci 3862306a36Sopenharmony_ciIn that case the struct net_device registration is done using 3962306a36Sopenharmony_cithe register_netdev(), and unregister_netdev() functions: 4062306a36Sopenharmony_ci 4162306a36Sopenharmony_ci.. code-block:: c 4262306a36Sopenharmony_ci 4362306a36Sopenharmony_ci int probe() 4462306a36Sopenharmony_ci { 4562306a36Sopenharmony_ci struct my_device_priv *priv; 4662306a36Sopenharmony_ci int err; 4762306a36Sopenharmony_ci 4862306a36Sopenharmony_ci dev = alloc_netdev_mqs(...); 4962306a36Sopenharmony_ci if (!dev) 5062306a36Sopenharmony_ci return -ENOMEM; 5162306a36Sopenharmony_ci priv = netdev_priv(dev); 5262306a36Sopenharmony_ci 5362306a36Sopenharmony_ci /* ... do all device setup before calling register_netdev() ... 5462306a36Sopenharmony_ci */ 5562306a36Sopenharmony_ci 5662306a36Sopenharmony_ci err = register_netdev(dev); 5762306a36Sopenharmony_ci if (err) 5862306a36Sopenharmony_ci goto err_undo; 5962306a36Sopenharmony_ci 6062306a36Sopenharmony_ci /* net_device is visible to the user! */ 6162306a36Sopenharmony_ci 6262306a36Sopenharmony_ci err_undo: 6362306a36Sopenharmony_ci /* ... undo the device setup ... */ 6462306a36Sopenharmony_ci free_netdev(dev); 6562306a36Sopenharmony_ci return err; 6662306a36Sopenharmony_ci } 6762306a36Sopenharmony_ci 6862306a36Sopenharmony_ci void remove() 6962306a36Sopenharmony_ci { 7062306a36Sopenharmony_ci unregister_netdev(dev); 7162306a36Sopenharmony_ci free_netdev(dev); 7262306a36Sopenharmony_ci } 7362306a36Sopenharmony_ci 7462306a36Sopenharmony_ciNote that after calling register_netdev() the device is visible in the system. 7562306a36Sopenharmony_ciUsers can open it and start sending / receiving traffic immediately, 7662306a36Sopenharmony_cior run any other callback, so all initialization must be done prior to 7762306a36Sopenharmony_ciregistration. 7862306a36Sopenharmony_ci 7962306a36Sopenharmony_ciunregister_netdev() closes the device and waits for all users to be done 8062306a36Sopenharmony_ciwith it. The memory of struct net_device itself may still be referenced 8162306a36Sopenharmony_ciby sysfs but all operations on that device will fail. 8262306a36Sopenharmony_ci 8362306a36Sopenharmony_cifree_netdev() can be called after unregister_netdev() returns on when 8462306a36Sopenharmony_ciregister_netdev() failed. 8562306a36Sopenharmony_ci 8662306a36Sopenharmony_ciDevice management under RTNL 8762306a36Sopenharmony_ci---------------------------- 8862306a36Sopenharmony_ci 8962306a36Sopenharmony_ciRegistering struct net_device while in context which already holds 9062306a36Sopenharmony_cithe ``rtnl_lock`` requires extra care. In those scenarios most drivers 9162306a36Sopenharmony_ciwill want to make use of struct net_device's ``needs_free_netdev`` 9262306a36Sopenharmony_ciand ``priv_destructor`` members for freeing of state. 9362306a36Sopenharmony_ci 9462306a36Sopenharmony_ciExample flow of netdev handling under ``rtnl_lock``: 9562306a36Sopenharmony_ci 9662306a36Sopenharmony_ci.. code-block:: c 9762306a36Sopenharmony_ci 9862306a36Sopenharmony_ci static void my_setup(struct net_device *dev) 9962306a36Sopenharmony_ci { 10062306a36Sopenharmony_ci dev->needs_free_netdev = true; 10162306a36Sopenharmony_ci } 10262306a36Sopenharmony_ci 10362306a36Sopenharmony_ci static void my_destructor(struct net_device *dev) 10462306a36Sopenharmony_ci { 10562306a36Sopenharmony_ci some_obj_destroy(priv->obj); 10662306a36Sopenharmony_ci some_uninit(priv); 10762306a36Sopenharmony_ci } 10862306a36Sopenharmony_ci 10962306a36Sopenharmony_ci int create_link() 11062306a36Sopenharmony_ci { 11162306a36Sopenharmony_ci struct my_device_priv *priv; 11262306a36Sopenharmony_ci int err; 11362306a36Sopenharmony_ci 11462306a36Sopenharmony_ci ASSERT_RTNL(); 11562306a36Sopenharmony_ci 11662306a36Sopenharmony_ci dev = alloc_netdev(sizeof(*priv), "net%d", NET_NAME_UNKNOWN, my_setup); 11762306a36Sopenharmony_ci if (!dev) 11862306a36Sopenharmony_ci return -ENOMEM; 11962306a36Sopenharmony_ci priv = netdev_priv(dev); 12062306a36Sopenharmony_ci 12162306a36Sopenharmony_ci /* Implicit constructor */ 12262306a36Sopenharmony_ci err = some_init(priv); 12362306a36Sopenharmony_ci if (err) 12462306a36Sopenharmony_ci goto err_free_dev; 12562306a36Sopenharmony_ci 12662306a36Sopenharmony_ci priv->obj = some_obj_create(); 12762306a36Sopenharmony_ci if (!priv->obj) { 12862306a36Sopenharmony_ci err = -ENOMEM; 12962306a36Sopenharmony_ci goto err_some_uninit; 13062306a36Sopenharmony_ci } 13162306a36Sopenharmony_ci /* End of constructor, set the destructor: */ 13262306a36Sopenharmony_ci dev->priv_destructor = my_destructor; 13362306a36Sopenharmony_ci 13462306a36Sopenharmony_ci err = register_netdevice(dev); 13562306a36Sopenharmony_ci if (err) 13662306a36Sopenharmony_ci /* register_netdevice() calls destructor on failure */ 13762306a36Sopenharmony_ci goto err_free_dev; 13862306a36Sopenharmony_ci 13962306a36Sopenharmony_ci /* If anything fails now unregister_netdevice() (or unregister_netdev()) 14062306a36Sopenharmony_ci * will take care of calling my_destructor and free_netdev(). 14162306a36Sopenharmony_ci */ 14262306a36Sopenharmony_ci 14362306a36Sopenharmony_ci return 0; 14462306a36Sopenharmony_ci 14562306a36Sopenharmony_ci err_some_uninit: 14662306a36Sopenharmony_ci some_uninit(priv); 14762306a36Sopenharmony_ci err_free_dev: 14862306a36Sopenharmony_ci free_netdev(dev); 14962306a36Sopenharmony_ci return err; 15062306a36Sopenharmony_ci } 15162306a36Sopenharmony_ci 15262306a36Sopenharmony_ciIf struct net_device.priv_destructor is set it will be called by the core 15362306a36Sopenharmony_cisome time after unregister_netdevice(), it will also be called if 15462306a36Sopenharmony_ciregister_netdevice() fails. The callback may be invoked with or without 15562306a36Sopenharmony_ci``rtnl_lock`` held. 15662306a36Sopenharmony_ci 15762306a36Sopenharmony_ciThere is no explicit constructor callback, driver "constructs" the private 15862306a36Sopenharmony_cinetdev state after allocating it and before registration. 15962306a36Sopenharmony_ci 16062306a36Sopenharmony_ciSetting struct net_device.needs_free_netdev makes core call free_netdevice() 16162306a36Sopenharmony_ciautomatically after unregister_netdevice() when all references to the device 16262306a36Sopenharmony_ciare gone. It only takes effect after a successful call to register_netdevice() 16362306a36Sopenharmony_ciso if register_netdevice() fails driver is responsible for calling 16462306a36Sopenharmony_cifree_netdev(). 16562306a36Sopenharmony_ci 16662306a36Sopenharmony_cifree_netdev() is safe to call on error paths right after unregister_netdevice() 16762306a36Sopenharmony_cior when register_netdevice() fails. Parts of netdev (de)registration process 16862306a36Sopenharmony_cihappen after ``rtnl_lock`` is released, therefore in those cases free_netdev() 16962306a36Sopenharmony_ciwill defer some of the processing until ``rtnl_lock`` is released. 17062306a36Sopenharmony_ci 17162306a36Sopenharmony_ciDevices spawned from struct rtnl_link_ops should never free the 17262306a36Sopenharmony_cistruct net_device directly. 17362306a36Sopenharmony_ci 17462306a36Sopenharmony_ci.ndo_init and .ndo_uninit 17562306a36Sopenharmony_ci~~~~~~~~~~~~~~~~~~~~~~~~~ 17662306a36Sopenharmony_ci 17762306a36Sopenharmony_ci``.ndo_init`` and ``.ndo_uninit`` callbacks are called during net_device 17862306a36Sopenharmony_ciregistration and de-registration, under ``rtnl_lock``. Drivers can use 17962306a36Sopenharmony_cithose e.g. when parts of their init process need to run under ``rtnl_lock``. 18062306a36Sopenharmony_ci 18162306a36Sopenharmony_ci``.ndo_init`` runs before device is visible in the system, ``.ndo_uninit`` 18262306a36Sopenharmony_ciruns during de-registering after device is closed but other subsystems 18362306a36Sopenharmony_cimay still have outstanding references to the netdevice. 18462306a36Sopenharmony_ci 18562306a36Sopenharmony_ciMTU 18662306a36Sopenharmony_ci=== 18762306a36Sopenharmony_ciEach network device has a Maximum Transfer Unit. The MTU does not 18862306a36Sopenharmony_ciinclude any link layer protocol overhead. Upper layer protocols must 18962306a36Sopenharmony_cinot pass a socket buffer (skb) to a device to transmit with more data 19062306a36Sopenharmony_cithan the mtu. The MTU does not include link layer header overhead, so 19162306a36Sopenharmony_cifor example on Ethernet if the standard MTU is 1500 bytes used, the 19262306a36Sopenharmony_ciactual skb will contain up to 1514 bytes because of the Ethernet 19362306a36Sopenharmony_ciheader. Devices should allow for the 4 byte VLAN header as well. 19462306a36Sopenharmony_ci 19562306a36Sopenharmony_ciSegmentation Offload (GSO, TSO) is an exception to this rule. The 19662306a36Sopenharmony_ciupper layer protocol may pass a large socket buffer to the device 19762306a36Sopenharmony_citransmit routine, and the device will break that up into separate 19862306a36Sopenharmony_cipackets based on the current MTU. 19962306a36Sopenharmony_ci 20062306a36Sopenharmony_ciMTU is symmetrical and applies both to receive and transmit. A device 20162306a36Sopenharmony_cimust be able to receive at least the maximum size packet allowed by 20262306a36Sopenharmony_cithe MTU. A network device may use the MTU as mechanism to size receive 20362306a36Sopenharmony_cibuffers, but the device should allow packets with VLAN header. With 20462306a36Sopenharmony_cistandard Ethernet mtu of 1500 bytes, the device should allow up to 20562306a36Sopenharmony_ci1518 byte packets (1500 + 14 header + 4 tag). The device may either: 20662306a36Sopenharmony_cidrop, truncate, or pass up oversize packets, but dropping oversize 20762306a36Sopenharmony_cipackets is preferred. 20862306a36Sopenharmony_ci 20962306a36Sopenharmony_ci 21062306a36Sopenharmony_cistruct net_device synchronization rules 21162306a36Sopenharmony_ci======================================= 21262306a36Sopenharmony_cindo_open: 21362306a36Sopenharmony_ci Synchronization: rtnl_lock() semaphore. 21462306a36Sopenharmony_ci Context: process 21562306a36Sopenharmony_ci 21662306a36Sopenharmony_cindo_stop: 21762306a36Sopenharmony_ci Synchronization: rtnl_lock() semaphore. 21862306a36Sopenharmony_ci Context: process 21962306a36Sopenharmony_ci Note: netif_running() is guaranteed false 22062306a36Sopenharmony_ci 22162306a36Sopenharmony_cindo_do_ioctl: 22262306a36Sopenharmony_ci Synchronization: rtnl_lock() semaphore. 22362306a36Sopenharmony_ci Context: process 22462306a36Sopenharmony_ci 22562306a36Sopenharmony_ci This is only called by network subsystems internally, 22662306a36Sopenharmony_ci not by user space calling ioctl as it was in before 22762306a36Sopenharmony_ci linux-5.14. 22862306a36Sopenharmony_ci 22962306a36Sopenharmony_cindo_siocbond: 23062306a36Sopenharmony_ci Synchronization: rtnl_lock() semaphore. 23162306a36Sopenharmony_ci Context: process 23262306a36Sopenharmony_ci 23362306a36Sopenharmony_ci Used by the bonding driver for the SIOCBOND family of 23462306a36Sopenharmony_ci ioctl commands. 23562306a36Sopenharmony_ci 23662306a36Sopenharmony_cindo_siocwandev: 23762306a36Sopenharmony_ci Synchronization: rtnl_lock() semaphore. 23862306a36Sopenharmony_ci Context: process 23962306a36Sopenharmony_ci 24062306a36Sopenharmony_ci Used by the drivers/net/wan framework to handle 24162306a36Sopenharmony_ci the SIOCWANDEV ioctl with the if_settings structure. 24262306a36Sopenharmony_ci 24362306a36Sopenharmony_cindo_siocdevprivate: 24462306a36Sopenharmony_ci Synchronization: rtnl_lock() semaphore. 24562306a36Sopenharmony_ci Context: process 24662306a36Sopenharmony_ci 24762306a36Sopenharmony_ci This is used to implement SIOCDEVPRIVATE ioctl helpers. 24862306a36Sopenharmony_ci These should not be added to new drivers, so don't use. 24962306a36Sopenharmony_ci 25062306a36Sopenharmony_cindo_eth_ioctl: 25162306a36Sopenharmony_ci Synchronization: rtnl_lock() semaphore. 25262306a36Sopenharmony_ci Context: process 25362306a36Sopenharmony_ci 25462306a36Sopenharmony_cindo_get_stats: 25562306a36Sopenharmony_ci Synchronization: rtnl_lock() semaphore, dev_base_lock rwlock, or RCU. 25662306a36Sopenharmony_ci Context: atomic (can't sleep under rwlock or RCU) 25762306a36Sopenharmony_ci 25862306a36Sopenharmony_cindo_start_xmit: 25962306a36Sopenharmony_ci Synchronization: __netif_tx_lock spinlock. 26062306a36Sopenharmony_ci 26162306a36Sopenharmony_ci When the driver sets NETIF_F_LLTX in dev->features this will be 26262306a36Sopenharmony_ci called without holding netif_tx_lock. In this case the driver 26362306a36Sopenharmony_ci has to lock by itself when needed. 26462306a36Sopenharmony_ci The locking there should also properly protect against 26562306a36Sopenharmony_ci set_rx_mode. WARNING: use of NETIF_F_LLTX is deprecated. 26662306a36Sopenharmony_ci Don't use it for new drivers. 26762306a36Sopenharmony_ci 26862306a36Sopenharmony_ci Context: Process with BHs disabled or BH (timer), 26962306a36Sopenharmony_ci will be called with interrupts disabled by netconsole. 27062306a36Sopenharmony_ci 27162306a36Sopenharmony_ci Return codes: 27262306a36Sopenharmony_ci 27362306a36Sopenharmony_ci * NETDEV_TX_OK everything ok. 27462306a36Sopenharmony_ci * NETDEV_TX_BUSY Cannot transmit packet, try later 27562306a36Sopenharmony_ci Usually a bug, means queue start/stop flow control is broken in 27662306a36Sopenharmony_ci the driver. Note: the driver must NOT put the skb in its DMA ring. 27762306a36Sopenharmony_ci 27862306a36Sopenharmony_cindo_tx_timeout: 27962306a36Sopenharmony_ci Synchronization: netif_tx_lock spinlock; all TX queues frozen. 28062306a36Sopenharmony_ci Context: BHs disabled 28162306a36Sopenharmony_ci Notes: netif_queue_stopped() is guaranteed true 28262306a36Sopenharmony_ci 28362306a36Sopenharmony_cindo_set_rx_mode: 28462306a36Sopenharmony_ci Synchronization: netif_addr_lock spinlock. 28562306a36Sopenharmony_ci Context: BHs disabled 28662306a36Sopenharmony_ci 28762306a36Sopenharmony_cistruct napi_struct synchronization rules 28862306a36Sopenharmony_ci======================================== 28962306a36Sopenharmony_cinapi->poll: 29062306a36Sopenharmony_ci Synchronization: 29162306a36Sopenharmony_ci NAPI_STATE_SCHED bit in napi->state. Device 29262306a36Sopenharmony_ci driver's ndo_stop method will invoke napi_disable() on 29362306a36Sopenharmony_ci all NAPI instances which will do a sleeping poll on the 29462306a36Sopenharmony_ci NAPI_STATE_SCHED napi->state bit, waiting for all pending 29562306a36Sopenharmony_ci NAPI activity to cease. 29662306a36Sopenharmony_ci 29762306a36Sopenharmony_ci Context: 29862306a36Sopenharmony_ci softirq 29962306a36Sopenharmony_ci will be called with interrupts disabled by netconsole. 300