Linux Network Device Driver - ianchen0119/Introduce-to-5GC GitHub Wiki

1. Introduce to device

在 Linux kernel 中,如果需要存取、操作裝置,必須事先安裝裝置的驅動程式(device driver)。

參考上圖,我們不難發現 Linux 將裝置分成了兩大類,分別是:

  • Block Devices
  • Character Devices

2. Types of Devices

在上面我們已經提到,Linux 將裝置分成了兩大類:Block Devices 以及 Character Devices

2.1 Block Devices

兩者最主要的差異在於 Block Devices 使用固定長度的方式傳輸資料,而 Character devices 可接受非固定長度的資料。 對於兩者的操作方式,在 user space 並沒有差異。

2.2 Character Devices

Character Devices 可以視為一個資料流,它可以像是檔案一樣被存取。因此,Character device driver 至少需要負責以下工作:

  • open
  • close
  • read
  • write

對於某些特殊案例,它還需要提供 ioctl 操作,比較常見的場景是 text console 以及 serial ports,兩者皆屬於 streaming structure。

2.3 Check the sample code!

/* In sample.c */
# include <linux/module.h>
# include <linux/kernel.h>
# include <linux/init.h>

static int __init dev_drv_init(void)
{
    pr_info("Initializing the module.\n");
    return 0;
}

static void __exit dev_drv_exit(void)
{
    pr_info("Unloading the module.\n");
    return;
}

module_init(dev_drv_init);
module_exit(dev_drv_exit);

MODULE_LICENSE("GPL");
MODULE_AUTHOR("Ian Chen");

上方程式碼是一個非常簡易的 kernel module。 當 kernel module 被載入後,它會印出 Initializing the module 訊息(需要使用 dmesg 工具查看)。 反之,當 kernel module 被移除,它會印出 Unloading the module

  • module_init() 用於註冊模組的 initializer。
  • module_exit() 當使用者使用 rmmod 移除模組,它會調用 cleanup_module() 將與模組相關的程式碼從 kernel space 清除。
  • MODULE_LICENSE() 是一個特別的巨集,它用來表示模組的 license。
  • MODULE_AUTHOR() 用於宣告模組的開發者。
  • pr_info() 會將訊息寫入系統的日誌當中,其路徑為 /var/log/messages

2.4 Compile the sample code

使用 Make 可以幫助我們提高每次修改程式碼後的測試效率,Make 會根據我們自定義的 makefile 自動化的編譯原始程式碼:

KERNEL_DIR := /lib/modules/$(shell uname -r)/build

MODULE_NAME = sample
obj-m := $(MODULE_NAME).o

all:
        make -C $(KERNEL_DIR) M=$(shell pwd) modules
clean:
        make -C $(KERNEL_DIR) M=$(shell pwd) clean

建立 Makefile 後,我們使用以下命令即可產生 kernel module 的 kernel object 檔案 sample.ko

$ make all

當你修改了模組的原始程式碼,可以使用以下命令清除先前編譯產出的檔案:

$ make clean

3. Network Device Driver Development

開發一個 Network Device Driver development 最快的方式就是調用 alloc_netdev() 或是 alloc_etherdev() 讓 OS 分配一個 network device,並且建立 net_device_ops structure 註冊裝置的 hook functions。

當 device driver 載入到 Linux kernel,kernel 會根據不同的時機調用我們註冊的 hook function 操作 network device 的行為,基本的 network device operations 包含了:

  • ndo_init() is called once when a network device is registered.
  • ndo_open() is called when a network device transitions to the up state.
  • ndo_validate_addr() is called for testing if the Media Access Control address is valid for the device.
  • ndo_stop() is called when a network device transitions to the down state.
  • ndo_start_xmit() is called when a packet needs to be transmitted.
    • NETDEV_TX_OK
    • NETDEV_TX_BUSY
  • ndo_change_mtu() is called when a user wants to change the Maximum Transfer Unit of a device.
  • ndo_set_mac_address() is called when the Media Access Control address needs to be changed. If this interface is not defined, the MAC address can not be changed.

3.1 Dummy module

Dummy module 內建於 Linux kernel,它可以用於建立虛擬的網路環境,讓開發者使用它進行端對端的測試。 舉例來說,對於封閉的網路環境,主機僅有 loopback address 127.0.0.1 可以用於分析,其餘的 IP 都是不可用的。 因此,dummy interface 為了解決這個問題誕生了!

3.1.1 Usage

使用下方的命令建立 dummy interface nodelocaldns

# ip link adds: add virtual link DEVICE specifies the physical device to act operate on. 
$ sudo ip link add nodelocaldns type dummy

以下命令會為 nodelocaldns 分配 IP:

$ sudo ip addr add 168.254.10.10 dev nodelocaldns
$ sudo ip addr add 10.20.0.10 dev nodelocaldns

當我們成功建立 dummy interface 並且為它分配 IP,我們就可以使用 ping 向它發送 icmp echo request:

ping 10.20.0.10

當我們不再需要 dummy interface,可以使用以下命令移除它:

$ sudo ip link delete nodelocaldns

3.1.2 Code Analysis

Dummy module 的原始程式碼可以在 Linux kernel 專案找到。

在這個小節中,我們會藉由閱讀 Dummy module 的原始程式碼學習如何開發 network device driver:

module_init(dummy_init_module);
module_exit(dummy_cleanup_module);
MODULE_LICENSE("GPL");
MODULE_ALIAS_RTNL_LINK(DRV_NAME);

The codes above list the initializer, clean-up function, module license, and module aliases.

Let's get into dummy_init_module() and dummy_cleanup_module():

static struct rtnl_link_ops dummy_link_ops __read_mostly = {
	.kind		= DRV_NAME,
	.setup		= dummy_setup,
	.validate	= dummy_validate,
};

// ...

static int __init dummy_init_module(void)
{
	int i, err = 0;

	down_write(&pernet_ops_rwsem);
	rtnl_lock();
	err = __rtnl_link_register(&dummy_link_ops);
	if (err < 0)
		goto out;

	for (i = 0; i < numdummies && !err; i++) {
		err = dummy_init_one();
		cond_resched();
	}
	if (err < 0)
		__rtnl_link_unregister(&dummy_link_ops);

out:
	rtnl_unlock();
	up_write(&pernet_ops_rwsem);

	return err;
}


static void __exit dummy_cleanup_module(void)
{
	rtnl_link_unregister(&dummy_link_ops);
}
  • Line 9 & 28:使用 semaphore 避免 race condition。
  • Line 14 & 27:Netlink mutex lock,用於保護 network devices 列表。
  • Line 15: 將 dummy_link_ops 註冊到 rtnl_link。
    • Rtnetlink 允許 kernel's routing tables 被存取或是替換。
    • Rtnetlink 維護了記載 link_ops 的 linked-list,其中的每一個節點可以使用 kind 識別。
    • netdev 就像是 rtnl_link 的客戶,它使用 rtnl_link_ops 與 dummy_link_ops 建立關聯。
    • 如果有任何 rtnl_link_ops 從 linked-list 被移除,Linux kernel 需要處理每一個與該操作有關連的模組。
  • Line 19 - 22: Dummy module 允許使用者一口起建立多個 dummies。
/* Number of dummy devices to be set up by this module. */
module_param(numdummies, int, 0);
MODULE_PARM_DESC(numdummies, "Number of dummy pseudo devices");

static int __init dummy_init_one(void)
{
    struct net_device *dev_dummy;
    int err;

    dev_dummy = alloc_netdev(0, "dummy%d", NET_NAME_ENUM, dummy_setup);
    if (!dev_dummy)
        return -ENOMEM;

    dev_dummy->rtnl_link_ops = &dummy_link_ops;
    err = register_netdevice(dev_dummy);
    if (err < 0)
        goto err;
    return 0;

err:
    free_netdev(dev_dummy);
    return err;
}

使用 alloc_netdev() 以分配 network device 需要使用的資源。 當分配工作完成,調用 register_netdevice() 將 device 註冊到 kernel 之中。

此外,它同時讓 dev_dummy 建立了 dummy_link_ops 的關聯。 到這裡,我們已經大致了解如何分配並且註冊一個網路裝置到 Linux kernel 上,讓我們繼續追蹤 dummy 的 setup 與 validate 函式:

static void dummy_setup(struct net_device *dev)
{
	ether_setup(dev);

	/* Initialize the device structure. */
	dev->netdev_ops = &dummy_netdev_ops;
	dev->ethtool_ops = &dummy_ethtool_ops;
	dev->needs_free_netdev = true;

	/* Fill in device structure with ethernet-generic values. */
	dev->flags |= IFF_NOARP;
	dev->flags &= ~IFF_MULTICAST;
	dev->priv_flags |= IFF_LIVE_ADDR_CHANGE | IFF_NO_QUEUE;
	dev->features	|= NETIF_F_SG | NETIF_F_FRAGLIST;
	dev->features	|= NETIF_F_GSO_SOFTWARE;
	dev->features	|= NETIF_F_HW_CSUM | NETIF_F_HIGHDMA | NETIF_F_LLTX;
	dev->features	|= NETIF_F_GSO_ENCAP_ALL;
	dev->hw_features |= dev->features;
	dev->hw_enc_features |= dev->features;
	eth_hw_addr_random(dev);

	dev->min_mtu = 0;
	dev->max_mtu = 0;
}

static int dummy_validate(struct nlattr *tb[], struct nlattr *data[],
			  struct netlink_ext_ack *extack)
{
	if (tb[IFLA_ADDRESS]) {
		if (nla_len(tb[IFLA_ADDRESS]) != ETH_ALEN)
			return -EINVAL;
		if (!is_valid_ether_addr(nla_data(tb[IFLA_ADDRESS])))
			return -EADDRNOTAVAIL;
	}
	return 0;
}

dummy_setup() initialize the netdev_ops, dummy_ethtool_ops, and fill the ethernet-generic values.

eth_hw_addr_random() is used to generate a random Ethernet address (MAC) to be used by a net device and set addr_assign_type so the state can be read by sysfs and be used by userspace.

IFLA_ADDRESS is the interface L2 address.

Besides, In the dummy_setup() function, it register the dummy_netdev_ops as operations of netdevice:

static const struct net_device_ops dummy_netdev_ops = {
	.ndo_init		= dummy_dev_init,
	.ndo_uninit		= dummy_dev_uninit,
	.ndo_start_xmit		= dummy_xmit,
	.ndo_validate_addr	= eth_validate_addr,
	.ndo_set_rx_mode	= set_multicast_list,
	.ndo_set_mac_address	= eth_mac_addr,
	.ndo_get_stats64	= dummy_get_stats64,
	.ndo_change_carrier	= dummy_change_carrier,
};

net_device_ops structure illustrates all the operations that can be used when the network device run, worthy of a special mention is the dummy_xmit:

static netdev_tx_t dummy_xmit(struct sk_buff *skb, struct net_device *dev)
{
	dev_lstats_add(dev, skb->len);

	skb_tx_timestamp(skb);
	dev_kfree_skb(skb);
	return NETDEV_TX_OK;
}
  • dev_lstats_add() is used to update the received packet length.
  • skb_tx_timestamp() is the driver hook for transmitting timestamping, it should be called before giving the sk_buff to the MAC hardware.
  • dev_kfree_skb() is used to free sk buffer.

3.2 What's sk_buff?

sk_buff is the buffer that is be used to transfer the real data through each network layer.

The associated structure is defined in the include/linux/sk_buff.h of the Linux Kernel source.

struct sk_buff {
	union {
		struct {
			/* These two members must be first to match sk_buff_head. */
			struct sk_buff		*next;
			struct sk_buff		*prev;

			union {
				struct net_device	*dev;
				/* Some protocols might use this space to store information,
				 * while device pointer would be NULL.
				 * UDP receive path is one user.
				 */
				unsigned long		dev_scratch;
			};
		};
		struct rb_node		rbnode; /* used in netem, ip4 defrag, and tcp stack */
		struct list_head	list;
		struct llist_node	ll_node;
	};

	union {
		struct sock		*sk;
		int			ip_defrag_offset;
	};

	union {
		ktime_t		tstamp;
		u64		skb_mstamp_ns; /* earliest departure time */
	};
	/*
	 * This is the control buffer. It is free to use for every
	 * layer. Please put your private variables there. If you
	 * want to keep them across layers you have to do a skb_clone()
	 * first. This is owned by whoever has the skb queued ATM.
	 */
	char			cb[48] __aligned(8);

	union {
		struct {
			unsigned long	_skb_refdst;
			void		(*destructor)(struct sk_buff *skb);
		};
		struct list_head	tcp_tsorted_anchor;
#ifdef CONFIG_NET_SOCK_MSG
		unsigned long		_sk_redir;
#endif
	};

#if defined(CONFIG_NF_CONNTRACK) || defined(CONFIG_NF_CONNTRACK_MODULE)
	unsigned long		 _nfct;
#endif
	unsigned int		len,
				data_len;
	__u16			mac_len,
				hdr_len;

	/* Following fields are _not_ copied in __copy_skb_header()
	 * Note that queue_mapping is here mostly to fill a hole.
	 */
	__u16			queue_mapping;

/* if you move cloned around you also must adapt those constants */
#ifdef __BIG_ENDIAN_BITFIELD
#define CLONED_MASK	(1 << 7)
#else
#define CLONED_MASK	1
#endif
#define CLONED_OFFSET		offsetof(struct sk_buff, __cloned_offset)

	/* private: */
	__u8			__cloned_offset[0];
	/* public: */
	__u8			cloned:1,
				nohdr:1,
				fclone:2,
				peeked:1,
				head_frag:1,
				pfmemalloc:1,
				pp_recycle:1; /* page_pool recycle indicator */
#ifdef CONFIG_SKB_EXTENSIONS
	__u8			active_extensions;
#endif

	/* Fields enclosed in headers group are copied
	 * using a single memcpy() in __copy_skb_header()
	 */
	struct_group(headers,

	/* private: */
	__u8			__pkt_type_offset[0];
	/* public: */
	__u8			pkt_type:3; /* see PKT_TYPE_MAX */
	__u8			ignore_df:1;
	__u8			nf_trace:1;
	__u8			ip_summed:2;
	__u8			ooo_okay:1;

	__u8			l4_hash:1;
	__u8			sw_hash:1;
	__u8			wifi_acked_valid:1;
	__u8			wifi_acked:1;
	__u8			no_fcs:1;
	/* Indicates the inner headers are valid in the skbuff. */
	__u8			encapsulation:1;
	__u8			encap_hdr_csum:1;
	__u8			csum_valid:1;

	/* private: */
	__u8			__pkt_vlan_present_offset[0];
	/* public: */
	__u8			vlan_present:1;	/* See PKT_VLAN_PRESENT_BIT */
	__u8			csum_complete_sw:1;
	__u8			csum_level:2;
	__u8			dst_pending_confirm:1;
	__u8			mono_delivery_time:1;	/* See SKB_MONO_DELIVERY_TIME_MASK */
#ifdef CONFIG_NET_CLS_ACT
	__u8			tc_skip_classify:1;
	__u8			tc_at_ingress:1;	/* See TC_AT_INGRESS_MASK */
#endif
#ifdef CONFIG_IPV6_NDISC_NODETYPE
	__u8			ndisc_nodetype:2;
#endif

	__u8			ipvs_property:1;
	__u8			inner_protocol_type:1;
	__u8			remcsum_offload:1;
#ifdef CONFIG_NET_SWITCHDEV
	__u8			offload_fwd_mark:1;
	__u8			offload_l3_fwd_mark:1;
#endif
	__u8			redirected:1;
#ifdef CONFIG_NET_REDIRECT
	__u8			from_ingress:1;
#endif
#ifdef CONFIG_NETFILTER_SKIP_EGRESS
	__u8			nf_skip_egress:1;
#endif
#ifdef CONFIG_TLS_DEVICE
	__u8			decrypted:1;
#endif
	__u8			slow_gro:1;
	__u8			csum_not_inet:1;

#ifdef CONFIG_NET_SCHED
	__u16			tc_index;	/* traffic control index */
#endif

	union {
		__wsum		csum;
		struct {
			__u16	csum_start;
			__u16	csum_offset;
		};
	};
	__u32			priority;
	int			skb_iif;
	__u32			hash;
	__be16			vlan_proto;
	__u16			vlan_tci;
#if defined(CONFIG_NET_RX_BUSY_POLL) || defined(CONFIG_XPS)
	union {
		unsigned int	napi_id;
		unsigned int	sender_cpu;
	};
#endif
#ifdef CONFIG_NETWORK_SECMARK
	__u32		secmark;
#endif

	union {
		__u32		mark;
		__u32		reserved_tailroom;
	};

	union {
		__be16		inner_protocol;
		__u8		inner_ipproto;
	};

	__u16			inner_transport_header;
	__u16			inner_network_header;
	__u16			inner_mac_header;

	__be16			protocol;
	__u16			transport_header;
	__u16			network_header;
	__u16			mac_header;

#ifdef CONFIG_KCOV
	u64			kcov_handle;
#endif

	); /* end headers group */

	/* These elements must be at the end, see alloc_skb() for details.  */
	sk_buff_data_t		tail;
	sk_buff_data_t		end;
	unsigned char		*head,
				*data;
	unsigned int		truesize;
	refcount_t		users;

#ifdef CONFIG_SKB_EXTENSIONS
	/* only useable after checking ->active_extensions != 0 */
	struct skb_ext		*extensions;
#endif
};

To briefly introduce the sk_buff structure, it includes the various header and the pointers of sk_buff_data:

  • transport_header
  • network_header
  • mac_header
  • tail & end

3.2.1 The buffer pointers: head, data, tail and end

Those data pointer is used to record the address of the specific data, for example:

  • For the transportation layer, the valid data includes its header and user data.

  • For the Network layer, it includes its header, transportation header and user data.

  • For the data link layer, it includes all of data in the network layer and its own protocol header.

3.3 The differences between alloc_netdev() and alloc_etherdev()

The following codes can be found in /net/ethernet/eth.c:

/**
 * alloc_etherdev_mqs - Allocates and sets up an Ethernet device
 * @sizeof_priv: Size of additional driver-private structure to be allocated
 *	for this Ethernet device
 * @txqs: The number of TX queues this device has.
 * @rxqs: The number of RX queues this device has.
 *
 * Fill in the fields of the device structure with Ethernet-generic
 * values. Basically does everything except registering the device.
 *
 * Constructs a new net device, complete with a private data area of
 * size (sizeof_priv).  A 32-byte (not bit) alignment is enforced for
 * this private data area.
 */

struct net_device *alloc_etherdev_mqs(int sizeof_priv, unsigned int txqs,
				      unsigned int rxqs)
{
	return alloc_netdev_mqs(sizeof_priv, "eth%d", NET_NAME_UNKNOWN,
				ether_setup, txqs, rxqs);
}
EXPORT_SYMBOL(alloc_etherdev_mqs);

// ...

/**
 * ether_setup - setup Ethernet network device
 * @dev: network device
 *
 * Fill in the fields of the device structure with Ethernet-generic values.
 */
void ether_setup(struct net_device *dev)
{
	dev->header_ops		= &eth_header_ops;
	dev->type		= ARPHRD_ETHER;
	dev->hard_header_len 	= ETH_HLEN;
	dev->min_header_len	= ETH_HLEN;
	dev->mtu		= ETH_DATA_LEN;
	dev->min_mtu		= ETH_MIN_MTU;
	dev->max_mtu		= ETH_DATA_LEN;
	dev->addr_len		= ETH_ALEN;
	dev->tx_queue_len	= DEFAULT_TX_QUEUE_LEN;
	dev->flags		= IFF_BROADCAST|IFF_MULTICAST;
	dev->priv_flags		|= IFF_TX_SKB_SHARING;

	eth_broadcast_addr(dev->broadcast);

}
EXPORT_SYMBOL(ether_setup);

it looks like alloc_etherdev() is the higher-level function. when it is invoked during the eth_net device setup, it will pass the special data, which is only can be used in the eth_net device, into alloc_netdev_mqs().

References