Linux Network Device Driver - ianchen0119/Introduce-to-5GC GitHub Wiki
1. Introduce to device
在 Linux kernel 中,如果需要存取、操作裝置,必須事先安裝裝置的驅動程式(device driver)。
參考上圖,我們不難發現 Linux 將裝置分成了兩大類,分別是:
- Block Devices
- Character Devices
2. Types of Devices
在上面我們已經提到,Linux 將裝置分成了兩大類:Block Devices
以及 Character Devices
。
2.1 Block Devices
兩者最主要的差異在於 Block Devices
使用固定長度的方式傳輸資料,而 Character devices
可接受非固定長度的資料。
對於兩者的操作方式,在 user space 並沒有差異。
2.2 Character Devices
Character Devices 可以視為一個資料流,它可以像是檔案一樣被存取。因此,Character device driver 至少需要負責以下工作:
- open
- close
- read
- write
對於某些特殊案例,它還需要提供 ioctl 操作,比較常見的場景是 text console 以及 serial ports,兩者皆屬於 streaming structure。
2.3 Check the sample code!
/* In sample.c */
# include <linux/module.h>
# include <linux/kernel.h>
# include <linux/init.h>
static int __init dev_drv_init(void)
{
pr_info("Initializing the module.\n");
return 0;
}
static void __exit dev_drv_exit(void)
{
pr_info("Unloading the module.\n");
return;
}
module_init(dev_drv_init);
module_exit(dev_drv_exit);
MODULE_LICENSE("GPL");
MODULE_AUTHOR("Ian Chen");
上方程式碼是一個非常簡易的 kernel module。
當 kernel module 被載入後,它會印出 Initializing the module
訊息(需要使用 dmesg 工具查看)。
反之,當 kernel module 被移除,它會印出 Unloading the module
。
module_init()
用於註冊模組的 initializer。module_exit()
當使用者使用rmmod
移除模組,它會調用cleanup_module()
將與模組相關的程式碼從 kernel space 清除。MODULE_LICENSE()
是一個特別的巨集,它用來表示模組的 license。MODULE_AUTHOR()
用於宣告模組的開發者。pr_info()
會將訊息寫入系統的日誌當中,其路徑為/var/log/messages
。
2.4 Compile the sample code
使用 Make 可以幫助我們提高每次修改程式碼後的測試效率,Make 會根據我們自定義的 makefile 自動化的編譯原始程式碼:
KERNEL_DIR := /lib/modules/$(shell uname -r)/build
MODULE_NAME = sample
obj-m := $(MODULE_NAME).o
all:
make -C $(KERNEL_DIR) M=$(shell pwd) modules
clean:
make -C $(KERNEL_DIR) M=$(shell pwd) clean
建立 Makefile 後,我們使用以下命令即可產生 kernel module 的 kernel object 檔案 sample.ko
:
$ make all
當你修改了模組的原始程式碼,可以使用以下命令清除先前編譯產出的檔案:
$ make clean
3. Network Device Driver Development
開發一個 Network Device Driver development 最快的方式就是調用 alloc_netdev()
或是 alloc_etherdev()
讓 OS 分配一個 network device,並且建立 net_device_ops
structure 註冊裝置的 hook functions。
當 device driver 載入到 Linux kernel,kernel 會根據不同的時機調用我們註冊的 hook function 操作 network device 的行為,基本的 network device operations 包含了:
ndo_init()
is called once when a network device is registered.ndo_open()
is called when a network device transitions to the up state.ndo_validate_addr()
is called for testing if the Media Access Control address is valid for the device.ndo_stop()
is called when a network device transitions to the down state.ndo_start_xmit()
is called when a packet needs to be transmitted.- NETDEV_TX_OK
- NETDEV_TX_BUSY
ndo_change_mtu()
is called when a user wants to change the Maximum Transfer Unit of a device.ndo_set_mac_address()
is called when the Media Access Control address needs to be changed. If this interface is not defined, the MAC address can not be changed.
3.1 Dummy module
Dummy module 內建於 Linux kernel,它可以用於建立虛擬的網路環境,讓開發者使用它進行端對端的測試。
舉例來說,對於封閉的網路環境,主機僅有 loopback address 127.0.0.1
可以用於分析,其餘的 IP 都是不可用的。
因此,dummy interface 為了解決這個問題誕生了!
3.1.1 Usage
使用下方的命令建立 dummy interface nodelocaldns
:
# ip link adds: add virtual link DEVICE specifies the physical device to act operate on.
$ sudo ip link add nodelocaldns type dummy
以下命令會為 nodelocaldns
分配 IP:
$ sudo ip addr add 168.254.10.10 dev nodelocaldns
$ sudo ip addr add 10.20.0.10 dev nodelocaldns
當我們成功建立 dummy interface 並且為它分配 IP,我們就可以使用 ping 向它發送 icmp echo request:
ping 10.20.0.10
當我們不再需要 dummy interface,可以使用以下命令移除它:
$ sudo ip link delete nodelocaldns
3.1.2 Code Analysis
Dummy module 的原始程式碼可以在 Linux kernel 專案找到。
在這個小節中,我們會藉由閱讀 Dummy module 的原始程式碼學習如何開發 network device driver:
module_init(dummy_init_module);
module_exit(dummy_cleanup_module);
MODULE_LICENSE("GPL");
MODULE_ALIAS_RTNL_LINK(DRV_NAME);
The codes above list the initializer, clean-up function, module license, and module aliases.
Let's get into dummy_init_module() and dummy_cleanup_module():
static struct rtnl_link_ops dummy_link_ops __read_mostly = {
.kind = DRV_NAME,
.setup = dummy_setup,
.validate = dummy_validate,
};
// ...
static int __init dummy_init_module(void)
{
int i, err = 0;
down_write(&pernet_ops_rwsem);
rtnl_lock();
err = __rtnl_link_register(&dummy_link_ops);
if (err < 0)
goto out;
for (i = 0; i < numdummies && !err; i++) {
err = dummy_init_one();
cond_resched();
}
if (err < 0)
__rtnl_link_unregister(&dummy_link_ops);
out:
rtnl_unlock();
up_write(&pernet_ops_rwsem);
return err;
}
static void __exit dummy_cleanup_module(void)
{
rtnl_link_unregister(&dummy_link_ops);
}
- Line 9 & 28:使用 semaphore 避免 race condition。
- Line 14 & 27:Netlink mutex lock,用於保護 network devices 列表。
- Line 15: 將 dummy_link_ops 註冊到 rtnl_link。
- Rtnetlink 允許 kernel's routing tables 被存取或是替換。
- Rtnetlink 維護了記載 link_ops 的 linked-list,其中的每一個節點可以使用 kind 識別。
- netdev 就像是 rtnl_link 的客戶,它使用 rtnl_link_ops 與 dummy_link_ops 建立關聯。
- 如果有任何 rtnl_link_ops 從 linked-list 被移除,Linux kernel 需要處理每一個與該操作有關連的模組。
- Line 19 - 22: Dummy module 允許使用者一口起建立多個 dummies。
/* Number of dummy devices to be set up by this module. */
module_param(numdummies, int, 0);
MODULE_PARM_DESC(numdummies, "Number of dummy pseudo devices");
static int __init dummy_init_one(void)
{
struct net_device *dev_dummy;
int err;
dev_dummy = alloc_netdev(0, "dummy%d", NET_NAME_ENUM, dummy_setup);
if (!dev_dummy)
return -ENOMEM;
dev_dummy->rtnl_link_ops = &dummy_link_ops;
err = register_netdevice(dev_dummy);
if (err < 0)
goto err;
return 0;
err:
free_netdev(dev_dummy);
return err;
}
使用 alloc_netdev()
以分配 network device 需要使用的資源。
當分配工作完成,調用 register_netdevice()
將 device 註冊到 kernel 之中。
此外,它同時讓 dev_dummy 建立了 dummy_link_ops 的關聯。 到這裡,我們已經大致了解如何分配並且註冊一個網路裝置到 Linux kernel 上,讓我們繼續追蹤 dummy 的 setup 與 validate 函式:
static void dummy_setup(struct net_device *dev)
{
ether_setup(dev);
/* Initialize the device structure. */
dev->netdev_ops = &dummy_netdev_ops;
dev->ethtool_ops = &dummy_ethtool_ops;
dev->needs_free_netdev = true;
/* Fill in device structure with ethernet-generic values. */
dev->flags |= IFF_NOARP;
dev->flags &= ~IFF_MULTICAST;
dev->priv_flags |= IFF_LIVE_ADDR_CHANGE | IFF_NO_QUEUE;
dev->features |= NETIF_F_SG | NETIF_F_FRAGLIST;
dev->features |= NETIF_F_GSO_SOFTWARE;
dev->features |= NETIF_F_HW_CSUM | NETIF_F_HIGHDMA | NETIF_F_LLTX;
dev->features |= NETIF_F_GSO_ENCAP_ALL;
dev->hw_features |= dev->features;
dev->hw_enc_features |= dev->features;
eth_hw_addr_random(dev);
dev->min_mtu = 0;
dev->max_mtu = 0;
}
static int dummy_validate(struct nlattr *tb[], struct nlattr *data[],
struct netlink_ext_ack *extack)
{
if (tb[IFLA_ADDRESS]) {
if (nla_len(tb[IFLA_ADDRESS]) != ETH_ALEN)
return -EINVAL;
if (!is_valid_ether_addr(nla_data(tb[IFLA_ADDRESS])))
return -EADDRNOTAVAIL;
}
return 0;
}
dummy_setup() initialize the netdev_ops, dummy_ethtool_ops, and fill the ethernet-generic values.
eth_hw_addr_random() is used to generate a random Ethernet address (MAC) to be used by a net device and set addr_assign_type so the state can be read by sysfs and be used by userspace.
IFLA_ADDRESS is the interface L2 address.
Besides, In the dummy_setup() function, it register the dummy_netdev_ops as operations of netdevice:
static const struct net_device_ops dummy_netdev_ops = {
.ndo_init = dummy_dev_init,
.ndo_uninit = dummy_dev_uninit,
.ndo_start_xmit = dummy_xmit,
.ndo_validate_addr = eth_validate_addr,
.ndo_set_rx_mode = set_multicast_list,
.ndo_set_mac_address = eth_mac_addr,
.ndo_get_stats64 = dummy_get_stats64,
.ndo_change_carrier = dummy_change_carrier,
};
net_device_ops structure illustrates all the operations that can be used when the network device run, worthy of a special mention is the dummy_xmit:
static netdev_tx_t dummy_xmit(struct sk_buff *skb, struct net_device *dev)
{
dev_lstats_add(dev, skb->len);
skb_tx_timestamp(skb);
dev_kfree_skb(skb);
return NETDEV_TX_OK;
}
dev_lstats_add()
is used to update the received packet length.skb_tx_timestamp()
is the driver hook for transmitting timestamping, it should be called before giving the sk_buff to the MAC hardware.dev_kfree_skb()
is used to free sk buffer.
3.2 What's sk_buff?
sk_buff is the buffer that is be used to transfer the real data through each network layer.
The associated structure is defined in the include/linux/sk_buff.h
of the Linux Kernel source.
struct sk_buff {
union {
struct {
/* These two members must be first to match sk_buff_head. */
struct sk_buff *next;
struct sk_buff *prev;
union {
struct net_device *dev;
/* Some protocols might use this space to store information,
* while device pointer would be NULL.
* UDP receive path is one user.
*/
unsigned long dev_scratch;
};
};
struct rb_node rbnode; /* used in netem, ip4 defrag, and tcp stack */
struct list_head list;
struct llist_node ll_node;
};
union {
struct sock *sk;
int ip_defrag_offset;
};
union {
ktime_t tstamp;
u64 skb_mstamp_ns; /* earliest departure time */
};
/*
* This is the control buffer. It is free to use for every
* layer. Please put your private variables there. If you
* want to keep them across layers you have to do a skb_clone()
* first. This is owned by whoever has the skb queued ATM.
*/
char cb[48] __aligned(8);
union {
struct {
unsigned long _skb_refdst;
void (*destructor)(struct sk_buff *skb);
};
struct list_head tcp_tsorted_anchor;
#ifdef CONFIG_NET_SOCK_MSG
unsigned long _sk_redir;
#endif
};
#if defined(CONFIG_NF_CONNTRACK) || defined(CONFIG_NF_CONNTRACK_MODULE)
unsigned long _nfct;
#endif
unsigned int len,
data_len;
__u16 mac_len,
hdr_len;
/* Following fields are _not_ copied in __copy_skb_header()
* Note that queue_mapping is here mostly to fill a hole.
*/
__u16 queue_mapping;
/* if you move cloned around you also must adapt those constants */
#ifdef __BIG_ENDIAN_BITFIELD
#define CLONED_MASK (1 << 7)
#else
#define CLONED_MASK 1
#endif
#define CLONED_OFFSET offsetof(struct sk_buff, __cloned_offset)
/* private: */
__u8 __cloned_offset[0];
/* public: */
__u8 cloned:1,
nohdr:1,
fclone:2,
peeked:1,
head_frag:1,
pfmemalloc:1,
pp_recycle:1; /* page_pool recycle indicator */
#ifdef CONFIG_SKB_EXTENSIONS
__u8 active_extensions;
#endif
/* Fields enclosed in headers group are copied
* using a single memcpy() in __copy_skb_header()
*/
struct_group(headers,
/* private: */
__u8 __pkt_type_offset[0];
/* public: */
__u8 pkt_type:3; /* see PKT_TYPE_MAX */
__u8 ignore_df:1;
__u8 nf_trace:1;
__u8 ip_summed:2;
__u8 ooo_okay:1;
__u8 l4_hash:1;
__u8 sw_hash:1;
__u8 wifi_acked_valid:1;
__u8 wifi_acked:1;
__u8 no_fcs:1;
/* Indicates the inner headers are valid in the skbuff. */
__u8 encapsulation:1;
__u8 encap_hdr_csum:1;
__u8 csum_valid:1;
/* private: */
__u8 __pkt_vlan_present_offset[0];
/* public: */
__u8 vlan_present:1; /* See PKT_VLAN_PRESENT_BIT */
__u8 csum_complete_sw:1;
__u8 csum_level:2;
__u8 dst_pending_confirm:1;
__u8 mono_delivery_time:1; /* See SKB_MONO_DELIVERY_TIME_MASK */
#ifdef CONFIG_NET_CLS_ACT
__u8 tc_skip_classify:1;
__u8 tc_at_ingress:1; /* See TC_AT_INGRESS_MASK */
#endif
#ifdef CONFIG_IPV6_NDISC_NODETYPE
__u8 ndisc_nodetype:2;
#endif
__u8 ipvs_property:1;
__u8 inner_protocol_type:1;
__u8 remcsum_offload:1;
#ifdef CONFIG_NET_SWITCHDEV
__u8 offload_fwd_mark:1;
__u8 offload_l3_fwd_mark:1;
#endif
__u8 redirected:1;
#ifdef CONFIG_NET_REDIRECT
__u8 from_ingress:1;
#endif
#ifdef CONFIG_NETFILTER_SKIP_EGRESS
__u8 nf_skip_egress:1;
#endif
#ifdef CONFIG_TLS_DEVICE
__u8 decrypted:1;
#endif
__u8 slow_gro:1;
__u8 csum_not_inet:1;
#ifdef CONFIG_NET_SCHED
__u16 tc_index; /* traffic control index */
#endif
union {
__wsum csum;
struct {
__u16 csum_start;
__u16 csum_offset;
};
};
__u32 priority;
int skb_iif;
__u32 hash;
__be16 vlan_proto;
__u16 vlan_tci;
#if defined(CONFIG_NET_RX_BUSY_POLL) || defined(CONFIG_XPS)
union {
unsigned int napi_id;
unsigned int sender_cpu;
};
#endif
#ifdef CONFIG_NETWORK_SECMARK
__u32 secmark;
#endif
union {
__u32 mark;
__u32 reserved_tailroom;
};
union {
__be16 inner_protocol;
__u8 inner_ipproto;
};
__u16 inner_transport_header;
__u16 inner_network_header;
__u16 inner_mac_header;
__be16 protocol;
__u16 transport_header;
__u16 network_header;
__u16 mac_header;
#ifdef CONFIG_KCOV
u64 kcov_handle;
#endif
); /* end headers group */
/* These elements must be at the end, see alloc_skb() for details. */
sk_buff_data_t tail;
sk_buff_data_t end;
unsigned char *head,
*data;
unsigned int truesize;
refcount_t users;
#ifdef CONFIG_SKB_EXTENSIONS
/* only useable after checking ->active_extensions != 0 */
struct skb_ext *extensions;
#endif
};
To briefly introduce the sk_buff structure, it includes the various header and the pointers of sk_buff_data:
- transport_header
- network_header
- mac_header
- tail & end
3.2.1 The buffer pointers: head, data, tail and end
Those data pointer is used to record the address of the specific data, for example:
-
For the transportation layer, the valid data includes its header and user data.
-
For the Network layer, it includes its header, transportation header and user data.
-
For the data link layer, it includes all of data in the network layer and its own protocol header.
3.3 The differences between alloc_netdev() and alloc_etherdev()
The following codes can be found in /net/ethernet/eth.c:
/**
* alloc_etherdev_mqs - Allocates and sets up an Ethernet device
* @sizeof_priv: Size of additional driver-private structure to be allocated
* for this Ethernet device
* @txqs: The number of TX queues this device has.
* @rxqs: The number of RX queues this device has.
*
* Fill in the fields of the device structure with Ethernet-generic
* values. Basically does everything except registering the device.
*
* Constructs a new net device, complete with a private data area of
* size (sizeof_priv). A 32-byte (not bit) alignment is enforced for
* this private data area.
*/
struct net_device *alloc_etherdev_mqs(int sizeof_priv, unsigned int txqs,
unsigned int rxqs)
{
return alloc_netdev_mqs(sizeof_priv, "eth%d", NET_NAME_UNKNOWN,
ether_setup, txqs, rxqs);
}
EXPORT_SYMBOL(alloc_etherdev_mqs);
// ...
/**
* ether_setup - setup Ethernet network device
* @dev: network device
*
* Fill in the fields of the device structure with Ethernet-generic values.
*/
void ether_setup(struct net_device *dev)
{
dev->header_ops = ð_header_ops;
dev->type = ARPHRD_ETHER;
dev->hard_header_len = ETH_HLEN;
dev->min_header_len = ETH_HLEN;
dev->mtu = ETH_DATA_LEN;
dev->min_mtu = ETH_MIN_MTU;
dev->max_mtu = ETH_DATA_LEN;
dev->addr_len = ETH_ALEN;
dev->tx_queue_len = DEFAULT_TX_QUEUE_LEN;
dev->flags = IFF_BROADCAST|IFF_MULTICAST;
dev->priv_flags |= IFF_TX_SKB_SHARING;
eth_broadcast_addr(dev->broadcast);
}
EXPORT_SYMBOL(ether_setup);
it looks like alloc_etherdev()
is the higher-level function. when it is invoked during the eth_net device setup, it will pass the special data, which is only can be used in the eth_net device, into alloc_netdev_mqs()
.