Linux Kernel Networking in LinuxCon - lala7573/record_repo GitHub Wiki

Linux Kernel Networking Walkthrough

  1. Getting packets from/to the NIC
  • NAPI, Busy Polling, RSS, RPS, XPS, GRO, TSO
  1. Packet processing
  • RX Handler, IP Processing, TCP Processing, TCP Fast Open
  1. Queuing from/to userspace
  • Socket Buffers, Flow Control, TCP Small Queues
  • Receive & Transmit Process
NIC -> DMA -> Ring Buffer

In Network Stack(kernel Space)
	Ring Buffer.read
	Parse L2 & IP
	if Local
		Parse TCP/UDP
		Socket Buffer.write
	else
		Forward
		if Route???
			Ring Buffer.write

	Socket Buffer.read
	Construct TCP/UDP
	Construct IP
	if Route???
		Ring Buffer.write

Process(User Space)
Socket Buffer read/write
  • Network Stack에 접근하는 3가지 방법
  1. Interrupt Driven : 일방적으로 ring buffer -> network stack으로 감
  2. NAPI based Polling(poll()) : network stack이 ring buffer에 polling
  3. Budy Polling(busy_poll()) : Task가 busy_poll()을 해서 Network stack으로 ring Buffer가 메세지를 보낼 수 있게함
  • RSS - Receive Side Scaling parallel processing을 고려해서 NIC는 몇개의 RX 큐로 패킷을 분배함 RX마다 IRQ를 분리함으로써, 하드웨어 인터럽트 핸들러로 작용할 CPU를 고른다..

  • RPS - Receive Packet Sterring processing을 위해서 cpu number를 선택하는 software filter

  • redo queue - CPU mappings ( RX queue N:M cpu)
  • distribute single queue to multiple CPUs (RX Queue 1: N cpu)
  • Hardware Offload
  • RX/TX Checksumming : 하드웨어에서 CPU intensive한 checksumming 수행
  • Virtual LAN filtering and tag stripping
    • Strip 802.1Q header and stor VLAN ID in network packet meta data
    • Filter out unsubscribed VLANs.
  • Segmentation Offload
  • Generic Receive Offload (ethtool -K eth0 gro on) NAPI based GRO (?) 여튼 poll()할 때, GRO라는 프로세싱이 돌면서 mtu기준으로 잘려진 패킷들을 붙여서 64K이상으로 만듦. 40x1500 byte packets를 처리하는 거보다 1x64K bytes packet을 처리하는게 낫다고.

  • Segmentation Offload

    • ethtool -K eth0 gso on : networkstack -> ring buffer로 갈때, 64k이상의 패킷을 mtu기준으로 잘라줌
    • ethtool -K eth0 tso on : DMA에서, mtu기준으로 잘라준다고 표현되어있네..
  • Packet Processing

 DMA
 -> link layer (넘어가는 시점에서 Packat Socket ETH_P_ALL->tcpdump)
 -> Ingress QoS
 -> RX Handler(Bridge, Open vSwitch, Team, Bonding, macvlan, macvtap) -> Proto Handler(IPv4,6,ARP,IPX,etc) (여기서 macvtap 빼고는 다시 link layer에서 다음 로직가는순간(tcpdump되고그럴때)으로 갈 수 있음)
 -> drop
  • IP Processing ?????? 이해가 잘 안됨 특히 input/output..

  • TCP Fast Open (net.ipv4.tcp_fastopen) 여기서도 regular쪽에 왜 HTTP GET이 있는걸까.. TCP인데..? 쨌든 원래는 three hand shaking해야하는데(SYN, SYN+ACK, ACK+HTTP GET, Data), Fast Open에서는

	clinet			server
1 req	         ->SYN
2x rtt	         <-SYN+ACK+Cookie
		 ->ACK+HTTP GET
		 <-Data

2 req		->SYN+Cookie+HTTP GET
1x rtt		<-SYN_ACK+Data
  • Memorry Accounting & Flow Control(net.ipv4.tcp_{r|w}mem)
ssh -> write()
if wmem overlimit?
	goto ssh->write() : Block or EWOULDBLOCK
wmem += packet-size
write(socket buffer)
TCPIP.read(socket buffer)
->TCP/IP->
TCPIP.write(TX Ring Buffer)
wmem -= packet-size
-DMA->
TCPIP.read(RX Ring Buffer)
if rmem overlimit?
	goto TCPIP.read(RX Ring Buffer) : Reduce TCP Window
rmem += packet-size
write(Socket Buffer)
rmem -= packet-size
-> ssh
  • TCP Small Queues(net.ipv4.tcp_limit_output_bytes) ssh torrent에서 데이터를 받고, TCP/IP -Queuing Discipline-> Driver -TX Ring Buffer-> DMA TSQ : max 128Kb in flight per socket