Linux Kernel Networking in LinuxCon - lala7573/record_repo GitHub Wiki
Linux Kernel Networking Walkthrough
- LinuxCon 2015 : http://www.slideshare.net/ThomasGraf5/linuxcon-2015-linux-kernel-networking-walkthrough
- Getting packets from/to the NIC
- NAPI, Busy Polling, RSS, RPS, XPS, GRO, TSO
- Packet processing
- RX Handler, IP Processing, TCP Processing, TCP Fast Open
- Queuing from/to userspace
- Socket Buffers, Flow Control, TCP Small Queues
- Receive & Transmit Process
NIC -> DMA -> Ring Buffer
In Network Stack(kernel Space)
Ring Buffer.read
Parse L2 & IP
if Local
Parse TCP/UDP
Socket Buffer.write
else
Forward
if Route???
Ring Buffer.write
Socket Buffer.read
Construct TCP/UDP
Construct IP
if Route???
Ring Buffer.write
Process(User Space)
Socket Buffer read/write
- Network Stack에 접근하는 3가지 방법
- Interrupt Driven : 일방적으로 ring buffer -> network stack으로 감
- NAPI based Polling(poll()) : network stack이 ring buffer에 polling
- Budy Polling(busy_poll()) : Task가 busy_poll()을 해서 Network stack으로 ring Buffer가 메세지를 보낼 수 있게함
-
RSS - Receive Side Scaling parallel processing을 고려해서 NIC는 몇개의 RX 큐로 패킷을 분배함 RX마다 IRQ를 분리함으로써, 하드웨어 인터럽트 핸들러로 작용할 CPU를 고른다..
-
RPS - Receive Packet Sterring processing을 위해서 cpu number를 선택하는 software filter
- redo queue - CPU mappings ( RX queue N:M cpu)
- distribute single queue to multiple CPUs (RX Queue 1: N cpu)
- Hardware Offload
- RX/TX Checksumming : 하드웨어에서 CPU intensive한 checksumming 수행
- Virtual LAN filtering and tag stripping
- Strip 802.1Q header and stor VLAN ID in network packet meta data
- Filter out unsubscribed VLANs.
- Segmentation Offload
-
Generic Receive Offload (ethtool -K eth0 gro on) NAPI based GRO (?) 여튼 poll()할 때, GRO라는 프로세싱이 돌면서 mtu기준으로 잘려진 패킷들을 붙여서 64K이상으로 만듦. 40x1500 byte packets를 처리하는 거보다 1x64K bytes packet을 처리하는게 낫다고.
-
Segmentation Offload
- ethtool -K eth0 gso on : networkstack -> ring buffer로 갈때, 64k이상의 패킷을 mtu기준으로 잘라줌
- ethtool -K eth0 tso on : DMA에서, mtu기준으로 잘라준다고 표현되어있네..
-
Packet Processing
DMA
-> link layer (넘어가는 시점에서 Packat Socket ETH_P_ALL->tcpdump)
-> Ingress QoS
-> RX Handler(Bridge, Open vSwitch, Team, Bonding, macvlan, macvtap) -> Proto Handler(IPv4,6,ARP,IPX,etc) (여기서 macvtap 빼고는 다시 link layer에서 다음 로직가는순간(tcpdump되고그럴때)으로 갈 수 있음)
-> drop
-
IP Processing ?????? 이해가 잘 안됨 특히 input/output..
-
TCP Fast Open (net.ipv4.tcp_fastopen) 여기서도 regular쪽에 왜 HTTP GET이 있는걸까.. TCP인데..? 쨌든 원래는 three hand shaking해야하는데(SYN, SYN+ACK, ACK+HTTP GET, Data), Fast Open에서는
clinet server
1 req ->SYN
2x rtt <-SYN+ACK+Cookie
->ACK+HTTP GET
<-Data
2 req ->SYN+Cookie+HTTP GET
1x rtt <-SYN_ACK+Data
- Memorry Accounting & Flow Control(net.ipv4.tcp_{r|w}mem)
ssh -> write()
if wmem overlimit?
goto ssh->write() : Block or EWOULDBLOCK
wmem += packet-size
write(socket buffer)
TCPIP.read(socket buffer)
->TCP/IP->
TCPIP.write(TX Ring Buffer)
wmem -= packet-size
-DMA->
TCPIP.read(RX Ring Buffer)
if rmem overlimit?
goto TCPIP.read(RX Ring Buffer) : Reduce TCP Window
rmem += packet-size
write(Socket Buffer)
rmem -= packet-size
-> ssh
- TCP Small Queues(net.ipv4.tcp_limit_output_bytes) ssh torrent에서 데이터를 받고, TCP/IP -Queuing Discipline-> Driver -TX Ring Buffer-> DMA TSQ : max 128Kb in flight per socket