load balancer - unders/mywiki GitHub Wiki

Traefik

General

Maglev Google

  • Maglev: A Fast and Reliable Software Network Load Balancer
  • Availability and reliability are enhanced as the system provides N+1 redundancy.
  • The system as a whole must also provide connection persistence: packets belonging to the same connection should always be directed to the same service endpoint.
  • Maglev associates each Virtual IP address (VIP) with a set of service endpoints and announces it to the router over BGP;
  • When the router receives a VIP packet, it forwards the packet to one of the Maglev machines in the cluster through ECMP, since all Maglev machines announce the VIP with the same cost.
  • When the Maglev machine receives the packet, it selects an endpoint from the set of service endpoints associated with the VIP, and encapsulates the packet using Generic Routing Encapsulation (GRE) with the outer IP header destined to the endpoint.
  • We use Direct Server Return (DSR) to send responses directly to the router so that Maglev does not need to handle returning packets, which are typically larger in size.
  • Each Maglev machine contains a controller and a forwarder.
  • On each Maglev machine, the controller periodically checks the health status of the forwarder. Depending on the results, the controller decides whether to announce or withdraw all the VIPs via BGP
  • The forwarder receives packets from the NIC (Network Interface Card), rewrites them with proper GRE/IP headers and then sends them back to the NIC. The Linux kernel is not involved in this process
  • Weengineered it to forward packets at line rate – typically 10Gbps in Google’s clusters today. This translates to 813Kpps (packets per second) for 1500-byte IP packets. Assuming IP packet size is 100 bytes on average, the forwarder must be able to process packets at 9.06Mpps
  • When Maglev is started, it pre-allocates a packet pool that is shared between the NIC and the forwarder
  • We pin each packet thread to a dedicated CPU core to ensure best performance.
  • Normally it takes the packet thread about 350ns to process each packet on our standard servers.
  • The maximum number of packets that Maglev can buffer is the size of the packet pool; beyond that the packets will be dropped by the NIC
  • Assuming the packet pool size is 3000 and the forwarder can process 10Mpps, it takes about 300µs to process all buffered packets. Hence a maximum of 300µs delay may be added to each packet if Maglev is heavily overloaded.
  • For connection-oriented protocols such as TCP, it is critical to send all packets of a connection to the same backend.
  • With these considerations in mind, we developed a new consistent hashing algorithm, which we call Maglev hashing
  • For example, if a large datagram is split into two fragments, the first fragment will contain both L3 and L4 headers while the second will only contain the L3 header. Thus when Maglev receives a non-first fragment, it cannot make the correct forwarding decision based only on that packet’s headers.
  • Since all fragments belonging to the same datagram contain the same 3-tuple, they are guaranteed to be redirected to the same Maglev. We use the GRE recursion control field to ensure that fragments are only redirected once.

Gorb

Seesaw

ECMP

  • Network routers distribute packets evenly to the Maglev machines via Equal Cost Multipath (ECMP)

Kernel bypass

Vulcand