Cilium

General
- Terminology
- Architecture
Network policy
Observability
- Network flow
BGP
External Gateway
Features
Installation
- Metrics
Links

General

Open-source and cloud native solution for providing, securing, and observing network connectivity between workloads. Works on top of eBPF. Was created to solve these challenges. Main advantage is the ability to define network, observability and security features directly into the kernel.

Cilium implements a simple flat Layer 3 network. By default an overlay networking model is used; requires minimal effort for deployment, only IP connectivity between hosts. Traffic is encapsulated for transport between hosts. Native routing is also supported, where regular routing tables on the nodes are used to route traffic to Pods; works well in native IPv6 networks and in conjunction with cloud network routers, or pre-existing routing daemons.

Life of a packet.

Terminology

Cilium Endpoint is application containers that share a common IP. This is similar to a Pod in Kubernetes, thus, Cilium Endpoint is essentially a Pod. Each endpoint also gets internal ID, Endpoint ID, which is unique within a node; set of labels that are derived either from a pod or container(s) are assigned to Endpoint.

Cilium identity is determined by labels and is unique cluster-wide. An endpoint is assigned an identity based on it's security relevant labels; endpoints, which share same set of security relevant labels, will share same identity. Unique numeric identifier is associated with each identity and is what's used by eBPF programs and Hubble. Identities are used to enforce basic L3 connectivity.

Security relevant labels are meaningful labels, which exclude metadata, e.g. creation timestamp. User has to provide list of string prefixes of meaningful label prefixes; standard behavior is id prefix - id.service1, id.service2.

Each agent is responsible to update eBPF maps with numeric identities relevant to endpoints running locally bye watching relevant Kubernetes resources.

Docs.

Architecture

Architecture diagram, component overview.

Cilium operator (single instance) - manages duties in the cluster, which generally are handled once for entire cluster. Isn't part of critical path for forwarding or network policy decision; may also be shortly unavailable.

Cilium agent (daemonset):

synchronize cluster state with api server
load eBPF programs and update eBPF maps via Linux kernel
fetch newly scheduled workloads from CNI plugin executable via filesystem socket
create DNS and envoy proxies
create Hubble gRPC services when the latter is enabled

Cilium client (CLI) - installed alongside Cilium agent, mostly used to inspect the state of agent (client communicates via agent's REST API).

Cilium CNI plugin - installed by agent on node's filesystem and invoked by Kubernetes when a pod is scheduled or terminated on a node; agent also reconfigures node's CNI to make use of newly installed plugin. When required plugin can communicate with agent via filesystem socket.

Hubble server (embedded into Cilium agent) - runs on each node and retrieves eBPF-based visibility from Cilium. Offers gRPC service to retrieve flows and Prometheus metrics.

Hubble relay - standalone component, which is aware of all running Hubble servers and offers cluster-wide visibility by connecting to their respective APIs and providing an API that represents entire cluster. Acts as intermediate between Hubble gRPC services and Hubble Observers. Whenever it is enabled, Cilium agents are restarted to enable gRPC services; Hubble Observer service and Hubble Peer service are added alongside. Peer service allows relay to detect new Hubble-enabled Cilium agents. Users interact with Observer via UI and CLI.

Cilium Mesh API server (optional, when service mesh is enabled) - allows Kubernetes services to be shared amongst multiple clusters. Deploys an etcd key-value store in each cluster, to hold information about Cilium identities. Also exposes a proxy service for each of these etcd stores. Cilium agents running in any member of the same Cluster Mesh can use this service to read information about Cilium identity state globally across the mesh. This allows to create and access global services that span the Cluster Mesh. Once the Cilium Cluster Mesh API service is available, Cilium agents running in any Kubernetes cluster that is a member of the Cluster Mesh are then able to securely read from each cluster’s etcd proxy thus gaining knowledge of Cilium identity state globally across the mesh. This makes it possible to create global services that span the cluster mesh.

IPAM modes.

BPF hooks used by Cilium:

Express Data Path - runs when package is received, before any processing happens
Traffic Controll (ingress/egress) - after initial processing, but before L3. Mostly applying L3 policies and redirecting traffic to endpoints.
Socket operations - in general attached to cgroup and runs on TCP events; Cilium attaches to root cgroup and monitors TCP state transitions.
Socket send/recv - runs on every send operation; message can be inspected and either dropped, sent to TCP layer or another socket.

Network policy

Network policies define which workloads are permitted to communicate with each other. Cilium assigns an identity to a group of applications, e.g. Kubernetes labels. This allows to avoid usage of Linux firewall rules, thus, scales better. Cilium supports multiple network policy formats; all can be used at the same time, but this might lead to unintended behavior.

standard Kubernetes NetworkPolicy (supports layer 3 and 4)
CiliumNetworkPolicy (supports layer 3, 4, and 7)
CiliumClusterwideNetworkPolicy (applies to entire cluster), allows the use of node selector (endpointSelector or nodeSelector)

Policy enforcement modes:

default - endpoints have unrestricted network access unless selected by policy; once selected only explicitly allowed traffic is permitted (per direction, if ingress and/or egress are selected). Default-deny can be turned off with EnableDefaultDeny configuration in network policy.
always - enforcement is enabled on all endpoints even if no rules select them.
never - disable policy enforcement, all communications are allowed.

CiliumNetworkPolicy is an extension of NetworkPolicy. Additional capabilities allow defining rules like "Allow HTTP GET to /foo/bar", "Require HTTP header X-Foo: in all requests"):

L7 HTTP policy rules, limiting Ingress and Egress to specific HTTP paths
Additional Layer 7 protocols, DNS, Kafka, gRPC (Envoy)
Service name based Egress policy for internal cluster communication
L3/L4 Ingress/Egress using entity matching
L3 Ingress and Egress policy using DNS FQDN matching

When L7 policy is applied and is active for any endpoint on a node, a local-only HTTP proxy is started by Cilium, and eBPF programs are instructed to direct incoming traffic to this proxy, so the latter can interpret and apply policy rules. This proxy also provides L7 observability. Path, Method, Host, Headers can be specified to match network traffic; if omitted, all traffic is allowed. L7 policy basically extends L4 - start with L4, then add rules section to define L7 logic.

Policy overview and examples.
Visualization tool - preexisting policies can be uploaded; also accepts Hubble flows.
L7 examples

Whether to write Ingress or Egress policy depends on the intent:

Ingress - control which pods can initiate communication with a particular service or endpoint
Egress - control what a pod can send information to

NetworkPolicy example:

apiVersion:
kind: NetworkPolicy
metadata:
  name: test-policy
  namespace: default
spec:
  podSelector:
    matchLabels:
      role: foo
  policyTypes:
    - Ingress
    - Egress
  ingress:
    - from:
        - ipBlock:
            cidr: 172.0.0.0/16
            except:
              - 172.0.1.0/24
        - namespaceSelector:
            matchLabels:
              project: bar
        - podSelector:
            matchLabels:
              role: frontend
      ports:
        - protocol: TCP
          port: 1234
  egress:
    - to:
        ipBlock:
          cidr: 10.0.0.0/24
      ports:
        - protocol: TCP
          port: 5678

Layer 3

Simple ingress rule to allow communication from frontend to backend endpoints.

apiVersion: cilium.io/v2
kind: CiliumNetworkPolicy
metadata:
  name: l3-rule
spec:
  endpointSelector:
    matchLabels:
      role: backend
  ingress:
  - fromEndpoints:
    - matchLabels:
        role: frontend

Layer 7

apiVersion: cilium.io/v2
kind: CiliumNetworkPolicy
metadata:
  name: l7-rule
spec:
  endpointSelector:
    matchLabels:
      role: backend
  ingress:
    - toPorts:
      - ports:
        - port: 80
          protocol: TCP
        rules:
          http:
            - method: GET
              path: "/api"
            - method: PUT
              path: "/version"
              headers:
                - "X-some-header: true"

Observability

Hubble is a dedicated network observability component of Cilium. Identity concept is used to easily identify and filter out traffic. Includes:

Visibility into Layer 3/4 (IP address and port) and Layer 7 (API protocol)
Event monitoring, e.g. dropped packet includes a lot of metadata such as full label information of sender and receiver
Prometheus metrics
Graphical UI for network traffic

Provides answers to:

Service dependency and communication map - what services communicate with each other, how frequently, what does dependency graph look like, what HTTP calls were made
Network monitoring and alerting - is and why network communication is failing, is it DNS, layer 4 or layer 7
Application monitoring - rate of 4xx or 5xx codes for a particular service, latency between applications
Security observability - which connections have been blocked due to network policy, which services were accessed outside of cluster

By default Cilium provides visibility to L3/L4 packet events only. L7 is enabled, when L7 policy is in place. Pod annotation can be used to enable L7 visibility without policy in place; annotation allows user to specify which traffic direction, port, protocol needs to be visible; Cilium would basically redirect that traffic to Envoy, thus, making it visible.

Hubble components

Hubble server - runs on each node as part of Cilium agent operations. Implements gRPC observer service, which provides access to network flows on a node; implements gRPC peer service used by Hubble Relay to discover peer Hubble servers.

Hubble peer (service) - used by Hubble Relay to discover Hubble servers.

Hubble relay (deployment) - communicates and keeps consistent gRPC API connections with Hubble server via peer service, exposes API for observability.

Hubble relay (service) - used by Hubble UI, can be exposed to be used by Hubble CLI as well.

Hubble UI (deployment) - backend for UI.

Hubble UI (service) - endpoint for client.

Hubble CLI

To install Hubble components via CLI run cilium hubble enable --ui. Hubble CLI cheatsheet.

While inside Cilium agent container run the following command to see the flow of events on that particular node:

$ hubble observe --follow

# Example with from and to label arguments to filter out communication between
# particular set of pods
$ kubectl -n kube-system exec -ti pod/cilium-<hash> -c cilium-agent -- hubble observe --from-label "class=tiefighter" --to-label "class=deathstar" --verdict DROPPED --last 1 -o json

Hubble CLI can also be installed locally to observe cluster-wide flows.

# First hubble service should be exposed locally
$ cilium hubble port-forward &

# Verify that hubble is accessible
$ hubble status

# Sample command
$ hubble observe --to-label "class=deathstar" --verdict DROPPED --all

Network flow

Similar to network packet, but designed to help to understand how the packet flows in the cluster; includes context information: where packet is coming from, where it is going, and whether it was dropped or forwarded. Since in Kubernetes environment IP addresses are ephemeral, flows provide more durable context information. This context information can also be exposed as labels in Prometheus metrics.

BGP

bgpControlPlane.enable: true disables tunnel mode, and enable native routing mode. This is needed as BGP peering requires direct connectivity between peers.

Once BGP control plane is enabled, Cilium watches for CiliumBGPPeeringPolicy CRDs, and implements them via CiliumBGPVirtualRouter. nodeSelector allows to assign specific set of nodes as edge nodes, e.g. those that have better network hardware for example. exportPodCIDR advertises CIDRs in the cluster and the node to external networks. serviceSelector also advertises any Kubernetes Services, e.g. ClusterIP, Loadbalancer.

apiVersion: cilium.io/v2alpha1
kind: CiliumBGPPeeringPolicy
metadata:
  name: foo
spec:
  nodeSelector:
    matchLabels:
      kubernetes.io/os: linux
  virtualRoutes:
    - exportPodCIDR: true
      localASN: 65100
      neighbors:
        - peerASN: 65000
          peerAddress: 172.22.0.5/32
      serviceSelector:
        matchLabels:
          app: nginx

Egress Gateway

This feature routes all IPv4 connection originating from pods to specific cluster-external CIDR through particular nodes, gateway nodes. Both BPF masquerading and kube-proxy replacement must be enabled.

Incompatibilities:

kubernetes should be used as identity store, set to crd instead of kvstore
gateway selected by policy must be in the same cluster as selected pods; thus, isn't compatible with Cluster Mesh feature
does not work with CiliumEndpointSlice
only IPv4

apiVersion: cilium.io/v2
kind: CiliumEgressGatewayPolicy
metadata:
  name: egress-sample
spec:
  # Specify which pods should be subject to the current policy
  selectors:
  - podSelector:
      matchLabels:
        org: empire
        class: mediabot
        # The following label selects default namespace
        io.kubernetes.pod.namespace: default
    nodeSelector: # optional, if not specified the policy applies to all nodes
      matchLabels:
        node.kubernetes.io/name: node1 # only traffic from this node will be SNATed
  # Specify which destination CIDR(s) this policy applies to
  destinationCIDRs:
  - "0.0.0.0/0"

  # Configure the gateway node.
  egressGateway:
    # Specify which node should act as gateway for this policy.
    nodeSelector:
      matchLabels:
        node.kubernetes.io/name: node2

    # Specify the IP address used to SNAT traffic matched by the policy.
    # It must exist as an IP associated with a network interface on the instance.
    egressIP: 10.168.60.100

    # Alternatively it's possible to specify the interface to be used for egress traffic.
    # In this case the first IPv4 assigned to that interface will be used as egress IP.
    # interface: enp0s8

Features

Other features that are enabled/provided by Cilium.

Service mesh

TODO:

https://isovalent.com/blog/post/cilium-service-mesh/

Cluster mesh

Common use cases:

HA - running multiple clusters in different regions or zones, covers complete or temporary unavailability of failure domain or misconfiguration.
Shared services - a fairly common practice is to build a cluster per tenant or service, e.g. different security requirements. Managing some common services like monitoring, secrets management and so on in a single cluster, which all other services/tenants would have access to could reduce maintenance overhead.
Splitting Stateful from Stateless services - since stateless is much more agile, simple to scale and migrate, it is easier to also manage it separately, thus, isolating dependency complexity of stateful applications to a smaller number of cluster.

Cilium supports both Kubernetes Ingress and Gateway API to provide fully functional service mesh. Effectively multiple clusters get merged into a large unified network. Provided features:

pod IP routing across cluster at native performance via tunneling or direct-routing
Transparent discovery of globally available Kubernetes services
Network policy enforcement (either Kubernetes NetworkPolicy or CiliumNetworkPolicy)
Transparent encryption

Requirements:

Nodes across cluster have unique IPs and connectivity between each other
All clusters must be assigned unique podCIDR to avoid pod IP overlapping across the mesh
The network between clusters must allow inter-cluster communication so Cilium agents can access all Cluster Mesh API Servers in the mesh. In public cloud Kubernetes services ensure that firewall requirements are fulfilled. The exact requirements depend on whether Cilium is configured to run in direct-routing or tunneling mode.

Architecture is based on Cluster Mesh API server and read only etcd. Each cluster runs its own replica of these components; Cilium agents watch Cluster Mesh API server in their cluster for changes and apply them to replicate multi-cluster state (access to API server is protected via TLS cert). State from multiple clusters is never mixed; one cluster has read-only access to another cluster. Configuration occurs via Kubernetes secrets, which contain address information of the remote etcd proxies, cluster name, and certificates required for access.

Global services behavior is affected by following annotations on Kubernetes services:

service.cilium.io/global set to true declares service to be global (services in other clusters must be defined with identical name and namespace).
service.cilium.io/shared (implied true) includes a service in global load-balancing. Setting to false excludes it, if global annotation from above was set to true.
service.cilium.io/affinity can be set to local, remote, or none. With local setting remote endpoint are used only if local ones are unavailable or not healthy (effectively fail-over from local to remote). remote is the opposite, useful during maintenance or other expected disruptions. none is implied default.

Network policies can utilize labels that include cluster name information, effectively introducing cross-cluster rules. User is still responsible to deploy policies in the correct cluster(s) based on intent.

Transparent encryption feature must be enabled or disabled in all clusters. Otherwise cluster without encryption configuration won't be able to communicate with encrypted cluster.

During installation utilize cluster.name and cluster.id Helm chart properties to set unique cluster names. After installing Cilium Mesh in first cluster, all subsequent clusters must use same CA that Cilium generated for the first cluster, because cluster mesh uses mTLS to secure access between cluster mesh API servers. With Helm there is also an option to prepare CA separately beforehand.

$ cilium install --context=$CLUSTER2 ...
$ kubectl --context=$CLUSTER1 get secret -n kube-system cilium-ca -o yaml | kubectl --context $CLUSTER2 create -f -
$ cilium clustermesh enable --service-type NodePort --context $CLUSTER1
$ cilium clustermesh enable --service-type NodePort --context $CLUSTER2

Topology Aware Routing and Service Mesh across Clusters with Cluster Mesh

Encryption

Transparent encryption may be needed in a situation when network that Kubernetes setup is running on is not trusted. PCI and HIPAA also now started to require encryption of data transmitted between networked services. Cilium provides this via WireGuard or IPSec (protocols that provide in-kernel transparent traffic encryption). This only encrypts traffic inside the cluster across nodes (external traffic or node-local communications are not affected)! WireGuard is a lightweight virtual Private Network solution built into Linux kernel; peer-based VPN - exchanging public keys, similar to SSH keys. IPsec is a similar, but older FIPS-compliant solution.

Configuration tutorial.

Helm values to enable encryption:

encryption:
  enabled: true
  type: wireguard

Verification steps:

# Might need to restart daemonset after Helm upgrade to apply changes
$ kubectl rollout restart daemonset/cilium -n kube-system

# Verify if # of peers is consistent with number of Cilium enabled nodes
$ kubectl exec -n kube-system -ti ds/cilium -- cilium status | grep Encryption
# Also new network device is created, cilium_wg0
$ kubectl exec -n kube-system -ti ds/cilium -- ip link | grep cilium

Load balancing

Can fully replace kube-proxy for this purpose, and even be used as a standalone load balancer; this feature is implemented in eBPF using hash tables.

kube-proxy implements Kubernetes services models by adjusting iptables ruleset, usually multiple iptables rules for each backend a service is serving; for each added service the list of rules to be traversed grows exponentially leading to performance issues in large clusters.

By default Cilium only handles per-packet in-cluster load balancing of ClusterIP services, while kube-proxy handles NodePort and LoadBalancer services, and ExternalIPs. Cilium can perform all these tasks including HostPort allocations, if containers define them.

Cilium CLI can detect absence of kube-proxy and modify Helm template configuration during installation.

Helm values example:

kubeProxyReplacement: strict
k8sServiceHost: <api-server host>
k8sServicePort: <api-server port>

Installation

Via CLI:

# Automatically reads current kubectl context and identifies cluster info
# such as kind of cluster, components present, etc
$ cilium install

# Get status on all Cilium components
$ cilium status

# Enable UI, status can be checked with previous command
cilium hubble enable --ui

# Test the setup, expect to take at least 10 minutes, 50 minutes 3-node EKS setup
$ cilium connectivity test --request-timeout 30s --connect-timeout 10s

CLI cheatsheet.

Metrics

Both Cilium and Hubble can also expose Prometheus metrics (independently of each other). Cilium provides information on how its components are operating, while Hubble provides information on network performance and flows. Configuring metrics collection.

Cilium operator and agent Prometheus metrics are enabled via Helm chart options, which starts embedded Prometheus server, and annotates pods for easy discovery. Additionally a headless cilium-agent Kubernetes Service is defined. Agent exposed metrics, Operator exposed metrics. Helm values files for enabling metrics:

prometheus:
  enabled: true
operator:
  prometheus:
    enabled: true

Hubble exposed metrics. When Hubble metrics are enabled, an annotated headless hubble-metrics Kubernetes Service is also created for Prometheus discovery. Since no Hubble metrics are enabled by default, one has to explicitly configure desired Hubble metrics with the flow context mapped to Prometheus labels. Since Hubble provides very rich context, not all information should be mapped to labels; and is configurable. Source and destination labels are configuration and can be filled in with any context information that fits best in the concrete case, e.g. sourceContext=ip for IP address to represent source. Additional labels can be populated from flow information using labelContext option. Sample Helm values files:

hubble:
  enabled: true
  metrics:
    enabled:
      - dns
      - drop
        # Add context information (as Prometheus labels)
        #- drop:sourceContext=pod;destinationContext=pod
      - tcp
      - flow
      - port-distribution
      - httpV2

Follow ups

https://www.youtube.com/live/ni0Uw4WLHYo?feature=shared&t=2034 - avoid extra hop if pods on different node have direct interface (eni)
https://www.youtube.com/live/ni0Uw4WLHYo?feature=shared&t=2189 - sort of same (if no eni), if same L2 network segment

Cilium - kamialie/knowledge_corner GitHub Wiki

Cilium

General

Terminology

Architecture

Network policy

Layer 3

Layer 7

Observability

Hubble components

Hubble CLI

Network flow

BGP

Egress Gateway

Features

Service mesh

Cluster mesh

Encryption

Load balancing

Installation

Metrics

Follow ups

Links

⚠️ GitHub.com Fallback ⚠️

Cilium - kamialie/knowledge_corner GitHub Wiki

Cilium

General

Terminology

Architecture

Network policy

Layer 3

Layer 7

Observability

Hubble components

Hubble CLI

Network flow

BGP

Egress Gateway

Features

Service mesh

Cluster mesh

Encryption

Load balancing

Installation

Metrics

Follow ups

Links

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️