Skip to main content
  1. Articles/

Direct Server Return with Cilium

·
Vegard S. Hagen
Author
Vegard S. Hagen
Pondering post-physicists
Table of Contents

Direct Server Return (DSR) is a networking technique that allows a backend server to learn the original request’s source IP address and respond directly, even behind a load balancer. This results in reduced latency and improved throughput, as well as the ability to filter traffic based on client IP.

Compared to Cilium’s default Source Network Address Translation (SNAT) — which rewrites source IPs and routes traffic through the load balancer, DSR is often more complicated to configure, and support relies on the underlying network infrastructure.

Motivation
#

While digging into the Border Gateway Protocol (BGP), I stumbled upon the concept of DSR as a way of preserving the source IP address of client requests.

My main use-case for DSR is getting the correct client IP address for DNS requests, though the promise of reduced latency, overhead, and throughput is also welcome.

Overview
#

This article will assume that you have LoadBalancer IP Address Management (LB-IPAM) configured with Cilium, as well as either L2 announcements with Address Resolution Protocol (ARP) or Border Gateway Protocol (BGP) configured.

We’ll be looking at Cilium in both with Generic Network Virtualization Encapsulation (GENEVE) tunnelling and native routing modes.

Digging deeper, we’ll play around with the Service .spec.externalTrafficPolicy field and see how it affects the traffic routing in our different configurations.

To verify that we’ve got DSR working, we will be using Træfik’s whoami webserver to peek at the HTTP requests, as well as tcpdump inside netshoot containers to inspect traffic.

Cilium configuration
#

Before we begin properly, we need to configure Cilium to replace the kube-proxy.

Cilium has a great quick start guide on how to do this with Kubeadm. If you’re instead running K3s or Talos, I’ve explained the process in previous articles.

Out-of-the-box, Cilium’s eBPF NodePort implementation operates in SNAT mode. Simplified, this means that node external requests are routed through the node for both incoming and outgoing traffic. I.e. the backend reply takes an extra hop before being returned to the client, losing the original client IP in the translation process. In DSR mode, however, the backend replies directly to the external client.

  
---
title: Source Network Address Translation vs Direct Server Return
---
flowchart LR
    subgraph SNAT["SNAT"]
        C1("Client")
        LB1("Load Balancer")
        P1("Backend Pod")
        C1 -->|"src = Client IP<br/>dst = Service IP"| LB1
        LB1 -->|"src = Node IP<br/>dst = Pod IP"| P1
        P1 -->|"src = Pod IP<br/>dst = Node IP"| LB1
        LB1 -->|"src = Service IP<br/>dst = Client IP"| C1
    end

    subgraph DSR["DSR"]
        C2("Client")
        LB2("Load Balancer")
        P2("Backend Pod")
        C2 -->|"src = Client IP<br/>dst = Service IP"| LB2
        LB2 -->|"src = Client IP<br/>dst = Pod IP"| P2
        P2 -->|"src = Service IP<br/>dst = Client IP"| C2
    end

    SNAT ~~~ DSR



There are several ways to configure Cilium to use DSR load balancing.

GENEVE Tunnelling
#

In an effort to maximise compatibility, Cilium uses Virtual eXtensible LAN (VXLAN) tunnelling1 by default, which unfortunately doesn’t support DSR.

To enable support for DSR in tunnelling mode, we have to instead use Generic Network Virtualization Encapsulation (GENEVE) tunnelling — a more extensible network encapsulation protocol than the protocol named for the property. We can change the tunnelling mode at a slight risk of losing compatibility with older hardware.

Assuming a Helm install of Cilium, we enable GENEVE tunnelling by configuring the tunnel protocol to be geneve on line 3

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
# cilium/values-tunnel.yaml
routingMode: tunnel
tunnelProtocol: geneve

loadBalancer:
  standalone: true
  mode: dsr
  dsrDispatch: geneve

bpf:
  lbModeAnnotation: true

DSR can now be enabled by configuring the loadBalancer.mode to dsr (line 7), and set the loadBalancer.dsrDispatch option to geneve encapsulation (line 8) — which is the only DSR dispatch option supported by GENEVE tunnelling.

Note that there is also a hybrid load balancer mode, which is DSR for TCP and SNAT for UDP.

We also enable bpf.lbModeAnnotation (line 11) to allow us to control the load balancer mode per Service using the service.cilium.io/forwarding-mode annotation. We will use this annotation in the Testing section below to have both DSR and SNAT load balanced Services simultaneously.

You might also consider turning on the eXpress Data Path (XDP) based standalone load balancer (line 6) which reportedly improves performance.

Native Routing
#

Native routing offers lower overhead, but it relies on the underlying network infrastructure. This means that it might not be available in all environments, especially in a cloud setting.

As for DSR options, native routing in Cilium supports both GENEVE and IP-in-IP encapsulation (ipip), as well as IP options (opt) — in decreasing amount of overhead.

To switch to native routing, we change the routingMode to native (line 2) and supply the Pod CIDR as the ipv4NativeRoutingCIDR (line 3) — or alternatively ipv6NativeRoutingCIDR if you prefer. Cilium will assume that networking for the given CIDR is preconfigured and directs traffic destined for the given IP range to the Linux network stack without applying any SNAT. We then have to either manually configure routes to reach the pods, or set autoDirectNodeRoutes to true (line 4) for it to be done for us.

For the lowest possible overhead, we set the dsrDispatch to opt on line 10 for IP options, but both ipip and geneve are also supported depending on the network environment.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
# cilium/values-native.yaml
routingMode: native
ipv4NativeRoutingCIDR: 10.244.0.0/16 # Talos PodCIDR
autoDirectNodeRoutes: true

loadBalancer:
  standalone: true
  algorithm: maglev
  mode: dsr
  dsrDispatch: opt

bpf:
  masquerade: true
  lbModeAnnotation: true

In an attempt to shoehorn more micro-optimisations into this article, I’ve also changed the load balancer algorithm to Maglev Consistent Hashing on line 8. Maglev hashing should improve resiliency in case of failures, as well as better load balancing properties. It also sounds cool. The downside is that Maglev hasing uses more memory than the default random algorithm.

We also enable eBPF Host-Routing — which should increase throughput, by setting bpf.masquerade to true on line 13.

Broadcasting
#

Now that we’ve configured how to route traffic — either through tunnelling or using native routing, we need a way to broadcast the different endpoints to our network.

The simplest way to make Cilium do this for us is with ARP announcements. I’ve covered this in more detail in my article on migrating from MetalLB to Cilium.

Briefly summarised, it amounts to the following Cilium Helm values

1
2
3
4
5
6
7
# cilium/values-arp.yaml
l2announcements:
  enabled: true

k8sClientRateLimit:
  qps: 20
  burst: 100

together with a CiliumL2AnnouncementPolicy

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
# cilium/arp-announce.yaml
apiVersion: cilium.io/v2alpha1
kind: CiliumL2AnnouncementPolicy
metadata:
  name: arp-announcement
  namespace: kube-system
spec:
  loadBalancerIPs: true
  serviceSelector:
    matchLabels:
      arp.cilium.io/announce-service: default

I’ve also prepared a dedicated IP pool which goes together with the ARP announcements.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
# cilium/arp-ip-pool.yaml
apiVersion: cilium.io/v2
kind: CiliumLoadBalancerIPPool
metadata:
  name: default-arp-ip-pool
spec:
  blocks:
    - start: 192.168.1.200
      stop: 192.168.1.255
  serviceSelector:
    matchLabels:
      arp.cilium.io/ip-pool: default

A more involved approach to IP broadcasting is to use BGP advertisements, though this requires that your networking hardware supports it. BGP is most commonly used between ISPs and in big data centres, as well as overengineered homelabs.

ARP is perfectly fine for smaller installations, though if you want to play around with BGP, I’ve written a separate article on how to configure BGP pairing with Cilium and UniFi equipment.

Skipping the implementations details, my BGP configuration boils down to a dedicated IP pool for BGP advertisements based on Service labels

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
# cilium/arp-ip-pool.yaml
apiVersion: cilium.io/v2
kind: CiliumLoadBalancerIPPool
metadata:
  name: default-arp-ip-pool
spec:
  blocks:
    - start: 192.168.1.200
      stop: 192.168.1.255
  serviceSelector:
    matchLabels:
      arp.cilium.io/ip-pool: default

and a bgp.cilium.io/advertise-service: default label for advertising them, and the control plane nodes doing the broadcasting.

Testing
#

For deranged reasons testing purposes, I’ve configured both ARP and BGP on my four-node cluster. Three control-plane nodes, ctrl-00, ctrl-01, and ctrl-02, and one worker node work-00. All four Nodes are schedulable for regular workloads.

  
---
title: Cluster
---
flowchart TB
    subgraph ctrl-01["ctrl-01"]
        vm01("IP: 192.168.1.101")
    end
    subgraph ctrl-02["ctrl-02"]
        vm02("IP: 192.168.1.102")
    end
    subgraph work-00["work-00"]
        vm03("IP: 192.168.1.110")
    end
    subgraph ctrl-00["ctrl-00"]
        vm00("IP: 192.168.1.100")
    end

    ctrl-00 ~~~ ctrl-02
    ctrl-01 ~~~ work-00



As a target we will use Traefik’s whoami webserver. This webserver replies with useful information about the request, including the remote IP address of the request as seen by the webserver. Since we also want to monitor the actual traffic, we embed the tiny Go webserver inside netshoota Docker + Kubernetes network trouble-shooting swiss-army container.

Taking advantage of the image volume feature — introduced in Kubernetes v1.31, we can mount one container’s filesystem inside another. Support for image volumes also depends on the underlying container-runtime, introduced in containerd 2.1.0 and CRI-O 1.31. You might also need to explicitly enable the feature gate ImageVolume in your cluster as I’ve done in this commit.

To construct our target container, we start with the netshoot image (line 14) and mount the whoami image (line 25) inside it under /whoami (line 21).

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
# test/pod-whoami.yaml
apiVersion: v1
kind: Pod
metadata:
  name: whoami
  namespace: dsr
  labels:
    app: whoami
spec:
  nodeSelector:
    kubernetes.io/hostname: ctrl-00
  containers:
    - name: netshoot
      image: ghcr.io/nicolaka/netshoot:v0.15
      command: [ /whoami/whoami ]
      ports:
        - name: http
          containerPort: 80
      volumeMounts:
        - name: whoami
          mountPath: /whoami
  volumes:
    - name: whoami
      image:
        reference: ghcr.io/traefik/whoami:latest

We then make sure to pin the Pod to a given Node using the nodeSelector field on line 11.

Next, we construct a DaemonSet with the same netshoot image, and set hostNetwork to true (line 16) so we can get a TCP dump of the traffic on the node itself.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
# test/ds-netshoot.yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: netshoot
  namespace: dsr
spec:
  selector:
    matchLabels:
      app: netshoot
  template:
    metadata:
      labels:
        app: netshoot
    spec:
      hostNetwork: true
      containers:
        - name: netshoot
          image: ghcr.io/nicolaka/netshoot:v0.15
          command: [ tail, -f, /dev/null ]
You might have to add the pod-security.kubernetes.io/enforce: privileged label to the Namespace to allow running containers connected to the host network.

We will be testing the difference between DSR and SNAT using the service.cilium.io/forwarding-mode Service annotation together with the two types of external traffic policies. An external traffic policy of type Cluster — which is the default, means that external traffic should be routed to all ready endpoints, whereas a Local external traffic policy means to route to only node-local endpoints.

The whoami Pod is exposed by eighth different Services — four announced with ARP on the 192.168.1.0/24 subnet, and four advertised with BGP on the 172.20.10.0/24 subnet, with the last block indicating which combination of externalTrafficPolicy and forwardingMode used, indicated by the table below.

externalTrafficPolicyDSRSNAT
Cluster.200.201
Local.210.211

The full Service definitions can be found in the Summary section.

We test using Cilium in both GENEVE tunnelling and native routing mode.

All requests are sent from a client on subnet 10.144.12.0/24, specifically 10.144.12.11. The PodCIDR is 10.244.0.0/16, and the service CIDR is 10.96.0.0/12. The nodes themselves are on subnet 192.168.1.0/24, specifically 192.168.1.{100,101,102,110}.

Probing the Network
#

We can run tcpdump inside both the host network attached netshoot containers and the whoami target Pod to listen in on traffic inside the Nodes and the target _Pod. Cilium also has a built-in cilium monitor command that can be run inside the Cilium pods to display “notifications and events emitted by BPF programs attached to endpoints and devices”.

To get the full Pod name of the different Pods spawned by the DaemonSet on all the nodes, we can e.g. run

NETSHOOT_CTRL_00=$(kubectl get pod -n dsr -l app=netshoot \
                    --field-selector spec.nodeName=ctrl-00 -o name)
NETSHOOT_CTRL_01=$(kubectl get pod -n dsr -l app=netshoot \
                    --field-selector spec.nodeName=ctrl-01 -o name)
NETSHOOT_CTRL_02=$(kubectl get pod -n dsr -l app=netshoot \
                    --field-selector spec.nodeName=ctrl-02 -o name)
NETSHOOT_WORK_00=$(kubectl get pod -n dsr -l app=netshoot \
                    --field-selector spec.nodeName=work-00 -o name)

and do the same for the Cilium pods

CILIUM_CTRL_00=$(kubectl get pod -n kube-system -l k8s-app=cilium \
                  --field-selector spec.nodeName=ctrl-00 -o name)
CILIUM_CTRL_01=$(kubectl get pod -n kube-system -l k8s-app=cilium \
                  --field-selector spec.nodeName=ctrl-01 -o name)
CILIUM_CTRL_02=$(kubectl get pod -n kube-system -l k8s-app=cilium \
                  --field-selector spec.nodeName=ctrl-02 -o name)
CILIUM_WORK_00=$(kubectl get pod -n kube-system -l k8s-app=cilium \
                  --field-selector spec.nodeName=work-00 -o name)

With the pod names in hand, we can then execute commands inside them by running

kubectl exec -n dsr ${NETSHOOT_CTRL_00} -- tcpdump -n
kubectl exec -n kube-system ${CILIUM_CTRL_00} -- cilium monitor

Since networks can be very chatty, it might be an idea to grep for interesting IPs, or we can output traffic to a file to be analysed by a tool like Wireshark

kubectl exec -n dsr ${NETSHOOT_CTRL_00} -- tcpdump -n -U -w - > capture.pcap

Here -n tells tcpdump to output raw IPs and port numbers, we then use -U in tandem with -w to write packets to stdout as soon as they are received.

Capturing and analysing packets is left as a trivial exercise to the reader.2

Wireshark can be used to analyse the captured traffic.

ARP with GENEVE Tunnelling
#

For ARP, Cilium creates a lease for each Service to announce. This means that the leaseholder will announce the IP address, regardless of which node the target Pod is running on.

To find which node has a given Lease, we can run

❯ kubectl get leases -n kube-system
NAME                                        HOLDER 
cilium-l2announce-dsr-arp-cluster-dsr       ctrl-00
cilium-l2announce-dsr-arp-cluster-snat      ctrl-00
cilium-l2announce-dsr-arp-local-dsr         ctrl-00
cilium-l2announce-dsr-arp-local-snat        ctrl-00

Here we’ve rolled the dice until ctrl-00 holds all the leases as this makes the demonstration easier. The leaseholder node effectively acts as a load balancer for the Virtual IPs (VIPs) assigned to that Service.

First, we place the whoami Pod on the same node as the leaseholder by making sure that

spec:
  nodeSelector:
    kubernetes.io/hostname: ctrl-00

Firing off cURL requests to the different ARP-announced IP addresses, the replies state that the RemoteAddr is that of the client IP (10.144.12.11), i.e. the Pod experiences the requests as coming from the original source IP address, regardless of external traffic policy and forwarding mode,

❯ curl 192.168.1.200
RemoteAddr: 10.144.12.11:53364

❯ curl 192.168.1.201
RemoteAddr: 10.144.12.11:63184

❯ curl 192.168.1.210
RemoteAddr: 10.144.12.11:63387

❯ curl 192.168.1.211
RemoteAddr: 10.144.12.11:63405

meaning that Cilium performs its eBPF magic, and we always experience DSR-like behaviour when the pod is on the same node as the load balancer node.

  
---
title: Traffic flow on same node
---
sequenceDiagram
    participant Client as Client
    participant Node as ctrl-00<br/>Load Balancer & Pod

    Client->>Node: SYN to VIP
    Note over Node: eBPF intercepts request<br/>DNAT: VIP → Pod IP
    Note over Node: Pod processes request
    Note over Node: eBPF intercepts reply<br/>Reverse DNAT: Pod IP → VIP
    Node->>Client: SYN-ACK from VIP

    Client->>Node: HTTP GET
    Note over Node: Local delivery + response
    Node->>Client: HTTP 200 OK



Things start getting more interesting when we move the target whoami Pod to a different Node than the leaseholder/load balancer Node. We move the Pod from ctrl-00 to ctrl-01 by changing the nodeSelector

spec:
  nodeSelector:
    kubernetes.io/hostname: ctrl-01

and re-create the Pod.

Firing off the same cURL requests, we now get different replies

❯ curl 192.168.1.200
RemoteAddr: 10.144.12.11:60279

❯ curl 192.168.1.201
RemoteAddr: 10.244.1.112:60831

❯ curl 192.168.1.210
curl: (7) Failed to connect to 192.168.1.210 port 80 after 6 ms: Couldn't connect to server

❯ curl 192.168.1.211
curl: (7) Failed to connect to 192.168.1.211 port 80 after 7 ms: Couldn't connect to server

First note that the Services with externalTrafficPolicy: Local (.210, and .211) refuses to connect since the target Pod is no longer running on the same Node as the load balancer, and we pay them no more attention.

The second thing to note is that the SNAT-Service (.201) now replies with an IP in the Pod CIDR range (10.244.0.0/16) instead of the client IP (10.144.12.11). The request experiences SNAT through the load balancer Pod before it is tunnelled to the correct Node containing the target Pod.

  
---
title: SNAT with Tunnelling
---
sequenceDiagram
    participant Client as Client
    participant LB as ctrl-00<br/>Load Balancer
    participant Pod as ctrl-01<br/>Pod

    Client->>LB: SYN to VIP
    Note over LB: DNAT: VIP → Pod IP<br/>SNAT: Client IP → LB IP
    LB->>Pod: GENEVE tunnel<br/>SYN to Pod IP
    Note over Pod: Pod processes request<br/>Sees LB IP

    Pod->>LB: GENEVE tunnel<br/>SYN-ACK from Pod IP
    Note over LB: SNAT: Pod IP → VIP<br/>DNAT: LB IP → Client IP
    LB->>Client: SYN-ACK from VIP

    Client->>LB: HTTP GET
    LB->>Pod: GENEVE tunnel<br/>HTTP GET
    Pod->>LB: GENEVE tunnel<br/>HTTP 200 OK
    LB->>Client: HTTP 200 OK



For the DSR-Service (.200), a GENEVE header with DSR options containing the client IP is added to the tunnelled packet. This header is then unpacked on the target Node, making it possible for the Pod to reply directly to the client.

  
---
title: DSR with Tunnelling
---
sequenceDiagram
    participant Client as Client
    participant LB as ctrl-00<br/>Load Balancer
    participant Pod as ctrl-01<br/>Pod

    Client->>LB: SYN to VIP
    Note over LB: eBPF intercepts<br/>DNAT: VIP → Pod IP<br/>No SNAT (preserve Client IP)
    LB->>Pod: GENEVE tunnel<br/>w/ DSR options<br/>SYN to Pod IP
    Note over Pod: Pod processes request<br/>Sees client IP<br/>Extracts VIP from DSR options<br/>Use VIP as source addess
    Pod-->>Client: SYN-ACK from VIP (direct)

    Client->>LB: HTTP GET
    LB->>Pod: GENEVE tunnel<br/>HTTP GET
    Pod-->>Client: HTTP 200 OK (direct)



The DSR options are only need to establish the connections. Once the connection is established, the Pod tracks it the for the entire connection.

BGP with GENEVE Tunnelling
#

BGP works differently from ARP in that more than one node can advertise the same IP address, though you need to peer the nodes to a router.

I’ve described my setup in detail in the aforementioned BGP article. Briefly summarised, all three control plane Nodes are configured as BGP speakers, while the worker Node stays mute.

Moving our target test Pod back to the ctrl-00 Node, we can check the routes advertised by our speaker Nodes

❯ cilium bgp routes available ipv4 unicast
Node      VRouter   Prefix             NextHop   Age   Attrs
ctrl-00   65200     172.20.10.200/32   0.0.0.0   27s   [{Origin: i} {Nexthop: 0.0.0.0}]   
          65200     172.20.10.201/32   0.0.0.0   27s   [{Origin: i} {Nexthop: 0.0.0.0}]   
          65200     172.20.10.210/32   0.0.0.0   24s   [{Origin: i} {Nexthop: 0.0.0.0}]   
          65200     172.20.10.211/32   0.0.0.0   24s   [{Origin: i} {Nexthop: 0.0.0.0}]   
ctrl-01   65200     172.20.10.200/32   0.0.0.0   27s   [{Origin: i} {Nexthop: 0.0.0.0}]   
          65200     172.20.10.201/32   0.0.0.0   27s   [{Origin: i} {Nexthop: 0.0.0.0}]   
ctrl-02   65200     172.20.10.200/32   0.0.0.0   28s   [{Origin: i} {Nexthop: 0.0.0.0}]   
          65200     172.20.10.201/32   0.0.0.0   28s   [{Origin: i} {Nexthop: 0.0.0.0}]   

Note that the Services with externalTrafficPolicy: Cluster (.200 and .201) are advertised by all Nodes, while the externalTrafficPolicy: Local (.210 and .211) are only advertised by the ctrl-00 Node where the Pod runs.

Trying to reach all the BGP advertised Services, we see that the Pod receives the client IP for all requests

❯ curl 172.20.10.200
RemoteAddr: 10.144.12.11:65117

❯ curl 172.20.10.201
RemoteAddr: 10.144.12.11:65153

❯ curl 172.20.10.210
RemoteAddr: 10.144.12.11:65172

❯ curl 172.20.10.211
RemoteAddr: 10.144.12.11:65189

Looking at the TCP dumps from the ctrl-00 Node and inside the Pod, we can see no meaningful difference in the communication between the client and the target Pod compared the ARP case with the Pod running on the leasholder Node. Even though multiple Nodes are advertising the same IP address, only one of them is preferred.3

Taking a peek at the BGP routing table on my UCG Max router, we see the same routes advertised

root@Cloud-Gateway-Max:~# vtysh -c "show ip bgp"
BGP table version is 469, local router ID is 192.168.1.1, vrf id 0
Default local pref 100, local AS 65100
Status codes:  s suppressed, d damped, h history, u unsorted, * valid, > best, = multipath,
               i internal, r RIB-failure, S Stale, R Removed
Nexthop codes: @NNN nexthop's vrf id, < announce-nh-self
Origin codes:  i - IGP, e - EGP, ? - incomplete
RPKI validation codes: V valid, I invalid, N Not found

     Network          Next Hop            Metric LocPrf Weight Path
 *>  172.20.10.200/32 192.168.1.100                          0 65200 i
 *=                   192.168.1.101                          0 65200 i
 *=                   192.168.1.102                          0 65200 i
 *>  172.20.10.201/32 192.168.1.100                          0 65200 i
 *=                   192.168.1.101                          0 65200 i
 *=                   192.168.1.102                          0 65200 i
 *>  172.20.10.210/32 192.168.1.100                          0 65200 i
 *>  172.20.10.211/32 192.168.1.100                          0 65200 i

Here we also see that ctrl-00 (which has IP 192.168.1.100) is preferred for the externalTrafficPolicy: Cluster Services, as indicated by the > symbol.

Changing the whoami Pod to run on ctrl-01 and trying again, we notice that all but the SNAT-ed externalTrafficPolicy: Cluster (.201) Service sees the real client IP

❯ curl 172.20.10.200
RemoteAddr: 10.144.12.11:54333

❯ curl 172.20.10.201
RemoteAddr: 10.244.1.112:54670

❯ curl 172.20.10.210
RemoteAddr: 10.144.12.11:54817

❯ curl 172.20.10.211
RemoteAddr: 10.144.12.11:54960

This can be explained by the .200 and .201 Services being replied to from the BGP preferred ctrl-00 Node, — even though a reply from ctrl-01 directly would be better, while the .210 and .211 Services are replied to from the Node the Pod is running on.

To back up this explanation, we can check the BGP routing table again. Here we see that the externalTrafficPolicy: Cluster routes haven’t changed, while the externalTrafficPolicy: Local routes have been moved to the ctrl-01 Node with IP 192.168.1.101

 *>  172.20.10.200/32 192.168.1.100                          0 65200 i
 *=                   192.168.1.101                          0 65200 i
 *=                   192.168.1.102                          0 65200 i
 *>  172.20.10.201/32 192.168.1.100                          0 65200 i
 *=                   192.168.1.101                          0 65200 i
 *=                   192.168.1.102                          0 65200 i
 *>  172.20.10.210/32 192.168.1.101                          0 65200 i
 *>  172.20.10.211/32 192.168.1.101                          0 65200 i

Deleting and recreating the Services and the Pod running on ctrl-01, the ctrl-00 node is still preferred for the fist jump!

Moving the Pod over to the non-BGP peered work-00 Node, we experience a similar behaviour to moving the Pod over to a non-leaseholder Node in the ARP case, though with a much longer timeout for the non-routable paths

❯ curl 172.20.10.200                              
RemoteAddr: 10.144.12.11:58728

❯ curl 172.20.10.201
RemoteAddr: 10.244.1.112:58736

❯ curl 172.20.10.210
curl: (7) Failed to connect to 172.20.10.210 port 80 after 7140 ms: Couldn't connect to server

❯ curl 172.20.10.211
curl: (7) Failed to connect to 172.20.10.211 port 80 after 7157 ms: Couldn't connect to server

The timeouts can again be explained by the Node not having a Local route to the Pod, thus dropping the connection.

Native Routing
#

Restarting our setup with the target Pod on ctrl-00 toghether with the leases, we unsuprisingly see the same behaviour as the tunnelling case for ARP announced Services

❯ curl 192.168.1.200                                       
RemoteAddr: 10.144.12.11:56464

❯ curl 192.168.1.201
RemoteAddr: 10.144.12.11:56600

❯ curl 192.168.1.210
RemoteAddr: 10.144.12.11:56769

❯ curl 192.168.1.211
RemoteAddr: 10.144.12.11:56978

Moving the Pod over to ctrl-01 without the leases, we also get a similar behaviour, though the RemoteAddr is now reported as the load balancer Node IP instead of the load balancer Pod IP

❯ curl 192.168.1.200
RemoteAddr: 10.144.12.11:57423

❯ curl 192.168.1.201
RemoteAddr: 192.168.1.100:52797

❯ curl 192.168.1.210
curl: (7) Failed to connect to 192.168.1.210 port 80 after 7 ms: Couldn't connect to server

❯ curl 192.168.1.211
curl: (7) Failed to connect to 192.168.1.211 port 80 after 6 ms: Couldn't connect to server

We still can’t route to node external VIPs with externalTrafficPolicy: Local for the same reason as before.

BGP also demonstrably behaves almost identically with native routing compared to tunnelling. The same Node IP is reported as the RemoteAddr when the Pod runs on a different Node than the preferred BGP peer

❯ curl 172.20.10.200
RemoteAddr: 10.144.12.11:65369

❯ curl 172.20.10.201
RemoteAddr: 192.168.1.100:49246

❯ curl 172.20.10.210
RemoteAddr: 10.144.12.11:49426

❯ curl 172.20.10.211
RemoteAddr: 10.144.12.11:49527

Again, moving the Pod over to the non-BGP peered work-00 node we get the same results with the externalTrafficPolicy: Local Services refusing to connectn.

❯ curl 172.20.10.200                                       
RemoteAddr: 10.144.12.11:64639

❯ curl 172.20.10.201
RemoteAddr: 192.168.1.100:64710

❯ curl 172.20.10.210
curl: (7) Failed to connect to 172.20.10.210 port 80 after 7096 ms: Couldn't connect to server

❯ curl 172.20.10.211
curl: (7) Failed to connect to 172.20.10.211 port 80 after 7134 ms: Couldn't connect to server

Except for the RemoteAddr change from a Pod IP to a Node IP we see little difference between tunnelling and native routing. To see where the real difference lies, we have to inspect the TCP dumps.

In the SNAT case, the packets are NAT’ed to the target Pod and sent along their way.

  
---
title: SNAT with Native Routing
---
sequenceDiagram
    participant Client as Client
    participant LB as ctrl-00<br/>Load Balancer
    participant Pod as ctrl-01<br/>Pod

    Client->>LB: SYN to VIP
    Note over LB: DNAT: VIP → Pod IP<br/>SNAT: Client IP → LB IP
    LB->>Pod: SYN to Pod IP
    Note over Pod: Pod processes request<br/>Sees LB IP

    Pod->>LB: SYN-ACK from Pod IP
    Note over LB: SNAT: Pod IP → VIP<br/>DNAT: LB IP → Client IP
    LB->>Client: SYN-ACK from VIP

    Client->>LB: HTTP GET
    LB->>Pod: HTTP GET
    Pod->>LB: HTTP 200 OK
    LB->>Client: HTTP 200 OK



On the return path, the load balancer Node reverse-NATs the replies and forwards them back to the client.

Whereas with DSR, we see the equivalent packets being rewritten as to appear being sent from the client to the Pod directly from the load balancer Node using IP options.

Here, the network router needs to know which Node to send the packets to based on the Pod IP.

  
---
title: DSR with Native Routing
---
sequenceDiagram
participant Client as Client
participant LB as ctrl-00<br/>Load Balancer
participant Pod as ctrl-01<br/>Pod

Client->>LB: SYN to VIP
Note over LB: DNAT: VIP → Pod IP<br/>Add VIP to IP Options header<br/>Forward via native routing
LB->>Pod: IP Options: VIP<br/>SYN to Pod IP

Note over Pod: Pod processes request<br/>Sees client IP<br/>Extract VIP from IP Options<br/>Use VIP as source address
Pod-->>Client: SYN-ACK from VIP (direct)

Client->>LB: HTTP GET
LB->>Pod: IP Options: VIP<br/>HTTP GET
Pod-->>Client: HTTP 200 OK (direct)


The target Node then receives the packet and forwards it to the Pod directly. The Pod receives the request and sees the client IP, the original VIP is then extracted from the IP options, and a reply is sent back to the client using the VIP as the source address.

With DSR routing using GENEVE tunnelling, the packets from the load balancer Node are being sent to the target Pod Node with an encapsulated GENEVE header

20:18:26.776725 IP 10.144.12.11.60280 > 172.20.10.200.80: Flags [S], seq 4182363690, win 65535, options [mss 1460,nop,wscale 6,nop,nop,TS val 2202670684 ecr 0,sackOK,eol], length 0
20:18:26.776776 IP 192.168.1.100.65047 > 192.168.1.101.6081: Geneve, Flags [none], vni 0x2, options [12 bytes]: IP 10.144.12.11.60280 > 10.244.0.240.80: Flags [S], seq 4182363690, win 65535, options [mss 1460,nop,wscale 6,nop,nop,TS val 2202670684 ecr 0,sackOK,eol], length 0
20:18:26.781830 IP 10.144.12.11.60280 > 172.20.10.200.80: Flags [.], ack 175331444, win 2054, options [nop,nop,TS val 2202670690 ecr 31889143], length 0
20:18:26.781857 IP 192.168.1.100.65047 > 192.168.1.101.6081: Geneve, Flags [none], vni 0x2: IP 10.144.12.11.60280 > 10.244.0.240.80: Flags [.], ack 175331444, win 2054, options [nop,nop,TS val 2202670690 ecr 31889143], length 0

On the receiving Node, the GENEVE header is inspected and stripped off before being sent to the target Pod. Note that we use Geneve encapsulation for both the tunnelling part and for DSR routing from the Pod.

While with native routing, the packet is only Destination NAT’ed and a lightweight IP option header is added to the packet with the original VIP as the source address

21:21:11.926423 IP 10.144.12.11.53948 > 172.20.10.200.80: Flags [SEW], seq 1715394499, win 65535, options [mss 1460,nop,wscale 6,nop,nop,TS val 1422266817 ecr 0,sackOK,eol], length 0
21:21:11.926439 IP 10.144.12.11.53948 > 10.244.0.241.80: Flags [SEW], seq 1715394499, win 65535, options [mss 1460,nop,wscale 6,nop,nop,TS val 1422266817 ecr 0,sackOK,eol], length 0
21:21:11.932260 IP 10.144.12.11.53948 > 172.20.10.200.80: Flags [.], ack 3228734302, win 2059, options [nop,nop,TS val 1422266822 ecr 2244985958], length 0
21:21:11.932280 IP 10.144.12.11.53948 > 10.244.0.241.80: Flags [.], ack 3228734302, win 2059, options [nop,nop,TS val 1422266822 ecr 2244985958], length 0

GENEVE tunnelling adds a few extra bytes of overhead, but it frees up routable IPs. If we had multiple clusters on the same network, we would have to use non-overlapping Pod CIDRs to avoid IP conflicts using native routing, though this IP exhaustion problem can be greatly alleviated by using IPv6.

  
---
title: Native Routing vs Geneve Encapsulation
---
block-beta
    columns 2

    block:native
        columns 1
        A["Native Routing"]
        space
        space
        space
        B["IP Header\nsrc: Client\ndst: Pod IP"]
        C["TCP Header\nport 80"]
        D["HTTP Data"]
    end

    block:geneve
        columns 1
        E["Geneve Tunneling"]
        F["Outer IP Header\nsrc: LB Node\ndst: Pod Node"]
        G["UDP Header\nport 6081"]
        H["Geneve Header"]
        I["Inner IP Header\nsrc: Client\ndst: Pod IP"]
        J["TCP Header\nport 80"]
        K["HTTP Data"]
    end

    classDef title fill:#64748b,stroke:#475569
    classDef overhead fill:#b45309,stroke:#92400e
    classDef packet fill:#047857,stroke:#065f46

    class A,E title
    class F,G,H overhead
    class B,C,D,I,J,K packet



Reverse Proxy
#

Up until now we’ve only done routing in the Transport layer (layer 4), by routing directly to the Pod using VIPs. But what happens if we introduce a reverse proxy in the Application layer (layer 7) in front of the Pod?

We leave Cilium in native routing mode and rely only on BGP advertisements for this part.

There are a lot of different reverse proxies out there, but I will be focusing on Cilium’s implementation of the Gateway API spec, which uses Envoy behind the scenes.

I’ll be using Gateway API configured with cert-manager as mentioned in the article above. Here I’ve created a wild-card Certificate with name cert-stonegarden (line 14) for my domain

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
# gateway/cert-stonegarden.yaml
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  name: cert-stonegarden
  namespace: gateway
spec:
  dnsNames:
    - "*.stonegarden.dev"
  issuerRef:
    group: cert-manager.io
    kind: ClusterIssuer
    name: cloudflare-cluster-issuer
  secretName: cert-stonegarden
  usages:
    - digital signature
    - key encipherment

and referenced this in the following Gateway resource on line 24

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
# gateway/gateway.yaml
apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
  name: internal
  namespace: gateway
spec:
  gatewayClassName: cilium
  infrastructure:
    labels:
      bgp.cilium.io/advertise-service: default
      bgp.cilium.io/ip-pool: default
  addresses:
    - type: IPAddress
      value: 172.20.10.100
  listeners:
    - protocol: HTTPS
      port: 443
      name: https-gateway
      hostname: "*.stonegarden.dev"
      tls:
        certificateRefs:
          - kind: Secret
            name: cert-stonegarden
      allowedRoutes:
        namespaces:
          from: All
    - protocol: TLS
      port: 443
      name: tls-passthrough
      hostname: "*.stonegarden.dev"
      tls:
        mode: Passthrough
      allowedRoutes:
        namespaces:
          from: All

The Gateway is assigned the IP 172.20.10.100 (line 15) from the previously created BGP CiliumLoadBalancerIPPool (line 12) The Gateway Service is advertised by the BGP peer Nodes (line 11).

Note that the Gateway has both an HTTPS listener (line 17) — for HTTPRoutes, and a TLS listener (line 28) — for TLSRoutes.

We then create what is basically a copy of the BGP advertised DSR Service with externalTrafficPolicy: Local

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
apiVersion: v1
kind: Service
metadata:
  name: whoami
  namespace: dsr
  annotations:
    io.cilium/lb-ipam-ips: 172.20.10.220
    service.cilium.io/forwarding-mode: dsr
  labels:
    bgp.cilium.io/advertise-service: default
    bgp.cilium.io/ip-pool: default
spec:
  type: LoadBalancer
  externalTrafficPolicy: Local
  internalTrafficPolicy: Cluster
  selector:
    app: whoami
  ports:
    - name: web
      port: 80
      targetPort: http

and match it with the following HTTPRoute

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
# test/http-route.yaml
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: whoami
  namespace: dsr
spec:
  parentRefs:
    - { name: internal, namespace: gateway }
  hostnames: [ "whoami.stonegarden.dev" ]
  rules:
    - backendRefs: [ { name: whoami-regular, port: 80 } ]
      matches:
        - path: { type: PathPrefix, value: / }

to be reacheable at https://whoami.stonegarden.dev (line 10).

Next, we check that the preferred BGP route 192.168.1.100 — which is the ctrl-00 Node

root@Cloud-Gateway-Max:~# vtysh -c "show ip bgp"
Status codes:  s suppressed, d damped, h history, u unsorted, * valid, > best, = multipath,
 *>  172.20.10.100/32 192.168.1.100                          0 65200 i
 *=                   192.168.1.101                          0 65200 i
 *=                   192.168.1.102                          0 65200 i

and move the target Pod once again over to that Node.

Firing off curl towards our reverse proxy at https://whoami.stonegarden.dev we see that the Pod thinks that it’s replying to a RemoteAddr that belongs to the PodCIDR-network and not the client, even though we supposedly requested DSR routing!4

❯ curl https://whoami.stonegarden.dev
RemoteAddr: 10.244.0.104:57436
...
X-Envoy-Internal: true
X-Forwarded-For: 10.144.12.11
X-Forwarded-Proto: https
X-Request-Id: 905974a3-3329-49e3-a43e-16ed056b5d9f

The reply from the whoami-webserver also includes some headers. From the whoami webserver we can see that Envoy has helpfully included the X-Forwarded-For header with the correct client IP. Cilium counts this header as the [source IP being visible to the Pod](https://docs.cilium.io/en/latest/network/servicemesh/gateway-api/gateway-api/#source-ip-visibility, and the MDN Web Docs agrees. What happened here is that the reverse proxy is potentially sending a DSR reply, but the request is still proxied to the target Pod.

The reverse proxy sees our client IP, but it acts as a middleman between the client and the target Pod.

Remember that we configured externalTrafficPolicy: Local on the Service. If we move the target Pod over to a non-preffered Node, e.g. ctrl-01

spec:
  nodeSelector:
    kubernetes.io/hostname: ctrl-01

and try curling again

❯ curl https://whoami.stonegarden.dev                     
upstream connect error or disconnect/reset before headers. reset reason: connection timeout% 

We’re unable to reach the target Pod since the reverse proxy/load balancer on ctrl-00 no longer sees the Service as being local to the Node. This indicates that Envoy is trying to use the “externally-facing” LoadBalancerIP address, instead of the Service’s “internally-facing” ClusterIP address (since we haven’t touched the internalTrafficPolicy which defaults to Cluster).

To test this theory, we can create a regular Service of type ClusterIP (line 10) — which is the default Service type

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
# test/svc-regular.yaml
apiVersion: v1
kind: Service
metadata:
  name: whoami-regular
  namespace: dsr
  annotations:
    service.cilium.io/forwarding-mode: dsr
spec:
  type: ClusterIP
  selector:
    app: whoami
  ports:
    - name: web
      port: 80
      targetPort: http

and update the HTTPRoute to reference it

  rules:
    - backendRefs: [ { name: whoami-regular, port: 80 } ]

Trying to reach the Pod again, we see that we’re now able to reach it

❯ curl https://whoami.stonegarden.dev                      
RemoteAddr: 10.244.2.192:57334
...
X-Envoy-Internal: true
X-Forwarded-For: 10.144.12.11
X-Forwarded-Proto: https
X-Request-Id: c42442c3-d49e-4fe0-ba9d-c3be6d81574e

To probe further we can add internalTrafficPolicy: Local to our ClusterIP Service and move the Pod to our non BGP-peered work-00 Node, and we still get a reply

❯ curl https://whoami.stonegarden.dev
RemoteAddr: 10.244.2.19:41581
...
X-Envoy-Internal: true
X-Forwarded-For: 10.144.12.11
X-Forwarded-Proto: https
X-Request-Id: e3e9ccdb-7b98-49e8-8dc6-bed7c8a366da

Taking a look at the TCP dumps, we can reconstruct the following simplified sequence diagram

  
---
title: Cross-node Traffic with TLS Termination at Proxy
---
sequenceDiagram
    participant Client as Client
    participant Proxy as ctrl-00<br/>Reverse Proxy/<br/>Load Balancer
    participant Pod as work-00<br/>Pod

    Note over Client,Proxy: TCP Handshake
    Client->>Proxy: SYN
    Proxy->>Client: SYN-ACK

    Note over Client,Proxy: TLS Termination (at Proxy)
    Client->>Proxy: TLS Client Hello
    Proxy->>Client: TLS Server Hello, Cert, etc.
    
    Client->>Proxy: Encrypted Data
    
    Note over Proxy: Decrypts & parses Host header<br/>Resolves backend from HTTPRoute<br/>Adds X-Forwarded headers

    rect rgba(128, 128, 128, 0.1337)
        Note over Proxy,Pod: New connection (Unencrypted)
        Proxy->>Pod: SYN
        Pod->>Proxy: SYN-ACK
        Proxy->>Pod: Plaintext Data
        Pod->>Proxy: Plaintext Data
    end

    Note over Proxy: Encrypts response
    Proxy->>Client: Encrypted Data



which shows that a separate connection is established between the reverse proxy and the target Pod, making a direct reply from the Pod to the client impossible, at least if we want to terminate TLS at the reverse proxy.

We can also try to use TLS Passthrough where the Pod itself terminates the TLS connection.

For this we first need a Certificate similar to what we used for the Gateway

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
# tls/certificate.yaml
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  name: whoami-tls
  namespace: dsr
spec:
  dnsNames: [ whoami-tls.stonegarden.dev ]
  issuerRef:
    group: cert-manager.io
    kind: ClusterIssuer
    name: cloudflare-cluster-issuer
  secretName: whoami-tls
  usages:
    - digital signature
    - key encipherment

Only that this time we request a single-domain certificate (line 8) instead of a wildcard one.

We mount the Secret created by the Certificate into our Pod (line 23 and 32)

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
# tls/pod-whoami-tls.yaml
apiVersion: v1
kind: Pod
metadata:
  name: whoami-tls
  namespace: dsr
  labels:
    app: whoami-tls
spec:
  hostNetwork: false
  nodeSelector:
    kubernetes.io/hostname: ctrl-00
  containers:
    - name: netshoot
      image: ghcr.io/nicolaka/netshoot:v0.15
      command: [
        /whoami/whoami,
        --cert, /tls/tls.crt,
        --key, /tls/tls.key,
        --port, "443"]
      ports:
        - name: https
          containerPort: 443
      volumeMounts:
        - name: whoami
          mountPath: /whoami
        - name: tls-certs
          mountPath: /tls
          readOnly: true
  volumes:
    - name: whoami
      image:
        reference: ghcr.io/traefik/whoami:latest
    - name: tls-certs
      secret:
        secretName: whoami-tls

and instruct the whoami webserver to use it on lines 16–20.

Since we want to also try to reach the Pod directly, we create an unremarkable Load Balancer Service with IP 172.20.10.230 (line 8)

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
# tls/svc.yaml
apiVersion: v1
kind: Service
metadata:
  name: whoami-tls
  namespace: dsr
  annotations:
    io.cilium/lb-ipam-ips: 172.20.10.230
    service.cilium.io/forwarding-mode: dsr
  labels:
    app: whoami-tls
    bgp.cilium.io/advertise-service: default
    bgp.cilium.io/ip-pool: default
spec:
  type: LoadBalancer
  selector:
    app: whoami-tls
  ports:
    - name: https
      port: 443
      targetPort: 443

Lastly, we connect the Service to the Gateway using the following TLSRoute

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
# tls/tlsroute.yaml
apiVersion: gateway.networking.k8s.io/v1alpha2
kind: TLSRoute
metadata:
  name: whoami-tls
  namespace: dsr
spec:
  parentRefs:
    - { name: internal, namespace: gateway }
  hostnames: [ "whoami-tls.stonegarden.dev" ]
  rules:
    - backendRefs: [ { name: whoami-tls, port: 443 } ]

which advertises the whoami-tls.stonegarden.dev route (line 10).

Trying to reach the Pod through the Gateway, we still get a PodCIDR-IP as the RemoteAddr

❯ curl https://whoami-tls.stonegarden.dev
RemoteAddr: 10.244.2.19:34869

and since the connection is encrypted, Envoy can’t inject the X-Forwarded-For header, making the webserver completely blind to the source IP.

Mapping the request, we get the following simplified sequence diagram

  
---
title: Cross-node Traffic with TLS Passthrough
---
sequenceDiagram
    participant Client as Client
    participant Proxy as ctrl-00<br/>Reverse Proxy/<br/>Load Balancer
    participant Pod as work-00<br/>Pod

    Note over Client,Proxy: TCP Handshake
    Client->>Proxy: SYN
    Proxy->>Client: SYN-ACK

    Note over Client,Proxy: TLS Client Hello (SNI)
    Client->>Proxy: TLS Client Hello
    
    Note over Proxy: Reads SNI<br/>Resolves backend from TLSRoute

    rect rgba(128, 128, 128, 0.1337)
        Note over Proxy,Pod: TCP Handshake
        Proxy->>Pod: SYN
        Pod->>Proxy: SYN-ACK
    end

    Note over Client,Pod: TLS Passthrough (Termination at Pod)
    Proxy->>Pod: TLS Client Hello
    Pod->>Proxy: TLS Server Hello, Cert, etc.
    Proxy->>Client: TLS Server Hello, Cert, etc.
    
    Note over Client,Pod: Encrypted Traffic
    Client->>Proxy: Encrypted Data
    Proxy->>Pod: Encrypted Data
    Pod->>Proxy: Encrypted Data
    Proxy->>Client: Encrypted Data


Here we again see that a separate connection is established between the reverse proxy and the target Pod, making a direct reply from the Pod to the client impossible.

What we can do in this situation is to create a DNS record directly pointing to the Service IP. To simulate this, we can use the --resolve flag for curl

❯ curl --resolve whoami-tls.stonegarden.dev:443:172.20.10.230 \
  https://whoami-tls.stonegarden.dev
RemoteAddr: 10.144.12.11:59917

Conclusion
#

Enabling DSR lets a target Pod see the client IP directly, but it does come with some caveats.

Cilium must be configured to enable DSR, and the underlying network must support it. The Service routing the target Pod also has to be able to talk to the client directly, no proxies can be involved.

To enable DSR in Cilium, we can either configure native routing if our network supports it, or we can use GENEVE tunnelling with some added overhead, — though using up a lot fewer IP.

Behind a well-behaved proxy, we should be able to rely on the X-Forwarded-For header to get the client IP, but backend support for the header might vary.

Unrelated to DSR, we can maybe save a jump using externalTrafficPolicy: Local on Services with BGP advertisements, but it’s risky with ARP as we’re not guaranteed that the load balancing Node will be the same as the target Pod Node.

Although quite the detour, I’m now able to have my AdGuardHome DNS server correctly pick up the source/client IP of DNS queries. This allows for better statistics and more fine-grained control over DNS.

In my case, the best approach is to turn on native routing with BGP peering. I’ve also opted to default to SNAT forwarding as it’s the most reliable option, but I’ve turned on bpg.lbModeAnnotation to allow selectively using DSR with IP options for some Services like the DNS server.

Summary
#

Cilium Configuration
#

Values used for GENEVE tunnelling with GENEVE options for DSR

# cilium/values-tunnel.yaml
routingMode: tunnel
tunnelProtocol: geneve

loadBalancer:
  standalone: true
  mode: dsr
  dsrDispatch: geneve

bpf:
  lbModeAnnotation: true

Values used for native routing with IP Options for DSR

# cilium/values-native.yaml
routingMode: native
ipv4NativeRoutingCIDR: 10.244.0.0/16 # Talos PodCIDR
autoDirectNodeRoutes: true

loadBalancer:
  standalone: true
  algorithm: maglev
  mode: dsr
  dsrDispatch: opt

bpf:
  masquerade: true
  lbModeAnnotation: true

Values used for ARP announcements

# cilium/values-arp.yaml
l2announcements:
  enabled: true

k8sClientRateLimit:
  qps: 20
  burst: 100

ARP announcement policy

# cilium/arp-announce.yaml
apiVersion: cilium.io/v2alpha1
kind: CiliumL2AnnouncementPolicy
metadata:
  name: arp-announcement
  namespace: kube-system
spec:
  loadBalancerIPs: true
  serviceSelector:
    matchLabels:
      arp.cilium.io/announce-service: default

ARP announcement IP pool

# cilium/arp-ip-pool.yaml
apiVersion: cilium.io/v2
kind: CiliumLoadBalancerIPPool
metadata:
  name: default-arp-ip-pool
spec:
  blocks:
    - start: 192.168.1.200
      stop: 192.168.1.255
  serviceSelector:
    matchLabels:
      arp.cilium.io/ip-pool: default

BGP advertisement IP pool

# cilium/bgp-ip-pool.yaml
apiVersion: cilium.io/v2
kind: CiliumLoadBalancerIPPool
metadata:
  name: default-bgp-ip-pool
spec:
  blocks:
    - cidr: 172.20.10.0/24
  serviceSelector:
    matchLabels:
      bgp.cilium.io/ip-pool: default

Settings used for the Reverse Proxy section.

# cilium/values.yaml
kubeProxyReplacement: true

routingMode: native
ipv4NativeRoutingCIDR: 10.244.0.0/16 # Talos PodCIDR
autoDirectNodeRoutes: true

bgpControlPlane:
  enabled: true

loadBalancer:
  standalone: true
  algorithm: maglev
  mode: dsr
  dsrDispatch: opt
  l7:
    backend: envoy

bpf:
  masquerade: true
  lbModeAnnotation: true

Test
#

# test/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

resources:
  - ns.yaml
  - ds-netshoot.yaml
  - pod-whoami.yaml
  - svc-arp-cluster-dsr.yaml
  - svc-arp-cluster-snat.yaml
  - svc-arp-local-dsr.yaml
  - svc-arp-local-snat.yaml
  - svc-bgp-cluster-dsr.yaml
  - svc-bgp-cluster-snat.yaml
  - svc-bgp-local-dsr.yaml
  - svc-bgp-local-snat.yaml
  - svc.yaml
  - http-route.yaml

Namespace with privileged Pod Security Admission Policy

# test/ns.yaml
apiVersion: v1
kind: Namespace
metadata:
  name: dsr
  labels:
    pod-security.kubernetes.io/enforce: privileged

Designated target Pod with whoami webserver and netshoot tools

# test/pod-whoami.yaml
apiVersion: v1
kind: Pod
metadata:
  name: whoami
  namespace: dsr
  labels:
    app: whoami
spec:
  nodeSelector:
    kubernetes.io/hostname: ctrl-00
  containers:
    - name: netshoot
      image: ghcr.io/nicolaka/netshoot:v0.15
      command: [ /whoami/whoami ]
      ports:
        - name: http
          containerPort: 80
      volumeMounts:
        - name: whoami
          mountPath: /whoami
  volumes:
    - name: whoami
      image:
        reference: ghcr.io/traefik/whoami:latest

DaemonSet with netshoot tools on the host network

# test/ds-netshoot.yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: netshoot
  namespace: dsr
spec:
  selector:
    matchLabels:
      app: netshoot
  template:
    metadata:
      labels:
        app: netshoot
    spec:
      hostNetwork: true
      containers:
        - name: netshoot
          image: ghcr.io/nicolaka/netshoot:v0.15
          command: [ tail, -f, /dev/null ]

ARP announced Services

# test/svc-arp-cluster-dsr.yaml
apiVersion: v1
kind: Service
metadata:
  name: arp-cluster-dsr
  namespace: dsr
  annotations:
    io.cilium/lb-ipam-ips: 192.168.1.200
    service.cilium.io/forwarding-mode: dsr
  labels:
    arp.cilium.io/announce-service: default
    arp.cilium.io/ip-pool: default
spec:
  type: LoadBalancer
  externalTrafficPolicy: Cluster
  selector:
    app: whoami
  ports:
    - name: web
      port: 80
      targetPort: http
# test/svc-arp-cluster-snat.yaml
apiVersion: v1
kind: Service
metadata:
  name: arp-cluster-snat
  namespace: dsr
  annotations:
    io.cilium/lb-ipam-ips: 192.168.1.201
    service.cilium.io/forwarding-mode: snat
  labels:
    arp.cilium.io/announce-service: default
    arp.cilium.io/ip-pool: default
spec:
  type: LoadBalancer
  externalTrafficPolicy: Cluster
  selector:
    app: whoami
  ports:
    - name: web
      port: 80
      targetPort: http
# test/svc-arp-local-dsr.yaml
apiVersion: v1
kind: Service
metadata:
  name: arp-local-dsr
  namespace: dsr
  annotations:
    io.cilium/lb-ipam-ips: 192.168.1.210
    service.cilium.io/forwarding-mode: dsr
  labels:
    arp.cilium.io/announce-service: default
    arp.cilium.io/ip-pool: default
spec:
  type: LoadBalancer
  externalTrafficPolicy: Local
  selector:
    app: whoami
  ports:
    - name: web
      port: 80
      targetPort: http
# test/svc-arp-local-snat.yaml
apiVersion: v1
kind: Service
metadata:
  name: arp-local-snat
  namespace: dsr
  annotations:
    io.cilium/lb-ipam-ips: 192.168.1.211
    service.cilium.io/forwarding-mode: snat
  labels:
    arp.cilium.io/announce-service: default
    arp.cilium.io/ip-pool: default
spec:
  type: LoadBalancer
  externalTrafficPolicy: Local
  selector:
    app: whoami
  ports:
    - name: web
      port: 80
      targetPort: http

BGP advertised Services

# test/svc-bgp-cluster-dsr.yaml
apiVersion: v1
kind: Service
metadata:
  name: bgp-cluster-dsr
  namespace: dsr
  annotations:
    io.cilium/lb-ipam-ips: 172.20.10.200
    service.cilium.io/forwarding-mode: dsr
  labels:
    bgp.cilium.io/advertise-service: default
    bgp.cilium.io/ip-pool: default
spec:
  type: LoadBalancer
  externalTrafficPolicy: Cluster
  selector:
    app: whoami
  ports:
    - name: web
      port: 80
      targetPort: http
# test/svc-bgp-cluster-snat.yaml
apiVersion: v1
kind: Service
metadata:
  name: bgp-cluster-snat
  namespace: dsr
  annotations:
    io.cilium/lb-ipam-ips: 172.20.10.201
    service.cilium.io/forwarding-mode: snat
  labels:
    bgp.cilium.io/advertise-service: default
    bgp.cilium.io/ip-pool: default
spec:
  type: LoadBalancer
  externalTrafficPolicy: Cluster
  selector:
    app: whoami
  ports:
    - name: web
      port: 80
      targetPort: http
# test/svc-bgp-local-dsr.yaml
apiVersion: v1
kind: Service
metadata:
  name: bgp-local-dsr
  namespace: dsr
  annotations:
    io.cilium/lb-ipam-ips: 172.20.10.210
    service.cilium.io/forwarding-mode: dsr
  labels:
    bgp.cilium.io/advertise-service: default
    bgp.cilium.io/ip-pool: default
spec:
  type: LoadBalancer
  externalTrafficPolicy: Local
  selector:
    app: whoami
  ports:
    - name: web
      port: 80
      targetPort: http
# test/svc-bgp-local-snat.yaml
apiVersion: v1
kind: Service
metadata:
  name: bgp-local-snat
  namespace: dsr
  annotations:
    io.cilium/lb-ipam-ips: 172.20.10.211
    service.cilium.io/forwarding-mode: snat
  labels:
    bgp.cilium.io/advertise-service: default
    bgp.cilium.io/ip-pool: default
spec:
  type: LoadBalancer
  externalTrafficPolicy: Local
  selector:
    app: whoami
  ports:
    - name: web
      port: 80
      targetPort: http

Service for the reverse proxy

apiVersion: v1
kind: Service
metadata:
  name: whoami
  namespace: dsr
  annotations:
    io.cilium/lb-ipam-ips: 172.20.10.220
    service.cilium.io/forwarding-mode: dsr
  labels:
    bgp.cilium.io/advertise-service: default
    bgp.cilium.io/ip-pool: default
spec:
  type: LoadBalancer
  externalTrafficPolicy: Local
  internalTrafficPolicy: Cluster
  selector:
    app: whoami
  ports:
    - name: web
      port: 80
      targetPort: http

HTTPRoute for the reverse proxy

# test/http-route.yaml
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: whoami
  namespace: dsr
spec:
  parentRefs:
    - { name: internal, namespace: gateway }
  hostnames: [ "whoami.stonegarden.dev" ]
  rules:
    - backendRefs: [ { name: whoami-regular, port: 80 } ]
      matches:
        - path: { type: PathPrefix, value: / }

Simpler Service for reverse proxy

# test/svc-regular.yaml
apiVersion: v1
kind: Service
metadata:
  name: whoami-regular
  namespace: dsr
  annotations:
    service.cilium.io/forwarding-mode: dsr
spec:
  type: ClusterIP
  selector:
    app: whoami
  ports:
    - name: web
      port: 80
      targetPort: http

Gateway
#

Gateway with TLS and HTTPS listeners

# gateway/gateway.yaml
apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
  name: internal
  namespace: gateway
spec:
  gatewayClassName: cilium
  infrastructure:
    labels:
      bgp.cilium.io/advertise-service: default
      bgp.cilium.io/ip-pool: default
  addresses:
    - type: IPAddress
      value: 172.20.10.100
  listeners:
    - protocol: HTTPS
      port: 443
      name: https-gateway
      hostname: "*.stonegarden.dev"
      tls:
        certificateRefs:
          - kind: Secret
            name: cert-stonegarden
      allowedRoutes:
        namespaces:
          from: All
    - protocol: TLS
      port: 443
      name: tls-passthrough
      hostname: "*.stonegarden.dev"
      tls:
        mode: Passthrough
      allowedRoutes:
        namespaces:
          from: All

Wildcard Certificate for the Gateway

# gateway/cert-stonegarden.yaml
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  name: cert-stonegarden
  namespace: gateway
spec:
  dnsNames:
    - "*.stonegarden.dev"
  issuerRef:
    group: cert-manager.io
    kind: ClusterIssuer
    name: cloudflare-cluster-issuer
  secretName: cert-stonegarden
  usages:
    - digital signature
    - key encipherment

Reverse Proxy
#

# tls/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

resources:
  - cert.yaml
  - svc.yaml
  - tls-route.yaml
  - pod-whoami-tls.yaml

Certificate for the TLS passthrough Pod

# tls/certificate.yaml
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  name: whoami-tls
  namespace: dsr
spec:
  dnsNames: [ whoami-tls.stonegarden.dev ]
  issuerRef:
    group: cert-manager.io
    kind: ClusterIssuer
    name: cloudflare-cluster-issuer
  secretName: whoami-tls
  usages:
    - digital signature
    - key encipherment

Pod serving its own TLS certificate

# tls/pod-whoami-tls.yaml
apiVersion: v1
kind: Pod
metadata:
  name: whoami-tls
  namespace: dsr
  labels:
    app: whoami-tls
spec:
  hostNetwork: false
  nodeSelector:
    kubernetes.io/hostname: ctrl-00
  containers:
    - name: netshoot
      image: ghcr.io/nicolaka/netshoot:v0.15
      command: [
        /whoami/whoami,
        --cert, /tls/tls.crt,
        --key, /tls/tls.key,
        --port, "443"]
      ports:
        - name: https
          containerPort: 443
      volumeMounts:
        - name: whoami
          mountPath: /whoami
        - name: tls-certs
          mountPath: /tls
          readOnly: true
  volumes:
    - name: whoami
      image:
        reference: ghcr.io/traefik/whoami:latest
    - name: tls-certs
      secret:
        secretName: whoami-tls

Service for the TLS passthrough Pod

# tls/svc.yaml
apiVersion: v1
kind: Service
metadata:
  name: whoami-tls
  namespace: dsr
  annotations:
    io.cilium/lb-ipam-ips: 172.20.10.230
    service.cilium.io/forwarding-mode: dsr
  labels:
    app: whoami-tls
    bgp.cilium.io/advertise-service: default
    bgp.cilium.io/ip-pool: default
spec:
  type: LoadBalancer
  selector:
    app: whoami-tls
  ports:
    - name: https
      port: 443
      targetPort: 443

TLSRoute for the TLS passthrough Pod

# tls/tlsroute.yaml
apiVersion: gateway.networking.k8s.io/v1alpha2
kind: TLSRoute
metadata:
  name: whoami-tls
  namespace: dsr
spec:
  parentRefs:
    - { name: internal, namespace: gateway }
  hostnames: [ "whoami-tls.stonegarden.dev" ]
  rules:
    - backendRefs: [ { name: whoami-tls, port: 443 } ]

  1. See this LinkedIn post by Nicolas Vibert for an explanation on VXLAN in Cilium. ↩︎

  2. Or you could cheat and look at my TCP dumps↩︎

  3. I think it should be technically possible to load balance between the different Nodes, but I haven’t been able to figure out how to do it yet. ↩︎

  4. This is actually expected behaviour as per Gateway API GitHub issue #451↩︎