Direct Server Return (DSR) is a networking technique that allows a backend server to learn the original request’s source IP address and respond directly, even behind a load balancer. This results in reduced latency and improved throughput, as well as the ability to filter traffic based on client IP.
Compared to Cilium’s default Source Network Address Translation (SNAT) — which rewrites source IPs and routes traffic through the load balancer, DSR is often more complicated to configure, and support relies on the underlying network infrastructure.
Motivation#
While digging into the Border Gateway Protocol (BGP), I stumbled upon the concept of DSR as a way of preserving the source IP address of client requests.
My main use-case for DSR is getting the correct client IP address for DNS requests, though the promise of reduced latency, overhead, and throughput is also welcome.
Overview#
This article will assume that you have LoadBalancer IP Address Management (LB-IPAM) configured with Cilium, as well as either L2 announcements with Address Resolution Protocol (ARP) or Border Gateway Protocol (BGP) configured.
We’ll be looking at Cilium in both with Generic Network Virtualization Encapsulation (GENEVE) tunnelling and native routing modes.
Digging deeper,
we’ll play around with the Service
.spec.externalTrafficPolicy field
and see how it affects the traffic routing in our different configurations.
To verify that we’ve got DSR working, we will be using Træfik’s whoami webserver to peek at the HTTP requests, as well as tcpdump inside netshoot containers to inspect traffic.
Cilium configuration#
Before we begin properly, we need to configure Cilium to replace the kube-proxy.
Cilium has a great quick start guide on how to do this with Kubeadm. If you’re instead running K3s or Talos, I’ve explained the process in previous articles.


Out-of-the-box, Cilium’s eBPF NodePort implementation operates in SNAT mode. Simplified, this means that node external requests are routed through the node for both incoming and outgoing traffic. I.e. the backend reply takes an extra hop before being returned to the client, losing the original client IP in the translation process. In DSR mode, however, the backend replies directly to the external client.
---
title: Source Network Address Translation vs Direct Server Return
---
flowchart LR
subgraph SNAT["SNAT"]
C1("Client")
LB1("Load Balancer")
P1("Backend Pod")
C1 -->|"src = Client IP<br/>dst = Service IP"| LB1
LB1 -->|"src = Node IP<br/>dst = Pod IP"| P1
P1 -->|"src = Pod IP<br/>dst = Node IP"| LB1
LB1 -->|"src = Service IP<br/>dst = Client IP"| C1
end
subgraph DSR["DSR"]
C2("Client")
LB2("Load Balancer")
P2("Backend Pod")
C2 -->|"src = Client IP<br/>dst = Service IP"| LB2
LB2 -->|"src = Client IP<br/>dst = Pod IP"| P2
P2 -->|"src = Service IP<br/>dst = Client IP"| C2
end
SNAT ~~~ DSR
There are several ways to configure Cilium to use DSR load balancing.
GENEVE Tunnelling#
In an effort to maximise compatibility, Cilium uses Virtual eXtensible LAN (VXLAN) tunnelling1 by default, which unfortunately doesn’t support DSR.
To enable support for DSR in tunnelling mode, we have to instead use Generic Network Virtualization Encapsulation (GENEVE) tunnelling — a more extensible network encapsulation protocol than the protocol named for the property. We can change the tunnelling mode at a slight risk of losing compatibility with older hardware.
Assuming a Helm install of Cilium,
we enable GENEVE tunnelling by configuring the tunnel protocol to be geneve on line 3
| |
DSR can now be enabled by configuring the loadBalancer.mode to dsr (line 7),
and set the loadBalancer.dsrDispatch option to geneve encapsulation (line 8)
— which is the only DSR dispatch option supported by GENEVE tunnelling.
Note that there is also a hybrid load balancer mode,
which is DSR for TCP and SNAT for UDP.
We also enable bpf.lbModeAnnotation (line 11) to allow us to control the load balancer mode per Service using the
service.cilium.io/forwarding-mode annotation.
We will use this annotation in the Testing section below to have both DSR and SNAT load balanced Services
simultaneously.
You might also consider turning on the eXpress Data Path (XDP) based standalone load balancer (line 6) which reportedly improves performance.
Native Routing#
Native routing offers lower overhead, but it relies on the underlying network infrastructure. This means that it might not be available in all environments, especially in a cloud setting.
As for DSR options,
native routing in Cilium supports
both GENEVE and
IP-in-IP encapsulation (ipip),
as well as IP options (opt)
— in decreasing amount of overhead.
To switch to native routing,
we change the routingMode to native (line 2) and supply the
Pod CIDR as the ipv4NativeRoutingCIDR (line 3)
— or alternatively ipv6NativeRoutingCIDR if you prefer.
Cilium will assume that networking for the given CIDR is preconfigured and directs traffic destined for the given
IP range to the Linux network stack without applying any SNAT.
We then have to either manually configure routes to reach the pods,
or set autoDirectNodeRoutes to true (line 4) for it to be done for us.
For the lowest possible overhead,
we set the dsrDispatch to opt on line 10 for IP options,
but both ipip and geneve are also supported depending on the network environment.
| |
In an attempt to shoehorn more micro-optimisations into this article,
I’ve also
changed the load balancer algorithm
to Maglev Consistent Hashing
on line 8.
Maglev hashing should improve resiliency in case of failures, as well as better load balancing properties.
It also sounds cool.
The downside is that Maglev hasing uses more memory than the default random algorithm.
We also enable eBPF Host-Routing
—
which should increase throughput,
by setting bpf.masquerade to true on line 13.
Broadcasting#
Now that we’ve configured how to route traffic — either through tunnelling or using native routing, we need a way to broadcast the different endpoints to our network.
The simplest way to make Cilium do this for us is with ARP announcements. I’ve covered this in more detail in my article on migrating from MetalLB to Cilium.

Briefly summarised, it amounts to the following Cilium Helm values
| |
together with a CiliumL2AnnouncementPolicy
| |
I’ve also prepared a dedicated IP pool which goes together with the ARP announcements.
| |
A more involved approach to IP broadcasting is to use BGP advertisements, though this requires that your networking hardware supports it. BGP is most commonly used between ISPs and in big data centres, as well as overengineered homelabs.
ARP is perfectly fine for smaller installations, though if you want to play around with BGP, I’ve written a separate article on how to configure BGP pairing with Cilium and UniFi equipment.

Skipping the implementations details, my BGP configuration boils down to a dedicated IP pool for BGP advertisements based on Service labels
| |
and a bgp.cilium.io/advertise-service: default label for advertising them,
and the control plane nodes doing the broadcasting.
Testing#
For deranged reasons testing purposes,
I’ve configured both ARP and BGP on my four-node cluster.
Three control-plane nodes, ctrl-00, ctrl-01, and ctrl-02,
and one worker node work-00.
All four Nodes are schedulable for regular workloads.
---
title: Cluster
---
flowchart TB
subgraph ctrl-01["ctrl-01"]
vm01("IP: 192.168.1.101")
end
subgraph ctrl-02["ctrl-02"]
vm02("IP: 192.168.1.102")
end
subgraph work-00["work-00"]
vm03("IP: 192.168.1.110")
end
subgraph ctrl-00["ctrl-00"]
vm00("IP: 192.168.1.100")
end
ctrl-00 ~~~ ctrl-02
ctrl-01 ~~~ work-00
As a target we will use Traefik’s whoami webserver. This webserver replies with useful information about the request, including the remote IP address of the request as seen by the webserver. Since we also want to monitor the actual traffic, we embed the tiny Go webserver inside netshoot — a Docker + Kubernetes network trouble-shooting swiss-army container.
Taking advantage of the image volume feature — introduced in Kubernetes v1.31, we can mount one container’s filesystem inside another. Support for image volumes also depends on the underlying container-runtime, introduced in containerd 2.1.0 and CRI-O 1.31. You might also need to explicitly enable the feature gate ImageVolume in your cluster as I’ve done in this commit.
To construct our target container,
we start with the netshoot image (line 14) and mount the whoami image (line 25) inside it under /whoami (line 21).
| |
We then make sure to pin the Pod to a given Node using the nodeSelector field on line 11.
Next, we construct a DaemonSet with the same
netshoot image,
and set hostNetwork to true (line 16) so we can get a TCP dump of the traffic on the node itself.
| |
pod-security.kubernetes.io/enforce: privileged label to the Namespace to allow running
containers connected to the host network.We will be testing the difference between DSR and SNAT using the service.cilium.io/forwarding-mode Service
annotation together with the two types
of external traffic policies.
An external traffic policy of type Cluster
— which is the default,
means that external traffic should be routed to all ready endpoints,
whereas a Local external traffic policy means to route to only node-local endpoints.
The whoami Pod is exposed by eighth different Services
— four announced with ARP on the 192.168.1.0/24 subnet,
and four advertised with BGP on the 172.20.10.0/24 subnet,
with the last block indicating which combination of externalTrafficPolicy and forwardingMode used,
indicated by the table below.
| externalTrafficPolicy | DSR | SNAT |
|---|---|---|
| Cluster | .200 | .201 |
| Local | .210 | .211 |
The full Service definitions can be found in the Summary section.
We test using Cilium in both GENEVE tunnelling and native routing mode.
All requests are sent from a client on subnet 10.144.12.0/24, specifically 10.144.12.11.
The PodCIDR is 10.244.0.0/16, and the service CIDR is 10.96.0.0/12.
The nodes themselves are on subnet 192.168.1.0/24, specifically 192.168.1.{100,101,102,110}.
Probing the Network#
We can run tcpdump inside both the host network attached netshoot containers and the whoami target
Pod to listen in on traffic inside the Nodes and the target _Pod.
Cilium also has a built-in cilium monitor command that can be run inside the Cilium pods to display
“notifications and events emitted by BPF programs attached to endpoints and devices”.
To get the full Pod name of the different Pods spawned by the DaemonSet on all the nodes, we can e.g. run
NETSHOOT_CTRL_00=$(kubectl get pod -n dsr -l app=netshoot \
--field-selector spec.nodeName=ctrl-00 -o name)
NETSHOOT_CTRL_01=$(kubectl get pod -n dsr -l app=netshoot \
--field-selector spec.nodeName=ctrl-01 -o name)
NETSHOOT_CTRL_02=$(kubectl get pod -n dsr -l app=netshoot \
--field-selector spec.nodeName=ctrl-02 -o name)
NETSHOOT_WORK_00=$(kubectl get pod -n dsr -l app=netshoot \
--field-selector spec.nodeName=work-00 -o name)and do the same for the Cilium pods
CILIUM_CTRL_00=$(kubectl get pod -n kube-system -l k8s-app=cilium \
--field-selector spec.nodeName=ctrl-00 -o name)
CILIUM_CTRL_01=$(kubectl get pod -n kube-system -l k8s-app=cilium \
--field-selector spec.nodeName=ctrl-01 -o name)
CILIUM_CTRL_02=$(kubectl get pod -n kube-system -l k8s-app=cilium \
--field-selector spec.nodeName=ctrl-02 -o name)
CILIUM_WORK_00=$(kubectl get pod -n kube-system -l k8s-app=cilium \
--field-selector spec.nodeName=work-00 -o name)With the pod names in hand, we can then execute commands inside them by running
kubectl exec -n dsr ${NETSHOOT_CTRL_00} -- tcpdump -n
kubectl exec -n kube-system ${CILIUM_CTRL_00} -- cilium monitorSince networks can be very chatty, it might be an idea to grep for interesting IPs, or we can output traffic to a file to be analysed by a tool like Wireshark
kubectl exec -n dsr ${NETSHOOT_CTRL_00} -- tcpdump -n -U -w - > capture.pcapHere -n tells tcpdump to output raw IPs and port numbers, we then use -U in tandem with -w to write packets to
stdout as soon as they are received.
Capturing and analysing packets is left as a trivial exercise to the reader.2
Wireshark can be used to analyse the captured traffic.
ARP with GENEVE Tunnelling#
For ARP, Cilium creates a lease for each Service to announce. This means that the leaseholder will announce the IP address, regardless of which node the target Pod is running on.
To find which node has a given Lease, we can run
❯ kubectl get leases -n kube-system
NAME HOLDER
cilium-l2announce-dsr-arp-cluster-dsr ctrl-00
cilium-l2announce-dsr-arp-cluster-snat ctrl-00
cilium-l2announce-dsr-arp-local-dsr ctrl-00
cilium-l2announce-dsr-arp-local-snat ctrl-00Here we’ve rolled the dice until ctrl-00 holds all the leases as this makes the demonstration easier.
The leaseholder node effectively acts as a load balancer for
the Virtual IPs (VIPs) assigned to that Service.
First, we place the whoami Pod on the same node as the leaseholder by making sure that
spec:
nodeSelector:
kubernetes.io/hostname: ctrl-00Firing off cURL requests to the different ARP-announced IP addresses,
the replies state that the RemoteAddr is that of the client IP (10.144.12.11),
i.e. the Pod experiences the requests as coming from the original source IP address,
regardless of external traffic policy and forwarding mode,
❯ curl 192.168.1.200
RemoteAddr: 10.144.12.11:53364
❯ curl 192.168.1.201
RemoteAddr: 10.144.12.11:63184
❯ curl 192.168.1.210
RemoteAddr: 10.144.12.11:63387
❯ curl 192.168.1.211
RemoteAddr: 10.144.12.11:63405meaning that Cilium performs its eBPF magic, and we always experience DSR-like behaviour when the pod is on the same node as the load balancer node.
---
title: Traffic flow on same node
---
sequenceDiagram
participant Client as Client
participant Node as ctrl-00<br/>Load Balancer & Pod
Client->>Node: SYN to VIP
Note over Node: eBPF intercepts request<br/>DNAT: VIP → Pod IP
Note over Node: Pod processes request
Note over Node: eBPF intercepts reply<br/>Reverse DNAT: Pod IP → VIP
Node->>Client: SYN-ACK from VIP
Client->>Node: HTTP GET
Note over Node: Local delivery + response
Node->>Client: HTTP 200 OK
Things start getting more interesting when we move the target whoami Pod to a different Node than the
leaseholder/load balancer Node.
We move the Pod from ctrl-00 to ctrl-01 by changing the nodeSelector
spec:
nodeSelector:
kubernetes.io/hostname: ctrl-01and re-create the Pod.
Firing off the same cURL requests, we now get different replies
❯ curl 192.168.1.200
RemoteAddr: 10.144.12.11:60279
❯ curl 192.168.1.201
RemoteAddr: 10.244.1.112:60831
❯ curl 192.168.1.210
curl: (7) Failed to connect to 192.168.1.210 port 80 after 6 ms: Couldn't connect to server
❯ curl 192.168.1.211
curl: (7) Failed to connect to 192.168.1.211 port 80 after 7 ms: Couldn't connect to serverFirst note that the Services with externalTrafficPolicy: Local (.210, and .211) refuses to connect since the
target Pod is no longer running on the same Node as the load balancer,
and we pay them no more attention.
The second thing to note is that the SNAT-Service (.201) now replies with an IP in the Pod CIDR range
(10.244.0.0/16) instead of the client IP (10.144.12.11).
The request experiences SNAT through the load balancer Pod before it is tunnelled to the correct Node containing
the target Pod.
---
title: SNAT with Tunnelling
---
sequenceDiagram
participant Client as Client
participant LB as ctrl-00<br/>Load Balancer
participant Pod as ctrl-01<br/>Pod
Client->>LB: SYN to VIP
Note over LB: DNAT: VIP → Pod IP<br/>SNAT: Client IP → LB IP
LB->>Pod: GENEVE tunnel<br/>SYN to Pod IP
Note over Pod: Pod processes request<br/>Sees LB IP
Pod->>LB: GENEVE tunnel<br/>SYN-ACK from Pod IP
Note over LB: SNAT: Pod IP → VIP<br/>DNAT: LB IP → Client IP
LB->>Client: SYN-ACK from VIP
Client->>LB: HTTP GET
LB->>Pod: GENEVE tunnel<br/>HTTP GET
Pod->>LB: GENEVE tunnel<br/>HTTP 200 OK
LB->>Client: HTTP 200 OK
For the DSR-Service (.200),
a GENEVE header with DSR options containing the client IP is added to the tunnelled packet.
This header is then unpacked on the target Node,
making it possible for the Pod to reply directly to the client.
---
title: DSR with Tunnelling
---
sequenceDiagram
participant Client as Client
participant LB as ctrl-00<br/>Load Balancer
participant Pod as ctrl-01<br/>Pod
Client->>LB: SYN to VIP
Note over LB: eBPF intercepts<br/>DNAT: VIP → Pod IP<br/>No SNAT (preserve Client IP)
LB->>Pod: GENEVE tunnel<br/>w/ DSR options<br/>SYN to Pod IP
Note over Pod: Pod processes request<br/>Sees client IP<br/>Extracts VIP from DSR options<br/>Use VIP as source addess
Pod-->>Client: SYN-ACK from VIP (direct)
Client->>LB: HTTP GET
LB->>Pod: GENEVE tunnel<br/>HTTP GET
Pod-->>Client: HTTP 200 OK (direct)
The DSR options are only need to establish the connections. Once the connection is established, the Pod tracks it the for the entire connection.
BGP with GENEVE Tunnelling#
BGP works differently from ARP in that more than one node can advertise the same IP address, though you need to peer the nodes to a router.
I’ve described my setup in detail in the aforementioned BGP article. Briefly summarised, all three control plane Nodes are configured as BGP speakers, while the worker Node stays mute.
Moving our target test Pod back to the ctrl-00 Node,
we can check the routes advertised by our speaker Nodes
❯ cilium bgp routes available ipv4 unicast
Node VRouter Prefix NextHop Age Attrs
ctrl-00 65200 172.20.10.200/32 0.0.0.0 27s [{Origin: i} {Nexthop: 0.0.0.0}]
65200 172.20.10.201/32 0.0.0.0 27s [{Origin: i} {Nexthop: 0.0.0.0}]
65200 172.20.10.210/32 0.0.0.0 24s [{Origin: i} {Nexthop: 0.0.0.0}]
65200 172.20.10.211/32 0.0.0.0 24s [{Origin: i} {Nexthop: 0.0.0.0}]
ctrl-01 65200 172.20.10.200/32 0.0.0.0 27s [{Origin: i} {Nexthop: 0.0.0.0}]
65200 172.20.10.201/32 0.0.0.0 27s [{Origin: i} {Nexthop: 0.0.0.0}]
ctrl-02 65200 172.20.10.200/32 0.0.0.0 28s [{Origin: i} {Nexthop: 0.0.0.0}]
65200 172.20.10.201/32 0.0.0.0 28s [{Origin: i} {Nexthop: 0.0.0.0}] Note that the Services with externalTrafficPolicy: Cluster (.200 and .201) are advertised by all Nodes,
while the externalTrafficPolicy: Local (.210 and .211) are only advertised by the ctrl-00 Node where the Pod
runs.
Trying to reach all the BGP advertised Services, we see that the Pod receives the client IP for all requests
❯ curl 172.20.10.200
RemoteAddr: 10.144.12.11:65117
❯ curl 172.20.10.201
RemoteAddr: 10.144.12.11:65153
❯ curl 172.20.10.210
RemoteAddr: 10.144.12.11:65172
❯ curl 172.20.10.211
RemoteAddr: 10.144.12.11:65189Looking at the TCP dumps from the ctrl-00 Node and inside the Pod,
we can see no meaningful difference in the communication between the client and the target Pod compared the ARP
case with the Pod running on the leasholder Node.
Even though multiple Nodes are advertising the same IP address,
only one of them is preferred.3
Taking a peek at the BGP routing table on my UCG Max router, we see the same routes advertised
root@Cloud-Gateway-Max:~# vtysh -c "show ip bgp"
BGP table version is 469, local router ID is 192.168.1.1, vrf id 0
Default local pref 100, local AS 65100
Status codes: s suppressed, d damped, h history, u unsorted, * valid, > best, = multipath,
i internal, r RIB-failure, S Stale, R Removed
Nexthop codes: @NNN nexthop's vrf id, < announce-nh-self
Origin codes: i - IGP, e - EGP, ? - incomplete
RPKI validation codes: V valid, I invalid, N Not found
Network Next Hop Metric LocPrf Weight Path
*> 172.20.10.200/32 192.168.1.100 0 65200 i
*= 192.168.1.101 0 65200 i
*= 192.168.1.102 0 65200 i
*> 172.20.10.201/32 192.168.1.100 0 65200 i
*= 192.168.1.101 0 65200 i
*= 192.168.1.102 0 65200 i
*> 172.20.10.210/32 192.168.1.100 0 65200 i
*> 172.20.10.211/32 192.168.1.100 0 65200 iHere we also see that ctrl-00 (which has IP 192.168.1.100) is preferred for the externalTrafficPolicy: Cluster
Services, as indicated by the > symbol.
Changing the whoami Pod to run on ctrl-01 and trying again,
we notice that all but the SNAT-ed externalTrafficPolicy: Cluster (.201) Service sees the real client IP
❯ curl 172.20.10.200
RemoteAddr: 10.144.12.11:54333
❯ curl 172.20.10.201
RemoteAddr: 10.244.1.112:54670
❯ curl 172.20.10.210
RemoteAddr: 10.144.12.11:54817
❯ curl 172.20.10.211
RemoteAddr: 10.144.12.11:54960This can be explained by the .200 and .201 Services being replied to from the BGP preferred ctrl-00 Node,
— even though a reply from ctrl-01 directly would be better,
while the .210 and .211 Services are replied to from the Node the Pod is running on.
To back up this explanation,
we can check the BGP routing table again.
Here we see that the externalTrafficPolicy: Cluster routes haven’t changed,
while the externalTrafficPolicy: Local routes have been moved to the ctrl-01 Node with IP 192.168.1.101
*> 172.20.10.200/32 192.168.1.100 0 65200 i
*= 192.168.1.101 0 65200 i
*= 192.168.1.102 0 65200 i
*> 172.20.10.201/32 192.168.1.100 0 65200 i
*= 192.168.1.101 0 65200 i
*= 192.168.1.102 0 65200 i
*> 172.20.10.210/32 192.168.1.101 0 65200 i
*> 172.20.10.211/32 192.168.1.101 0 65200 iDeleting and recreating the Services and the Pod running on ctrl-01,
the ctrl-00 node is still preferred for the fist jump!
Moving the Pod over to the non-BGP peered work-00 Node,
we experience a similar behaviour to moving the Pod over to a non-leaseholder Node in the ARP case,
though with a much longer timeout for the non-routable paths
❯ curl 172.20.10.200
RemoteAddr: 10.144.12.11:58728
❯ curl 172.20.10.201
RemoteAddr: 10.244.1.112:58736
❯ curl 172.20.10.210
curl: (7) Failed to connect to 172.20.10.210 port 80 after 7140 ms: Couldn't connect to server
❯ curl 172.20.10.211
curl: (7) Failed to connect to 172.20.10.211 port 80 after 7157 ms: Couldn't connect to serverThe timeouts can again be explained by the Node not having a Local route to the Pod,
thus dropping the connection.
Native Routing#
Restarting our setup with the target Pod on ctrl-00 toghether with the leases,
we unsuprisingly see the same behaviour as the tunnelling case for ARP announced Services
❯ curl 192.168.1.200
RemoteAddr: 10.144.12.11:56464
❯ curl 192.168.1.201
RemoteAddr: 10.144.12.11:56600
❯ curl 192.168.1.210
RemoteAddr: 10.144.12.11:56769
❯ curl 192.168.1.211
RemoteAddr: 10.144.12.11:56978Moving the Pod over to ctrl-01 without the leases,
we also get a similar behaviour,
though the RemoteAddr is now reported as the load balancer Node IP instead of the load balancer Pod IP
❯ curl 192.168.1.200
RemoteAddr: 10.144.12.11:57423
❯ curl 192.168.1.201
RemoteAddr: 192.168.1.100:52797
❯ curl 192.168.1.210
curl: (7) Failed to connect to 192.168.1.210 port 80 after 7 ms: Couldn't connect to server
❯ curl 192.168.1.211
curl: (7) Failed to connect to 192.168.1.211 port 80 after 6 ms: Couldn't connect to serverWe still can’t route to node external VIPs with externalTrafficPolicy: Local for the same reason as before.
BGP also demonstrably behaves almost identically with native routing compared to tunnelling.
The same Node IP is reported as the RemoteAddr when the Pod runs on a different Node than the preferred
BGP peer
❯ curl 172.20.10.200
RemoteAddr: 10.144.12.11:65369
❯ curl 172.20.10.201
RemoteAddr: 192.168.1.100:49246
❯ curl 172.20.10.210
RemoteAddr: 10.144.12.11:49426
❯ curl 172.20.10.211
RemoteAddr: 10.144.12.11:49527Again, moving the Pod over to the non-BGP peered work-00 node we get the same results with the
externalTrafficPolicy: Local Services refusing to connectn.
❯ curl 172.20.10.200
RemoteAddr: 10.144.12.11:64639
❯ curl 172.20.10.201
RemoteAddr: 192.168.1.100:64710
❯ curl 172.20.10.210
curl: (7) Failed to connect to 172.20.10.210 port 80 after 7096 ms: Couldn't connect to server
❯ curl 172.20.10.211
curl: (7) Failed to connect to 172.20.10.211 port 80 after 7134 ms: Couldn't connect to serverExcept for the RemoteAddr change from a Pod IP to a Node IP we see little difference between tunnelling and native
routing.
To see where the real difference lies, we have to inspect the TCP dumps.
In the SNAT case, the packets are NAT’ed to the target Pod and sent along their way.
---
title: SNAT with Native Routing
---
sequenceDiagram
participant Client as Client
participant LB as ctrl-00<br/>Load Balancer
participant Pod as ctrl-01<br/>Pod
Client->>LB: SYN to VIP
Note over LB: DNAT: VIP → Pod IP<br/>SNAT: Client IP → LB IP
LB->>Pod: SYN to Pod IP
Note over Pod: Pod processes request<br/>Sees LB IP
Pod->>LB: SYN-ACK from Pod IP
Note over LB: SNAT: Pod IP → VIP<br/>DNAT: LB IP → Client IP
LB->>Client: SYN-ACK from VIP
Client->>LB: HTTP GET
LB->>Pod: HTTP GET
Pod->>LB: HTTP 200 OK
LB->>Client: HTTP 200 OK
On the return path, the load balancer Node reverse-NATs the replies and forwards them back to the client.
Whereas with DSR, we see the equivalent packets being rewritten as to appear being sent from the client to the Pod directly from the load balancer Node using IP options.
Here, the network router needs to know which Node to send the packets to based on the Pod IP.
--- title: DSR with Native Routing --- sequenceDiagram participant Client as Client participant LB as ctrl-00<br/>Load Balancer participant Pod as ctrl-01<br/>Pod Client->>LB: SYN to VIP Note over LB: DNAT: VIP → Pod IP<br/>Add VIP to IP Options header<br/>Forward via native routing LB->>Pod: IP Options: VIP<br/>SYN to Pod IP Note over Pod: Pod processes request<br/>Sees client IP<br/>Extract VIP from IP Options<br/>Use VIP as source address Pod-->>Client: SYN-ACK from VIP (direct) Client->>LB: HTTP GET LB->>Pod: IP Options: VIP<br/>HTTP GET Pod-->>Client: HTTP 200 OK (direct)
The target Node then receives the packet and forwards it to the Pod directly. The Pod receives the request and sees the client IP, the original VIP is then extracted from the IP options, and a reply is sent back to the client using the VIP as the source address.
With DSR routing using GENEVE tunnelling, the packets from the load balancer Node are being sent to the target Pod Node with an encapsulated GENEVE header
20:18:26.776725 IP 10.144.12.11.60280 > 172.20.10.200.80: Flags [S], seq 4182363690, win 65535, options [mss 1460,nop,wscale 6,nop,nop,TS val 2202670684 ecr 0,sackOK,eol], length 0
20:18:26.776776 IP 192.168.1.100.65047 > 192.168.1.101.6081: Geneve, Flags [none], vni 0x2, options [12 bytes]: IP 10.144.12.11.60280 > 10.244.0.240.80: Flags [S], seq 4182363690, win 65535, options [mss 1460,nop,wscale 6,nop,nop,TS val 2202670684 ecr 0,sackOK,eol], length 0
20:18:26.781830 IP 10.144.12.11.60280 > 172.20.10.200.80: Flags [.], ack 175331444, win 2054, options [nop,nop,TS val 2202670690 ecr 31889143], length 0
20:18:26.781857 IP 192.168.1.100.65047 > 192.168.1.101.6081: Geneve, Flags [none], vni 0x2: IP 10.144.12.11.60280 > 10.244.0.240.80: Flags [.], ack 175331444, win 2054, options [nop,nop,TS val 2202670690 ecr 31889143], length 0On the receiving Node, the GENEVE header is inspected and stripped off before being sent to the target Pod. Note that we use Geneve encapsulation for both the tunnelling part and for DSR routing from the Pod.
While with native routing, the packet is only Destination NAT’ed and a lightweight IP option header is added to the packet with the original VIP as the source address
21:21:11.926423 IP 10.144.12.11.53948 > 172.20.10.200.80: Flags [SEW], seq 1715394499, win 65535, options [mss 1460,nop,wscale 6,nop,nop,TS val 1422266817 ecr 0,sackOK,eol], length 0
21:21:11.926439 IP 10.144.12.11.53948 > 10.244.0.241.80: Flags [SEW], seq 1715394499, win 65535, options [mss 1460,nop,wscale 6,nop,nop,TS val 1422266817 ecr 0,sackOK,eol], length 0
21:21:11.932260 IP 10.144.12.11.53948 > 172.20.10.200.80: Flags [.], ack 3228734302, win 2059, options [nop,nop,TS val 1422266822 ecr 2244985958], length 0
21:21:11.932280 IP 10.144.12.11.53948 > 10.244.0.241.80: Flags [.], ack 3228734302, win 2059, options [nop,nop,TS val 1422266822 ecr 2244985958], length 0GENEVE tunnelling adds a few extra bytes of overhead, but it frees up routable IPs. If we had multiple clusters on the same network, we would have to use non-overlapping Pod CIDRs to avoid IP conflicts using native routing, though this IP exhaustion problem can be greatly alleviated by using IPv6.
---
title: Native Routing vs Geneve Encapsulation
---
block-beta
columns 2
block:native
columns 1
A["Native Routing"]
space
space
space
B["IP Header\nsrc: Client\ndst: Pod IP"]
C["TCP Header\nport 80"]
D["HTTP Data"]
end
block:geneve
columns 1
E["Geneve Tunneling"]
F["Outer IP Header\nsrc: LB Node\ndst: Pod Node"]
G["UDP Header\nport 6081"]
H["Geneve Header"]
I["Inner IP Header\nsrc: Client\ndst: Pod IP"]
J["TCP Header\nport 80"]
K["HTTP Data"]
end
classDef title fill:#64748b,stroke:#475569
classDef overhead fill:#b45309,stroke:#92400e
classDef packet fill:#047857,stroke:#065f46
class A,E title
class F,G,H overhead
class B,C,D,I,J,K packet
Reverse Proxy#
Up until now we’ve only done routing in the Transport layer (layer 4), by routing directly to the Pod using VIPs. But what happens if we introduce a reverse proxy in the Application layer (layer 7) in front of the Pod?
We leave Cilium in native routing mode and rely only on BGP advertisements for this part.
There are a lot of different reverse proxies out there, but I will be focusing on Cilium’s implementation of the Gateway API spec, which uses Envoy behind the scenes.

I’ll be using Gateway API configured with cert-manager as mentioned in the article above.
Here I’ve created a wild-card Certificate with name cert-stonegarden (line 14) for my domain
| |
and referenced this in the following Gateway resource on line 24
| |
The Gateway is assigned the IP 172.20.10.100 (line 15) from the previously created BGP
CiliumLoadBalancerIPPool (line 12)
The Gateway Service is advertised by the BGP peer Nodes (line 11).
Note that the Gateway has both an HTTPS listener (line 17) — for HTTPRoutes, and a TLS listener (line 28) — for TLSRoutes.
We then create what is basically a copy of the BGP advertised DSR Service with externalTrafficPolicy: Local
| |
and match it with the following HTTPRoute
| |
to be reacheable at https://whoami.stonegarden.dev (line 10).
Next, we check that the preferred BGP route 192.168.1.100
— which is the ctrl-00 Node
root@Cloud-Gateway-Max:~# vtysh -c "show ip bgp"
Status codes: s suppressed, d damped, h history, u unsorted, * valid, > best, = multipath,
*> 172.20.10.100/32 192.168.1.100 0 65200 i
*= 192.168.1.101 0 65200 i
*= 192.168.1.102 0 65200 iand move the target Pod once again over to that Node.
Firing off curl towards our reverse proxy at https://whoami.stonegarden.dev we see that the Pod thinks that it’s
replying to a RemoteAddr that belongs to the PodCIDR-network and not the client,
even though we supposedly requested DSR routing!4
❯ curl https://whoami.stonegarden.dev
RemoteAddr: 10.244.0.104:57436
...
X-Envoy-Internal: true
X-Forwarded-For: 10.144.12.11
X-Forwarded-Proto: https
X-Request-Id: 905974a3-3329-49e3-a43e-16ed056b5d9fThe reply from the whoami-webserver also includes some headers.
From the whoami webserver we can see that Envoy has helpfully included the X-Forwarded-For header with the correct
client IP.
Cilium counts this header as the [source IP being visible to the
Pod](https://docs.cilium.io/en/latest/network/servicemesh/gateway-api/gateway-api/#source-ip-visibility,
and the MDN Web Docs agrees.
What happened here is that the reverse proxy is potentially sending a DSR reply,
but the request is still proxied to the target Pod.
The reverse proxy sees our client IP, but it acts as a middleman between the client and the target Pod.
Remember that we configured externalTrafficPolicy: Local on the Service.
If we move the target Pod over to a non-preffered Node,
e.g. ctrl-01
spec:
nodeSelector:
kubernetes.io/hostname: ctrl-01and try curling again
❯ curl https://whoami.stonegarden.dev
upstream connect error or disconnect/reset before headers. reset reason: connection timeout% We’re unable to reach the target Pod since the reverse proxy/load balancer on ctrl-00 no longer sees
the Service as being local to the Node.
This indicates that Envoy is trying to use the “externally-facing” LoadBalancerIP address,
instead of the Service’s “internally-facing” ClusterIP address
(since we haven’t touched the internalTrafficPolicy which defaults to Cluster).
To test this theory, we can create a regular Service of type ClusterIP (line 10) — which is the default Service type
| |
and update the HTTPRoute to reference it
rules:
- backendRefs: [ { name: whoami-regular, port: 80 } ]Trying to reach the Pod again, we see that we’re now able to reach it
❯ curl https://whoami.stonegarden.dev
RemoteAddr: 10.244.2.192:57334
...
X-Envoy-Internal: true
X-Forwarded-For: 10.144.12.11
X-Forwarded-Proto: https
X-Request-Id: c42442c3-d49e-4fe0-ba9d-c3be6d81574eTo probe further we can add internalTrafficPolicy: Local to our ClusterIP Service and move the Pod to our non
BGP-peered work-00 Node, and we still get a reply
❯ curl https://whoami.stonegarden.dev
RemoteAddr: 10.244.2.19:41581
...
X-Envoy-Internal: true
X-Forwarded-For: 10.144.12.11
X-Forwarded-Proto: https
X-Request-Id: e3e9ccdb-7b98-49e8-8dc6-bed7c8a366daTaking a look at the TCP dumps, we can reconstruct the following simplified sequence diagram
---
title: Cross-node Traffic with TLS Termination at Proxy
---
sequenceDiagram
participant Client as Client
participant Proxy as ctrl-00<br/>Reverse Proxy/<br/>Load Balancer
participant Pod as work-00<br/>Pod
Note over Client,Proxy: TCP Handshake
Client->>Proxy: SYN
Proxy->>Client: SYN-ACK
Note over Client,Proxy: TLS Termination (at Proxy)
Client->>Proxy: TLS Client Hello
Proxy->>Client: TLS Server Hello, Cert, etc.
Client->>Proxy: Encrypted Data
Note over Proxy: Decrypts & parses Host header<br/>Resolves backend from HTTPRoute<br/>Adds X-Forwarded headers
rect rgba(128, 128, 128, 0.1337)
Note over Proxy,Pod: New connection (Unencrypted)
Proxy->>Pod: SYN
Pod->>Proxy: SYN-ACK
Proxy->>Pod: Plaintext Data
Pod->>Proxy: Plaintext Data
end
Note over Proxy: Encrypts response
Proxy->>Client: Encrypted Data
which shows that a separate connection is established between the reverse proxy and the target Pod, making a direct reply from the Pod to the client impossible, at least if we want to terminate TLS at the reverse proxy.
We can also try to use TLS Passthrough where the Pod itself terminates the TLS connection.
For this we first need a Certificate similar to what we used for the Gateway
| |
Only that this time we request a single-domain certificate (line 8) instead of a wildcard one.
We mount the Secret created by the Certificate into our Pod (line 23 and 32)
| |
and instruct the whoami webserver to use it on lines 16–20.
Since we want to also try to reach the Pod directly,
we create an unremarkable Load Balancer Service with IP 172.20.10.230 (line 8)
| |
Lastly, we connect the Service to the Gateway using the following TLSRoute
| |
which advertises the whoami-tls.stonegarden.dev route (line 10).
Trying to reach the Pod through the Gateway,
we still get a PodCIDR-IP as the RemoteAddr
❯ curl https://whoami-tls.stonegarden.dev
RemoteAddr: 10.244.2.19:34869and since the connection is encrypted, Envoy can’t inject the X-Forwarded-For header,
making the webserver completely blind to the source IP.
Mapping the request, we get the following simplified sequence diagram
---
title: Cross-node Traffic with TLS Passthrough
---
sequenceDiagram
participant Client as Client
participant Proxy as ctrl-00<br/>Reverse Proxy/<br/>Load Balancer
participant Pod as work-00<br/>Pod
Note over Client,Proxy: TCP Handshake
Client->>Proxy: SYN
Proxy->>Client: SYN-ACK
Note over Client,Proxy: TLS Client Hello (SNI)
Client->>Proxy: TLS Client Hello
Note over Proxy: Reads SNI<br/>Resolves backend from TLSRoute
rect rgba(128, 128, 128, 0.1337)
Note over Proxy,Pod: TCP Handshake
Proxy->>Pod: SYN
Pod->>Proxy: SYN-ACK
end
Note over Client,Pod: TLS Passthrough (Termination at Pod)
Proxy->>Pod: TLS Client Hello
Pod->>Proxy: TLS Server Hello, Cert, etc.
Proxy->>Client: TLS Server Hello, Cert, etc.
Note over Client,Pod: Encrypted Traffic
Client->>Proxy: Encrypted Data
Proxy->>Pod: Encrypted Data
Pod->>Proxy: Encrypted Data
Proxy->>Client: Encrypted Data
Here we again see that a separate connection is established between the reverse proxy and the target Pod, making a direct reply from the Pod to the client impossible.
What we can do in this situation is to create a DNS record directly pointing to the Service IP.
To simulate this,
we can use the --resolve flag for curl
❯ curl --resolve whoami-tls.stonegarden.dev:443:172.20.10.230 \
https://whoami-tls.stonegarden.dev
RemoteAddr: 10.144.12.11:59917Conclusion#
Enabling DSR lets a target Pod see the client IP directly, but it does come with some caveats.
Cilium must be configured to enable DSR, and the underlying network must support it. The Service routing the target Pod also has to be able to talk to the client directly, no proxies can be involved.
To enable DSR in Cilium, we can either configure native routing if our network supports it, or we can use GENEVE tunnelling with some added overhead, — though using up a lot fewer IP.
Behind a well-behaved proxy,
we should be able to rely on the X-Forwarded-For header to get the client IP,
but backend support for the header might vary.
Unrelated to DSR,
we can maybe save a jump using externalTrafficPolicy: Local on Services with BGP advertisements,
but it’s risky with ARP as we’re not guaranteed that the load balancing Node will be the same as the target Pod
Node.
Although quite the detour, I’m now able to have my AdGuardHome DNS server correctly pick up the source/client IP of DNS queries. This allows for better statistics and more fine-grained control over DNS.
In my case, the best approach is to turn on native routing with BGP peering.
I’ve also opted to default to SNAT forwarding as it’s the most reliable option,
but I’ve turned on bpg.lbModeAnnotation to allow selectively using DSR with IP options for some Services like the
DNS server.
Summary#
Cilium Configuration#
Values used for GENEVE tunnelling with GENEVE options for DSR
# cilium/values-tunnel.yaml
routingMode: tunnel
tunnelProtocol: geneve
loadBalancer:
standalone: true
mode: dsr
dsrDispatch: geneve
bpf:
lbModeAnnotation: trueValues used for native routing with IP Options for DSR
# cilium/values-native.yaml
routingMode: native
ipv4NativeRoutingCIDR: 10.244.0.0/16 # Talos PodCIDR
autoDirectNodeRoutes: true
loadBalancer:
standalone: true
algorithm: maglev
mode: dsr
dsrDispatch: opt
bpf:
masquerade: true
lbModeAnnotation: trueValues used for ARP announcements
# cilium/values-arp.yaml
l2announcements:
enabled: true
k8sClientRateLimit:
qps: 20
burst: 100ARP announcement policy
# cilium/arp-announce.yaml
apiVersion: cilium.io/v2alpha1
kind: CiliumL2AnnouncementPolicy
metadata:
name: arp-announcement
namespace: kube-system
spec:
loadBalancerIPs: true
serviceSelector:
matchLabels:
arp.cilium.io/announce-service: defaultARP announcement IP pool
# cilium/arp-ip-pool.yaml
apiVersion: cilium.io/v2
kind: CiliumLoadBalancerIPPool
metadata:
name: default-arp-ip-pool
spec:
blocks:
- start: 192.168.1.200
stop: 192.168.1.255
serviceSelector:
matchLabels:
arp.cilium.io/ip-pool: defaultBGP advertisement IP pool
# cilium/bgp-ip-pool.yaml
apiVersion: cilium.io/v2
kind: CiliumLoadBalancerIPPool
metadata:
name: default-bgp-ip-pool
spec:
blocks:
- cidr: 172.20.10.0/24
serviceSelector:
matchLabels:
bgp.cilium.io/ip-pool: defaultSettings used for the Reverse Proxy section.
# cilium/values.yaml
kubeProxyReplacement: true
routingMode: native
ipv4NativeRoutingCIDR: 10.244.0.0/16 # Talos PodCIDR
autoDirectNodeRoutes: true
bgpControlPlane:
enabled: true
loadBalancer:
standalone: true
algorithm: maglev
mode: dsr
dsrDispatch: opt
l7:
backend: envoy
bpf:
masquerade: true
lbModeAnnotation: trueTest#
# test/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
- ns.yaml
- ds-netshoot.yaml
- pod-whoami.yaml
- svc-arp-cluster-dsr.yaml
- svc-arp-cluster-snat.yaml
- svc-arp-local-dsr.yaml
- svc-arp-local-snat.yaml
- svc-bgp-cluster-dsr.yaml
- svc-bgp-cluster-snat.yaml
- svc-bgp-local-dsr.yaml
- svc-bgp-local-snat.yaml
- svc.yaml
- http-route.yamlNamespace with
privileged Pod Security Admission Policy
# test/ns.yaml
apiVersion: v1
kind: Namespace
metadata:
name: dsr
labels:
pod-security.kubernetes.io/enforce: privilegedDesignated target Pod with whoami webserver and netshoot tools
# test/pod-whoami.yaml
apiVersion: v1
kind: Pod
metadata:
name: whoami
namespace: dsr
labels:
app: whoami
spec:
nodeSelector:
kubernetes.io/hostname: ctrl-00
containers:
- name: netshoot
image: ghcr.io/nicolaka/netshoot:v0.15
command: [ /whoami/whoami ]
ports:
- name: http
containerPort: 80
volumeMounts:
- name: whoami
mountPath: /whoami
volumes:
- name: whoami
image:
reference: ghcr.io/traefik/whoami:latestDaemonSet with netshoot tools on the host network
# test/ds-netshoot.yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: netshoot
namespace: dsr
spec:
selector:
matchLabels:
app: netshoot
template:
metadata:
labels:
app: netshoot
spec:
hostNetwork: true
containers:
- name: netshoot
image: ghcr.io/nicolaka/netshoot:v0.15
command: [ tail, -f, /dev/null ]ARP announced Services
# test/svc-arp-cluster-dsr.yaml
apiVersion: v1
kind: Service
metadata:
name: arp-cluster-dsr
namespace: dsr
annotations:
io.cilium/lb-ipam-ips: 192.168.1.200
service.cilium.io/forwarding-mode: dsr
labels:
arp.cilium.io/announce-service: default
arp.cilium.io/ip-pool: default
spec:
type: LoadBalancer
externalTrafficPolicy: Cluster
selector:
app: whoami
ports:
- name: web
port: 80
targetPort: http# test/svc-arp-cluster-snat.yaml
apiVersion: v1
kind: Service
metadata:
name: arp-cluster-snat
namespace: dsr
annotations:
io.cilium/lb-ipam-ips: 192.168.1.201
service.cilium.io/forwarding-mode: snat
labels:
arp.cilium.io/announce-service: default
arp.cilium.io/ip-pool: default
spec:
type: LoadBalancer
externalTrafficPolicy: Cluster
selector:
app: whoami
ports:
- name: web
port: 80
targetPort: http# test/svc-arp-local-dsr.yaml
apiVersion: v1
kind: Service
metadata:
name: arp-local-dsr
namespace: dsr
annotations:
io.cilium/lb-ipam-ips: 192.168.1.210
service.cilium.io/forwarding-mode: dsr
labels:
arp.cilium.io/announce-service: default
arp.cilium.io/ip-pool: default
spec:
type: LoadBalancer
externalTrafficPolicy: Local
selector:
app: whoami
ports:
- name: web
port: 80
targetPort: http# test/svc-arp-local-snat.yaml
apiVersion: v1
kind: Service
metadata:
name: arp-local-snat
namespace: dsr
annotations:
io.cilium/lb-ipam-ips: 192.168.1.211
service.cilium.io/forwarding-mode: snat
labels:
arp.cilium.io/announce-service: default
arp.cilium.io/ip-pool: default
spec:
type: LoadBalancer
externalTrafficPolicy: Local
selector:
app: whoami
ports:
- name: web
port: 80
targetPort: httpBGP advertised Services
# test/svc-bgp-cluster-dsr.yaml
apiVersion: v1
kind: Service
metadata:
name: bgp-cluster-dsr
namespace: dsr
annotations:
io.cilium/lb-ipam-ips: 172.20.10.200
service.cilium.io/forwarding-mode: dsr
labels:
bgp.cilium.io/advertise-service: default
bgp.cilium.io/ip-pool: default
spec:
type: LoadBalancer
externalTrafficPolicy: Cluster
selector:
app: whoami
ports:
- name: web
port: 80
targetPort: http# test/svc-bgp-cluster-snat.yaml
apiVersion: v1
kind: Service
metadata:
name: bgp-cluster-snat
namespace: dsr
annotations:
io.cilium/lb-ipam-ips: 172.20.10.201
service.cilium.io/forwarding-mode: snat
labels:
bgp.cilium.io/advertise-service: default
bgp.cilium.io/ip-pool: default
spec:
type: LoadBalancer
externalTrafficPolicy: Cluster
selector:
app: whoami
ports:
- name: web
port: 80
targetPort: http# test/svc-bgp-local-dsr.yaml
apiVersion: v1
kind: Service
metadata:
name: bgp-local-dsr
namespace: dsr
annotations:
io.cilium/lb-ipam-ips: 172.20.10.210
service.cilium.io/forwarding-mode: dsr
labels:
bgp.cilium.io/advertise-service: default
bgp.cilium.io/ip-pool: default
spec:
type: LoadBalancer
externalTrafficPolicy: Local
selector:
app: whoami
ports:
- name: web
port: 80
targetPort: http# test/svc-bgp-local-snat.yaml
apiVersion: v1
kind: Service
metadata:
name: bgp-local-snat
namespace: dsr
annotations:
io.cilium/lb-ipam-ips: 172.20.10.211
service.cilium.io/forwarding-mode: snat
labels:
bgp.cilium.io/advertise-service: default
bgp.cilium.io/ip-pool: default
spec:
type: LoadBalancer
externalTrafficPolicy: Local
selector:
app: whoami
ports:
- name: web
port: 80
targetPort: httpService for the reverse proxy
apiVersion: v1
kind: Service
metadata:
name: whoami
namespace: dsr
annotations:
io.cilium/lb-ipam-ips: 172.20.10.220
service.cilium.io/forwarding-mode: dsr
labels:
bgp.cilium.io/advertise-service: default
bgp.cilium.io/ip-pool: default
spec:
type: LoadBalancer
externalTrafficPolicy: Local
internalTrafficPolicy: Cluster
selector:
app: whoami
ports:
- name: web
port: 80
targetPort: httpHTTPRoute for the reverse proxy
# test/http-route.yaml
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
name: whoami
namespace: dsr
spec:
parentRefs:
- { name: internal, namespace: gateway }
hostnames: [ "whoami.stonegarden.dev" ]
rules:
- backendRefs: [ { name: whoami-regular, port: 80 } ]
matches:
- path: { type: PathPrefix, value: / }Simpler Service for reverse proxy
# test/svc-regular.yaml
apiVersion: v1
kind: Service
metadata:
name: whoami-regular
namespace: dsr
annotations:
service.cilium.io/forwarding-mode: dsr
spec:
type: ClusterIP
selector:
app: whoami
ports:
- name: web
port: 80
targetPort: httpGateway#
Gateway with TLS and HTTPS listeners
# gateway/gateway.yaml
apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
name: internal
namespace: gateway
spec:
gatewayClassName: cilium
infrastructure:
labels:
bgp.cilium.io/advertise-service: default
bgp.cilium.io/ip-pool: default
addresses:
- type: IPAddress
value: 172.20.10.100
listeners:
- protocol: HTTPS
port: 443
name: https-gateway
hostname: "*.stonegarden.dev"
tls:
certificateRefs:
- kind: Secret
name: cert-stonegarden
allowedRoutes:
namespaces:
from: All
- protocol: TLS
port: 443
name: tls-passthrough
hostname: "*.stonegarden.dev"
tls:
mode: Passthrough
allowedRoutes:
namespaces:
from: AllWildcard Certificate for the Gateway
# gateway/cert-stonegarden.yaml
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
name: cert-stonegarden
namespace: gateway
spec:
dnsNames:
- "*.stonegarden.dev"
issuerRef:
group: cert-manager.io
kind: ClusterIssuer
name: cloudflare-cluster-issuer
secretName: cert-stonegarden
usages:
- digital signature
- key enciphermentReverse Proxy#
# tls/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
- cert.yaml
- svc.yaml
- tls-route.yaml
- pod-whoami-tls.yamlCertificate for the TLS passthrough Pod
# tls/certificate.yaml
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
name: whoami-tls
namespace: dsr
spec:
dnsNames: [ whoami-tls.stonegarden.dev ]
issuerRef:
group: cert-manager.io
kind: ClusterIssuer
name: cloudflare-cluster-issuer
secretName: whoami-tls
usages:
- digital signature
- key enciphermentPod serving its own TLS certificate
# tls/pod-whoami-tls.yaml
apiVersion: v1
kind: Pod
metadata:
name: whoami-tls
namespace: dsr
labels:
app: whoami-tls
spec:
hostNetwork: false
nodeSelector:
kubernetes.io/hostname: ctrl-00
containers:
- name: netshoot
image: ghcr.io/nicolaka/netshoot:v0.15
command: [
/whoami/whoami,
--cert, /tls/tls.crt,
--key, /tls/tls.key,
--port, "443"]
ports:
- name: https
containerPort: 443
volumeMounts:
- name: whoami
mountPath: /whoami
- name: tls-certs
mountPath: /tls
readOnly: true
volumes:
- name: whoami
image:
reference: ghcr.io/traefik/whoami:latest
- name: tls-certs
secret:
secretName: whoami-tlsService for the TLS passthrough Pod
# tls/svc.yaml
apiVersion: v1
kind: Service
metadata:
name: whoami-tls
namespace: dsr
annotations:
io.cilium/lb-ipam-ips: 172.20.10.230
service.cilium.io/forwarding-mode: dsr
labels:
app: whoami-tls
bgp.cilium.io/advertise-service: default
bgp.cilium.io/ip-pool: default
spec:
type: LoadBalancer
selector:
app: whoami-tls
ports:
- name: https
port: 443
targetPort: 443TLSRoute for the TLS passthrough Pod
# tls/tlsroute.yaml
apiVersion: gateway.networking.k8s.io/v1alpha2
kind: TLSRoute
metadata:
name: whoami-tls
namespace: dsr
spec:
parentRefs:
- { name: internal, namespace: gateway }
hostnames: [ "whoami-tls.stonegarden.dev" ]
rules:
- backendRefs: [ { name: whoami-tls, port: 443 } ]See this LinkedIn post by Nicolas Vibert for an explanation on VXLAN in Cilium. ↩︎
Or you could cheat and look at my TCP dumps. ↩︎
I think it should be technically possible to load balance between the different Nodes, but I haven’t been able to figure out how to do it yet. ↩︎
This is actually expected behaviour as per Gateway API GitHub issue #451. ↩︎


