Expose loadbalanced Kubernetes services with BGP (Cilium)
This blog post shows you how your Kubernetes Service can be exposed to the outside world, using Cilium and BGP.
Cilium
Cilium is an open source project to provide networking, security and observability for cloud native environments such as Kubernetes clusters and other container orchestration platforms.
BGP
Border Gateway Protocol (BGP) is a standardized exterior gateway protocol designed to exchange routing and reachability information among autonomous systems (AS) on the Internet. BGP is classified as a path-vector routing protocol. It makes routing decisions based on paths, network policies or rule sets configured by a network administrator. Source: Wikipedia
You can compare BGP somewhat with the postal service, where a distribution center can be seen as an AS which is responsible for certain postal codes. When you send a letter to your friend on the other side of the country, your local postal office does not deliver it directly to your friend, but they know based on the postal code, to which distribution center they should send it. And from that distribution center it probably will be send to another distribution center before it is handed over for local delivery in the city of your friend.
Cilium and BGP
In release 1.10, Cilium integrated BGP support using MetalLB, which enables it to announce Kubernetes Service ip addresses of the type LoadBalancer using BGP. The result is that services are reachable from outside the Kubernetes network without extra components, such as an Ingress Router. Especially the ‘without extra components’ part is fantastic news, since every component adds latency - so without those less latency.
Lab environment
First let me explain how the lab is set up and what the final result will be.
The lab consists of a client network (192.168.10.0/24) and a Kubernetes network (192.168.1.1/24). When a Service gets a LoadBalancer ip address, that address will be served from the pool 172.16.10.0/24. In our lab, the following nodes are present:
Name | IP addresses | Description |
---|---|---|
bgp-router1 | 192.168.1.1/24 (k8s network), 192.168.10.1/24 (client network) | The BGP router |
k8s-control | 192.168.1.5/24 (k8s network), 192.168.10.233/24 (client network) | Management node |
k8s-master1 | 192.168.1.10/24 (k8s network) | k8s master |
k8s-worker1 | 192.168.1.21/24 (k8s network) | k8s worker |
k8s-worker2 | 192.168.1.22/24 (k8s network) | k8s worker |
k8s-worker3 | 192.168.1.23/24 (k8s network) | k8s worker |
k8s-worker4 | 192.168.1.24/24 (k8s network) | k8s worker |
k8s-worker5 | 192.168.1.25/24 (k8s network) | k8s worker |
After all the parts are configured, it will be possible to reach a Service in the Kubernetes network, from the client network, using the announced LoadBalancer IP address.
Lab configuration
BGP router
The router is a Red Hat 8 system with three network interfaces (external-, kubernetes- and client network) with FRRouting (FRR) responsible for handling the BGP traffic. FRR is a free and open source Internet routing protocol suite for Linux and Unix platforms. It implements many routing protocols like BGP, OSPF and RIP. In our lab only BGP will be enabled.
dnf install frr
systemctl enable frr
firewall-cmd --permanent --add-port=179/tcp
echo 'net.ipv4.ip_forward=1' > /etc/sysctl.d/01-sysctl.conf
sysctl -w net.ipv4.ip_forward=1
After installing FRR, the BGP daemon is configured to start by changing bgpd=no
to bgpd=yes
in the configuration file /etc/frr/daemons
, using the following BGP configuration in /etc/frr/bgpd.conf
log syslog notifications
frr defaults traditional
!
router bgp 64512
no bgp ebgp-requires-policy
bgp router-id 192.168.1.1
neighbor 192.168.1.10 remote-as 64512
neighbor 192.168.1.10 update-source enp7s0
neighbor 192.168.1.21 remote-as 64512
neighbor 192.168.1.21 update-source enp7s0
neighbor 192.168.1.22 remote-as 64512
neighbor 192.168.1.22 update-source enp7s0
neighbor 192.168.1.23 remote-as 64512
neighbor 192.168.1.23 update-source enp7s0
neighbor 192.168.1.24 remote-as 64512
neighbor 192.168.1.24 update-source enp7s0
neighbor 192.168.1.25 remote-as 64512
neighbor 192.168.1.25 update-source enp7s0
address-family ipv4 unicast
neighbor 192.168.1.10 next-hop-self
neighbor 192.168.1.21 next-hop-self
neighbor 192.168.1.22 next-hop-self
neighbor 192.168.1.23 next-hop-self
neighbor 192.168.1.24 next-hop-self
neighbor 192.168.1.25 next-hop-self
exit-address-family
!
address-family ipv6 unicast
exit-address-family
!
line vty
In the config file above the AS number 64512 is used, which is reserved for private use. The Kubernetes master node and worker nodes are configured as neighbor. The ip address of the router’s interface in the Kubernetes network (192.168.1.1) is used as router id.
After the configuration above is applied and the FRR daemon is started using the systemctl start frr
command, the command vtysh -c 'show bgp summary'
shows the following output.
IPv4 Unicast Summary:
BGP router identifier 192.168.1.1, local AS number 64512 vrf-id 0
BGP table version 0
RIB entries 0, using 0 bytes of memory
Peers 6, using 86 KiB of memory
Neighbor V AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down State/PfxRcd PfxSnt
192.168.1.10 4 64512 0 0 0 0 0 never Active 0
192.168.1.21 4 64512 0 0 0 0 0 never Active 0
192.168.1.22 4 64512 0 0 0 0 0 never Active 0
192.168.1.23 4 64512 0 0 0 0 0 never Active 0
192.168.1.24 4 64512 0 0 0 0 0 never Active 0
192.168.1.25 4 64512 0 0 0 0 0 never Active 0
Total number of neighbors 6
Kubernetes and Cilium
It goes beyond the scope of this blog to explain how to install the Kubernetes nodes and the Kubernetes cluster. For your information: in this lab Red Hat 8 (minimal installation) is used as the operating system for all the nodes and Kubeadm was subsequently used to set up the cluster.
The Cilium Helm chart version v1.10.5 is used to install and configure Cilium on the cluster, using these values:
debug:
enabled: true
externalIPs:
enabled: true
hostPort:
enabled: true
hostServices:
enabled: true
kubeProxyReplacement: strict
k8sServiceHost: 192.168.1.10
k8sServicePort: 6443
nodePort:
enabled: true
operator:
replicas: 1
bgp:
enabled: true
announce:
loadbalancerIP: true
hubble:
enabled: true
metrics:
enabled:
- dns
- drop
- tcp
- flow
- port-distribution
- icmp
- http
listenAddress: ":4244"
relay:
enabled: true
ui:
enabled: true
To get Cilium up and running with BGP, only the bgp
key and subkeys are needed from the settings above. The other settings are used to get a fully working Cilium environment with, for example the Hubble user interface.
Expose a service
When Kubernetes is running and Cilium is configured, it is time to create a deployment and expose it to the network using BGP. The following YAML file creates a Deployment web1, which is just a simple NGINX web server serving the default web page The file also creates a Service web1-lb with a Service type LoadBalancer. This results in an external ip address that is announced to our router using BGP.
---
apiVersion: v1
kind: Service
metadata:
name: web1-lb
spec:
type: LoadBalancer
ports:
- port: 80
targetPort: 80
protocol: TCP
name: http
selector:
svc: web1-lb
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: web1
spec:
selector:
matchLabels:
svc: web1-lb
template:
metadata:
labels:
svc: web1-lb
spec:
containers:
- name: web1
image: nginx
imagePullPolicy: IfNotPresent
ports:
- containerPort: 80
readinessProbe:
httpGet:
path: /
port: 80
After applying the YAML file above, the command kubectl get svc
shows that Service web1-lb has an external ip address:
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
kubernetes ClusterIP 10.96.0.1 <none> 443/TCP 7d3h
web1-lb LoadBalancer 10.106.236.120 172.16.10.0 80:30256/TCP 7d2h
The address 172.16.10.0 seems strange, but it is fine. Often the .0 address is skipped and the .1 address is used as the first address. One of the reasons is that in the early days the .0 address was used for broadcast, which was later changed to .255. Since .0 is still a valid address MetalLB, which is responsible for the address pool, hands it out as the first address. The command vtysh -c 'show bgp summary'
on router bgp-router1 shows that it has received one prefix:
IPv4 Unicast Summary:
BGP router identifier 192.168.1.1, local AS number 64512 vrf-id 0
BGP table version 17
RIB entries 1, using 192 bytes of memory
Peers 6, using 128 KiB of memory
Neighbor V AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down State/PfxRcd PfxSnt
192.168.1.10 4 64512 445 435 0 0 0 03:36:56 1 0
192.168.1.21 4 64512 446 435 0 0 0 03:36:54 1 0
192.168.1.22 4 64512 445 435 0 0 0 03:36:56 1 0
192.168.1.23 4 64512 445 435 0 0 0 03:36:56 1 0
192.168.1.24 4 64512 446 435 0 0 0 03:36:56 1 0
192.168.1.25 4 64512 445 435 0 0 0 03:36:56 1 0
Total number of neighbors 6
The following snippet of the routing table (ip route
) tells us that for that specific ip address 172.16.10.0, 6 possible routes/destinations are present. In other words, all Kubernetes nodes announced that they can handle traffic for that address. Cool!!
172.16.10.0 proto bgp metric 20
nexthop via 192.168.1.10 dev enp7s0 weight 1
nexthop via 192.168.1.21 dev enp7s0 weight 1
nexthop via 192.168.1.22 dev enp7s0 weight 1
nexthop via 192.168.1.23 dev enp7s0 weight 1
nexthop via 192.168.1.24 dev enp7s0 weight 1
nexthop via 192.168.1.25 dev enp7s0 weight 1
Indeed, the web page is now visible from our router.
$ curl -s -v http://172.16.10.0/ -o /dev/null
* Trying 172.16.10.0...
* TCP_NODELAY set
* Connected to 172.16.10.0 (172.16.10.0) port 80 (#0)
> GET / HTTP/1.1
> Host: 172.16.10.0
> User-Agent: curl/7.61.1
> Accept: */*
>
< HTTP/1.1 200 OK
< Server: nginx/1.21.3
< Date: Sun, 31 Oct 2021 14:19:17 GMT
< Content-Type: text/html
< Content-Length: 615
< Last-Modified: Tue, 07 Sep 2021 15:21:03 GMT
< Connection: keep-alive
< ETag: "6137835f-267"
< Accept-Ranges: bytes
<
{ [615 bytes data]
* Connection #0 to host 172.16.10.0 left intact
And a client in our client network can also reach that same page, since it uses bgp-router1 as default route.
More details
Now it is all working, most engineers want to see more details, so I will not let you down :)
ping
One of the first things you will notice is that the LoadBalanced ip address is not reachable via ping. Diving a bit deeper reveals why, but before that, let’s create the Cilium aliases to make it easier running cilium
, which is present in each Cilium agent pod.
CILIUM_POD=$(kubectl -n kube-system get pods -l k8s-app=cilium --output=jsonpath='{.items[*].metadata.name}' --field-selector=spec.nodeName=k8s-master1)
alias cilium="kubectl -n kube-system exec -ti ${CILIUM_POD} -c cilium-agent -- cilium"
First see the output of this snippet of cilium bpf lb list
, that shows the configured load balancing configuration inside Cilium for our Service *web1-lb`:
172.16.10.0:80 0.0.0.0:0 (5) [LoadBalancer]
10.0.3.150:80 (5)
Here you can see that a mapping is created between source port 80 and destination port 80. This mapping is executed using eBPF logic at the interface and is present on all nodes. This mapping shows that only(!) traffic for port 80 is balanced. All other traffic, including the ping, is not picked up. That is why you can see the icmp packet reaching the node, but a response is never send.
Observe traffic
Hubble is the networking and security observability platform which is built on top of eBPF and Cilium. Via the command line and via a graphical web GUI, it is possible to see current and historical traffic. In this lab, Hubble is placed on the k8s-control node, which has direct access to the API of Hubble Relay. Hubble Relay is the component that obtains the needed information from the Cilium nodes. Be aware that the hubble
command is also present in each Cilium agent pod, but that one will only show information for that specific agent!
export HUBBLE_VERSION=$(curl -s https://raw.githubusercontent.com/cilium/hubble/master/stable.txt)
curl -L --remote-name-all https://github.com/cilium/hubble/releases/download/$HUBBLE_VERSION/hubble-linux-amd64.tar.gz{,.sha256sum}
sha256sum --check hubble-linux-amd64.tar.gz.sha256sum
sudo tar xzvfC hubble-linux-amd64.tar.gz /usr/local/bin
rm hubble-linux-amd64.tar.gz{,.sha256sum}
The following outputs show the observer information which is a result of the curl http://172.16.10.0/
command on the router.
$ hubble observe --namespace default --follow
Oct 31 15:43:41.382: 192.168.1.1:36946 <> default/web1-696bfbbbc4-jnxbc:80 to-overlay FORWARDED (TCP Flags: SYN)
Oct 31 15:43:41.384: 192.168.1.1:36946 <> default/web1-696bfbbbc4-jnxbc:80 to-overlay FORWARDED (TCP Flags: ACK)
Oct 31 15:43:41.384: 192.168.1.1:36946 <> default/web1-696bfbbbc4-jnxbc:80 to-overlay FORWARDED (TCP Flags: ACK, PSH)
Oct 31 15:43:41.385: 192.168.1.1:36946 <> default/web1-696bfbbbc4-jnxbc:80 to-overlay FORWARDED (TCP Flags: ACK)
Oct 31 15:43:41.385: 192.168.1.1:36946 <> default/web1-696bfbbbc4-jnxbc:80 to-overlay FORWARDED (TCP Flags: ACK)
Oct 31 15:43:41.386: 192.168.1.1:36946 <> default/web1-696bfbbbc4-jnxbc:80 to-overlay FORWARDED (TCP Flags: ACK, FIN)
Oct 31 15:43:41.386: 192.168.1.1:36946 <> default/web1-696bfbbbc4-jnxbc:80 to-overlay FORWARDED (TCP Flags: ACK)
Before, I warned about not using the hubble
command inside the Cilium agent pod, but it can also be very informative seeing the specific node traffic. In this case a hubble observe --namespace default --follow
is executed within each Cilium agent pod and the curl from the router is once executed. On the node where the pod is ‘living’ (k8s-worker2), we see the same output as the one above, However on another pod (k8s-worker1) we see the following output:
Oct 31 15:56:05.220: 10.0.3.103:48278 -> default/web1-696bfbbbc4-jnxbc:80 to-endpoint FORWARDED (TCP Flags: SYN)
Oct 31 15:56:05.220: 10.0.3.103:48278 <- default/web1-696bfbbbc4-jnxbc:80 to-stack FORWARDED (TCP Flags: SYN, ACK)
Oct 31 15:56:05.220: 10.0.3.103:48278 -> default/web1-696bfbbbc4-jnxbc:80 to-endpoint FORWARDED (TCP Flags: ACK)
Oct 31 15:56:05.221: 10.0.3.103:48278 -> default/web1-696bfbbbc4-jnxbc:80 to-endpoint FORWARDED (TCP Flags: ACK, PSH)
Oct 31 15:56:05.221: 10.0.3.103:48278 <- default/web1-696bfbbbc4-jnxbc:80 to-stack FORWARDED (TCP Flags: ACK, PSH)
Oct 31 15:56:05.222: 10.0.3.103:48278 -> default/web1-696bfbbbc4-jnxbc:80 to-endpoint FORWARDED (TCP Flags: ACK, FIN)
Oct 31 15:56:05.222: 10.0.3.103:48278 <- default/web1-696bfbbbc4-jnxbc:80 to-stack FORWARDED (TCP Flags: ACK, FIN)
Oct 31 15:56:05.222: 10.0.3.103:48278 -> default/web1-696bfbbbc4-jnxbc:80 to-endpoint FORWARDED (TCP Flags: ACK)
Oct 31 15:56:12.739: 10.0.4.105:36956 -> default/web1-696bfbbbc4-jnxbc:80 to-endpoint FORWARDED (TCP Flags: SYN)
Oct 31 15:56:12.739: default/web1-696bfbbbc4-jnxbc:80 <> 10.0.4.105:36956 to-overlay FORWARDED (TCP Flags: SYN, ACK)
Oct 31 15:56:12.742: 10.0.4.105:36956 -> default/web1-696bfbbbc4-jnxbc:80 to-endpoint FORWARDED (TCP Flags: ACK)
Oct 31 15:56:12.742: 10.0.4.105:36956 -> default/web1-696bfbbbc4-jnxbc:80 to-endpoint FORWARDED (TCP Flags: ACK, PSH)
Oct 31 15:56:12.745: default/web1-696bfbbbc4-jnxbc:80 <> 10.0.4.105:36956 to-overlay FORWARDED (TCP Flags: ACK, PSH)
Oct 31 15:56:12.749: 10.0.4.105:36956 -> default/web1-696bfbbbc4-jnxbc:80 to-endpoint FORWARDED (TCP Flags: ACK, FIN)
Oct 31 15:56:12.749: default/web1-696bfbbbc4-jnxbc:80 <> 10.0.4.105:36956 to-overlay FORWARDED (TCP Flags: ACK, FIN)
What we see here is that our router is sending the traffic for ip address 172.16.10.0 to k8s-worker1, but that worker does not host our web1 container, so it forwards the traffic to k8s-worker2 which handles the traffic. All the forwarding logic is handled using eBPF – a small BPF program attached to the interface will send the traffic and routes to another worker if needed. That is also the reason that running tcpdump on k8s-worker1, where the packages initially are received, does not show any traffic. It is already redirected to k8s-worker2 before it could land in the ip stack of k8s-worker1.
Cilium.io has a lot of information about eBPF and the internals. If you have not heard about eBPF and you are into Linux and/or networking, please do yourself a favor and learn at least the basics. In my humble opinion eBPF will change networking in Linux drastically in the near future and especially for cloud native environments!
Hubble Web GUI
With a working BGP set-up, it is quite simple to make the Hubble Web GUI available to the outside world as well.
kubectl -n kube-system expose deployment hubble-ui --type=LoadBalancer --port=80 --target-port=8081 --name hubble-ui-lb
Final words
Due to the integrated MetalLB, it is very easy to set up Cilium with BGP. Plus, you don’t need expensive network hardware. Cilium/BGP, combined with the disabling of kube-proxy, lowers the latency to your cloud based Services and gives a clear view of what is exposed to the outside world by only announcing the LoadBalancers ip addresses. Although an Ingress Controller is not required with this set-up, I still would recommend one for most HTTP Services. They have great value at the protocol level for rewriting URLs or rate limiting requests. Examples are NGINX or Traefik (exposed by BGP of course).
All in all, it is very exciting to see that cloud native networking, but also networking within Linux is still improving!