Kubernetes is a system for automating deployment, scaling, and management of containerized applications. Networking is a central part of Kubernetes, and in this article we will explore how Kubernetes configures the cluster to handle east-west traffic. We’ll reserve discussion on north-south traffic for a later article.
This article is long and a bit heavy-handed on annotations, command-line instructions, and pointers to implementations in Kubernetes and associated components. There are dozens of footnotes. We are diving into fairly deep waters here. I tried my best to keep a coherent flow going. Feel free to drop a comment below if you notice a mistake somewhere or if you’d like to offer editorial advice.
Concepts
By default, all pods in a K8s cluster can communicate with each other without NAT (source)1, therefore each pod is assigned a cluster-wide IP address. Containers within each pod share the pod’s network namespace, allowing them to communicate with each other on localhost
via the loopback
interface. From the point of view of the workloads running inside the containers, this IP network looks like any other and no changes are necessary.
Conceptual view of inter-Pod and intra-Pod network communication.
Recall from a previous article that as far as K8s components go, the kubelet and the kube-proxy are responsible for creating pods and applying network configurations on the cluster’s nodes.
When the pod is being created or terminated, part of the kubelet
’s job is to set up or cleanup the pod’s sandbox on the node it is running on. The kubelet
relies on the Container Runtime Interface (CRI) implementation to handle the details of creating and destroying sandboxes. The CRI is composed of several interfaces; the interesting ones for us are the RuntimeService
interface (client-side API; integration point kubelet
->CRI) and the RuntimeServiceServer
interface (server-side API; integration point RuntimeService
->CRI implementation). These APIs are both big and fat, but for this article we are only interested in the *PodSandbox
set of methods (e.g. RunPodSandbox
). Underneath the CRI’s hood, however, is the Container Network Interface that creates and configures the pod’s network namespace2.
The kube-proxy
configures routing rules to proxy traffic directed at Services
and performs simple load-balancing between the corresponding Endpoints
3.
Finally, a third component, coreDNS
, resolves network names by looking them up in etcd
.
Components involved in the network configuration for a pod. Blue circles are pods and orange rectangles are daemons. Note that etcd
is shown here as a database service, but it is also deployed as a pod.
In the next section we will understand how pod networking works by manually creating our own pods and have a client in one pod invoke an API in a different pod.
I will be using a simple K8s cluster I set up with
kind
in the walkthrough below.kind
creates a docker container per K8s node. You may choose a similar sandbox, machine instances in the cloud, or any other setup that simulates at least two host machines connected to the same network. Also note that Linux hosts are used for this walkthrough.
Create your own Pod Network
We will manually create pods on different hosts to gain an understanding of how Kubernetes’ networking is configured under the hood.
Network namespaces
Linux has a concept called namespaces. Namespaces are a feature that isolate the resources that a process sees from another processes. For example, a process may see MySQL running with PID 123 but a different process running in a different namespace (but on the same host) will see a different process assigned to PID 123, or none at all.
There are different kinds of namespaces; we are interested in the Network (net) namespace.
Each namespace has a virtual loopback
interface and may have additional virtual network devices attached. Each of these virtual devices may be assigned exclusive or overlapping IP address ranges.
localhost
Processes running inside the same net
namespace can send messages to each other over localhost
.
Hands On
Create a
net
namespace with a client and a server:On a host we’ll call “client”
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 # create network namespace root@kind-control-plane:/# ip netns add client root@kind-control-plane:/# ip netns list client # `loopback` is DOWN by default root@kind-control-plane:/# ip netns exec client ip link list 1: lo: <LOOPBACK> mtu 65536 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 # initialize `loopback` (`lo` is shorthand for "loopback") root@kind-control-plane:/# ip netns exec client ip link set lo up # start the server root@kind-control-plane:/# ip netns exec client nohup python3 -m http.server 8080 & [1] 29509 root@kind-control-plane:/# nohup: ignoring input and appending output to 'nohup.out' # invoke the server root@kind-control-plane:/# ip netns exec client curl -m 2 localhost:8080 <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"> <html> <head> <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> <title>Directory listing for /</title> </head> <body> <h1>Directory listing for /</h1> <hr> <ul> ... </ul> <hr> </body> </html>
Traffic from a client to a server inside a network namespace. Blue is traffic on localhost
. Notice the host’s interface (eth0
) is bypassed entirely for this traffic.
With this we have one or more processes that can communicate over localhost
. This is exactly how K8s Pods work, and these “processes” are K8s containers.
Connecting network namespaces on the same host
Remember that all pods in a K8s cluster can communicate with each other without NAT. So, how would two pods on the same host communicate with each other? Let’s give it a shot. Let’s create a “server” namespace and attempt to communicate with it.
Hands On
On the same “client” host
1 2 3 4 5 6 7 8 9 10 11 12 13 14 # create the other pod's network namespace root@kind-control-plane:/# ip netns add server root@kind-control-plane:/# ip netns list server client # stop the server you had running before and restart it in the new `server` namespace root@kind-control-plane:/# ip netns exec server nohup python3 -m http.server 8080 & [1] 29538 root@kind-control-plane:/# nohup: ignoring input and appending output to 'nohup.out' # attempt to call this server from the client namespace root@kind-control-plane:/# ip netns exec client curl localhost:8080 curl: (7) Failed to connect to localhost port 8080 after 0 ms: Connection refused
We don’t have an address for server
from within the client
namespace yet. These two network namespaces are completely disconnected from each other. All client
and server
have is localhost
(dev lo
) which is always assigned 127.0.0.1
. We need another interface between these two namespaces for communication to happen.
Linux has the concept of Virtual Ethernet Devices (veth) that act like “pipes” through which network packets flow, and of which you can attach either end to a namespace or a device. The “ends” of these “pipes” act as virtual devices to which IP addresses can be assigned. It is perfectly possible to create a veth device and connect our two namespaces like this:
However, consider that veth
are point-to-point devices with just two ends and, remembering our requirement that all Pods must communicate with each other without NAT, we would need \(n(n-1)/2\) veth pairs, where \(n\) is the number of namespaces. This becomes unwieldy pretty quickly. We will use a bridge instead to solve this problem. A bridge lets us connect any number of devices to it and will happily route traffic between them, turning our architecture into a hub-and-spoke and reducing the number of veth pairs to just \(n\).
Hands On
On the “client” host
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 # create a bridge root@kind-control-plane:/# ip link add bridge type bridge # create veth pairs root@kind-control-plane:/# ip link add veth-client type veth peer name veth-clientbr root@kind-control-plane:/# ip link add veth-server type veth peer name veth-serverbr # connect one end of the veth devices to the bridge root@kind-control-plane:/# ip link set veth-clientbr master bridge root@kind-control-plane:/# ip link set veth-serverbr master bridge # attach the other end of the veth devices to their respective namespaces root@kind-control-plane:/# ip link set veth-client netns client root@kind-control-plane:/# ip link set veth-server netns server # assign IP addresses to the bridge and our new interfaces inside the client and server namespaces root@kind-control-plane:/# ip netns exec client ip addr add 10.0.0.1/24 dev veth-client root@kind-control-plane:/# ip netns exec server ip addr add 10.0.0.2/24 dev veth-server root@kind-control-plane:/# ip addr add 10.0.0.0/24 dev bridge # bring our devices up root@kind-control-plane:/# ip netns exec client ip link set veth-client up root@kind-control-plane:/# ip netns exec server ip link set veth-server up root@kind-control-plane:/# ip link set veth-clientbr up root@kind-control-plane:/# ip link set veth-serverbr up root@kind-control-plane:/# ip link set bridge up # confirm state of our interfaces: # state of client interfaces root@kind-control-plane:/# ip netns exec client ip addr 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever 16: veth-client@if15: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000 link/ether 5e:0e:50:4b:f5:32 brd ff:ff:ff:ff:ff:ff link-netnsid 0 inet 10.0.0.1/24 scope global veth-client valid_lft forever preferred_lft forever inet6 fe80::5c0e:50ff:fe4b:f532/64 scope link valid_lft forever preferred_lft forever # state of server interfaces root@kind-control-plane:/# ip netns exec server ip addr ... 18: veth-server@if17: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000 link/ether 46:d0:61:5d:7c:9a brd ff:ff:ff:ff:ff:ff link-netnsid 0 inet 10.0.0.2/24 scope global veth-server valid_lft forever preferred_lft forever inet6 fe80::44d0:61ff:fe5d:7c9a/64 scope link valid_lft forever preferred_lft forever # state of host interfaces root@kind-control-plane:/# ip addr ... 11: eth0@if12: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default link/ether 02:42:ac:12:00:02 brd ff:ff:ff:ff:ff:ff link-netnsid 0 inet 172.18.0.2/16 brd 172.18.255.255 scope global eth0 valid_lft forever preferred_lft forever inet6 fc00:f853:ccd:e793::2/64 scope global nodad valid_lft forever preferred_lft forever inet6 fe80::42:acff:fe12:2/64 scope link valid_lft forever preferred_lft forever 14: bridge: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000 link/ether ba:21:cf:c1:62:52 brd ff:ff:ff:ff:ff:ff inet 10.0.0.0/24 scope global bridge valid_lft forever preferred_lft forever 15: veth-clientbr@if16: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master bridge state UP group default qlen 1000 link/ether ba:21:cf:c1:62:52 brd ff:ff:ff:ff:ff:ff link-netns client 17: veth-serverbr@if18: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master bridge state UP group default qlen 1000 link/ether c2:52:97:04:03:2c brd ff:ff:ff:ff:ff:ff link-netns server # test connectivity root@kind-control-plane:/# ip netns exec client curl -v 10.0.0.2:8080 <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"> <html> <head> <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> <title>Directory listing for /</title> </head> <body> <h1>Directory listing for /</h1> <hr> <ul> ... </ul> <hr> </body> </html>
At this point the whole setup looks like this:
Two linux net
namespaces connected to each other via a bridge. Note that although the bridge is connected to the host’s interface (eth0
), traffic between the namespaces bypasses it entirely.
We have just connected two network namespaces on the same host.
Connecting network namespaces on different hosts
The only way in and out of our hosts in our example above is via their eth0
interface. For outbound traffic, the packets first need to reach eth0
before being forwarded to the physical network. For inbound packets, eth0
needs to forward those to the bridge where they will be routed to the respective namespace interfaces. Let’s first separate our two namespaces before going further.
Moving our network namespaces onto different hosts
Let’s first clean up everything we’ve done so far4:
Hands On
Steps
1 2 3 4 5 6 7 8 # delete the namespaces root@kind-control-plane:/# ip netns del client root@kind-control-plane:/# ip netns del server # delete the veth and bridge devices root@kind-control-plane:/# ip link del veth-client root@kind-control-plane:/# ip link del veth-server root@kind-control-plane:/# ip link del bridge
Let’s now set up our namespaces in different hosts.
Hands On
Same steps as before except on different hosts with some minor differences:
On the “client” host
1 2 3 4 5 6 7 8 9 10 11 root@kind-control-plane:/# ip netns add client root@kind-control-plane:/# ip link add bridge type bridge root@kind-control-plane:/# ip link add veth-client type veth peer name veth-clientbr root@kind-control-plane:/# ip link set veth-client netns client root@kind-control-plane:/# ip link set veth-clientbr master bridge root@kind-control-plane:/# ip addr add 10.0.0.0/24 dev bridge root@kind-control-plane:/# ip netns exec client ip addr add 10.0.0.1/24 dev veth-client root@kind-control-plane:/# ip netns exec client ip link set lo up root@kind-control-plane:/# ip netns exec client ip link set veth-client up root@kind-control-plane:/# ip link set bridge up root@kind-control-plane:/# ip link set veth-clientbr upOn the “server” host
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 root@kind-worker:/# ip netns add server root@kind-worker:/# ip link add bridge type bridge root@kind-worker:/# ip link add veth-server type veth peer name veth-serverbr root@kind-worker:/# ip link set veth-server netns server root@kind-worker:/# ip link set veth-serverbr master bridge root@kind-worker:/# ip addr add 10.0.0.0/24 dev bridge root@kind-worker:/# ip netns exec server ip addr add 10.0.0.2/24 dev veth-server root@kind-worker:/# ip netns exec server ip link set lo up root@kind-worker:/# ip netns exec server ip link set veth-server up root@kind-worker:/# ip link set bridge up root@kind-worker:/# ip link set veth-serverbr up # run the server root@kind-worker:/# ip netns exec server nohup python3 -m http.server 8080 & [1] 1314 nohup: ignoring input and appending output to 'nohup.out'
Namespaces on different hosts. The host interfaces (eth0
) are on the same network.
Now that everything is set up, let’s first tackle outbound traffic.
From our network namespaces to the physical network
First let’s see if we can reach eth0
on each host:
1
2
3
4
5
6
7
# on the client host
root@kind-control-plane:/# ip netns exec client ping 172.18.0.2
ping: connect: Network is unreachable
# on the server host
root@kind-worker:/# ip netns exec server ping 172.18.0.4
ping: connect: Network is unreachable
The host isn’t reachable from the namespaces yet. We haven’t configured an IP route5 to forward packets destined to eth0
in neither host. Let’s set up a default route via the bridge in both namespaces and test:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
# on client host
root@kind-control-plane:/# ip netns exec client ip route add default via 10.0.0.0
root@kind-control-plane:/# ip netns exec client ping 172.18.0.2 -c 2
PING 172.18.0.2 (172.18.0.2) 56(84) bytes of data.
64 bytes from 172.18.0.2: icmp_seq=1 ttl=64 time=0.076 ms
64 bytes from 172.18.0.2: icmp_seq=2 ttl=64 time=0.039 ms
--- 172.18.0.2 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1031ms
rtt min/avg/max/mdev = 0.039/0.057/0.076/0.018 ms
# on server host
root@kind-worker:/# ip netns exec server ip route add default via 10.0.0.0
root@kind-worker:/# ip netns exec server ping 172.18.0.4 -c 2
PING 172.18.0.4 (172.18.0.4) 56(84) bytes of data.
64 bytes from 172.18.0.4: icmp_seq=1 ttl=64 time=0.036 ms
64 bytes from 172.18.0.4: icmp_seq=2 ttl=64 time=0.035 ms
--- 172.18.0.4 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1031ms
rtt min/avg/max/mdev = 0.035/0.035/0.036/0.000 ms
Great, we can now reach our host interfaces. By extension, we can also reach any destination reachable from eth0
:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# on client host
root@kind-control-plane:/# ip netns exec client curl https://google.com
<HTML><HEAD><meta http-equiv="content-type" content="text/html;charset=utf-8">
<TITLE>301 Moved</TITLE></HEAD><BODY>
<H1>301 Moved</H1>
The document has moved
<A HREF="https://www.google.com/">here</A>.
</BODY></HTML>
# on server host
root@kind-worker:/# ip netns exec server curl https://google.com
<HTML><HEAD><meta http-equiv="content-type" content="text/html;charset=utf-8">
<TITLE>301 Moved</TITLE></HEAD><BODY>
<H1>301 Moved</H1>
The document has moved
<A HREF="https://www.google.com/">here</A>.
</BODY></HTML>
This flow looks similar to the following when viewed from the client
flow (Google’s infrastructure has been vastly simplified):
Next up, let’s try to communicate to our server from the client
namespace.
From the physical network to our network namespaces
If we try to reach server
from client
we can see that it doesn’t work:
1
2
root@kind-control-plane:/# ip netns exec client curl -m 2 10.0.0.2:8080
curl: (28) Connection timed out after 2001 milliseconds
Let’s dig in with tcpdump
.
Open a terminal window and, since we aren’t sure what path the packets are flowing through, run tcpdump -nn -e -l -i any
on host 172.18.0.2
. Friendly warning: the output will be very verbose because tcpdump
will listen on all interfaces.
On the same host 172.18.0.2
, try to curl the server from the client
namespace again with ip netns exec client curl -m 2 10.0.0.2:8080
. After it times out again, stop tcpdump
by pressing Ctrl+C
and review the output. Search for 10.0.0.2
, our destination address. You should spot some lines like the following:
1
2
15:05:35.754605 bridge Out ifindex 5 a6:93:c7:0c:96:b2 ethertype ARP (0x0806), length 48: Request who-has 10.0.0.2 tell 10.0.0.0, length 28
15:05:35.754608 veth-clientbr Out ifindex 6 a6:93:c7:0c:96:b2 ethertype ARP (0x0806), length 48: Request who-has 10.0.0.2 tell 10.0.0.0, length 28
You may see several of these requests with no corresponding reply6.
These are ARP requests, and the reason they’re being fired off is that there is no [IP (layer 3)] route between the client
and server
namespaces. It is possible to manually configure ARP entries and implement “proxy-ARP” to connect client
and server
at Layer 2, but we are not doing that today. Kubernetes’ networking model is built on Layer 3 and up, and so must our solution.
We will configure IP routing5 rules to route client
traffic to server
. Let’s first configure a manual route for 10.0.0.2
on the client host:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
# on client host
root@kind-control-plane:/# ip route add 10.0.0.2 via 172.18.0.4
# validate
root@kind-control-plane:/# curl 10.0.0.2:8080
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<title>Directory listing for /</title>
</head>
<body>
<h1>Directory listing for /</h1>
<hr>
<ul>
...
</ul>
<hr>
</body>
</html>
As you can see, curl
‘ing our server API in the server
namespace from the client host now works7.
Let’s try curl
‘ing the server from the client
namespace again:
1
2
root@kind-control-plane:/# ip netns exec client curl -m 2 10.0.0.2:8080
curl: (28) Connection timed out after 2001 milliseconds
Another dump with tcpdump
reveals the same unanswered ARP
requests as before. Why aren’t there responses to these considering we’ve successfully established a connection from the client host to the server
namespace? One reason is that the connection was made at layer 3 (IP route), but ARP
is a layer 2 protocol, and as per the OSI model’s semantics, lower-level protocols cannot depend on higher-level ones. Another reason is that ARP
messages only reach devices directly connected to our network interface, in this case eth0
: the latter’s ARP
table does not contain an entry for 10.0.0.2
even though its namespace’s IP routing table does.
The layer 3 solution for us is simple: establish another IP route for 10.0.0.2
inside the client
namespace8:
1
root@kind-control-plane:/# ip netns exec client ip route add 10.0.0.2 via 10.0.0.0
You can now verify that calling server
from client
works:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
root@kind-control-plane:/# ip netns exec client curl -m 2 10.0.0.2:8080
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<title>Directory listing for /</title>
</head>
<body>
<h1>Directory listing for /</h1>
<hr>
<ul>
...
</ul>
<hr>
</body>
</html>
Congratulations 🎉 🎉 - we have just manually created two Pods (net
namespaces) on different hosts, with one “container” (curl
) in one Pod invoking an API in a container in the other Pod without NAT.
A process inside a client
namespace connecting to an open socket on a server
namespace in another host. The client process does not perform any NAT.
How Kubernetes creates Pods
We now know how pods are implemented under the hood. We have learned that Kubernetes “pods” are namespaces and that Kubernetes “containers” are processes running within those namespaces. These pods are connected to each other within each host with virtual networking devices (veth
, bridge
), and with simple IP routing rules for traffic to cross from one pod to another over the physical network.
Where and how does Kubernetes do all this?
The Container Runtime Interface (CRI)
Back in the concepts section we said the kubelet
uses the Container Runtime Interface to create the pod “sandboxes”.
The kubelet
creates pod sandboxes here. Note that runtimeService
is of type RuntimeService
, belonging to the CRI API. It embeds the PodSandboxManager
type, which is responsible for actually creating the sandboxes (RunPodSandbox
method). Kubernetes has an internal implementation of RuntimeService
in remoteRuntimeService
, but this is just a thin wrapper around the CRI API’s RuntimeServiceClient
(GitHub won’t automatically open the file due to its size). Look closely and you’ll notice that RuntimeServiceClient
is implemented by runtimeServiceClient
, which uses a gRPC connection to invoke the container runtime service. gRPC is (normally) transported over TCP sockets (Layer 3).
The kubelet
runs on each node and, if it needs to create a pod on that node, why would it need to communicate with the CRI service over TCP?
Go, the lingua franca of cloud-native development (including Kubernetes), has a builtin plugin
system but it has some serious drawbacks in terms of maintainability. Eli Bendersky gives a good outline of how they work with pros and cons here that is worth a read. Towards the end of the article you’ll notice a bias towards RPC-based plugins; this is exactly what the CRI’s designers chose as their architecture. So although the kubelet
and the CRI service are running on the same node, the gRPC messages can be transported locally via localhost
(for TCP) or Unix domain sockets or some other channel available on the host.
So we now have Kubernetes invoking the standard CRI API that in turn invokes a “remote”, CRI-compliant gRPC service. This service is the CRI implementation that can be swapped out. Kubernetes’ docs list a few common ones:
The details of what happens next vary by implementation, and is all abstracted away from the Kubernetes runtime. Take containerd
as an example (it’s the CRI used in kind, the K8S distribution I chose for the walkthrough above). containerd
has a plugin architecture that is resolved at compile time9. containerd
’s implementation of RuntimeServiceServer
(part of Concepts) has its RunPodSandbox
method (also part of Concepts) rely on a “CNI” plugin to set up the pod’s network namespace.
What is the CNI?
The Container Network Interface (CNI)
The CNI is used by the CRI to create and configure the network namespaces used by the pods10. CNI implementations are invoked by executing their respective binaries and providing network configuration via stdin
(see the spec’s execution protocol)11. On unix hosts, containerd
by default looks for a standard CNI config file inside the /etc/cni/net.d
directory and for the plugin binaries it looks in /opt/cni/bin
(see code). Each node in my kind
cluster has only one config file: /etc/cni/net.d/10-kindnet.conflist
. Here are the contents of this file in my control-plane
node:
Click to expand
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
{
"cniVersion": "0.3.1",
"name": "kindnet",
"plugins": [
{
"type": "ptp",
"ipMasq": false,
"ipam": {
"type": "host-local",
"dataDir": "/run/cni-ipam-state",
"routes": [
{
"dst": "0.0.0.0/0"
}
],
"ranges": [
[
{
"subnet": "10.244.0.0/24"
}
]
]
},
"mtu": 1500
},
{
"type": "portmap",
"capabilities": {
"portMappings": true
}
}
]
}
The same config file on the worker nodes have identical content except for subnet
, which varies from host to host. I won’t go in depth about how the CNI spec and plugins work (that deserves its own article). You can read version 0.3.1
of the spec here. What’s conceptually important for us is that there are three plugins being executed (two of them are chained) with this configuration. These plugins are:
- ptp: creates a point-to-point link between a container and the host by using a veth device.
- host-local: allocates IPv4 and IPv6 addresses out of a specified address range.
- portmap: will forward traffic from one or more ports on the host to the container.
Do any of these concepts sound familiar to you? They should!12 These are the things we painstakingly configured step-by-step in our walkthrough above. With this information in mind, go back to the component diagram in Concepts and map each of these concepts to the boxes in the diagram.
Services
No discussion of Kubernetes’ cluster network can conclude without mentioning Services.
Conceptually, a Kubernetes Service is merely a Virtual IP assigned to a set of pods, and to which a stable DNS name is assigned. Kubernetes also provides simple load balancing out of the box for some types of services (ClusterIP
, NodePort
).
Each service is mapped to a set of IPs belonging to the pods exposed by the service. These set of IPs is called EndpointSlice and is constantly updated to reflect the IPs currently in use by the backend pods13. Which pods? The ones matching the service’s selector.
Example Service with label ‘myLabel’ set to value ‘MyApp’
1
2
3
4
5
6
7
8
9
10
11
apiVersion: v1
kind: Service
metadata:
name: my-service
spec:
selector:
myLabel: MyApp
ports:
- protocol: TCP
port: 80
targetPort: 9376
When a user creates a new Service:
kube-apiserver
assigns it the next free IP by incrementing a counter stored inetcd
14.kube-apiserver
stores the service inetcd
15.- This event is pushed to all watches16.
coreDNS
:- EndpointSlice Controller: event is caught and a new EndpointSlice is assigned to the service19.
kube-proxy
: event is caught andiptables
is configured on worker nodes20.
All steps from 4 onwards are executing concurrently by independent processes. The final state is depicted in the diagram in the Concepts section.
Note that we have incidentally glossed over Kubernetes’ distributed and event-driven architecture. We’ll expand on that topic in a future article.
We snuck in a new concept in step 6: iptables
. Let’s expand on that next.
iptables
Iptables is used to set up, maintain, and inspect the tables of IP packet filter rules in the Linux kernel. Several different tables may be defined. Each table contains a number of built-in chains and may also contain user-defined chains.
Each chain is a list of rules which can match a set of packets. Each rule specifies what to do with a packet that matches. This is called a `target’, which may be a jump to a user-defined chain in the same table.
–
iptables
manpage
System and network administrators use iptables
to configure IP routing rules on Linux hosts, and so does kube-proxy
21. On Windows hosts kube-proxy
uses an analogous API called Host Compute Network service API, internally represented by the HostNetworkService interface. It is because of this difference in OS-dependent implementations of the network stack that we simply labelled them as “OS IP rules” in the Concepts section’s diagram.
kube-proxy
uses iptables
to configure Linux hosts to distribute traffic directed at a Service’s clusterIP
(ie. a virtual IP) to the backend pods selected by the service using NAT. So yes, there is definitely network address translation in a Kubernetes cluster, but it’s hidden from your workloads.
kube-proxy
adds a rule to the PREROUTING
chain that targets a custom chain called KUBE-SERVICES
22. The end result looks like this:
1
2
3
4
root@kind-control-plane:/# iptables -t nat -L PREROUTING -n -v
Chain PREROUTING (policy ACCEPT 18999 packets, 3902K bytes)
pkts bytes target prot opt in out source destination
18955 3898K KUBE-SERVICES all -- * * 0.0.0.0/0 0.0.0.0/0 /* kubernetes service portals */
Initially the KUBE-SERVICES
chain contains rules just for the NodePort
custom chain and several built-in services:
1
2
3
4
5
6
7
8
root@kind-control-plane:/# iptables -t nat -L KUBE-SERVICES -n -v
Chain KUBE-SERVICES (2 references)
pkts bytes target prot opt in out source destination
0 0 KUBE-SVC-TCOU7JCQXEZGVUNU udp -- * * 0.0.0.0/0 10.96.0.10 /* kube-system/kube-dns:dns cluster IP */ udp dpt:53
0 0 KUBE-SVC-ERIFXISQEP7F7OF4 tcp -- * * 0.0.0.0/0 10.96.0.10 /* kube-system/kube-dns:dns-tcp cluster IP */ tcp dpt:53
0 0 KUBE-SVC-JD5MR3NA4I4DYORP tcp -- * * 0.0.0.0/0 10.96.0.10 /* kube-system/kube-dns:metrics cluster IP */ tcp dpt:9153
0 0 KUBE-SVC-NPX46M4PTMTKRN6Y tcp -- * * 0.0.0.0/0 10.96.0.1 /* default/kubernetes:https cluster IP */ tcp dpt:443
417 25020 KUBE-NODEPORTS all -- * * 0.0.0.0/0 0.0.0.0/0 /* kubernetes service nodeports; NOTE: this must be the last rule in this chain */ ADDRTYPE match dst-type LOCAL
New rules are appended for each service by the Proxier’s syncProxyRules
method and are written here. For example, the following shows a rule targeting a custom chain KUBE-SVC-BM6F4AVTDKG47F3K
for a service named mysvc
:
1
2
3
4
5
6
7
8
9
root@kind-control-plane:/# iptables -t nat -L KUBE-SERVICES -n -v
Chain KUBE-SERVICES (2 references)
pkts bytes target prot opt in out source destination
0 0 KUBE-SVC-BM6F4AVTDKG47F3K tcp -- * * 0.0.0.0/0 10.96.62.22 /* default/mysvc cluster IP */ tcp dpt:8080
0 0 KUBE-SVC-TCOU7JCQXEZGVUNU udp -- * * 0.0.0.0/0 10.96.0.10 /* kube-system/kube-dns:dns cluster IP */ udp dpt:53
0 0 KUBE-SVC-ERIFXISQEP7F7OF4 tcp -- * * 0.0.0.0/0 10.96.0.10 /* kube-system/kube-dns:dns-tcp cluster IP */ tcp dpt:53
0 0 KUBE-SVC-JD5MR3NA4I4DYORP tcp -- * * 0.0.0.0/0 10.96.0.10 /* kube-system/kube-dns:metrics cluster IP */ tcp dpt:9153
0 0 KUBE-SVC-NPX46M4PTMTKRN6Y tcp -- * * 0.0.0.0/0 10.96.0.1 /* default/kubernetes:https cluster IP */ tcp dpt:443
417 25020 KUBE-NODEPORTS all -- * * 0.0.0.0/0 0.0.0.0/0 /* kubernetes service nodeports; NOTE: this must be the last rule in this chain */ ADDRTYPE match dst-type LOCAL
If we inspect KUBE-SVC-BM6F4AVTDKG47F3K
we see something interesting:
1
2
3
4
5
6
7
root@kind-control-plane:/# iptables -t nat -L KUBE-SVC-BM6F4AVTDKG47F3K -n -v
Chain KUBE-SVC-BM6F4AVTDKG47F3K (1 references)
pkts bytes target prot opt in out source destination
0 0 KUBE-MARK-MASQ tcp -- * * !10.244.0.0/16 10.96.62.22 /* default/mysvc cluster IP */ tcp dpt:8080
0 0 KUBE-SEP-CMSFOBEB7HHZOTBZ all -- * * 0.0.0.0/0 0.0.0.0/0 /* default/mysvc -> 10.244.1.2:8080 */ statistic mode random probability 0.33333333349
0 0 KUBE-SEP-VVWLMARALSB3FCZF all -- * * 0.0.0.0/0 0.0.0.0/0 /* default/mysvc -> 10.244.2.2:8080 */ statistic mode random probability 0.50000000000
0 0 KUBE-SEP-XGAC3VXZG7B73WCD all -- * * 0.0.0.0/0 0.0.0.0/0 /* default/mysvc -> 10.244.2.3:8080 */
Ignoring the masq for now, we see three rules targeting chains for service endpoints. kube-proxy
adds these entries as it handles incoming events for endpointslices (see NewProxier()). Each rule has a helpful comment indicating the target service endpoint.
Note how these rules have a probability assigned to them. Rules in iptables
chains are processed sequentially. In this example there are three service endpoint rules, and the first is assigned a probability of 0.33
. Next, if the dice roll failed on the first one, we roll it again for the second rule, this time with a probability of 50%. If that fails, we fall back to the third rule with a probability of 100%. In this way we have an even distribution of traffic amongst the three endpoints. The probabilities are set here. Note how the probability curve is fixed as a flat distribution, and also note how kube-proxy
is not balancing this traffic itself. As noted in Concepts, kube-proxy
is not itself in the data plane.
In our example above, mysvc
is selecting three pods with endpoints 10.244.1.2:8080
, 10.244.2.2:8080
, and 10.244.2.3:8080
.
This is the service definition:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
apiVersion: v1
kind: Service
metadata:
labels:
app: test
name: mysvc
namespace: default
spec:
type: ClusterIP
ports:
- port: 8080
protocol: TCP
targetPort: 8080
selector:
app: test
And these are the IPs assigned to the selected pods (take note of the nodes as well):
1
2
3
4
5
$ k get po -l app=test -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
test-75d6d47c7f-jcdzz 1/1 Running 0 4d7h 10.244.2.2 kind-worker2 <none> <none>
test-75d6d47c7f-lgqcq 1/1 Running 0 4d7h 10.244.1.2 kind-worker <none> <none>
test-75d6d47c7f-pjrjp 1/1 Running 0 4d7h 10.244.2.3 kind-worker2 <none> <none>
If we inspect one of the service endpoint chains we see something else interesting:
1
2
3
4
5
root@kind-control-plane:/# iptables -t nat -L KUBE-SEP-CMSFOBEB7HHZOTBZ -n -v
Chain KUBE-SEP-CMSFOBEB7HHZOTBZ (1 references)
pkts bytes target prot opt in out source destination
0 0 KUBE-MARK-MASQ all -- * * 10.244.1.2 0.0.0.0/0 /* default/mysvc */
0 0 DNAT tcp -- * * 0.0.0.0/0 0.0.0.0/0 /* default/mysvc */ tcp to:10.244.1.2:8080
We see a DNAT
(destination NAT) rule that translates the destination address to 10.244.1.2:8080
. We already know that this destination is hosted on node kind-worker
, so investigating on that node we see:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
# list devices and their assigned IP ranges
root@kind-worker:/# ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: veth4e573577@if2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default
link/ether 5a:b9:16:0d:a6:18 brd ff:ff:ff:ff:ff:ff link-netns cni-b5e04919-09af-0a9f-6945-a9929d71d789
inet 10.244.1.1/32 scope global veth4e573577 <------ 10.244.1.2 IS IN THIS RANGE
valid_lft forever preferred_lft forever
13: eth0@if14: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default
link/ether 02:42:ac:12:00:03 brd ff:ff:ff:ff:ff:ff link-netnsid 0
inet 172.18.0.3/16 brd 172.18.255.255 scope global eth0
valid_lft forever preferred_lft forever
inet6 fc00:f853:ccd:e793::3/64 scope global nodad
valid_lft forever preferred_lft forever
inet6 fe80::42:acff:fe12:3/64 scope link
valid_lft forever preferred_lft forever
# show device
root@kind-worker:/# ip link list veth4e573577
2: veth4e573577@if2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default
link/ether 5a:b9:16:0d:a6:18 brd ff:ff:ff:ff:ff:ff link-netns cni-b5e04919-09af-0a9f-6945-a9929d71d789 <------ NETWORK NAMESPACE
# list network namespaces
root@kind-worker:/# ip netns list
cni-b5e04919-09af-0a9f-6945-a9929d71d789
# list all processes running in the target namespace
root@kind-worker:/# ps $(ip netns pids cni-b5e04919-09af-0a9f-6945-a9929d71d789)
PID TTY STAT TIME COMMAND
505179 ? Ss 0:00 /pause
505237 ? Ss 0:00 nginx: master process nginx -g daemon off;
505278 ? S 0:00 nginx: worker process
505279 ? S 0:00 nginx: worker process
505280 ? S 0:00 nginx: worker process
505281 ? S 0:00 nginx: worker process
505282 ? S 0:00 nginx: worker process
505283 ? S 0:00 nginx: worker process
505284 ? S 0:00 nginx: worker process
505285 ? S 0:00 nginx: worker process
505286 ? S 0:00 nginx: worker process
505287 ? S 0:00 nginx: worker process
505288 ? S 0:00 nginx: worker process
505289 ? S 0:00 nginx: worker process
505290 ? S 0:00 nginx: worker process
505291 ? S 0:00 nginx: worker process
505292 ? S 0:00 nginx: worker process
505293 ? S 0:00 nginx: worker process
We are back in net
namespace land!
In our case, we are running nginx on a simple deployment:
Spec
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app: test
name: test
namespace: default
spec:
replicas: 3
selector:
matchLabels:
app: test
template:
metadata:
labels:
app: test
spec:
containers:
- image: nginx
name: nginx
Tying it all together
Kubernetes is an event-driven, distributed platform that automates the deployment and networking aspects of your workloads. kube-apiserver
is the platform’s “event hub”.
Blue arrows show where configuration data for Deployments flow. Red arrows show where configuration data for Services flow. Note that this is just a subset of all the machinery activated when a user creates either of these two resources.
kubelet
runs on each node and listens for events from kube-apiserver
where pods are added to the node it’s running on. When a pod is created, be it with a controller or just an orphaned pod, kubelet
uses the Container Runtime Interface (CRI) to create the pod’s sandbox. The CRI in turn uses the Container Network Interface to configure the pod’s network namespace on the node. The pod will have an IP that is reachable by any other pod in any other node.
When a ClusterIP
Service is created, kube-apiserver
assigns a free Virtual IP to it and persists the Service object to etcd
. The event is caught by coreDNS
which proceeds to cache the service_name -> cluster_ip mapping, and respond to DNS requests accordingly. The event is also caught by the EndpointSlice controller which then creates and attaches an EndpointSlice with the IPs of the selected Pods to the Service and saves the update to etcd
.
kube-proxy
runs on each node and listens for events from kube-apiserver
where Services and EndpointSlices are added and configures the local node’s IP routing rules to point the Service’s virtual IP to the backend Pods with an even distribution.
During runtime, a client container queries coreDNS
for the Service’s address and directs its request to the Service’s virtual IP. The local routing rules (iptables
on Linux hosts, Host Compute Service API
on Windows) randomly select one of the backend Pod IP addresses and forwards traffic to that Pod.
Footnotes
You can use the NetworkPolicy resource (+ a suitable CNI plugin) to block traffic to/from Pods. ↩
Despite the CNI featuring prominently in K8S docs, Kubernetes does not actually interface with the CNI directly as others have pointed out here. Kubernetes’ source code does not depend on the CNI API. ↩
Note that
kube-proxy
is itself not actually in the request path (data plane). ↩Don’t worry too much: the changes done so far are not persistent across system restarts. ↩
Wikipedia has a very nice description of the IP routing algorithm here. ↩ ↩2
A reply would look like this:
14:47:51.365200 bridge In ifindex 5 06:82:91:69:f0:36 ethertype ARP (0x0806), length 48: Reply 10.0.0.1 is-at 06:82:91:69:f0:36, length 28
↩If you capture another dump with
tcpdump
you’ll notice an absence ofARP
requests for10.0.0.2
. This is because the route forwards the traffic to172.18.0.4
, and the MAC address for the latter is already cached in the host’s ARP table. ↩In reality, Kubernetes does this in a more efficient way by configuring IP routes for IP ranges (segments) instead of specific addresses. You can verify IP routes on a host with
ip route list
. In my case, I could see that Kubernetes has routed10.244.1.0/24
via172.18.0.4
(our “server” host) and10.244.2.0/24
via172.18.0.3
(a third node not relevant to our discussion). ↩As described by Eli’s article and the opposite of the
kubelet
->CRI
integration.containerd
’s CRI service is a plugin that is registered here. ↩At the moment the CNI’s scope is limited to network-related configurations during creation and deletion of a pod. The README notes that future extensions could be possible to enable dynamic scenarios such as NetworkPolicies (cilium already supports network policies). ↩
Yet another way to implement a plugin architecture. ↩
Assuming I’ve done a decent job in this article :). ↩
Update is done by the EndpointSlice Controller. We’ll talk about this and other controllers in a future article. ↩
Breadcrumbs: (Service REST storage -> allocator -> Range allocator -> etcd storage) ↩
See (Store.Create). ↩
We will cover watches in more detail in a future article. ↩
Breadcrumbs: InitKubeCache -> dnsController.Run -> controller.Run -> Reflector.Run -> Reflector.ListAndWatch -> watchHandler. ↩
Breadcrumbs: ServeDNS -> A() -> checkForApex -> Services() -> Records() -> findServices -> SvcIndex -> ByIndex (client-go). ↩
See Controller.syncService. ↩
Breadcrumbs: ProxyServer.Run -> NewServiceConfig -> ServiceConfig.handleAddService -> Proxier.OnServiceAdd -> Proxier.OnServiceUpdate -> Proxier.Sync -> Proxier.syncProxyRules. ↩
iptables
is the default. There is a newer alternative using IPVS that one can use by setting theproxy-mode
appropriately (seeproxy-mode
in options forkube-proxy
). There used to be an older third mode called userspace but support for that was removed. ↩Breadcrumbs: kubeServicesChain -> iptablesJumpChains -> syncProxyRules). ↩