How Kubernetes Routes IP Packets to Services’ Cluster IPs

|

I recently observed DNS resolution errors on a large Kubernetes (K8s) cluster. This behavior was only happening on 0.1% of K8s nodes. But the fact that this behavior wasn’t self-healing and crippled tenant workloads in addition to my penchant to chase rabbits down holes meant I wasn’t going to let it go. I emerged learning how K8s Services’ Cluster IP feature actually works. Explaining this feature and my particular problem and speculative fix is the goal of this post.

The Problem

The large K8s cluster is actually a Google Kubernetes Engine (GKE) cluster with master version 1.17.14-gke.400 and node version 1.17.13-gke.2600. This is a multi-tenant cluster with hundreds of nodes. Each node runs dozens of user workloads. Some users said DNS resolution within their Pods on certain nodes weren’t working. I was able to reproduce this behavior with the following steps.

Kubernetes schedules kube-dns Pods and a Service on the cluster that provide DNS and configures kubelets to tell individual containers to use the DNS Service’s IP to resolve DNS names. See K8s docs here. First I get the kube-dns‘ Service’s Cluster IP. This is the IP address to which DNS queries from Pods are sent.

1
2
3
kubectl --context my-gke-cluster -n kube-system get services kube-dns
NAME       TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)         AGE
kube-dns   ClusterIP   10.178.64.10   <none>        53/UDP,53/TCP   666d

Then I make DNS queries against the Cluster IP from a Pod running on a broken node.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
# Log into the GKE node
gcloud --project my-project compute ssh my-gke-node --zone us-central1-b --internal-ip

# Need to run toolbox container which has iptables command. Google's Container-Optimized OS doesn't
# have it.
dxia@my-gke-node ~ $ toolbox
20200603-00: Pulling from google-containers/toolbox
Digest: sha256:36e2f6b8aa40328453aed7917860a8dee746c101dfde4464ce173ed402c1ec57
Status: Image is up to date for gcr.io/google-containers/toolbox:20200603-00
gcr.io/google-containers/toolbox:20200603-00
e6b1ee70f91ac405623cbf1d2afa9a532a022dc644bddddd754d2cd786f58273

dxia-gcr.io_google-containers_toolbox-20200603-00
Please do not use --share-system anymore, use $SYSTEMD_NSPAWN_SHARE_* instead.
Spawning container dxia-gcr.io_google-containers_toolbox-20200603-00 on /var/lib/toolbox/dxia-gcr.io_google-containers_toolbox-20200603-00.
Press ^] three times within 1s to kill container.

# Install dig
root@toolbox:~# apt-get update && apt-get install dnsutils

# Ask the kube-dns Cluster IP to resolve www.google.com
# dig will hang when it's waiting on a DNS reply. So ^C's show DNS resolution failures
root@toolbox:~# for x in $(seq 1 20); do echo ${x}; dig @10.178.64.10 www.google.com > /dev/null; done
1
^C2
^C3
4
5
6
7
8
^C9
10
11
12
13
14
15
^C16
17
18
^C19
20

I cordoned and drained the node and added the annotation cluster-autoscaler.kubernetes.io/scale-down-disabled=true to prevent the cluster autoscaler from deleting it.

Then I performed a more basic test. I tested whether I could even make a TCP connection to the Cluster IP on port 53 (default DNS port).

1
2
3
4
# Run nc 1000 times without reverse DNS lookup, in verbose and scan mode
# Count only failed connections
root@toolbox:~# for x in $(seq 1 1000); do nc 10.178.64.10 53 -nvz 2>&1 | grep -v open; done | wc -l
257

A quarter of the TCP connections fail. This means the error is below the DNS layer at TCP layer 3.

Finding the Root Cause: Down the Rabbit Hole

Some background for those unfamiliar. K8s nodes (via the kube-proxy DaemonSet) will route IP packets originating from a Pod with a destination of a K8s Service’s Cluster IP to a backing Pod IP in one of three proxy modes: user space, iptables, and IPVS. I’m assuming GKE runs kube-proxy in iptables proxy mode since iptables instead of IPVS is mentioned in their docs here.

kube-proxy should keep the node’s iptable rules up to date with the actual kube-dns Service’s endpoints. The following console output shows how I figured out the IP packet flow by tracing matching iptables rules.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
# List rules in FORWARD chain's filter table
root@toolbox:~# iptables -L FORWARD -t filter
Chain FORWARD (policy DROP)
target     prot opt source               destination
cali-FORWARD  all  --  anywhere             anywhere             /* cali:wUHhoiAYhphO9Mso */
KUBE-FORWARD  all  --  anywhere             anywhere             /* kubernetes forwarding rules */
KUBE-SERVICES  all  --  anywhere             anywhere             ctstate NEW /* kubernetes service portals */
DOCKER-USER  all  --  anywhere             anywhere
DOCKER-ISOLATION-STAGE-1  all  --  anywhere             anywhere
ACCEPT     all  --  anywhere             anywhere             ctstate RELATED,ESTABLISHED
DOCKER     all  --  anywhere             anywhere
ACCEPT     all  --  anywhere             anywhere
ACCEPT     all  --  anywhere             anywhere
ACCEPT     tcp  --  anywhere             anywhere
ACCEPT     udp  --  anywhere             anywhere
ACCEPT     icmp --  anywhere             anywhere
ACCEPT     sctp --  anywhere             anywhere

# List rules in KUBE-SERVICES chain's nat table and look for rules that forward IP packets destined
# for the K8s Service kube-system/kube-dns' Cluster IP
root@toolbox:~# iptables -L KUBE-SERVICES -t nat | grep kube-system/kube-dns | grep SVC
KUBE-SVC-ERIFXISQEP7F7OF4  tcp  --  anywhere             10.178.64.10         /* kube-system/kube-dns:dns-tcp cluster IP */ tcp dpt:domain
KUBE-SVC-TCOU7JCQXEZGVUNU  udp  --  anywhere             10.178.64.10         /* kube-system/kube-dns:dns cluster IP */ udp dpt:domain

# List rules in KUBE-SVC-ERIFXISQEP7F7OF4 chain's nat table
Chain KUBE-SVC-ERIFXISQEP7F7OF4 (1 references)
target     prot opt source               destination
KUBE-SEP-BMNCBK7ROA3MA6UU  all  --  anywhere             anywhere             statistic mode random probability 0.01538461540
KUBE-SEP-GYUBQUCI6VR6AER2  all  --  anywhere             anywhere             statistic mode random probability 0.01562500000
KUBE-SEP-IF56RUVXN2P4ORZZ  all  --  anywhere             anywhere             statistic mode random probability 0.01587301586
KUBE-SEP-WUD7OE7TYMWFJJYX  all  --  anywhere             anywhere             statistic mode random probability 0.01612903224
KUBE-SEP-B7IYZJB6QVUX246S  all  --  anywhere             anywhere             statistic mode random probability 0.01639344264
KUBE-SEP-T6B7SPNOX3DH33BE  all  --  anywhere             anywhere             statistic mode random probability 0.01666666660
KUBE-SEP-REJSUT2VC76HMIRQ  all  --  anywhere             anywhere             statistic mode random probability 0.01694915257
KUBE-SEP-B4N4VXNUSBNXHV73  all  --  anywhere             anywhere             statistic mode random probability 0.01724137925
KUBE-SEP-XUJIW6IGZX4X5BBG  all  --  anywhere             anywhere             statistic mode random probability 0.01754385978
KUBE-SEP-MMBQBWR6AYIPMUZL  all  --  anywhere             anywhere             statistic mode random probability 0.01785714272
KUBE-SEP-6O5U6FAKQVEXGTP7  all  --  anywhere             anywhere             statistic mode random probability 0.01818181807
KUBE-SEP-DMN3RJWMPAEHNOGE  all  --  anywhere             anywhere             statistic mode random probability 0.01851851866
KUBE-SEP-FHJKZIH3JDZSXJUD  all  --  anywhere             anywhere             statistic mode random probability 0.01886792434
KUBE-SEP-YRPM7BEQS2YESSJL  all  --  anywhere             anywhere             statistic mode random probability 0.01923076902
KUBE-SEP-BSHQZGGNYIILL3V7  all  --  anywhere             anywhere             statistic mode random probability 0.01960784290
KUBE-SEP-XTW5FCAH2423EWAV  all  --  anywhere             anywhere             statistic mode random probability 0.02000000002
KUBE-SEP-2ETTGYCM3KLKL54Q  all  --  anywhere             anywhere             statistic mode random probability 0.02040816331
KUBE-SEP-ZUFFQWVT2EY73YVF  all  --  anywhere             anywhere             statistic mode random probability 0.02083333349
KUBE-SEP-VUNSBD5OILT2BGUX  all  --  anywhere             anywhere             statistic mode random probability 0.02127659554
KUBE-SEP-3XVS5OF4SBBHATZW  all  --  anywhere             anywhere             statistic mode random probability 0.02173913037
KUBE-SEP-IRW2YX5BEMBR3OGF  all  --  anywhere             anywhere             statistic mode random probability 0.02222222229
KUBE-SEP-6J6T3TOCBEQ5NUQ5  all  --  anywhere             anywhere             statistic mode random probability 0.02272727247
KUBE-SEP-E3FOMPW5DQK5FDIA  all  --  anywhere             anywhere             statistic mode random probability 0.02325581387
KUBE-SEP-EO4O2TBNDPU377YQ  all  --  anywhere             anywhere             statistic mode random probability 0.02380952379
KUBE-SEP-ZGRZOBXXZ2KPGNZD  all  --  anywhere             anywhere             statistic mode random probability 0.02439024393
KUBE-SEP-XLRCUOCE6XAL3TYE  all  --  anywhere             anywhere             statistic mode random probability 0.02499999991
KUBE-SEP-477YCBVB2RZ4WKUD  all  --  anywhere             anywhere             statistic mode random probability 0.02564102551
KUBE-SEP-FGVS22Q3OCM6S5VS  all  --  anywhere             anywhere             statistic mode random probability 0.02631578967
KUBE-SEP-FBHD55TKQKCEKSUO  all  --  anywhere             anywhere             statistic mode random probability 0.02702702722
KUBE-SEP-ULRGL5A7XKWV3HB6  all  --  anywhere             anywhere             statistic mode random probability 0.02777777798
KUBE-SEP-HO6T2NOJNNMVWDPW  all  --  anywhere             anywhere             statistic mode random probability 0.02857142873
KUBE-SEP-PV23DIU55F5LDJIX  all  --  anywhere             anywhere             statistic mode random probability 0.02941176482
KUBE-SEP-6PL2LOTBN64MN2IF  all  --  anywhere             anywhere             statistic mode random probability 0.03030303027
KUBE-SEP-3G3LTNLLVZWE57GZ  all  --  anywhere             anywhere             statistic mode random probability 0.03125000000
KUBE-SEP-SNHFF6VK2KP44I7Q  all  --  anywhere             anywhere             statistic mode random probability 0.03225806449
KUBE-SEP-KNOCRXE7JOQ4FBTI  all  --  anywhere             anywhere             statistic mode random probability 0.03333333321
KUBE-SEP-M5NXUS47V77SM3HZ  all  --  anywhere             anywhere             statistic mode random probability 0.03448275849
KUBE-SEP-VEMFKB2E3QRFFRSG  all  --  anywhere             anywhere             statistic mode random probability 0.03571428591
KUBE-SEP-RRYDQV524YXA4GDR  all  --  anywhere             anywhere             statistic mode random probability 0.03703703685
KUBE-SEP-G65AAYF5LWFW4YBM  all  --  anywhere             anywhere             statistic mode random probability 0.03846153850
KUBE-SEP-K4HN6ANXSPKA7JGZ  all  --  anywhere             anywhere             statistic mode random probability 0.04000000004
KUBE-SEP-72YXYSKWHCML6KJJ  all  --  anywhere             anywhere             statistic mode random probability 0.04166666651
KUBE-SEP-YCD5TFDQM4ELQ5WX  all  --  anywhere             anywhere             statistic mode random probability 0.04347826075
KUBE-SEP-U7N4W7N5OKDP5PNC  all  --  anywhere             anywhere             statistic mode random probability 0.04545454541
KUBE-SEP-ACPRKJJSJ73NAQNV  all  --  anywhere             anywhere             statistic mode random probability 0.04761904757
KUBE-SEP-HPAV4MFMKCM43BC2  all  --  anywhere             anywhere             statistic mode random probability 0.04999999981
KUBE-SEP-VXO5CPBPAES2GS3A  all  --  anywhere             anywhere             statistic mode random probability 0.05263157887
KUBE-SEP-LJ3HM5QDYEB4ICUB  all  --  anywhere             anywhere             statistic mode random probability 0.05555555550
KUBE-SEP-W6VORIPTN7FDPIMU  all  --  anywhere             anywhere             statistic mode random probability 0.05882352963
KUBE-SEP-A5SGQE4VKXUT2NEC  all  --  anywhere             anywhere             statistic mode random probability 0.06250000000
KUBE-SEP-4LCLRUWZUF2DDGKK  all  --  anywhere             anywhere             statistic mode random probability 0.06666666688
KUBE-SEP-K7NZ33CKVQDPMIET  all  --  anywhere             anywhere             statistic mode random probability 0.07142857136
KUBE-SEP-76ISGBIKEK2QPYDL  all  --  anywhere             anywhere             statistic mode random probability 0.07692307699
KUBE-SEP-3S5ELV7JJCII2KNO  all  --  anywhere             anywhere             statistic mode random probability 0.08333333349
KUBE-SEP-THLYLIADKU5Z5I32  all  --  anywhere             anywhere             statistic mode random probability 0.09090909082
KUBE-SEP-T7P5MBD5MAWH2XB5  all  --  anywhere             anywhere             statistic mode random probability 0.10000000009
KUBE-SEP-WQ6DVZHCVUTU5QJS  all  --  anywhere             anywhere             statistic mode random probability 0.11111111101
KUBE-SEP-5RVGOA4UDKOKKI7O  all  --  anywhere             anywhere             statistic mode random probability 0.12500000000
KUBE-SEP-VSXQV2AZ43RZQSL7  all  --  anywhere             anywhere             statistic mode random probability 0.14285714272
KUBE-SEP-RVDWX7YLRKCSUDII  all  --  anywhere             anywhere             statistic mode random probability 0.16666666651
KUBE-SEP-OECSAM56W6JQA562  all  --  anywhere             anywhere             statistic mode random probability 0.20000000019
KUBE-SEP-HY76TWODHVCVLG5Y  all  --  anywhere             anywhere             statistic mode random probability 0.25000000000
KUBE-SEP-3UNVKH34LEKZ2P5K  all  --  anywhere             anywhere             statistic mode random probability 0.33333333349
KUBE-SEP-TDCXKWGVKJJ22VHB  all  --  anywhere             anywhere             statistic mode random probability 0.50000000000
KUBE-SEP-Z7ZOTGJIY44EKMWW  all  --  anywhere             anywhere

# List the rules of two random chains above to see the DNAT'ed Pod IP
root@toolbox:~# iptables -L KUBE-SEP-RVDWX7YLRKCSUDII -t nat
Chain KUBE-SEP-RVDWX7YLRKCSUDII (1 references)
target     prot opt source               destination
KUBE-MARK-MASQ  all  --  10.179.94.16         anywhere
DNAT       tcp  --  anywhere             anywhere             tcp to::0 persistent:0 persistent

root@toolbox:~# iptables -L KUBE-SEP-6PL2LOTBN64MN2IF -t nat
Chain KUBE-SEP-6PL2LOTBN64MN2IF (1 references)
target     prot opt source               destination
KUBE-MARK-MASQ  all  --  10.179.45.66         anywhere
DNAT       tcp  --  anywhere             anywhere             tcp to::0 persistent:0 persistent

These final rules are the ones that actually replace the destination Cluster IP of 10.178.64.10 with a randomly chosen kube-dns Pod IP. The random selection is implemented by the rules in the KUBE-SVC-ERIFXISQEP7F7OF4 chain which have statistic mode random probability p. Rules are matched top down. So the first rule with target KUBE-SEP-BMNCBK7ROA3MA6UU has a probability of 0.01538461540 of being picked. The second rule with target KUBE-SEP-GYUBQUCI6VR6AER2 has a probability of 0.01562500000 of being picked. But this 0.01562500000 is applied to the probability that the first rule didn’t match. So its overall probability is (1 - 0.01538461540) * 0.01562500000 ~= 0.01538461540. Applying this calculation to the other rules, you can see each rule has a probability of 0.01538461540 or 1/n in being selected where n = 65 is the total number of kube-dns Pods in this case. This algorithm is actually a variation of reservoir sampling.

Confirming the Root Cause

At this point I strongly suspected the iptables rules were stale and routing packets to kube-dns Pod IPs that no longer exist. In order to confirm this I wanted to find an actual DNAT’ed IP that didn’t correspond to any actual kube-dns Pod. There were 65 rules in the KUBE-SVC-ERIFXISQEP7F7OF4 chain, but I expected 77 because that was the number of kube-dns Pods.

1
2
kubectl --context my-gke-cluster -n kube-system get endpoints kube-dns -o json | jq -r .subsets[0].addresses | jq length
77

On nodes without DNS issues, I saw the correct number of rules.

1
2
root@healthy-gke-node:~# iptables -L KUBE-SVC-ERIFXISQEP7F7OF4 -t nat | wc -l
79 [two extra lines of headers]

I saw this Pod IP when inspecting a randomly chosen rule on my-gke-node.

1
2
3
4
5
root@toolbox:~# iptables -L KUBE-SEP-RVDWX7YLRKCSUDII -t nat
Chain KUBE-SEP-RVDWX7YLRKCSUDII (1 references)
target     prot opt source               destination
KUBE-MARK-MASQ  all  --  10.179.94.16         anywhere
DNAT       tcp  --  anywhere             anywhere             tcp to::0 persistent:0 persistent

No kube-dns Pod existed with this IP.

1
2
kubectl --context my-gke-cluster -n kube-system get pods --selector k8s-app=kube-dns -o wide | grep 10.179.94.16
[no output]

This confirmed kube-proxy wasn’t updating the iptables rules for kube-dns. Why? The kube-proxy logs on the node showed these ongoing occurring errors.

1
2
3
4
dxia@my-gke-node ~ $ tail -f /var/log/kube-proxy.log
E0126 20:40:24.739255       1 reflector.go:153] k8s.io/client-go/informers/factory.go:135: Failed to list *v1.Service: an error on the server ("") has prevented the request from succeeding (get services)
E0126 20:40:24.739611       1 reflector.go:153] k8s.io/client-go/informers/factory.go:135: Failed to list *v1.Endpoints: an error on the server ("") has prevented the request from succeeding (get endpoints)
E0126 20:40:34.742869       1 reflector.go:153] k8s.io/client-go/informers/factory.go:135: Failed to list *v1.Service: an error on the server ("") has prevented the request from succeeding (get services)

The Speculative Fix

I think these kube-proxy errors are caused by this underlying K8s bug, but I’m not sure.

we found that after the problem occurred all subsequent requests were still send on the same connection. It seems that although the client will resend the request to apiserver, but the underlay http2 library still maintains the old connection so all subsequent requests are still send on this connection and received the same error use of closed connection.

So the question is why http2 still maintains an already closed connection? Maybe the connection it maintained is indeed alive but some intermediate connections are closed unexpectedly?

https://github.com/kubernetes/kubernetes/issues/87615#issuecomment-596312532

The bug in that issue is fixed in K8s 1.19 and 1.20.

If you’re using GKE and Google Cloud Monitoring, this log query will show which nodes’ kube-proxy Pods can’t get updated Service and Endpoint data from the K8s API.

1
2
3
4
5
resource.type="k8s_node"
resource.labels.project_id="[YOUR-PROJECT]"
logName="projects/[YOUR-PROJECT]/logs/kube-proxy"
jsonPayload.message:"Failed to list "
severity=ERROR

Comments