How Kubernetes Routes IP Packets to Services’ Cluster IPs
|
By David Xia
I recently observed DNS resolution errors on a large Kubernetes (K8s) cluster. This behavior was
only happening on 0.1% of K8s nodes. But the fact that this behavior wasn’t self-healing and
crippled tenant workloads in addition to my penchant to chase rabbits down holes meant I
wasn’t going to let it go. I emerged learning how K8s Services’ Cluster IP feature actually works.
Explaining this feature and my particular problem and speculative fix is the goal of this post.
The Problem
The large K8s cluster is actually a Google Kubernetes Engine (GKE) cluster with master version
1.17.14-gke.400 and node version 1.17.13-gke.2600. This is a multi-tenant cluster with hundreds of
nodes. Each node runs dozens of user workloads. Some users said DNS resolution within their Pods on
certain nodes weren’t working. I was able to reproduce this behavior with the following steps.
Kubernetes schedules kube-dns Pods and a Service on the cluster that provide DNS and configures
kubelets to tell individual containers to use the DNS Service’s IP to resolve DNS names. See K8s
docs here. First I get the kube-dns‘ Service’s Cluster IP. This is the IP address to
which DNS queries from Pods are sent.
123
kubectl --context my-gke-cluster -n kube-system get services kube-dns
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
kube-dns ClusterIP 10.178.64.10 <none> 53/UDP,53/TCP 666d
Then I make DNS queries against the Cluster IP from a Pod running on a broken node.
# Log into the GKE node
gcloud --project my-project compute ssh my-gke-node --zone us-central1-b --internal-ip
# Need to run toolbox container which has iptables command. Google's Container-Optimized OS doesn't
# have it.
dxia@my-gke-node ~ $ toolbox
20200603-00: Pulling from google-containers/toolbox
Digest: sha256:36e2f6b8aa40328453aed7917860a8dee746c101dfde4464ce173ed402c1ec57
Status: Image is up to date for gcr.io/google-containers/toolbox:20200603-00
gcr.io/google-containers/toolbox:20200603-00
e6b1ee70f91ac405623cbf1d2afa9a532a022dc644bddddd754d2cd786f58273
dxia-gcr.io_google-containers_toolbox-20200603-00
Please do not use --share-system anymore, use $SYSTEMD_NSPAWN_SHARE_* instead.
Spawning container dxia-gcr.io_google-containers_toolbox-20200603-00 on /var/lib/toolbox/dxia-gcr.io_google-containers_toolbox-20200603-00.
Press ^] three times within 1s to kill container.
# Install dig
root@toolbox:~# apt-get update && apt-get install dnsutils
# Ask the kube-dns Cluster IP to resolve www.google.com
# dig will hang when it's waiting on a DNS reply. So ^C's show DNS resolution failures
root@toolbox:~# for x in $(seq 1 20); do echo ${x}; dig @10.178.64.10 www.google.com > /dev/null; done
1
^C2
^C3
4
5
6
7
8
^C9
10
11
12
13
14
15
^C16
17
18
^C19
20
Then I performed a more basic test. I tested whether I could even make a TCP connection to the
Cluster IP on port 53 (default DNS port).
1234
# Run nc 1000 times without reverse DNS lookup, in verbose and scan mode
# Count only failed connections
root@toolbox:~# for x in $(seq 1 1000); do nc 10.178.64.10 53 -nvz 2>&1 | grep -v open; done | wc -l
257
A quarter of the TCP connections fail. This means the error is below the DNS layer at TCP layer 3.
Finding the Root Cause: Down the Rabbit Hole
Some background for those unfamiliar. K8s nodes (via the kube-proxy DaemonSet) will route IP
packets originating from a Pod with a destination of a K8s Service’s Cluster IP to a backing Pod IP
in one of three proxy modes: user space, iptables, and IPVS. I’m assuming GKE
runs kube-proxy in iptables proxy mode since iptables instead of IPVS is mentioned in their docs
here.
kube-proxy should keep the node’s iptable rules up to date with the actual kube-dns
Service’s endpoints. The following console output shows how I figured out the IP packet flow by
tracing matching iptables rules.
# List rules in FORWARD chain's filter table
root@toolbox:~# iptables -L FORWARD -t filter
Chain FORWARD (policy DROP)
target prot opt source destination
cali-FORWARD all -- anywhere anywhere /* cali:wUHhoiAYhphO9Mso */
KUBE-FORWARD all -- anywhere anywhere /* kubernetes forwarding rules */
KUBE-SERVICES all -- anywhere anywhere ctstate NEW /* kubernetes service portals */
DOCKER-USER all -- anywhere anywhere
DOCKER-ISOLATION-STAGE-1 all -- anywhere anywhere
ACCEPT all -- anywhere anywhere ctstate RELATED,ESTABLISHED
DOCKER all -- anywhere anywhere
ACCEPT all -- anywhere anywhere
ACCEPT all -- anywhere anywhere
ACCEPT tcp -- anywhere anywhere
ACCEPT udp -- anywhere anywhere
ACCEPT icmp -- anywhere anywhere
ACCEPT sctp -- anywhere anywhere
# List rules in KUBE-SERVICES chain's nat table and look for rules that forward IP packets destined
# for the K8s Service kube-system/kube-dns' Cluster IP
root@toolbox:~# iptables -L KUBE-SERVICES -t nat | grep kube-system/kube-dns | grep SVC
KUBE-SVC-ERIFXISQEP7F7OF4 tcp -- anywhere 10.178.64.10 /* kube-system/kube-dns:dns-tcp cluster IP */ tcp dpt:domain
KUBE-SVC-TCOU7JCQXEZGVUNU udp -- anywhere 10.178.64.10 /* kube-system/kube-dns:dns cluster IP */ udp dpt:domain
# List rules in KUBE-SVC-ERIFXISQEP7F7OF4 chain's nat table
Chain KUBE-SVC-ERIFXISQEP7F7OF4 (1 references)
target prot opt source destination
KUBE-SEP-BMNCBK7ROA3MA6UU all -- anywhere anywhere statistic mode random probability 0.01538461540
KUBE-SEP-GYUBQUCI6VR6AER2 all -- anywhere anywhere statistic mode random probability 0.01562500000
KUBE-SEP-IF56RUVXN2P4ORZZ all -- anywhere anywhere statistic mode random probability 0.01587301586
KUBE-SEP-WUD7OE7TYMWFJJYX all -- anywhere anywhere statistic mode random probability 0.01612903224
KUBE-SEP-B7IYZJB6QVUX246S all -- anywhere anywhere statistic mode random probability 0.01639344264
KUBE-SEP-T6B7SPNOX3DH33BE all -- anywhere anywhere statistic mode random probability 0.01666666660
KUBE-SEP-REJSUT2VC76HMIRQ all -- anywhere anywhere statistic mode random probability 0.01694915257
KUBE-SEP-B4N4VXNUSBNXHV73 all -- anywhere anywhere statistic mode random probability 0.01724137925
KUBE-SEP-XUJIW6IGZX4X5BBG all -- anywhere anywhere statistic mode random probability 0.01754385978
KUBE-SEP-MMBQBWR6AYIPMUZL all -- anywhere anywhere statistic mode random probability 0.01785714272
KUBE-SEP-6O5U6FAKQVEXGTP7 all -- anywhere anywhere statistic mode random probability 0.01818181807
KUBE-SEP-DMN3RJWMPAEHNOGE all -- anywhere anywhere statistic mode random probability 0.01851851866
KUBE-SEP-FHJKZIH3JDZSXJUD all -- anywhere anywhere statistic mode random probability 0.01886792434
KUBE-SEP-YRPM7BEQS2YESSJL all -- anywhere anywhere statistic mode random probability 0.01923076902
KUBE-SEP-BSHQZGGNYIILL3V7 all -- anywhere anywhere statistic mode random probability 0.01960784290
KUBE-SEP-XTW5FCAH2423EWAV all -- anywhere anywhere statistic mode random probability 0.02000000002
KUBE-SEP-2ETTGYCM3KLKL54Q all -- anywhere anywhere statistic mode random probability 0.02040816331
KUBE-SEP-ZUFFQWVT2EY73YVF all -- anywhere anywhere statistic mode random probability 0.02083333349
KUBE-SEP-VUNSBD5OILT2BGUX all -- anywhere anywhere statistic mode random probability 0.02127659554
KUBE-SEP-3XVS5OF4SBBHATZW all -- anywhere anywhere statistic mode random probability 0.02173913037
KUBE-SEP-IRW2YX5BEMBR3OGF all -- anywhere anywhere statistic mode random probability 0.02222222229
KUBE-SEP-6J6T3TOCBEQ5NUQ5 all -- anywhere anywhere statistic mode random probability 0.02272727247
KUBE-SEP-E3FOMPW5DQK5FDIA all -- anywhere anywhere statistic mode random probability 0.02325581387
KUBE-SEP-EO4O2TBNDPU377YQ all -- anywhere anywhere statistic mode random probability 0.02380952379
KUBE-SEP-ZGRZOBXXZ2KPGNZD all -- anywhere anywhere statistic mode random probability 0.02439024393
KUBE-SEP-XLRCUOCE6XAL3TYE all -- anywhere anywhere statistic mode random probability 0.02499999991
KUBE-SEP-477YCBVB2RZ4WKUD all -- anywhere anywhere statistic mode random probability 0.02564102551
KUBE-SEP-FGVS22Q3OCM6S5VS all -- anywhere anywhere statistic mode random probability 0.02631578967
KUBE-SEP-FBHD55TKQKCEKSUO all -- anywhere anywhere statistic mode random probability 0.02702702722
KUBE-SEP-ULRGL5A7XKWV3HB6 all -- anywhere anywhere statistic mode random probability 0.02777777798
KUBE-SEP-HO6T2NOJNNMVWDPW all -- anywhere anywhere statistic mode random probability 0.02857142873
KUBE-SEP-PV23DIU55F5LDJIX all -- anywhere anywhere statistic mode random probability 0.02941176482
KUBE-SEP-6PL2LOTBN64MN2IF all -- anywhere anywhere statistic mode random probability 0.03030303027
KUBE-SEP-3G3LTNLLVZWE57GZ all -- anywhere anywhere statistic mode random probability 0.03125000000
KUBE-SEP-SNHFF6VK2KP44I7Q all -- anywhere anywhere statistic mode random probability 0.03225806449
KUBE-SEP-KNOCRXE7JOQ4FBTI all -- anywhere anywhere statistic mode random probability 0.03333333321
KUBE-SEP-M5NXUS47V77SM3HZ all -- anywhere anywhere statistic mode random probability 0.03448275849
KUBE-SEP-VEMFKB2E3QRFFRSG all -- anywhere anywhere statistic mode random probability 0.03571428591
KUBE-SEP-RRYDQV524YXA4GDR all -- anywhere anywhere statistic mode random probability 0.03703703685
KUBE-SEP-G65AAYF5LWFW4YBM all -- anywhere anywhere statistic mode random probability 0.03846153850
KUBE-SEP-K4HN6ANXSPKA7JGZ all -- anywhere anywhere statistic mode random probability 0.04000000004
KUBE-SEP-72YXYSKWHCML6KJJ all -- anywhere anywhere statistic mode random probability 0.04166666651
KUBE-SEP-YCD5TFDQM4ELQ5WX all -- anywhere anywhere statistic mode random probability 0.04347826075
KUBE-SEP-U7N4W7N5OKDP5PNC all -- anywhere anywhere statistic mode random probability 0.04545454541
KUBE-SEP-ACPRKJJSJ73NAQNV all -- anywhere anywhere statistic mode random probability 0.04761904757
KUBE-SEP-HPAV4MFMKCM43BC2 all -- anywhere anywhere statistic mode random probability 0.04999999981
KUBE-SEP-VXO5CPBPAES2GS3A all -- anywhere anywhere statistic mode random probability 0.05263157887
KUBE-SEP-LJ3HM5QDYEB4ICUB all -- anywhere anywhere statistic mode random probability 0.05555555550
KUBE-SEP-W6VORIPTN7FDPIMU all -- anywhere anywhere statistic mode random probability 0.05882352963
KUBE-SEP-A5SGQE4VKXUT2NEC all -- anywhere anywhere statistic mode random probability 0.06250000000
KUBE-SEP-4LCLRUWZUF2DDGKK all -- anywhere anywhere statistic mode random probability 0.06666666688
KUBE-SEP-K7NZ33CKVQDPMIET all -- anywhere anywhere statistic mode random probability 0.07142857136
KUBE-SEP-76ISGBIKEK2QPYDL all -- anywhere anywhere statistic mode random probability 0.07692307699
KUBE-SEP-3S5ELV7JJCII2KNO all -- anywhere anywhere statistic mode random probability 0.08333333349
KUBE-SEP-THLYLIADKU5Z5I32 all -- anywhere anywhere statistic mode random probability 0.09090909082
KUBE-SEP-T7P5MBD5MAWH2XB5 all -- anywhere anywhere statistic mode random probability 0.10000000009
KUBE-SEP-WQ6DVZHCVUTU5QJS all -- anywhere anywhere statistic mode random probability 0.11111111101
KUBE-SEP-5RVGOA4UDKOKKI7O all -- anywhere anywhere statistic mode random probability 0.12500000000
KUBE-SEP-VSXQV2AZ43RZQSL7 all -- anywhere anywhere statistic mode random probability 0.14285714272
KUBE-SEP-RVDWX7YLRKCSUDII all -- anywhere anywhere statistic mode random probability 0.16666666651
KUBE-SEP-OECSAM56W6JQA562 all -- anywhere anywhere statistic mode random probability 0.20000000019
KUBE-SEP-HY76TWODHVCVLG5Y all -- anywhere anywhere statistic mode random probability 0.25000000000
KUBE-SEP-3UNVKH34LEKZ2P5K all -- anywhere anywhere statistic mode random probability 0.33333333349
KUBE-SEP-TDCXKWGVKJJ22VHB all -- anywhere anywhere statistic mode random probability 0.50000000000
KUBE-SEP-Z7ZOTGJIY44EKMWW all -- anywhere anywhere
# List the rules of two random chains above to see the DNAT'ed Pod IP
root@toolbox:~# iptables -L KUBE-SEP-RVDWX7YLRKCSUDII -t nat
Chain KUBE-SEP-RVDWX7YLRKCSUDII (1 references)
target prot opt source destination
KUBE-MARK-MASQ all -- 10.179.94.16 anywhere
DNAT tcp -- anywhere anywhere tcp to::0 persistent:0 persistent
root@toolbox:~# iptables -L KUBE-SEP-6PL2LOTBN64MN2IF -t nat
Chain KUBE-SEP-6PL2LOTBN64MN2IF (1 references)
target prot opt source destination
KUBE-MARK-MASQ all -- 10.179.45.66 anywhere
DNAT tcp -- anywhere anywhere tcp to::0 persistent:0 persistent
These final rules are the ones that actually replace the destination Cluster IP of 10.178.64.10 with
a randomly chosen kube-dns Pod IP. The random selection is implemented by the rules in the
KUBE-SVC-ERIFXISQEP7F7OF4 chain which have statistic mode random probability p. Rules are
matched top down. So the first rule with target KUBE-SEP-BMNCBK7ROA3MA6UU has a probability of
0.01538461540 of being picked. The second rule with target KUBE-SEP-GYUBQUCI6VR6AER2 has a
probability of 0.01562500000 of being picked. But this 0.01562500000 is applied to the probability
that the first rule didn’t match. So its overall probability is (1 - 0.01538461540) * 0.01562500000
~= 0.01538461540. Applying this calculation to the other rules, you can see each rule has a
probability of 0.01538461540 or 1/n in being selected where n = 65 is the total number of kube-dns
Pods in this case. This algorithm is actually a variation of reservoir sampling.
Confirming the Root Cause
At this point I strongly suspected the iptables rules were stale and routing packets to kube-dns
Pod IPs that no longer exist. In order to confirm this I wanted to find an actual DNAT’ed IP that
didn’t correspond to any actual kube-dns Pod. There were 65 rules in the KUBE-SVC-ERIFXISQEP7F7OF4
chain, but I expected 77 because that was the number of kube-dns Pods.
This confirmed kube-proxy wasn’t updating the iptables rules for kube-dns. Why? The kube-proxy
logs on the node showed these ongoing occurring errors.
1234
dxia@my-gke-node ~ $ tail -f /var/log/kube-proxy.log
E0126 20:40:24.739255 1 reflector.go:153] k8s.io/client-go/informers/factory.go:135: Failed to list *v1.Service: an error on the server ("") has prevented the request from succeeding (get services)
E0126 20:40:24.739611 1 reflector.go:153] k8s.io/client-go/informers/factory.go:135: Failed to list *v1.Endpoints: an error on the server ("") has prevented the request from succeeding (get endpoints)
E0126 20:40:34.742869 1 reflector.go:153] k8s.io/client-go/informers/factory.go:135: Failed to list *v1.Service: an error on the server ("") has prevented the request from succeeding (get services)
The Speculative Fix
I think these kube-proxy errors are caused by this underlying K8s bug, but I’m not sure.
we found that after the problem occurred all subsequent requests were still send on the same
connection. It seems that although the client will resend the request to apiserver, but the
underlay http2 library still maintains the old connection so all subsequent requests are still
send on this connection and received the same error use of closed connection.
So the question is why http2 still maintains an already closed connection? Maybe the connection it
maintained is indeed alive but some intermediate connections are closed unexpectedly?
If you’re using GKE and Google Cloud Monitoring, this log query will show which nodes’ kube-proxy
Pods can’t get updated Service and Endpoint data from the K8s API.
12345
resource.type="k8s_node"
resource.labels.project_id="[YOUR-PROJECT]"
logName="projects/[YOUR-PROJECT]/logs/kube-proxy"
jsonPayload.message:"Failed to list "
severity=ERROR