I manage multitenant Google Kubernetes Engine (GKE) clusters for stateless backend services at work. Google recently graduated GKE’s Workload Identity (WI) feature to generally available (GA). When my team used WI during its beta stage, it seemed to fail when there were more than 16 requests per second (RPS) on one GKE node to retrieve Google access tokens.
Before we knew about this low RPS failure threshold, we told many internal engineering teams to go ahead and use the feature. In hindsight, we should’ve load-tested the feature before making it generally available internally especially since it wasn’t even GA publicly.
My efforts to load test WI have grown more sophisticated over time. This post describes the progression. It’s like the “4 Levels of …” Epicurious Youtube videos. The goal here is to find out at what RPS WI starts to fail and to try to learn some generalizable lessons from load testing vendor-managed services.
tl;dr lessons learned
- always load test new features above and beyond what you expect your production load will be
- use proper load testing tools and not bash for loops
My specific GKE cluster configuration
- GKE masters and nodes running version 1.15.9-gke.22
- regional cluster in Google Cloud Platform (GCP) (not on-premise)
- 4 GKE nodes that are n1-standard-32 GCE instances in one node pool
- each node is configured to have a maximum of 32 Pods
- cluster and node pool have WI enabled
High level of what Workload Identity is and how it works
Workloads on GKE often need to access GCP resources like PubSub or CloudSQL. In order to do so, your workload needs to use a Google Service Account (GSA) key that is authorized to access those resources. So you end up creating keys for all your GSA’s and copy-pasting these keys into Kubernetes Secrets for your workloads. This is insecure and not maintainable if you are a company that has dozens of engineering teams and hundreds of workloads.
So GCP offered WI which allows a Kubernetes Service Account (KSA) to be associated with a GSA. If a workload can run with a certain KSA, it’ll transparently get the Google access token for the associated GSA. No manual copy-pasting GSA keys!
How does this work? You have to enable WI on your cluster and node pool. This creates a
gke-metadata-server
DaemonSet in the kube-system
namespace. gke-metadata-server
is the
entrypoint to the whole WI system. Here’s a nice Google Cloud Next conference talk with more
details.
gke-metadata-server
is the only part of WI that is exposed to GKE users, i.e. runs on machines you
control. It’s like the Verizon FiOS box in your basement. You control your house, but there’s a
little box that Verizon owns and operates in there. All other parts of WI run on GCP infrastructure
that you can’t see. When I saw failures with WI, it all seemed to happen in
gke-metadata-server
. So that’s what I’ll load test.
Here’s the gke-metadata-server
DaemonSet YAML for reference. As of the time of this writing the
image is gke.gcr.io/gke-metadata-server:20200218_1145_RC0
. You might see different behavior with
different images.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 |
|
Level 1
What kind of load am I putting on gke-metadata-server
? Since this DaemonSet exists to give out
Google access tokens, I’ll send it HTTP requests asking for such tokens.
I built a Docker image with the following Dockerfile
.
1 2 |
|
Then I created the following K8s Deployment YAML.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
|
I ran seven of these Pods on a single node (see the nodeSelector
above) to target a single
instance of gke-metadata-server
.
This isn’t a great test because there’s a lot of extra work performed by the Container in running
gcloud
to print a Google access token (there may be bottlenecks in this gcloud
command itself
which is Python code), curling the googleapis.com
endpoint to get the token info (originally done
to verify the token was valid). And there’s probably bottlenecks in using a shell to do this. All in
all, this implementation doesn’t really let you specify a fixed RPS. You’re at the mercy of how fast
your Container, shell, gcloud, and the network will let you execute this. I also wasn’t able to run
more Pods on a single node because I was hitting the max 32 pods per node limit. There were already
a bunch of other GKE-system level workloads like Calico that took up node capacity.
Level 2
Apply this one Pod
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 |
|
Then kubectl exec
in and run this command.
1
|
|
Everything seemed to work fine when N was 100. When N was 200 I got a few errors like the below. They look like client-side errors and not server ones though.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 |
|
gcloud
does not synchronize between processes with concurrent invokations. It sometimes writes
files to disk. So this is also not a great load test because it still doesn’t let you achieve a
specific RPS and has client-side bottlenecks.
Level 3
Use a proper HTTP load testing tool. A colleague told me about vegeta
.
It’s a seemingly good tool, but, more importantly, its commands are amazing.
vegeta attack ...
.
I first start a golang
Pod that just busy-waits.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
|
Then I get a shell in it.
1 2 3 4 5 6 7 8 9 10 11 12 |
|
Let’s throw some load on WI! my-gsa@my-project.iam.gserviceaccount.com
is the GSA associated with
the KSA your workload runs as.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 |
|
After more bisection, I found that this specific instance of gke-metadata-server
starts to fail around 150RPS. When it does fail, p99 latency skyrockets from less than 1 second to
30 seconds. This is usually a sign of a rate limiter or quota.
How have you tried load testing WI or other GKE features? What’re your favorite load testing tools for these cases, and what interesting behavior have you found?