I’m helping a recently acquired team at work figure out if they can migrate from Kafka to Google Cloud Pub/Sub. Part of the exploration was figuring out the change in latencies, if any, from switching.
The team’s production setup is like this.
- They paid an external company called Confluent to run a managed Kafka cluster in AWS Oregon.
- This is the same region where this team ran all their backend services. Part of their migration also involves switching their workloads from AWS Oregon to GCP us-central1. If they choose to migrate to Pub/Sub, their services will be publishing and subscribing to messages across cloud providers and regions. So my latency benchmarks took that into account.
- All their services are written in Golang.
- Services run as containers in AWS Elastic Container Service.
I defined latency as the time elapsed from when a message is published and when it’s received by a subscriber. I didn’t count the extra time it takes for a subscriber to acknowledge the message. I used Golang and the same upstream libraries for Kafka and Pub/Sub that they used or would use, respectively, in production. I published messages of various sizes at various rates from AWS EC2 instances in Oregon for five minutes. At the same time, five Google Compute Engine instances in us-central1 subscribed to these messages (pull-based) as fast as possible with an initial burn-in period of one minute. I didn’t measure the latency until the burn-in period elapsed to avoid any effects on latency that may arise from using a new topic or subscription or not enough messages flowing through the messaging service. This ensured I more closely mimicked message latency in production. I always took the percentile summary of the subscriber with the second highest p99 latency. I created new Pub/Sub or Kafka topics for each series in the graphs below. Kafka topics always had eight partitions.
I took some inspiration from a blog post titled “Benchmarking Message Queue Latency” and also found the following GCP post “Testing Cloud Pub/Sub clients to maximize streaming performance.” The latter linked to the code used to benchmark Pub/Sub. Unfortunately, after trying that tool many times and finding it wasn’t documented well or had various issues like this, I gave up and wrote my own simple latency benchmarker in Golang. This was probably better anyways to ensure I was using the same language and client libraries as the team I was helping.
My full results are in this Google sheet. The benchmarking code is at github.com/davidxia/cloud-message-latency.