What I Recently Learned About Docker Networking and Debugging Networking Issues in General
|
By David Xia
This is a story about how debugged a confounding local development environment issue, what I
learned about Docker in the process, and the generally applicable debugging strategies and
techniques that helped me ultimately solve it. Skip to the end if you only want to read the
debugging strategies and techniques. The overall story, however, will illustrate how they applied
in this specific case.
Problem Statement and Use Case
A data infrastructure team at work provides a tool for starting a data pipeline job from a local
development environment. Let’s call this tool foo. This tool depends on gcloud and docker.
It creates a user-defined Docker network, runs a utility container called bar connected to that
network, and then runs another container called qux that talks to bar to retrieve Oauth tokens
from Google Cloud Platform (GCP).
Most developers run foo on their local workstations, e.g. Macbooks. But I have the newer
Macbook with the Apple M1 ARM-based chip. Docker Desktop on Mac support for M1s was
relatively recent. I didn’t want deal with Docker weirdness. I also didn’t have a lot of free
disk space on my 256GB Macbook and thus didn’t feel like clogging up my drive with lots of Java,
Scala, and Docker gunk.
So I tried running foo on a GCE VM configured by our Puppet configuration files. I ran foo,
I got this error.
Error #1: inter-container networking failed between containers attached to user-defined Docker networks
dxia@my-host$ foo --verbose run -f data-info.yaml -w DumpKubernetesContainerImagesJob -p 2021-04-26 -r my-project/target/image-name
DEBUG:verify: Docker network `foo-network` already exists
DEBUG:verify: bar container found.
INFO:run: starting workflow DumpKubernetesContainerImagesJob ...
ERROR: (gcloud.auth.activate-service-account) [Errno 110] Connection timed out
This may be due to network connectivity issues. Please check your network settings, and the status of the service you are trying to reach.
Traceback (most recent call last):
File "/usr/local/bin/activate-google-application-credentials", line 19, in <module>
'auth', 'activate-service-account', '--key-file', json_path])
File "/usr/lib/python3.6/subprocess.py", line 311, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['gcloud', 'auth', 'activate-service-account', '--key-file', '/etc/_foo/gcp-sa-key.json']' returned non-zero exit status 1.
ERROR:foo:
RAN: /usr/bin/docker run -it -v /home/dxia/my-project/_foo:/etc/_foo:ro -v /home/dxia/my-project/_foo/__init__.py:/usr/local/lib/python2.7/dist-packages/oauth2client/__init__.py:ro -v /home/dxia/my-project/_foo/__init__.py:/usr/local/lib/python3.6/dist-packages/oauth2client/__init__.py:ro -v /home/dxia/my-project/_foo/__init__.py:/usr/local/lib/python3.7/dist-packages/oauth2client/__init__.py:ro -v /home/dxia/my-project/_foo/__init__.py:/usr/local/lib/python3.8/dist-packages/oauth2client/__init__.py:ro -v /home/dxia/my-project/_foo/__init__.py:/usr/local/lib/python3.9/dist-packages/oauth2client/__init__.py:ro -v /home/dxia/my-project/_foo/__init__.py:/usr/local/lib/python2.7/site-packages/oauth2client/__init__.py:ro -v /home/dxia/my-project/_foo/__init__.py:/usr/local/lib/python3.6/site-packages/oauth2client/__init__.py:ro -v /home/dxia/my-project/_foo/__init__.py:/usr/local/lib/python3.7/site-packages/oauth2client/__init__.py:ro -v /home/dxia/my-project/_foo/__init__.py:/usr/local/lib/python3.8/site-packages/oauth2client/__init__.py:ro -v /home/dxia/my-project/_foo/__init__.py:/usr/local/lib/python3.9/site-packages/oauth2client/__init__.py:ro --net foo-network -e FOO_COMPONENT_ID=my-project -e FOO_WORKFLOW_ID=DumpKubernetesContainerImagesJob -e FOO_PARAMETER=2021-04-26 -e FOO_DOCKER_IMAGE=my-project:20210426T211411-2b5452d -e 'FOO_DOCKER_ARGS=wrap-luigi --module luigi_tasks DumpKubernetesContainerImagesJob --when 2021-04-26' -e FOO_EXECUTION_ID=foorun-2e30c385-2f89-494c-bc0e-97b3eff316d5 -e FOO_TRIGGER_ID=foo-942f155b-49eb-4af8-a6e4-3adf6f72577b -e FOO_TRIGGER_TYPE=foo -e FOO_ENVIRONMENT=foo -e FOO_LOGGING=text -e GOOGLE_APPLICATION_CREDENTIALS=/etc/_foo/gcp-sa-key.json -e FOO_SERVICE_ACCOUNT=dump-k8-deployment-info-pipeli@my-project.iam.gserviceaccount.com gcr.io/xpn-1/my-project:20210426T211411-2b5452d wrap-luigi --module luigi_tasks DumpKubernetesContainerImagesJob --when 2021-04-26
STDOUT:
STDERR:
Traceback (most recent call last):
File "/home/dxia/my-project//lib/python3.6/site-packages/foo/foo.py", line 304, in main
args.func(args)
File "/home/dxia/my-project//lib/python3.6/site-packages/foo/foo.py", line 269, in _run
args.declarative_infra,
File "/home/dxia/my-project//lib/python3.6/site-packages/foo/run.py", line 641, in run_workflow
declarative_infra,
File "/home/dxia/my-project//lib/python3.6/site-packages/foo/run.py", line 161, in _run_workflow
p.wait()
File "/home/dxia/my-project//lib/python3.6/site-packages/sh.py", line 841, in wait
self.handle_command_exit_code(exit_code)
File "/home/dxia/my-project//lib/python3.6/site-packages/sh.py", line 865, in handle_command_exit_code
raise exc
sh.ErrorReturnCode_1:
RAN: /usr/bin/docker run -it -v /home/dxia/my-project/_foo:/etc/_foo:ro -v /home/dxia/my-project/_foo/__init__.py:/usr/local/lib/python2.7/dist-packages/oauth2client/__init__.py:ro -v /home/dxia/my-project/_foo/__init__.py:/usr/local/lib/python3.6/dist-packages/oauth2client/__init__.py:ro -v /home/dxia/my-project/_foo/__init__.py:/usr/local/lib/python3.7/dist-packages/oauth2client/__init__.py:ro -v /home/dxia/my-project/_foo/__init__.py:/usr/local/lib/python3.8/dist-packages/oauth2client/__init__.py:ro -v /home/dxia/my-project/_foo/__init__.py:/usr/local/lib/python3.9/dist-packages/oauth2client/__init__.py:ro -v /home/dxia/my-project/_foo/__init__.py:/usr/local/lib/python2.7/site-packages/oauth2client/__init__.py:ro -v /home/dxia/my-project/_foo/__init__.py:/usr/local/lib/python3.6/site-packages/oauth2client/__init__.py:ro -v /home/dxia/my-project/_foo/__init__.py:/usr/local/lib/python3.7/site-packages/oauth2client/__init__.py:ro -v /home/dxia/my-project/_foo/__init__.py:/usr/local/lib/python3.8/site-packages/oauth2client/__init__.py:ro -v /home/dxia/my-project/_foo/__init__.py:/usr/local/lib/python3.9/site-packages/oauth2client/__init__.py:ro --net foo-network -e FOO_COMPONENT_ID=my-project -e FOO_WORKFLOW_ID=DumpKubernetesContainerImagesJob -e FOO_PARAMETER=2021-04-26 -e FOO_DOCKER_IMAGE=my-project:20210426T211411-2b5452d -e 'FOO_DOCKER_ARGS=wrap-luigi --module luigi_tasks DumpKubernetesContainerImagesJob --when 2021-04-26' -e FOO_EXECUTION_ID=foorun-2e30c385-2f89-494c-bc0e-97b3eff316d5 -e FOO_TRIGGER_ID=foo-942f155b-49eb-4af8-a6e4-3adf6f72577b -e FOO_TRIGGER_TYPE=foo -e FOO_ENVIRONMENT=foo -e FOO_LOGGING=text -e GOOGLE_APPLICATION_CREDENTIALS=/etc/_foo/gcp-sa-key.json -e FOO_SERVICE_ACCOUNT=dump-k8-deployment-info-pipeli@my-project.iam.gserviceaccount.com gcr.io/xpn-1/my-project:20210426T211411-2b5452d wrap-luigi --module luigi_tasks DumpKubernetesContainerImagesJob --when 2021-04-26
STDOUT:
STDERR:
The HTTP connection timed out. First I checked whether the container started by foo can make a TCP
connection to the bar container. I ran foo --verbose run -f data-info.yaml -w
DumpKubernetesContainerImagesJob -p 2021-04-26 -r my-project/target/image-name
again and did the following in another terminal window.
nsenter is a cool tool that allows you to run programs in different Linux namespaces.
It’s very useful when you can’t get an executable shell into a container with commands like
docker exec -it ... bash. This can happen when the container doesn’t even include any shells
and just has the binary executable for instance.
1234567
dxia@my-host:~$ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
a0e872188831 my-project:20210426T211411-2b5452d "/usr/local/bin/edge…" 2 seconds ago Up 1 second relaxed_pike
4dda670a2ee1 foo/bar:latest "./bar" 2 hours ago Up 2 hours 0.0.0.0:80->80/tcp bar
dxia@my-host:~$ sudo nsenter -n -t $(docker inspect --format a0e872188831) nc 172.20.0.127 80 -nvz -w 5
(UNKNOWN) [172.20.0.127] 80 (http) : Connection timed out
So the HTTP connection timeout was caused by an error lower down on the networking stack: an
inability to establish a TCP connection. A TCP connection from the host to bar worked though.
When I see a networking issue like this, I know there might be some misconfigured firewall rule
blocking IP packets. I listed all the firewall rules. The ones in the filter table’s FORWARD
chain caught my attention.
1234567891011
dxia@my-host:~$ sudo iptables --list FORWARD --verbose --numeric --line-numbers --table filter
Chain FORWARD (policy DROP 38 packets, 2280 bytes)
num pkts bytes target prot opt in out source destination
1 6204 492K DOCKER-USER all -- * * 0.0.0.0/0 0.0.0.0/0
2 6204 492K DOCKER-ISOLATION-STAGE-1 all -- * * 0.0.0.0/0 0.0.0.0/0
3 3080 323K ACCEPT all -- * corp0 0.0.0.0/0 0.0.0.0/0 ctstate RELATED,ESTABLISHED
4 1 60 DOCKER all -- * corp0 0.0.0.0/0 0.0.0.0/0
5 3085 167K ACCEPT all -- corp0 !corp0 0.0.0.0/0 0.0.0.0/0
6 0 0 ACCEPT all -- corp0 corp0 0.0.0.0/0 0.0.0.0/0
7 264 17722 DOCKER all -- * br-8ce7e363e4f9 0.0.0.0/0 0.0.0.0/0
8 7382 17M ACCEPT all -- br-8ce7e363e4f9 !br-8ce7e363e4f9 0.0.0.0/0 0.0.0.0/0
I disabled the GCE VM’s cronned Puppet run and then ran sudo systemctl restart docker. I ran
bar and a test nginx1 container connected to foo-network.
12345678910
dxia@my-host:~$ docker run --rm -d -v ~/.config/gcloud/:/.config/gcloud --name bar --net foo-network --ip 172.20.0.127 -p 80:80 foo/bar:latest
3f9cc17b3f71e7056fd8072449afa78eb9a6a166ac091d751b69545ead0438b1
dxia@my-host:~$ docker run --net foo-network --name nginx1 -d -p 8080:80 nginx:latest
1b0b2b981f9389a989aa8f60a141b5e9a18ba5582141b6668c9078b6312dcfaf
dxia@my-host:~$ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
1b0b2b981f93 nginx:latest "/docker-entrypoint.…" 5 seconds ago Up 3 seconds 0.0.0.0:8080->80/tcp nginx1
3f9cc17b3f71 foo/bar:latest "./bar" About a minute ago Up 59 seconds 0.0.0.0:80->80/tcp bar
Now a TCP connection from the nginx container to bar succeeded.
I checked iptables rules again and saw two additional rules (7 and 8) in the filter table’s
FORWARD chain. Rule 8 allowed IP packets coming in from the br-8ce7e363e4f9 network interface
(in this case a Linux bridge) and leaving through the same interface.
12345678910111213
dxia@my-host:~$ sudo iptables --list FORWARD --verbose --numeric --line-numbers --table filter
Chain FORWARD (policy DROP 0 packets, 0 bytes)
num pkts bytes target prot opt in out source destination
1 0 0 DOCKER-USER all -- * * 0.0.0.0/0 0.0.0.0/0
2 0 0 DOCKER-ISOLATION-STAGE-1 all -- * * 0.0.0.0/0 0.0.0.0/0
3 0 0 ACCEPT all -- * corp0 0.0.0.0/0 0.0.0.0/0 ctstate RELATED,ESTABLISHED
4 0 0 DOCKER all -- * corp0 0.0.0.0/0 0.0.0.0/0
5 0 0 ACCEPT all -- corp0 !corp0 0.0.0.0/0 0.0.0.0/0
6 0 0 ACCEPT all -- corp0 corp0 0.0.0.0/0 0.0.0.0/0
7 0 0 ACCEPT all -- * br-8ce7e363e4f9 0.0.0.0/0 0.0.0.0/0 ctstate RELATED,ESTABLISHED
8 0 0 ACCEPT all -- br-8ce7e363e4f9 br-8ce7e363e4f9 0.0.0.0/0 0.0.0.0/0
9 264 17722 DOCKER all -- * br-8ce7e363e4f9 0.0.0.0/0 0.0.0.0/0
10 7382 17M ACCEPT all -- br-8ce7e363e4f9 !br-8ce7e363e4f9 0.0.0.0/0 0.0.0.0/0
When I re-ran Puppet rules 7 and 8 were deleted and containers on the foo-network were again
unable to establish a TCP connection. I added rule 8 manually and confirmed this is the rule
causing my error above.
Error #2: DNS queries for external records from bar failed
123456789101112131415161718192021
dxia@my-host$ foo run -f data-info.yaml -w DumpKubernetesContainerImagesJob -p 2021-04-26 -r my-project/target/image-name
INFO:run: starting workflow DumpKubernetesContainerImagesJob ...
ERROR: (gcloud.auth.activate-service-account) There was a problem refreshing your current auth tokens: Invalid response 500.
Please run:
$ gcloud auth login
to obtain new credentials.
If you have already logged in with a different account:
$ gcloud config set account ACCOUNT
to select an already authenticated account to use.
Traceback (most recent call last):
File "/usr/local/bin/activate-google-application-credentials", line 19, in <module>
'auth', 'activate-service-account', '--key-file', json_path])
File "/usr/lib/python3.6/subprocess.py", line 311, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['gcloud', 'auth', 'activate-service-account', '--key-file', '/etc/_foo/gcp-sa-key.json']' returned non-zero exit status 1.
ERROR:foo: non-zero exit code (1) from `/usr/bin/docker run -it -v /home/dxia/my-project/_foo:/etc/_foo:ro -v /home/dxia/my-project/_foo/__init__.py:/usr/local/lib/python2.7/dist-packages/oauth2client/__init__.py:ro -v /home/dxia/my-project/_foo/__init__.py:/usr/local/lib/python3.6/dist-packages/oauth2client/__init__.py:ro -v /home/dxia/my-project/_foo/__init__.py:/usr/local/lib/python3.7/dist-packages/oauth2client/__init__.py:ro -v /home/dxia/my-project/_foo/__init__.py:/usr/local/lib/python3.8/dist-packages/oauth2client/__init__.py:ro -v /home/dxia/my-project/_foo/__init__.py:/usr/local/lib/python3.9/dist-packages/oauth2client/__init__.py:ro -v /home/dxia/my-project/_foo/__init__.py:/usr/local/lib/python2.7/site-packages/oauth2client/__init__.py:ro -v /home/dxia/my-project/_foo/__init__.py:/usr/local/lib/python3.6/site-packages/oauth2client/__init__.py:ro -v /home/dxia/my-project/_foo/__init__.py:/usr/local/lib/python3.7/site-packages/oauth2client/__init__.py:ro -v /home/dxia/my-project/_foo/__init__.py:/usr/local/lib/python3.8/site-packages/oauth2client/__init__.py:ro -v /home/dxia/my-project/_foo/__init__.py:/usr/local/lib/python3.9/site-packages/oauth2client/__init__.py:ro --net foo-network -e FOO_COMPONENT_ID=my-project -e FOO_WORKFLOW_ID=DumpKubernetesContainerImagesJob -e FOO_PARAMETER=2021-04-02 -e FOO_DOCKER_IMAGE=my-project:20210422T065801-2b5452d -e 'FOO_DOCKER_ARGS=wrap-luigi --module luigi_tasks DumpKubernetesContainerImagesJob --when 2021-04-02' -e FOO_EXECUTION_ID=foorun-3feaee45-35e4-4c01-9430-86de52eb2db1 -e FOO_TRIGGER_ID=foo-f721de7f-edf9-4bb3-8cdf-1e9bbcec5035 -e FOO_TRIGGER_TYPE=foo -e FOO_ENVIRONMENT=foo -e FOO_LOGGING=text -e GOOGLE_APPLICATION_CREDENTIALS=/etc/_foo/gcp-sa-key.json -e FOO_SERVICE_ACCOUNT=dump-k8-deployment-info-pipeli@my-project.iam.gserviceaccount.com gcr.io/xpn-1/my-project:20210422T065801-2b5452d wrap-luigi --module luigi_tasks DumpKubernetesContainerImagesJob --when 2021-04-02`
The only background knowledge we need to know here is that the qux container is sending a Google
Service Account (GSA) JSON credential with "token_uri": "http://172.20.0.127:80/token". Bar
then uses that token for further GCP API requests. So bar needs to query DNS for
accounts.google.com. Bar container logs show that it cannot lookup the DNS A record for
accounts.google.com by querying 127.0.0.11:53.
1234567891011
dxia@my-host:~$ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
1b0b2b981f93 nginx:latest "/docker-entrypoint.…" 9 hours ago Up 9 hours 0.0.0.0:8080->80/tcp nginx1
3f9cc17b3f71 foo/bar:latest "./bar" 9 hours ago Up 9 hours 0.0.0.0:80->80/tcp bar
dxia@my-host:~$ docker logs --follow bar
2021/04/22 05:54:19 bar started
2021/04/22 06:59:13 Received JWT assertion: [REDACTED base-64 string]
2021/04/22 06:59:13 Servive account name: projects/-/serviceAccounts/dump-k8-deployment-info-pipeli@my-project.iam.gserviceaccount.com
2021/04/22 06:59:28 Post https://iamcredentials.googleapis.com/v1/projects/-/serviceAccounts/dump-k8-deployment-info-pipeli@my-project.iam.gserviceaccount.com:generateAccessToken?alt=json&prettyPrint=false: Post https://accounts.google.com/o/oauth2/token: dial tcp: lookup accounts.google.com on 127.0.0.11:53: read udp 127.0.0.1:46920->127.0.0.11:53: i/o timeout
2021/04/22 06:59:28 Failed to create new token for dump-k8-deployment-info-pipeli@my-project.iam.gserviceaccount.com
I wondered why bar was querying 127.0.0.11 for DNS. It turns out this is another loopback
address. In fact, all of 127.0.0.0/8 is loopback according to RFC-6890. I guess Docker
containers that are attached to user-defined Docker networks are configured by default to use
127.0.0.11 in their /etc/resolv.conf.
12345678910
dxia@my-host$ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
1b0b2b981f93 nginx:latest "/docker-entrypoint.…" 9 hours ago Up 9 hours 0.0.0.0:8080->80/tcp nginx1
3f9cc17b3f71 foo/bar:latest "./bar" 9 hours ago Up 9 hours 0.0.0.0:80->80/tcp bar
dxia@my-host$ docker exec -it nginx1 /bin/sh -c "cat /etc/resolv.conf"
search corp.net
nameserver 127.0.0.11
options attempts:1 timeout:5 ndots:0
Why were these Docker containers configured to query for DNS records on 127.0.0.11? It turned
out after some Googling that
By default, a container inherits the DNS settings of the host, as defined in the /etc/resolv.conf
configuration file. Containers that use the default bridge network get a copy of this file,
whereas containers that use a custom network use Docker’s embedded DNS server, which forwards
external DNS lookups to the DNS servers configured on the host.
Now I wondered if Docker’s embedded DNS server is actually running. After some more Googling, I
realized that each container also had its own set of firewall rules. So I listed bar’s nat
table’s DOCKER_OUTPUT chain’s rules. These two rules showed that the destination port is
changed for TCP packets bound for 127.0.0.11:53 to 37619. UDP packets have their port changed to
58552.
After enabling log-level": "debug" in /etc/docker/daemon.json and reloading the configuration
file, I saw that the daemon was trying to forward the DNS query to 10.99.0.1. This was the IP of
the corp0 bridge network interface which we create instead of the default docker0 bridge
network. I saw there was an IO timeout when the daemon was waiting for the DNS reply.
12345
dxia@my-host$ sudo journalctl --follow -u docker
-- Logs begin at Tue 2019-11-05 18:17:27 UTC. --
Apr 22 15:43:12 my-host.corp.net dockerd[10984]: time="2021-04-22T15:43:12.496979903Z" level=debug msg="[resolver] read from DNS server failed, read udp 172.20.0.127:37928->10.99.0.1:53: i/o timeout"
Apr 22 15:43:13 my-host.corp.net dockerd[10984]: time="2021-04-22T15:43:13.496539033Z" level=debug msg="Name To resolve: accounts.google.com."
Apr 22 15:43:13 my-host.corp.net dockerd[10984]: time="2021-04-22T15:43:13.496958664Z" level=debug msg="[resolver] query accounts.google.com. (A) from 172.20.0.127:51642, forwarding to udp:10.99.0.1"
We set dockerd’s upstream DNS server as 10.99.0.1 because we have unbound running as a DNS
proxy/cache on the host. We configured it to bind on the bridge interface so Docker containers
can hit the host-local unbound instance by routing DNS requests to corp0.
So why can’t the daemon forward IP packets from 172.20.0.127:37928 to 10.99.0.1:53? It seemed
like UDP packets sent from bar were able to reach 10.99.0.1:53, but DNS requests failed. I also
knew DNS requests from the host to 10.99.0.1:53 worked.
My hypothesis at this point was that Docker’s embedded DNS server wasn’t working in some way.
After exploring this for a while with no luck, I questioned my assumption that UDP packets from
172.20.0.127:37928 were able to reach 10.99.0.1:53. I realized TCP packets from
172.20.0.127:37928 were not able to reach 10.99.0.1:53.
So it must be another networking issue which means there must be another firewall rule that’s
blocking packets from the bar container to 10.99.0.1. After a while of looking, I realized the
filter table’s INPUT chain’s default policy was DROP and that there was no rule that matched
packets coming in from the br-8ce7e363e4f9 interface.
But… there’s no A records? Docker daemon logs stated that the upstream local unbound DNS server
did not return any A records.
Error #3: unbound refused to reply to DNS queries from a private IP range used by the Docker network we’re using
123
Apr 27 00:07:13 my-host.corp.net dockerd[32659]: time="2021-04-27T00:07:13.864160829Z" level=debug msg="Name To resolve: accounts.google.com."
Apr 27 00:07:13 my-host.corp.net dockerd[32659]: time="2021-04-27T00:07:13.864325564Z" level=debug msg="[resolver] query accounts.google.com. (A) from 172.20.0.127:57576, forwarding to udp:10.99.0.1"
Apr 27 00:07:13 my-host.corp.net dockerd[32659]: time="2021-04-27T00:07:13.864537556Z" level=debug msg="[resolver] external DNS udp:10.99.0.1 did not return any A records for \"accounts.google.com.\""
Bingo! There’s no access-control entry that allowed DNS queries from 172.20.0.127. I added
access-control: 172.16.0.0/12 allow (since all of 172.16.0.0/12 is private IPv4 address space
according to RFC-1918) and reloaded unbound. Now it worked!
Apr 27 00:11:41 my-host.corp.net dockerd[32659]: time="2021-04-27T00:11:41.957934764Z" level=debug msg="Name To resolve: accounts.google.com."
Apr 27 00:11:41 my-host.corp.net dockerd[32659]: time="2021-04-27T00:11:41.958087666Z" level=debug msg="[resolver] query accounts.google.com. (A) from 172.20.0.127:33346, forwarding to udp:10.99.0.1"
Apr 27 00:11:41 my-host.corp.net dockerd[32659]: time="2021-04-27T00:11:41.960007990Z" level=debug msg="[resolver] received A record \"74.125.202.84\" for \"accounts.google.com.\" from udp:10.99.0.1"
General Debugging Strategies and Techniques I Used
Here are the general debugging strategies I used and reinforced for myself.
When network requests fail, go down one layer on the stack to identify on exactly which layer it
fails. I.e. find out which protocol is responsible for the failure: HTTP, TCP, IP?
Try to reproduce the error by running the most direct and minimal command that simulates my actual
failure. In this case, an HTTP request made by gcloud was failing. I translated that into an nc command
that simulated the establishment of the TCP connection between containers. Or a DNS query from
bar was failing. I translated that into a dig command. And in all these cases, the origin of
these IP packets mattered. So knowing how to use nsenter to enter a network namespace and create
IP packets that originate from the same container was useful. nsenter is essential when debugging
containers that don’t have any tools installed in them. The bar image only contains one
go-compiled executable. There’s no other tools I can use in there.
I encountered three errors in this case. Tackle one error at a time, and don’t give up.
Be scientific. Have a working hypothesis at each step for which you collect evidence that either
supports or refutes it.
If you get stuck, go back and question your previous assumptions or conclusions. Are there other
tests you can run that can confirm or disprove what you thought was true?
Patches
Error #1: I created a patch that makes our Puppet installation ignore rules created by Docker
networks in the filter table’s FORWARD chain.
Error #2: Unfortunately, I don’t think there’s a good solution to this other than disabling our
GCE VM’s periodic Puppet runs and manually adding a rule to allow packets from the new interface.
The chain’s default policy is DROP, and interface names are dynamic.
Error #3: I made a patch that makes unbound reply to DNS queries with source IPs of in the range
172.16.0.0/12.