A story about debugging an opaque network issue
So, during the course of our development work we operate across a number of different infrastructure environments, each with their own particular requirements. In this case, the requirements were fairly normal — there was us, a server, and docker running on that server.
However, there was one small twist — access to this server was only granted on a VPN. So, like any practical nerds, we set up a permanent site → site VPN that we could access within the office. Great! Everything works.
Usually.
Unfortunately, we ended up in a situation something as follows:
$ ssh ${SERVER}
$ sudo docker-compose up -d
(non operative shell)
This was a truly odd bug. Additionally, it only happened sometimes. At first, it wasn’t clear that it was a docker bug — occasionally, SSH would simply hang, and it was a little while before we determined it was ${SOMETHING_WITH_DOCKER}
Things that could go wrong #1: Port conflict
It’s possible, in principle, to drop SSH access by setting up an application that also requires SSH (GitLab), and forwarding that port with docker
to 0.0.0.0:22
. iptables
will chomp all your rules and you’ll suddenly be completely unable to SSH into the machine.
At least, that’s what we thought. It turns out this wasn’t the case. Once we regained access to the machine thanks to our gracious hosts we determined:
We didn’t have a service forwarding port 22 at all
All traffic was dropped; ICMP specifically, but we tested various other protocols to no avail
ICMP was available from outside our network
So it was clearly “Something something VPN”
Things that could go wrong #2: Firewall conflicts
In this environment there are restrictive egress firewalls in place to prevent data being stolen off the machine.
These firewalls were presumably reloaded after each invocation of docker-compose ${SOMETHING}
, as docker created and deleted virtual adapters.
However, for this to be true the problem would need to be consistent. Unfortunately, the problem happened only randomly.
Things that actually went wrong #3: Docker allocates a subnet in use
We left it alone for some time. In the meantime, we’d figured out that, somehow, we were able to gain access to the machine through another machine thereby allowing us to work even in the disaster case.
Notably, this also allowed us to snapshot machine state when it was broken. So, we waited.
Aaaand waited.
Aaaaaaaaaaand waited.
A day later: It breaks again.
So, we figured it was network related. We ran ip addr
, saved the output to a file in /tmp/foo
and rebooted docker:
$ docker-compose stop
$ docker-compose up -d
It came up again. Odd. Okay, let’s save the output of ip addr
into /tmp/bar
and diff them. We fairly quickly noted the oddity:
660: br-8221f7b9a761: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default
link/ether 02:42:b4:7a:15:82 brd ff:ff:ff:ff:ff:ff
inet 192.168.0.1/16 brd 192.168.255.255 scope global br-8221f7b9a761
valid_lft forever preferred_lft forever
inet6 fe80::42:b4ff:fe7a:1582/64 scope link
valid_lft forever preferred_lft forever
In one case, there was an uncomfortably familiar IP range — the office network range. Could it be … an IP address conflict?
We reproduced the command that would break the connection:
$ sudo docker network create --ip-range=192.168.0.1/32 --subnet=192.168.0.0/16 testing
Boom! Network dead.
Theorizing
Broadly, the problem is that docker
rotates through a set of IP ranges that it’s allocated to. These ranges are all private:
10.0.0.0–10.255.255.255
172.16.0.0–172.31.255.255
192.168.0.0–192.168.255.255
However, the site → site VPN meant that we were not connecting between 1 public network and another, as is usual, but between two private networks.
Once docker would allocate a range that conflicted with our gateway, traffic would reach the machine successfully, but be inadvertently routed into the docker-compose
network and not back to the office as intended.
This issue happened sporadically as docker iterates through a limited set of the private IP ranges by default. So, it’s only a matter of time before this IP range happens again.
Resolution
The resolution at this point is straight forward — don’t use the 192.168.0.0/24
subnet. However, it does highlight some of the fragility of coordinating site→site VPNs, particularly when we don’t control both sides of the connection.
In Conclusion
This was a satisfying bug to chase down, as it was not easy to debug, it stopped work in a dramatic way and it wasn’t easily reproducible. However, we learned a little about how this infrastructure is setup, and rebuilt some of our network knowledge as a team.
Nerd success. Happy friday, ya’ll.
Thanks
Behrouz Abbasi was helping me debug this one.
Anton Boritskiy also helped, and had faith in us to deal with it … eventually.