That’s a nice service. Be a shame if it… went missing.
Debugging perfectly good pages returning a 404
Debugging perfectly good pages returning a 404
So, as part of ongoing learning I am developing what would have to be the most complex wedding website in the history of man. It is a golang powered, multi-service, JSON → gRPC transcoded istio mediated REST API consumed by an Angular frontend service, and I had a bug with CORS.
It started out actually a bug on CORS. The API is transcoded from REST to gRPC by the excellent gateway written in Golang and forwarded on to the service. This worked fine with $ curl
, but when it came time to implement it in a browser the browser has set up Cross Origin Resource Sharing (CORS) headers as a means to prevent cross site scripting. I hadn’t implemented these. I did, pushed it to prod and the problem went away.
Sometimes.
The problem being, the page would sometimes 404 on the options request.
Debuggery
Theory 1: I can’t develop well
The simplest theory is that when implementing the CORS component of the application I missed some request property in the routing stack that the browser fired off.
However, after looking for this awhile I used the “copy as curl” feature of Chrome to reproduce the issue outside the browser so I could manipulate the request more easily.
And it worked:
curl 'https://api.tld.com/v1alpha2/check-in' \
-X OPTIONS \
-H 'Host: api.tld.com' \
-H 'User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Firefox/60.0' \
-H 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8' \
-H 'Accept-Language: en-US,en;q=0.5' --compressed \
-H 'Access-Control-Request-Method: GET' \
-H 'Access-Control-Request-Headers: authorization' \
-H 'Origin: http://tld.local' \
-H 'Connection: keep-alive' \
-H 'Cache-Control: max-age=0' \
-I
HTTP/2 200
access-control-allow-origin: http://tld.local
access-control-allow-credentials: true
access-control-allow-methods: GET,POST,PATCH,PUT,DELETE,OPTIONS
access-control-allow-headers: Content-Type,Accept,Authorization
date: Fri, 01 Feb 2019 09:40:38 GMT
server: envoy
That was the bizarre part
Theory 2: Envoy is not propagating CORS requests
Another theory while debugging was that Envoy was somehow not propagating the CORS requests. To address this, I introduced the CORS configuration to Istio such that Istio should return CORS headers appropriately.
While this worked (the above is generated by Istio and not the app), the 404 persisted
Theory 3: There’s a bug in Chrome
The next theory was perhaps this was a chrome specific bug. Indeed, in switching to Firefox the issue didn’t initially present.
However, after a few minutes it presented differently.
The OPTIONS request now worked! Hooray! However, the authentication server started to 404. Reloading the page caused the authentication server to respond correctly, but the api server to 404.
Theory 4: Envoy is not delimiting load balancing requests
The last theory was that the connection was being reused across multiple domains, and that sharing the connection was causing ${SOMETHING}
wrong with the istio/envoy combination.
The necessary background is that both api.tld.com
and login.tld.com
both resolve to the same IP:
$ dig api.tld.com +short
35.205.247.239
$ dig login.tld.com +short
35.205.247.239
They are additionally sharing an SSL certificate with SAN extensions.
$ openssl s_client -connect andrewhowden.com:443 -showcerts | openssl x509 -noout -text
X509v3 Subject Alternative Name:
DNS:andrewhowden.com, DNS:api.tld.com, DNS:login.tld.com, tld.com, DNS:pgp.andrewhowden.com, DNS:www.tl.com, DNS:www.tld.com
That means in principle, the connection can be reused. It took a little while to reach this conclusion, and when I did reach it I looked for ways to verify it. It turns out it’s surprisingly difficult to verify whether this connection is being shared — chrome://net-internals
shows two connections, one prefixed with a pm/
:
The “fix”
I had previously done some research on how TLS works, and both connections require TLS. They must negotiate an ephemeral symmetric key, but they can reuse the same key for both connections as TLS is ~ TCP and not HTTP.
So, to force different connections I gave them different certificates — without the SAN
extension. Different certificates means different public keys, and a symmetric key that could not be shared across the the two connections.
No more connection reuse!
The issue immediately went away. At the time it was ~11:30PM, and kind of could not believe it.
Learnings
Like many bugs this one was me discovering that a conceptual model of how things worked (that is, separate connections per domain) did not work as I imagined it did.
Having done the theoretical reading beforehand about TCP and TLS I knew in principle what could be the problem and, despite not being able to “prove” that my theory was correct pushing the “fix” was harmless if there was no bug, and fixed it if there was.
So, the process was just:
Investigate
Hypothesize
Alter
Repeat
Additionally, I was debugging this late at night and on a personal project. Late at night meant I was tired, but a personal project meant that the worst person I could disappoint is me. The tiredness contributed to an inability to think creatively about the problem, but there was also no stress of getting it fixed — I could only disappoint myself (and wifey).
Rediscovery
While writing this post, I found not only that I was not the first to discover this, but that there is a “better” fix. I will implement this … at some point.
404 NR when using browser on multiple ingress gateways · Issue #9429 · istio/istio
Describe the bug When using browsers (tested on multiple browsers, multiple OS, multiple devices and multiple…github.com
Additionally, it turns out that the sharing of connections is part of the specification of HTTP/2.
In Conclusion
Bugs are hard. Debugging them is interesting, and requires some combination of previous research, creative thinking and stubbornness.
❤