Websocket is not working behind AWS Load balancer?
Recently I set up a internal tool for engineers inside our company to run one-off jobs without going to terminal. It is a simple FastAPI service, and use websocket to send live logs to the UI. The traffic path is below
cloudflare -> AWS classic LB -> Nginx ingress -> buzzfeed sso -> istio -> uvicorn -> fastAPI app
Here both Nginx
and buzzfeed sso
are reverse proxy and buzzfeed sso
act as the authentication layer.
The problem is that I can see live log locally, but it does not work after I push to production.
First, how I know there is a problem, and what problem it is?
When I ran a one-off job, the server log shows an entry
INFO: 127.0.0.6:60295 - "GET /ws/abcdefghij HTTP/1.1" 404 Not Found
, so I know the request came in but somehow router
does not recognize it. Also, I wonder why it is HTTP instead of websocket.
The ASGI standard async def application(scope, receive, send):
says the first
argument scope
has a type
key that indicates the protocol name. I read
starlette source code and confirm it should be websocket
. It makes no sense
to receive a HTTP
record.
Meanwhile, I suspect this HTTP request is probably the websocket handshake
message, but I did not see log upgrade to websocket
.
It took me a while to find a way to see what was going on inside
uvicorn + fastAPI
: monkey patch.
def new_data_received(self, data: bytes) -> None:
self._unset_keepalive_if_required()
try:
logger.info(f"xxx data {data.decode('utf-8')}")
self.parser.feed_data(data)
except httptools.HttpParserError:
msg = "Invalid HTTP request received."
self.logger.warning(msg)
self.send_400_response(msg)
return
except httptools.HttpParserUpgrade:
self.handle_upgrade()
uvicorn.protocols.http.httptools_impl.HttpToolsProtocol.data_received = (
new_data_received
)
What I see is below
GET /ws/b06df69c-2341-476f-b857-58c3f690fd57 HTTP/1.1
host: one-off-master.staging-oneoff.svc.cluster.local:8004
user-agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Safari/537.36
accept-encoding: gzip
accept-language: en-US,en;q=0.9,zh-CN;q=0.8,zh;q=0.7,es;q=0.6,ru;q=0.5,fr;q=0.4
cache-control: no-cache
cdn-loop: cloudflare
cf-connecting-ip: 98.227.7.99
cf-ipcountry: US
cf-ray: 7212afd23c762bff-ORD
cf-visitor: {"scheme":"https"}
cookie: session=<...>
origin: https://staging-oneoff.tryevergreen.com
pragma: no-cache
sec-websocket-extensions: permessage-deflate; client_max_window_bits
sec-websocket-key: bIOl9pXidSou2h391fMsQg==
sec-websocket-version: 13
x-forwarded-email: [email protected]
x-forwarded-for: 172.31.93.191, 127.0.0.6
x-forwarded-groups:
x-forwarded-host: staging-oneoff.tryevergreen.com
x-forwarded-host: staging-oneoff.tryevergreen.com
x-forwarded-port: 80
x-forwarded-proto: http
x-forwarded-user: xiong
x-original-forwarded-for: 98.227.7.99, 172.70.130.172
x-real-ip: 172.31.93.191
x-request-id: c2b936fa4e66077ab0573418c9b86b2a
x-scheme: http
x-envoy-attempt-count: 1
x-forwarded-client-cert: <...>
x-b3-traceid: 40e4f85b4d4a7e6d345892d4ba4eb999
x-b3-spanid: 553262c905b63a40
x-b3-parentspanid: ebde97693103cd06
x-b3-sampled: 0
Ah!. The message is indeed the handshake request, but Upgrade
and Connection
headers are missing. Where were they dropped?
How do I know istio is not the problem?
I first suspect Istio dropped the websocket header because the sso services are running in different EKS cluster than our fastAPI server. Also, I see a few posts online complaining websocket support in Istio. Then I quickly tested the hypnosis.
The first thing I tried is using a python library called websockets
. I
created a new pod running a python image in the sso cluster and then run below
command.
python -m websockets ws://one-off-master.staging-oneoff.svc.cluster.local:8004/ws/9adef6d1-0a43-4a9a-928c-7cc582f90051
It works. Later on, I found a even simpler way: just enter the sso-proxy pod and run curl below
curl -i -X GET \
-H "Connection: Upgrade" \
-H "Upgrade: websocket" \
-H "sec-websocket-extensions: permessage-deflate; client_max_window_bits" \
-H "sec-websocket-key: bIOl9pXidSou2h391fMsQg==" -H "sec-websocket-version: 13" \
-H "accept-encoding: gzip" \
-H "cdn-loop: cloudflare" \
one-off-master.staging-oneoff.svc.cluster.local:8004/ws/abcdefghijk
A 101 response was received and terminal hang there for input/output, so
it means websocket handshake succeeded and a TCP connnection is hijacked
for me. Hijack
is a word I learned from golang. See http.Hijacker
.
How I know sso is not the problem
Now I suspect sso dropped the websocket headers.
First, I spent lots of time reading sso source code. The code is not super complicated. The proxy part is basically a wrapper around golang standard module reverseproxy.go. Golang reverseproxy supports websocket natively. It clearly adds the headers back after dropping them.
// After stripping all the hop-by-hop connection headers above, add back any
// necessary for protocol upgrades, such as for websockets.
if reqUpType != "" {
outreq.Header.Set("Connection", "Upgrade")
outreq.Header.Set("Upgrade", reqUpType)
}
So sso should not be the culprit. If you wonder why reverse proxy drops
them first,
you can google hop-by-hop
headers.
The second proof comes from below Nginx study.
How do I know Nginx is not the problem?
I went to the Nginx pod and find a section in /etc/nginx/nginx.conf
# Allow websocket connections
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection $connection_upgrade;
Therefore, Nginx should not be a problem. Meanwhile, running below request in the Nginx pod got a 101 response.
curl -i -X GET \
-H "Connection: Upgrade" \
-H "Upgrade: websocket" \
-H "sec-websocket-extensions: permessage-deflate; client_max_window_bits" \
-H "sec-websocket-key: bIOl9pXidSou2h391fMsQg==" -H "sec-websocket-version: 13" \
-H "accept-encoding: gzip" \
-H "cdn-loop: cloudflare" \
-H 'Host: staging-oneoff.tryevergreen.com' \
localhost:80/ws/abcdefghij
It means Nginx correctly sent out the headers to sso and sso correctly
forwarded them to Istio. There is no problem in this chain
Nginx -> sso -> Istio -> uvicorn -> fastAPI
.
A hack needs to be mentioned is I added skip_auth_regex
to sso-proxy
config to bypass the authentication step. Without it, above request will be
redirected to sso-oauth
service and asks me to login.
Now I know LB or cloudflare is the problem.
LB or cloudflare? Luckily, I found this post https://github.com/kubernetes/ingress-nginx/issues/3746
How do I verify it is LB problem, same as above, we can directly curl the load balancer. First, we need to find the LB DNS or A record. See below
% k get services
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
nginx-internal-ingress-nginx-controller LoadBalancer 10.100.37.123 ab4689a0641b74457a48a8575ddf7cf5-1999473190.us-east-2.elb.amazonaws.com 80:30931/TCP,443:32259/TCP 579d
Then we can replace localhost
in above curl command with
ab4689a0641b74457a48a8575ddf7cf5-1999473190.us-east-2.elb.amazonaws.com
.
OK. I reproduced the error message in the service log.
INFO: 127.0.0.6:60295 - "GET /ws/abcdefghij HTTP/1.1" 404 Not Found
After change the load balancer from classic to NLB, websocket starts working.