Skip to content

[Serve] Fix 499 misclassification when client disconnects after successful response#61396

Merged
abrarsheikh merged 1 commit into
masterfrom
di-499
Mar 2, 2026
Merged

[Serve] Fix 499 misclassification when client disconnects after successful response#61396
abrarsheikh merged 1 commit into
masterfrom
di-499

Conversation

@abrarsheikh

@abrarsheikh abrarsheikh commented Feb 28, 2026

Copy link
Copy Markdown
Contributor

When the client closed the connection right after receiving the full response, Serve sometimes logged 499 (Client Closed Request) even though the response was complete. This showed up in metrics (e.g. ray_serve_num_http_error_requests_total) and logs.

Root cause

In direct ingress ASGI handling, response_finished was set with:

if msg.get("more_body") is False:

Per the ASGI spec, more_body is optional and defaults to False when omitted. Starlette/FastAPI and Serve’s convert_object_to_asgi_messages omit it on the final body chunk. In that case msg.get("more_body") is None, so None is False is False and response_finished was never set.

When the client disconnected right after the response, receive_task finished first. Because response_finished stayed False, the code treated it as an abort and logged 499.

Before

[2026-03-01 20:16:42,874] ip-10-0-77-90/INFO/locust.main: Shutting down (exit code 0)
Type     Name                                                                          # reqs      # fails |    Avg     Min     Max    Med |   req/s  failures/s
--------|----------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
GET      /                                                                              15152     0(0.00%) |   1907      61    2126   2000 |   97.73        0.00
--------|----------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
         Aggregated                                                                     15152     0(0.00%) |   1907      61    2126   2000 |   97.73        0.00
(ServeReplica:default:Model pid=14567) INFO 2026-03-01 20:16:42,607 default_Model 15macvxy 9b27b091-30aa-492d-b5c1-6cd334608da7 -- GET / 499 2041.4ms
(ServeReplica:default:Model pid=14567) INFO 2026-03-01 20:16:42,618 default_Model 15macvxy 3d9c22ca-4569-456f-a79e-b8a14a757911 -- GET / 499 2041.4ms
(ServeReplica:default:Model pid=14567) INFO 2026-03-01 20:16:42,628 default_Model 15macvxy 10b3012e-41ef-4cc0-87b2-f169a0622ff7 -- GET / 499 2041.1ms
(ServeReplica:default:Model pid=14567) INFO 2026-03-01 20:16:42,638 default_Model 15macvxy a6cbb63e-6f7d-4209-88cd-d0456567b4df -- GET / 200 2041.5ms
(ServeReplica:default:Model pid=14567) INFO 2026-03-01 20:16:42,648 default_Model 15macvxy 3c2b9346-c94c-4fc7-942a-723d9ee4c040 -- GET / 499 2041.5ms
(ServeReplica:default:Model pid=14567) INFO 2026-03-01 20:16:42,658 default_Model 15macvxy af31dfaf-7d5d-4692-80ca-5e5b18304312 -- GET / 499 2041.4ms
image

After

[2026-03-01 20:27:22,287] ip-10-0-77-90/INFO/locust.main: Shutting down (exit code 0)
Type     Name                                                                          # reqs      # fails |    Avg     Min     Max    Med |   req/s  failures/s
--------|----------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
GET      /                                                                              33164     0(0.00%) |   1979      16    2070   2000 |   97.88        0.00
--------|----------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
         Aggregated                                                                     33164     0(0.00%) |   1979      16    2070   2000 |   97.88        0.00
(ServeReplica:default:Model pid=29389) INFO 2026-03-01 20:27:19,875 default_Model 26ok3pcv a1b1fa79-7326-4671-ac47-faa82b50c694 -- GET / 200 2061.5ms
(ServeReplica:default:Model pid=29389) INFO 2026-03-01 20:27:19,875 default_Model 26ok3pcv e77b1e4f-bbb9-4cb2-86df-43fd3e077a25 -- GET / 200 2051.3ms
(ServeReplica:default:Model pid=29389) INFO 2026-03-01 20:27:19,885 default_Model 26ok3pcv 9be3b7ab-9ad0-4b86-b76d-aa79e6d893a4 -- GET / 200 2051.0ms
(ServeReplica:default:Model pid=29389) INFO 2026-03-01 20:27:19,925 default_Model 26ok3pcv 6c024662-f5f0-42ea-9bc7-74602c456917 -- GET / 200 2081.7ms
(ServeReplica:default:Model pid=29389) INFO 2026-03-01 20:27:19,926 default_Model 26ok3pcv cd208d85-5293-496b-bf4c-7c8e169f0be3 -- GET / 200 2071.5ms
(ServeReplica:default:Model pid=29389) INFO 2026-03-01 20:27:19,926 default_Model 26ok3pcv 3c2335fa-586b-437e-8d1d-cb3223a71c35 -- GET / 200 2061.8ms
(ServeReplica:default:Model pid=29389) INFO 2026-03-01 20:27:19,926 default_Model 26ok3pcv e03e34c2-c794-44e8-a9c9-e0e6c437db65 -- GET / 200 2051.6ms
(ServeReplica:default:Model pid=29389) INFO 2026-03-01 20:27:19,935 default_Model 26ok3pcv 3815b56b-9603-44e1-8c82-648b444688d7 -- GET / 200 2049.7ms
image
…ssful response

Signed-off-by: abrar <abrar@anyscale.com>
@abrarsheikh abrarsheikh added the go add ONLY when ready to merge, run all tests label Feb 28, 2026

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request correctly addresses a race condition where successful requests were misclassified as 499 errors when the client disconnected immediately after receiving the response. The root cause, related to the handling of the optional more_body field in the ASGI spec, is fixed by changing the condition to correctly default to False when the field is omitted. The addition of a check for http.response.body message type makes the logic more robust. A new test case has been added that effectively reproduces the race condition using concurrent requests, ensuring the fix is well-tested and preventing future regressions. The changes are clear, correct, and well-implemented.

@abrarsheikh abrarsheikh changed the title [Serve] Fix 499 misclassification when client disconnects after succe… Feb 28, 2026
@abrarsheikh abrarsheikh marked this pull request as ready for review March 2, 2026 13:46
@abrarsheikh abrarsheikh requested a review from a team as a code owner March 2, 2026 13:46

@akyang-anyscale akyang-anyscale left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice find!

@abrarsheikh abrarsheikh merged commit 0ef911f into master Mar 2, 2026
6 checks passed
@abrarsheikh abrarsheikh deleted the di-499 branch March 2, 2026 19:15
@ray-gardener ray-gardener Bot added serve Ray Serve Related Issue observability Issues related to the Ray Dashboard, Logging, Metrics, Tracing, and/or Profiling labels Mar 2, 2026
kamil-kaczmarek pushed a commit that referenced this pull request Mar 3, 2026
…ssful response (#61396)

When the client closed the connection right after receiving the full
response, Serve sometimes logged **499 (Client Closed Request)** even
though the response was complete. This showed up in metrics (e.g.
`ray_serve_num_http_error_requests_total`) and logs.

### Root cause

In direct ingress ASGI handling, `response_finished` was set with:

```python
if msg.get("more_body") is False:
```

Per the ASGI spec, `more_body` is optional and defaults to `False` when
omitted. Starlette/FastAPI and Serve’s `convert_object_to_asgi_messages`
omit it on the final body chunk. In that case `msg.get("more_body")` is
`None`, so `None is False` is `False` and `response_finished` was never
set.

When the client disconnected right after the response, `receive_task`
finished first. Because `response_finished` stayed `False`, the code
treated it as an abort and logged 499.


## Before
```bash
[2026-03-01 20:16:42,874] ip-10-0-77-90/INFO/locust.main: Shutting down (exit code 0)
Type     Name                                                                          # reqs      # fails |    Avg     Min     Max    Med |   req/s  failures/s
--------|----------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
GET      /                                                                              15152     0(0.00%) |   1907      61    2126   2000 |   97.73        0.00
--------|----------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
         Aggregated                                                                     15152     0(0.00%) |   1907      61    2126   2000 |   97.73        0.00
```

```bash
(ServeReplica:default:Model pid=14567) INFO 2026-03-01 20:16:42,607 default_Model 15macvxy 9b27b091-30aa-492d-b5c1-6cd334608da7 -- GET / 499 2041.4ms
(ServeReplica:default:Model pid=14567) INFO 2026-03-01 20:16:42,618 default_Model 15macvxy 3d9c22ca-4569-456f-a79e-b8a14a757911 -- GET / 499 2041.4ms
(ServeReplica:default:Model pid=14567) INFO 2026-03-01 20:16:42,628 default_Model 15macvxy 10b3012e-41ef-4cc0-87b2-f169a0622ff7 -- GET / 499 2041.1ms
(ServeReplica:default:Model pid=14567) INFO 2026-03-01 20:16:42,638 default_Model 15macvxy a6cbb63e-6f7d-4209-88cd-d0456567b4df -- GET / 200 2041.5ms
(ServeReplica:default:Model pid=14567) INFO 2026-03-01 20:16:42,648 default_Model 15macvxy 3c2b9346-c94c-4fc7-942a-723d9ee4c040 -- GET / 499 2041.5ms
(ServeReplica:default:Model pid=14567) INFO 2026-03-01 20:16:42,658 default_Model 15macvxy af31dfaf-7d5d-4692-80ca-5e5b18304312 -- GET / 499 2041.4ms
```

<img width="2688" height="302" alt="image"
src="https://github.com/user-attachments/assets/58875091-80df-4b98-b0e6-184e878f9a66"
/>


## After
```bash
[2026-03-01 20:27:22,287] ip-10-0-77-90/INFO/locust.main: Shutting down (exit code 0)
Type     Name                                                                          # reqs      # fails |    Avg     Min     Max    Med |   req/s  failures/s
--------|----------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
GET      /                                                                              33164     0(0.00%) |   1979      16    2070   2000 |   97.88        0.00
--------|----------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
         Aggregated                                                                     33164     0(0.00%) |   1979      16    2070   2000 |   97.88        0.00
```

```bash
(ServeReplica:default:Model pid=29389) INFO 2026-03-01 20:27:19,875 default_Model 26ok3pcv a1b1fa79-7326-4671-ac47-faa82b50c694 -- GET / 200 2061.5ms
(ServeReplica:default:Model pid=29389) INFO 2026-03-01 20:27:19,875 default_Model 26ok3pcv e77b1e4f-bbb9-4cb2-86df-43fd3e077a25 -- GET / 200 2051.3ms
(ServeReplica:default:Model pid=29389) INFO 2026-03-01 20:27:19,885 default_Model 26ok3pcv 9be3b7ab-9ad0-4b86-b76d-aa79e6d893a4 -- GET / 200 2051.0ms
(ServeReplica:default:Model pid=29389) INFO 2026-03-01 20:27:19,925 default_Model 26ok3pcv 6c024662-f5f0-42ea-9bc7-74602c456917 -- GET / 200 2081.7ms
(ServeReplica:default:Model pid=29389) INFO 2026-03-01 20:27:19,926 default_Model 26ok3pcv cd208d85-5293-496b-bf4c-7c8e169f0be3 -- GET / 200 2071.5ms
(ServeReplica:default:Model pid=29389) INFO 2026-03-01 20:27:19,926 default_Model 26ok3pcv 3c2335fa-586b-437e-8d1d-cb3223a71c35 -- GET / 200 2061.8ms
(ServeReplica:default:Model pid=29389) INFO 2026-03-01 20:27:19,926 default_Model 26ok3pcv e03e34c2-c794-44e8-a9c9-e0e6c437db65 -- GET / 200 2051.6ms
(ServeReplica:default:Model pid=29389) INFO 2026-03-01 20:27:19,935 default_Model 26ok3pcv 3815b56b-9603-44e1-8c82-648b444688d7 -- GET / 200 2049.7ms
```

<img width="2691" height="312" alt="image"
src="https://github.com/user-attachments/assets/f3c07d01-db5a-4080-8559-032063ac85d4"
/>

Signed-off-by: abrar <abrar@anyscale.com>
Signed-off-by: Kamil Kaczmarek <kamil@anyscale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

go add ONLY when ready to merge, run all tests observability Issues related to the Ray Dashboard, Logging, Metrics, Tracing, and/or Profiling serve Ray Serve Related Issue

2 participants