sdk: support dynamic timeout propagation in sandbox router#857
Conversation
✅ Deploy Preview for agent-sandbox canceled.
|
|
Hi @HasonoCell. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Regular contributors should join the org to skip this step. Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
There was a problem hiding this comment.
Pull request overview
This PR adds request-level timeout propagation from the SDKs to sandbox-router via an X-Sandbox-Timeout header so long-running sandbox operations aren’t cut off by the router’s default proxy timeout.
Changes:
sandbox-routernow parsesX-Sandbox-Timeoutand applies it as a per-requesthttpxtimeout override.- Python sync/async connectors inject
X-Sandbox-Timeoutfor router-based connections. - Go SDK propagates the remaining context deadline through
X-Sandbox-Timeout, with tests added across router + SDKs.
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| clients/python/agentic-sandbox-client/sandbox-router/sandbox_router.py | Adds per-request timeout parsing and passes it to httpx.AsyncClient.send(...). |
| clients/python/agentic-sandbox-client/sandbox-router/test_sandbox_router.py | Adds tests validating header override and fallback behavior. |
| clients/python/agentic-sandbox-client/k8s_agent_sandbox/connector.py | Injects X-Sandbox-Timeout into router headers for sync connector. |
| clients/python/agentic-sandbox-client/k8s_agent_sandbox/async_connector.py | Injects X-Sandbox-Timeout into router headers for async connector. |
| clients/python/agentic-sandbox-client/k8s_agent_sandbox/test/unit/test_connector.py | Adds unit tests for sync connector timeout header injection behavior. |
| clients/python/agentic-sandbox-client/k8s_agent_sandbox/test/unit/test_async_sandboxclient.py | Adds unit tests for async connector timeout header injection behavior. |
| clients/go/sandbox/types.go | Introduces headerSandboxTimeout constant. |
| clients/go/sandbox/connector.go | Propagates remaining context deadline into X-Sandbox-Timeout for router requests. |
| clients/go/sandbox/files_test.go | Extends header coverage tests to assert timeout header presence and validity. |
| clients/go/sandbox/commands_test.go | Adds test ensuring Run propagates context deadline via X-Sandbox-Timeout. |
|
/ok-to-test |
|
The approach looks good to me. /assign @SHRUTI6991 |
|
/retest |
|
I pushed one follow-up fix after the previous review because the first implementation used Could you take a look at this again?thanks! @aditya-shantanu |
|
By the way I force-pushed because I had accidentally mixed the PR work with local The latest functional change in that force-push was the router follow-up fix to apply request-level timeout through the correct HTTPX request path. |
c25fd35 to
03a2a2a
Compare
|
/retest |
|
|
||
| def test_request_header_below_proxy_timeout_overrides_default(self, client): | ||
| async def capture_send(req, **kwargs): | ||
| capture_send.timeout = req.extensions.get("timeout") |
There was a problem hiding this comment.
It looks like capture_send.timeout = req.ext was left in by mistake before the req.extensions.get call. If req.ext is not available on the httpx.Request object in the version being used, this will raise an AttributeError and cause the test to fail with a 500 instead of the expected 502.
Reminder: Please do not click "Commit suggestion" in the GitHub UI as it adds Copilot as a co-author, which breaks the CNCF CLA check. Apply the change locally instead.
There was a problem hiding this comment.
Thanks for taking a look. I checked the current PR head and there is no leftover capture_send.timeout = req.ext in the test; the code there already uses req.extensions.get("timeout").
I also reran the sandbox-router test suite on the current head, and it passes cleanly (39 passed). I'm wondering if this might be a stale comment.
igooch
left a comment
There was a problem hiding this comment.
One non-blocking suggestion from AI code assist inline below (connect-timeout bounding) — safe to defer.
Looks good to me.
| headers=headers, | ||
| content=request.stream() | ||
| content=request.stream(), | ||
| timeout=timeout, |
There was a problem hiding this comment.
Non-blocking suggestion as this is pre-existing, not a regression — the client default already pinned connect to proxy_timeout, so this is just cheap to fix on the line being edited. Adopting it means updating the connect == read assertions in test_sandbox_router.py.
build_request(timeout=<float>) wraps the value in httpx.Timeout(<float>), which fans that single number out to all four components — connect, read, write, pool (the test confirms connect == read).
But this value is really a read budget for slow commands; tying connect to it means the router waits the full budget (up to 180s) just to open a socket. If a backend is blackholing packets during pod churn, that pins a worker for the whole budget before failing — a refused connection still returns immediately.
| timeout=timeout, | |
| timeout=httpx.Timeout(timeout, connect=min(timeout, 5.0)), |
min(timeout, 5.0) caps connect at 5s without exceeding the overall budget; read/write/pool stay at the full value.
There was a problem hiding this comment.
Thanks, that makes sense. I applied this locally and capped the backend connect timeout with httpx.Timeout(timeout, connect=min(timeout, 5.0)), so the read/write/pool budgets still use the propagated request timeout while connect is bounded to at most 5s.
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: HasonoCell, igooch, janetkuo The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
What this PR does / why we need it:
Propagates client-side request timeouts to the sandbox-router so long-running sandbox operations are not prematurely terminated by the router's default proxy timeout.
Today, SDK calls can accept a longer timeout (for example
sandbox.commands.run(..., timeout=600)), but that timeout is only enforced client-side. When requests are routed throughsandbox-router, the router still applies its own default proxy timeout, which can terminate long-running requests early with504 Gateway Timeout.This PR adds support for request-level timeout propagation by:
sandbox-routerto acceptX-Sandbox-Timeoutand use it as a per-request proxy timeout overrideX-Sandbox-TimeoutWhich issue(s) this PR is related to:
Fixes #820
Summary by CodeRabbit
New Features
Tests