Skip to content

sdk: support dynamic timeout propagation in sandbox router#857

Merged
k8s-ci-robot merged 6 commits into
kubernetes-sigs:mainfrom
HasonoCell:main
Jun 13, 2026
Merged

sdk: support dynamic timeout propagation in sandbox router#857
k8s-ci-robot merged 6 commits into
kubernetes-sigs:mainfrom
HasonoCell:main

Conversation

@HasonoCell

@HasonoCell HasonoCell commented May 23, 2026

Copy link
Copy Markdown
Contributor

What this PR does / why we need it:

Propagates client-side request timeouts to the sandbox-router so long-running sandbox operations are not prematurely terminated by the router's default proxy timeout.

Today, SDK calls can accept a longer timeout (for example sandbox.commands.run(..., timeout=600)), but that timeout is only enforced client-side. When requests are routed through sandbox-router, the router still applies its own default proxy timeout, which can terminate long-running requests early with 504 Gateway Timeout.

This PR adds support for request-level timeout propagation by:

  • teaching sandbox-router to accept X-Sandbox-Timeout and use it as a per-request proxy timeout override
  • updating the Python sync and async SDK connectors to inject X-Sandbox-Timeout
  • updating the Go SDK connector to propagate the remaining request deadline through the same header
  • adding tests for router behavior and SDK header propagation

Which issue(s) this PR is related to:

Fixes #820

Summary by CodeRabbit

  • New Features

    • Go and Python clients now propagate request timeouts via an HTTP header so downstream services and the router can honor client deadlines and enforce per-attempt caps.
    • Router applies per-request timeout selection, falls back to configured proxy timeout for invalid/missing values, and caps excessive values; backend timeouts are surfaced as 504 responses.
  • Tests

    • Added and expanded tests validating timeout header propagation, router timeout selection/capping, and timeout-related response behavior across clients and proxy paths.
@netlify

netlify Bot commented May 23, 2026

Copy link
Copy Markdown

Deploy Preview for agent-sandbox canceled.

Name Link
🔨 Latest commit 4a7a2d6
🔍 Latest deploy log https://app.netlify.com/projects/agent-sandbox/deploys/6a2775d681c9280008c8e1ac
@k8s-ci-robot k8s-ci-robot requested review from barney-s and janetkuo May 23, 2026 08:03
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels May 23, 2026
@k8s-ci-robot

Copy link
Copy Markdown
Contributor

Hi @HasonoCell. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work.

Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label May 23, 2026
@janetkuo janetkuo requested a review from Copilot May 23, 2026 09:00

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds request-level timeout propagation from the SDKs to sandbox-router via an X-Sandbox-Timeout header so long-running sandbox operations aren’t cut off by the router’s default proxy timeout.

Changes:

  • sandbox-router now parses X-Sandbox-Timeout and applies it as a per-request httpx timeout override.
  • Python sync/async connectors inject X-Sandbox-Timeout for router-based connections.
  • Go SDK propagates the remaining context deadline through X-Sandbox-Timeout, with tests added across router + SDKs.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
clients/python/agentic-sandbox-client/sandbox-router/sandbox_router.py Adds per-request timeout parsing and passes it to httpx.AsyncClient.send(...).
clients/python/agentic-sandbox-client/sandbox-router/test_sandbox_router.py Adds tests validating header override and fallback behavior.
clients/python/agentic-sandbox-client/k8s_agent_sandbox/connector.py Injects X-Sandbox-Timeout into router headers for sync connector.
clients/python/agentic-sandbox-client/k8s_agent_sandbox/async_connector.py Injects X-Sandbox-Timeout into router headers for async connector.
clients/python/agentic-sandbox-client/k8s_agent_sandbox/test/unit/test_connector.py Adds unit tests for sync connector timeout header injection behavior.
clients/python/agentic-sandbox-client/k8s_agent_sandbox/test/unit/test_async_sandboxclient.py Adds unit tests for async connector timeout header injection behavior.
clients/go/sandbox/types.go Introduces headerSandboxTimeout constant.
clients/go/sandbox/connector.go Propagates remaining context deadline into X-Sandbox-Timeout for router requests.
clients/go/sandbox/files_test.go Extends header coverage tests to assert timeout header presence and validity.
clients/go/sandbox/commands_test.go Adds test ensuring Run propagates context deadline via X-Sandbox-Timeout.
Comment thread clients/python/agentic-sandbox-client/k8s_agent_sandbox/connector.py Outdated
Comment thread clients/python/agentic-sandbox-client/k8s_agent_sandbox/async_connector.py Outdated
@vicentefb

Copy link
Copy Markdown
Member

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels May 26, 2026
@aditya-shantanu

Copy link
Copy Markdown
Collaborator

The approach looks good to me.
/lgtm

/assign @SHRUTI6991

@k8s-ci-robot k8s-ci-robot added lgtm "Looks good to me", indicates that a PR is ready to be merged. and removed lgtm "Looks good to me", indicates that a PR is ready to be merged. labels May 26, 2026
@HasonoCell

Copy link
Copy Markdown
Contributor Author

/retest

@HasonoCell

Copy link
Copy Markdown
Contributor Author

I pushed one follow-up fix after the previous review because the first implementation used httpx.AsyncClient.send(..., timeout=...), which does not support request-level timeout overrides in the httpx version used here and caused the Python SDK router/gateway E2E tests to fail. The follow-up commit switches the router to apply the request timeout at request-build time, and the presubmits are now green again.

Could you take a look at this again?thanks! @aditya-shantanu

@HasonoCell

Copy link
Copy Markdown
Contributor Author

By the way I force-pushed because I had accidentally mixed the PR work with local main / fork-sync history while rebasing onto the latest upstream/main.To clean that up, I moved the in-progress changes onto a temporary branch, fast-forwarded the final fix back onto the PR branch, and force-pushed so the PR would end up with a clean linear history instead of a mixed branch state. That is why the commit history in this PR appears to have been refreshed :)

The latest functional change in that force-push was the router follow-up fix to apply request-level timeout through the correct HTTPX request path.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 30, 2026
@HasonoCell HasonoCell force-pushed the main branch 2 times, most recently from c25fd35 to 03a2a2a Compare May 30, 2026 08:59
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 30, 2026
@HasonoCell

Copy link
Copy Markdown
Contributor Author

/retest


def test_request_header_below_proxy_timeout_overrides_default(self, client):
async def capture_send(req, **kwargs):
capture_send.timeout = req.extensions.get("timeout")

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like capture_send.timeout = req.ext was left in by mistake before the req.extensions.get call. If req.ext is not available on the httpx.Request object in the version being used, this will raise an AttributeError and cause the test to fail with a 500 instead of the expected 502.

Reminder: Please do not click "Commit suggestion" in the GitHub UI as it adds Copilot as a co-author, which breaks the CNCF CLA check. Apply the change locally instead.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking a look. I checked the current PR head and there is no leftover capture_send.timeout = req.ext in the test; the code there already uses req.extensions.get("timeout").

I also reran the sandbox-router test suite on the current head, and it passes cleanly (39 passed). I'm wondering if this might be a stale comment.

@igooch igooch left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One non-blocking suggestion from AI code assist inline below (connect-timeout bounding) — safe to defer.

Looks good to me.

headers=headers,
content=request.stream()
content=request.stream(),
timeout=timeout,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non-blocking suggestion as this is pre-existing, not a regression — the client default already pinned connect to proxy_timeout, so this is just cheap to fix on the line being edited. Adopting it means updating the connect == read assertions in test_sandbox_router.py.

build_request(timeout=<float>) wraps the value in httpx.Timeout(<float>), which fans that single number out to all four components — connect, read, write, pool (the test confirms connect == read).

But this value is really a read budget for slow commands; tying connect to it means the router waits the full budget (up to 180s) just to open a socket. If a backend is blackholing packets during pod churn, that pins a worker for the whole budget before failing — a refused connection still returns immediately.

Suggested change
timeout=timeout,
timeout=httpx.Timeout(timeout, connect=min(timeout, 5.0)),

min(timeout, 5.0) caps connect at 5s without exceeding the overall budget; read/write/pool stay at the full value.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, that makes sense. I applied this locally and capped the backend connect timeout with httpx.Timeout(timeout, connect=min(timeout, 5.0)), so the read/write/pool budgets still use the propagated request timeout while connect is bounded to at most 5s.

@igooch igooch left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@janetkuo janetkuo left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@k8s-ci-robot

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: HasonoCell, igooch, janetkuo

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. ready-for-review size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

8 participants