Skip to content

python-runtime-sandbox: long-running /execute can block HTTP readiness and make pods NotReady #1015

Description

@Ryotess

What happened?

The official Python runtime sandbox example appears to be vulnerable to event-loop blocking during long-running /execute calls.

The current upstream runtime defines /execute as an async FastAPI handler, but calls blocking subprocess.run() inside it:

https://github.com/kubernetes-sigs/agent-sandbox/blob/main/examples/python-runtime-sandbox/main.py

At the same time, the official quickstart points users to the ready-made SandboxTemplate:

https://github.com/kubernetes-sigs/agent-sandbox/blob/main/examples/quickstart/README.md

which uses the python-runtime-sandbox:latest-main image and configures HTTP readiness against / every 1 second:

https://github.com/kubernetes-sigs/agent-sandbox/blob/main/clients/python/agentic-sandbox-client/python-sandbox-template.yaml

Because /execute and / are served by the same FastAPI/Uvicorn process, a long-running /execute command can block the event loop and delay the readiness handler. Kubernetes may then mark the sandbox pod as NotReady, even though the user command is still running successfully and the container is not actually unhealthy.

This surfaced for us during pandas / CSV data processing workloads, but the issue seems to come from the upstream example runtime design itself: blocking subprocess execution inside an async endpoint, combined with HTTP readiness served by the same event loop.

How can we reproduce it (as minimally and precisely as possible)?

Using the upstream quickstart/runtime path:

  1. Apply the official Python SandboxTemplate from:
    clients/python/agentic-sandbox-client/python-sandbox-template.yaml

    This template uses:

    • python-runtime-sandbox:latest-main
    • HTTP readiness probe: GET /
    • periodSeconds: 1
    • no explicit timeoutSeconds, so Kubernetes defaults to 1 second
  2. Create a sandbox using that template.

  3. Run a long-running command through /execute, for example:

python3 -c "import time; time.sleep(30); print('done')"

or any data processing(for example, pandas CSV-processing) command that runs for long enough.

  1. Watch pod events while /execute is still running:
kubectl describe pod <sandbox-pod>
kubectl get events --field-selector involvedObject.name=<sandbox-pod>

Observed event pattern:

Readiness probe failed: context deadline exceeded
Client.Timeout exceeded while awaiting headers

In runtime logs, readiness GET / requests are delayed until the /execute request completes.

Version

main @ 262aa41

Anything else we need to know?

I may be missing intended usage guidance here. If long-running /execute commands are expected, should the example runtime avoid blocking the event loop, or should the sample readiness probe be documented differently?

Possible fixes:

  1. Use asyncio.create_subprocess_exec() in /execute instead of blocking subprocess.run().

    Example shape:

    process = await asyncio.create_subprocess_exec(
        *args,
        stdout=asyncio.subprocess.PIPE,
        stderr=asyncio.subprocess.PIPE,
        cwd="/app",
    )
    stdout, stderr = await process.communicate()```

Alternatively, make /execute a sync FastAPI endpoint (def, not async def) so FastAPI runs it in a threadpool rather than blocking the event loop.

Optionally keep a single-command lock if the intended runtime semantics are one command at a time per sandbox.

Add a regression test where GET / remains responsive while /execute is running.

Our local workaround was to replace the runtime server with an API-compatible version that uses asyncio.create_subprocess_exec() for /execute. With that change, a pandas workload completed successfully while HTTP readiness remained healthy:

not_ready_samples: 0
restart_count: 0
unhealthy_events_since_workload: 0

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/bugCategorizes issue or PR as related to a bug.priority/important-soonMust be staffed and worked on either currently, or very soon, ideally in time for the next release.

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    Status
    Linked

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions