python-runtime-sandbox: long-running /execute can block HTTP readiness and make pods NotReady

What happened?

The official Python runtime sandbox example appears to be vulnerable to event-loop blocking during long-running /execute calls.

The current upstream runtime defines /execute as an async FastAPI handler, but calls blocking subprocess.run() inside it:

https://github.com/kubernetes-sigs/agent-sandbox/blob/main/examples/python-runtime-sandbox/main.py

At the same time, the official quickstart points users to the ready-made SandboxTemplate:

https://github.com/kubernetes-sigs/agent-sandbox/blob/main/examples/quickstart/README.md

which uses the python-runtime-sandbox:latest-main image and configures HTTP readiness against / every 1 second:

https://github.com/kubernetes-sigs/agent-sandbox/blob/main/clients/python/agentic-sandbox-client/python-sandbox-template.yaml

Because /execute and / are served by the same FastAPI/Uvicorn process, a long-running /execute command can block the event loop and delay the readiness handler. Kubernetes may then mark the sandbox pod as NotReady, even though the user command is still running successfully and the container is not actually unhealthy.

This surfaced for us during pandas / CSV data processing workloads, but the issue seems to come from the upstream example runtime design itself: blocking subprocess execution inside an async endpoint, combined with HTTP readiness served by the same event loop.

How can we reproduce it (as minimally and precisely as possible)?

Using the upstream quickstart/runtime path:

Apply the official Python SandboxTemplate from:
clients/python/agentic-sandbox-client/python-sandbox-template.yaml

This template uses:
- python-runtime-sandbox:latest-main
- HTTP readiness probe: GET /
- periodSeconds: 1
- no explicit timeoutSeconds, so Kubernetes defaults to 1 second
Create a sandbox using that template.
Run a long-running command through /execute, for example:

python3 -c "import time; time.sleep(30); print('done')"

or any data processing(for example, pandas CSV-processing) command that runs for long enough.

Watch pod events while /execute is still running:

kubectl describe pod <sandbox-pod>
kubectl get events --field-selector involvedObject.name=<sandbox-pod>

Observed event pattern:

Readiness probe failed: context deadline exceeded
Client.Timeout exceeded while awaiting headers

In runtime logs, readiness GET / requests are delayed until the /execute request completes.

Version

main @ 262aa41

Anything else we need to know?

I may be missing intended usage guidance here. If long-running /execute commands are expected, should the example runtime avoid blocking the event loop, or should the sample readiness probe be documented differently?

Possible fixes:

Use asyncio.create_subprocess_exec() in /execute instead of blocking subprocess.run().

Example shape:

process = await asyncio.create_subprocess_exec(
    *args,
    stdout=asyncio.subprocess.PIPE,
    stderr=asyncio.subprocess.PIPE,
    cwd="/app",
)
stdout, stderr = await process.communicate()```

Alternatively, make /execute a sync FastAPI endpoint (def, not async def) so FastAPI runs it in a threadpool rather than blocking the event loop.

Optionally keep a single-command lock if the intended runtime semantics are one command at a time per sandbox.

Add a regression test where GET / remains responsive while /execute is running.

Our local workaround was to replace the runtime server with an API-compatible version that uses asyncio.create_subprocess_exec() for /execute. With that change, a pandas workload completed successfully while HTTP readiness remained healthy:

not_ready_samples: 0
restart_count: 0
unhealthy_events_since_workload: 0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

python-runtime-sandbox: long-running /execute can block HTTP readiness and make pods NotReady #1015

What happened?

How can we reproduce it (as minimally and precisely as possible)?

Version

Anything else we need to know?

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

python-runtime-sandbox: long-running /execute can block HTTP readiness and make pods NotReady #1015

Description

What happened?

How can we reproduce it (as minimally and precisely as possible)?

Version

Anything else we need to know?

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions