What happened?
The official Python runtime sandbox example appears to be vulnerable to event-loop blocking during long-running /execute calls.
The current upstream runtime defines /execute as an async FastAPI handler, but calls blocking subprocess.run() inside it:
https://github.com/kubernetes-sigs/agent-sandbox/blob/main/examples/python-runtime-sandbox/main.py
At the same time, the official quickstart points users to the ready-made SandboxTemplate:
https://github.com/kubernetes-sigs/agent-sandbox/blob/main/examples/quickstart/README.md
which uses the python-runtime-sandbox:latest-main image and configures HTTP readiness against / every 1 second:
https://github.com/kubernetes-sigs/agent-sandbox/blob/main/clients/python/agentic-sandbox-client/python-sandbox-template.yaml
Because /execute and / are served by the same FastAPI/Uvicorn process, a long-running /execute command can block the event loop and delay the readiness handler. Kubernetes may then mark the sandbox pod as NotReady, even though the user command is still running successfully and the container is not actually unhealthy.
This surfaced for us during pandas / CSV data processing workloads, but the issue seems to come from the upstream example runtime design itself: blocking subprocess execution inside an async endpoint, combined with HTTP readiness served by the same event loop.
How can we reproduce it (as minimally and precisely as possible)?
Using the upstream quickstart/runtime path:
-
Apply the official Python SandboxTemplate from:
clients/python/agentic-sandbox-client/python-sandbox-template.yaml
This template uses:
python-runtime-sandbox:latest-main
- HTTP readiness probe:
GET /
periodSeconds: 1
- no explicit
timeoutSeconds, so Kubernetes defaults to 1 second
-
Create a sandbox using that template.
-
Run a long-running command through /execute, for example:
python3 -c "import time; time.sleep(30); print('done')"
or any data processing(for example, pandas CSV-processing) command that runs for long enough.
- Watch pod events while /execute is still running:
kubectl describe pod <sandbox-pod>
kubectl get events --field-selector involvedObject.name=<sandbox-pod>
Observed event pattern:
Readiness probe failed: context deadline exceeded
Client.Timeout exceeded while awaiting headers
In runtime logs, readiness GET / requests are delayed until the /execute request completes.
Version
main @ 262aa41
Anything else we need to know?
I may be missing intended usage guidance here. If long-running /execute commands are expected, should the example runtime avoid blocking the event loop, or should the sample readiness probe be documented differently?
Possible fixes:
-
Use asyncio.create_subprocess_exec() in /execute instead of blocking subprocess.run().
Example shape:
process = await asyncio.create_subprocess_exec(
*args,
stdout=asyncio.subprocess.PIPE,
stderr=asyncio.subprocess.PIPE,
cwd="/app",
)
stdout, stderr = await process.communicate()```
Alternatively, make /execute a sync FastAPI endpoint (def, not async def) so FastAPI runs it in a threadpool rather than blocking the event loop.
Optionally keep a single-command lock if the intended runtime semantics are one command at a time per sandbox.
Add a regression test where GET / remains responsive while /execute is running.
Our local workaround was to replace the runtime server with an API-compatible version that uses asyncio.create_subprocess_exec() for /execute. With that change, a pandas workload completed successfully while HTTP readiness remained healthy:
not_ready_samples: 0
restart_count: 0
unhealthy_events_since_workload: 0
What happened?
The official Python runtime sandbox example appears to be vulnerable to event-loop blocking during long-running
/executecalls.The current upstream runtime defines
/executeas an async FastAPI handler, but calls blockingsubprocess.run()inside it:https://github.com/kubernetes-sigs/agent-sandbox/blob/main/examples/python-runtime-sandbox/main.py
At the same time, the official quickstart points users to the ready-made SandboxTemplate:
https://github.com/kubernetes-sigs/agent-sandbox/blob/main/examples/quickstart/README.md
which uses the
python-runtime-sandbox:latest-mainimage and configures HTTP readiness against/every 1 second:https://github.com/kubernetes-sigs/agent-sandbox/blob/main/clients/python/agentic-sandbox-client/python-sandbox-template.yaml
Because
/executeand/are served by the same FastAPI/Uvicorn process, a long-running/executecommand can block the event loop and delay the readiness handler. Kubernetes may then mark the sandbox pod asNotReady, even though the user command is still running successfully and the container is not actually unhealthy.This surfaced for us during pandas / CSV data processing workloads, but the issue seems to come from the upstream example runtime design itself: blocking subprocess execution inside an async endpoint, combined with HTTP readiness served by the same event loop.
How can we reproduce it (as minimally and precisely as possible)?
Using the upstream quickstart/runtime path:
Apply the official Python SandboxTemplate from:
clients/python/agentic-sandbox-client/python-sandbox-template.yamlThis template uses:
python-runtime-sandbox:latest-mainGET /periodSeconds: 1timeoutSeconds, so Kubernetes defaults to 1 secondCreate a sandbox using that template.
Run a long-running command through
/execute, for example:python3 -c "import time; time.sleep(30); print('done')"or any data processing(for example, pandas CSV-processing) command that runs for long enough.
Observed event pattern:
Readiness probe failed: context deadline exceeded Client.Timeout exceeded while awaiting headersIn runtime logs, readiness GET / requests are delayed until the /execute request completes.
Version
main @ 262aa41
Anything else we need to know?
I may be missing intended usage guidance here. If long-running
/executecommands are expected, should the example runtime avoid blocking the event loop, or should the sample readiness probe be documented differently?Possible fixes:
Use
asyncio.create_subprocess_exec()in/executeinstead of blockingsubprocess.run().Example shape:
Alternatively, make /execute a sync FastAPI endpoint (def, not async def) so FastAPI runs it in a threadpool rather than blocking the event loop.
Optionally keep a single-command lock if the intended runtime semantics are one command at a time per sandbox.
Add a regression test where GET / remains responsive while /execute is running.
Our local workaround was to replace the runtime server with an API-compatible version that uses asyncio.create_subprocess_exec() for /execute. With that change, a pandas workload completed successfully while HTTP readiness remained healthy: