Motivation.
#47451 contains an initial implementation of the proposal described in this RFC.
Summary
This RFC proposes a standard, extensible warmup infrastructure for JIT kernels in vLLM.
The goal is to let kernels from different JIT backends, including Triton, CuTeDSL, TileLang, and potential future DSLs, expose the set of specializations that should be compiled during engine startup.
This is not intended to be a one-off warmup path for a specific kernel. Instead, it defines a kernel-owned contract where each warmable kernel describes its own compile keys, dispatch logic, warmup search space, and compile-only entry point.
Motivation
vLLM increasingly relies on JIT-generated kernels from multiple DSLs. Today, warming these kernels is difficult to standardize because each backend exposes different runtime and compilation APIs. This RFC proposes a common infrastructure with several goals:
- Create shared JIT warmup infrastructure for JIT backends in vLLM.
- Keep warmup definitions close to the kernels that own the specialization logic, making the system easier to review and maintain.
- Warm up actual compile keys instead of running representative non-key inputs, such as token sizes, and hoping they map to all required specializations. That mapping is not always obvious or guaranteed, so warmup should target the compile-key space directly.
- Use Python AST dispatch tracing to derive compile-key search spaces from normal Python
dispatch(...) methods, avoiding duplicated hand-written warmup logic.
- Support compile-only warmup APIs, avoiding dummy runtime launches and dummy tensor allocation. Dummy runs can be expensive and may have side effects; each DSL should expose fake tensor/spec descriptors suitable for compilation only.
- Define a standard interface for new contributors and potentially third-party libraries to expose warmup metadata: compile keys, representative warmup keys, and a compile-only API.
Proposed Change.
Each warmable kernel should expose a small wrapper object near the kernel's normal runtime entry point.
The wrapper owns:
- A frozen
CompileKey dataclass with the fields that identify one compiled specialization.
- A
dispatch(...) method that maps normal dispatch arguments to CompileKey.
- A
get_warmup_keys(...) method that returns representative keys to compile.
- A
compile(compile_key) method that compiles one key.
Proposed shape:
class MyKernel(VllmJitKernel["MyKernel.CompileKey"]):
@dataclass(frozen=True)
class CompileKey:
BLOCK_SIZE: int
def dispatch(self, *, num_tokens: int) -> CompileKey:
return self.CompileKey(BLOCK_SIZE= ...)
def get_warmup_keys(self, vllm_config: VllmConfig) -> list[CompileKey]:
...
def compile(self, compile_key: CompileKey) -> None:
...
MY_KERNEL = MyKernel()
CompileKey must be hashable. get_warmup_keys(...) should deduplicate keys before returning them if multiple representative inputs map to the same compiled specialization.
Scope
This RFC covers the warmup contract and the minimal shared infrastructure needed to make JIT warmup kernel-owned and backend-extensible.
The initial scope is limited to:
- A generic wrapper interface for warmable kernels.
- AST-assisted expansion of representative dispatch inputs into compile keys.
- Backend-specific compile-only adapters where needed.
- Initial Triton and CuTeDSL examples that exercise the contract.
It does not attempt to migrate every existing warmup path or define a complete backend-neutral fake tensor API for all DSLs in this first step.
Current And Prior Work
To my knowledge, PR #47451 is the first implementation of this proposal. It adds the shared wrapper contract and demonstrates it with one Triton path and one CuTeDSL path.
Before this proposal, warmup logic was mostly backend-specific and ad hoc. In some cases it used representative runtime-like inputs rather than directly expressing the compile-key space.
Risks
- Some specialization fields (compile-keys) may be hidden in backend internals and hard to model.
- Over-warming can increase startup time.
- Third-party libraries may need small API additions to expose compile-only entry points, and potentially methods to get warmup compile keys.
Feedback Period.
No response
CC List.
No response
Any Other Things.
No response
Before submitting a new issue...
Motivation.
#47451 contains an initial implementation of the proposal described in this RFC.
Summary
This RFC proposes a standard, extensible warmup infrastructure for JIT kernels in vLLM.
The goal is to let kernels from different JIT backends, including Triton, CuTeDSL, TileLang, and potential future DSLs, expose the set of specializations that should be compiled during engine startup.
This is not intended to be a one-off warmup path for a specific kernel. Instead, it defines a kernel-owned contract where each warmable kernel describes its own compile keys, dispatch logic, warmup search space, and compile-only entry point.
Motivation
vLLM increasingly relies on JIT-generated kernels from multiple DSLs. Today, warming these kernels is difficult to standardize because each backend exposes different runtime and compilation APIs. This RFC proposes a common infrastructure with several goals:
dispatch(...)methods, avoiding duplicated hand-written warmup logic.Proposed Change.
Each warmable kernel should expose a small wrapper object near the kernel's normal runtime entry point.
The wrapper owns:
CompileKeydataclass with the fields that identify one compiled specialization.dispatch(...)method that maps normal dispatch arguments toCompileKey.get_warmup_keys(...)method that returns representative keys to compile.compile(compile_key)method that compiles one key.Proposed shape:
CompileKey must be hashable.
get_warmup_keys(...)should deduplicate keys before returning them if multiple representative inputs map to the same compiled specialization.Scope
This RFC covers the warmup contract and the minimal shared infrastructure needed to make JIT warmup kernel-owned and backend-extensible.
The initial scope is limited to:
It does not attempt to migrate every existing warmup path or define a complete backend-neutral fake tensor API for all DSLs in this first step.
Current And Prior Work
To my knowledge, PR #47451 is the first implementation of this proposal. It adds the shared wrapper contract and demonstrates it with one Triton path and one CuTeDSL path.
Before this proposal, warmup logic was mostly backend-specific and ad hoc. In some cases it used representative runtime-like inputs rather than directly expressing the compile-key space.
Risks
Feedback Period.
No response
CC List.
No response
Any Other Things.
No response
Before submitting a new issue...