[Security] Fix info disclosure via int32 truncation in GGUF dequantize kernels#44971
Merged
Isotr0py merged 2 commits intoJun 11, 2026
Conversation
Widen the element-count parameter `k` in `to_cuda_ggml_t` and all dequantize functions from `int` to `int64_t`. When tensor dimensions exceed INT_MAX (e.g. 65536x65536), the 32-bit `k` silently truncates, causing the kernel to process only a fraction of the output tensor. Since output tensors are allocated with `torch::empty` (uninitialized), the unfilled portion retains stale GPU memory which may contain data from other users' inference requests in multi-tenant deployments. Also widen `col`, `batch`, `vecs`, and `padded` variables in gguf_kernel.cu from `int` to `int64_t` to prevent the same class of truncation in the matmul and MoE paths. As defense-in-depth, zero-initialize all output tensors via `torch::stable::fill_(Y, 0.0)` so that even if a future truncation bug occurs, stale GPU memory is never exposed. Signed-off-by: Juan Pérez de Algaba <jperezde@redhat.com> Signed-off-by: jperezde <jperezde@redhat.com>
Isotr0py
approved these changes
Jun 11, 2026
Saddss
pushed a commit
to Saddss/vllm
that referenced
this pull request
Jun 14, 2026
…e kernels (vllm-project#44971) Signed-off-by: jperezde <jperezde@redhat.com>
vivek8123
pushed a commit
to odh-on-pz/vllm-upstream
that referenced
this pull request
Jun 18, 2026
…e kernels (vllm-project#44971) Signed-off-by: jperezde <jperezde@redhat.com>
divineearthly
pushed a commit
to divineearthly/vllm
that referenced
this pull request
Jun 19, 2026
…e kernels (vllm-project#44971) Signed-off-by: jperezde <jperezde@redhat.com> Signed-off-by: divineearthly <divineearthly@gmail.com>
tunglinwood
pushed a commit
to tunglinwood/vllm
that referenced
this pull request
Jun 22, 2026
…e kernels (vllm-project#44971) Signed-off-by: jperezde <jperezde@redhat.com>
nkzhenhua
pushed a commit
to nkzhenhua/vllm
that referenced
this pull request
Jun 24, 2026
…e kernels (vllm-project#44971) Signed-off-by: jperezde <jperezde@redhat.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Purpose
Fix an information disclosure vulnerability caused by integer truncation in GGUF dequantize kernels (csrc/libtorch_stable/quantization/gguf/).
The to_cuda_ggml_t function pointer typedef declares its element count parameter k as int (32-bit). When a GGUF model has weight tensor dimensions whose product exceeds INT_MAX (e.g. a 65536x65536 matrix), the int64_t product m * n is silently truncated to int at the call site. The CUDA kernel then processes only the truncated number of elements. Since output tensors are allocated with torch::empty (uninitialized memory), the unfilled portion retains stale GPU memory which — in multi-tenant inference deployments — may contain tensor data from other users' requests.
Changes:
Test Plan
The type change from int to int64_t is ABI-compatible and all downstream call sites already pass int64_t values (e.g. m * n in ggml_dequantize). The fix eliminates the silent truncation at the function pointer boundary.
Test Result
MR was created with the assisstance of: opus-4.6-high