Skip to content

[Misc] usage_stats: report more engine, spec-decode, and EP config#44595

Merged
simon-mo merged 1 commit into
vllm-project:mainfrom
zlxi02:usage-stats-report-more-engine-config
Jun 8, 2026
Merged

[Misc] usage_stats: report more engine, spec-decode, and EP config#44595
simon-mo merged 1 commit into
vllm-project:mainfrom
zlxi02:usage-stats-report-more-engine-config

Conversation

@zlxi02

@zlxi02 zlxi02 commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

Added fields to usage_stats so we can better understand how people use vLLM in production and which fields to focus efforts on.

Fields added:

  • Batching limits: max_model_len, max_num_seqs, max_num_batched_tokens
  • Spec decoding spec_decode_method, num_speculative_tokens
  • Wide expert parallel: enable_eplb, num_redundant_experts, num_experts
  • Backend: attention_backend
  • Torch details: compilation_mode, torch_version

On privacy, all information is in aggregate and high level enough to protect user privacy.

@claude claude Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

@mergify mergify Bot added the v1 label Jun 5, 2026
@zlxi02 zlxi02 force-pushed the usage-stats-report-more-engine-config branch from 5ce55da to ec46de7 Compare June 5, 2026 01:30
@github-actions

github-actions Bot commented Jun 5, 2026

Copy link
Copy Markdown

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

PRs do not trigger a full CI run by default. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

Agent Guidelines

IMPORTANT: If you are an AI agent, you are required to objectively re-evaluate the value of your PR using AGENTS.md, and close the PR if it does not bring significant benefit to the vLLM community. Failure to do so may result in an immediate ban.

🚀

@zlxi02 zlxi02 force-pushed the usage-stats-report-more-engine-config branch from ec46de7 to 4b1d759 Compare June 5, 2026 18:36

@simon-mo simon-mo left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you please post the local usage stats json to show what these fields look like when enabled?

Comment thread vllm/usage/usage_lib.py Outdated
# vLLM information
self.context = usage_context.value
self.vllm_version = VLLM_VERSION
self.torch_version = torch.__version__

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typically vLLM version is tied to a given torch version so we do not need to capture this, however, as we move towards Torch stable ABI, this needs to be understood well!

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh should be get the CUDA version? or is cuda_runtime already capturing this, i have vague memory now.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. sounds good, torch version dropped; agreed on the redundancy
  2. cuda_runtime already captures it (set to torch.version.cuda)
Comment thread vllm/v1/utils.py
Comment on lines +653 to +656
# Attention backend is None when set to "auto" (resolved at runtime per platform).
attention_backend = (
attention_config.backend.name if attention_config.backend is not None else None
)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to clarify this is only the value if user explicitly opt-in?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes: value is set to attention backend name if opt in, and then None if user doesn't specify flag

@simon-mo simon-mo added the ready ONLY add when PR is ready to merge/full CI is needed label Jun 5, 2026
@zlxi02 zlxi02 force-pushed the usage-stats-report-more-engine-config branch from 4b1d759 to 3dd2a4e Compare June 5, 2026 19:25
@zlxi02

zlxi02 commented Jun 5, 2026

Copy link
Copy Markdown
Contributor Author

can you please post the local usage stats json to show what these fields look like when enabled?

Ran on a single-GPU GB200 node with LLM(model="facebook/opt-125m", max_model_len=512). Resulting ~/.config/vllm/usage_stats.json:

{
  "uuid": "<redacted>",
  "provider": "GCP",
  "num_cpu": 140,
  "cpu_type": "Neoverse-V2",
  "cpu_family_model_stepping": ",,",
  "total_memory": 947973390336,
  "architecture": "aarch64",
  "platform": "Linux-...aarch64",
  "xpu_runtime": null,
  "cuda_runtime": "13.0",
  "gpu_count": 4,
  "gpu_type": "NVIDIA GB200",
  "gpu_memory_per_device": 197897617408,
  "env_var_json": "{\"VLLM_USE_MODELSCOPE\": false, \"VLLM_USE_FLASHINFER_SAMPLER\": true, \"VLLM_PP_LAYER_PARTITION\": null, \"VLLM_USE_TRITON_AWQ\": false, \"VLLM_ENABLE_V1_MULTIPROCESSING\": true}",
  "model_architecture": "OPTForCausalLM",
  "vllm_version": "0.20.2rc1.dev242+gd7af6b34d",
  "context": "ENGINE_CONTEXT",
  "log_time": 1780688065386652928,
  "source": "production",
  "dtype": "torch.float16",
  "block_size": 16,
  "gpu_memory_utilization": 0.92,
  "kv_cache_memory_bytes": null,
  "quantization": null,
  "kv_cache_dtype": "auto",
  "enable_lora": false,
  "enable_prefix_caching": true,
  "enforce_eager": false,
  "disable_custom_all_reduce": false,
  "tensor_parallel_size": 1,
  "data_parallel_size": 1,
  "pipeline_parallel_size": 1,
  "enable_expert_parallel": false,
  "all2all_backend": "allgather_reducescatter",
  "kv_connector": null,
  "max_model_len": 512,
  "max_num_seqs": 1024,
  "max_num_batched_tokens": 16384,
  "attention_backend": null,
  "compilation_mode": "VLLM_COMPILE",
  "spec_decode_method": null,
  "num_speculative_tokens": null,
  "enable_eplb": false,
  "num_redundant_experts": 0,
  "num_experts": 0
}

New fields at the bottom^

simon-mo commented Jun 5, 2026

Copy link
Copy Markdown
Collaborator

can you please post a json with spec decode fields and other fields available?

@zlxi02

zlxi02 commented Jun 6, 2026

Copy link
Copy Markdown
Contributor Author

can you please post a json with spec decode fields and other fields available?

yes, new run on Qwen/Qwen1.5-MoE-A2.7B + EPLB + n-gram spec decode + FlashInfer attention (compilation_mode is None because we had to avoid a regression on main)

~/.config/vllm/usage_stats.json:

{
  "uuid": "<redacted>",
  "provider": "GCP",
  "num_cpu": 140,
  "cpu_type": "Neoverse-V2",
  "cpu_family_model_stepping": ",,",
  "total_memory": 947973390336,
  "architecture": "aarch64",
  "platform": "Linux-...aarch64",
  "xpu_runtime": null,
  "cuda_runtime": "13.0",
  "gpu_count": 4,
  "gpu_type": "NVIDIA GB200",
  "gpu_memory_per_device": 197897617408,
  "env_var_json": "{\"VLLM_USE_MODELSCOPE\": false, \"VLLM_USE_FLASHINFER_SAMPLER\": true, \"VLLM_PP_LAYER_PARTITION\": null, \"VLLM_USE_TRITON_AWQ\": false, \"VLLM_ENABLE_V1_MULTIPROCESSING\": true}",
  "model_architecture": "Qwen2MoeForCausalLM",
  "vllm_version": "0.20.2rc1.dev242+gd7af6b34d",
  "context": "ENGINE_CONTEXT",
  "log_time": 1780703520195269120,
  "source": "production",
  "dtype": "torch.bfloat16",
  "block_size": 16,
  "gpu_memory_utilization": 0.92,
  "kv_cache_memory_bytes": null,
  "quantization": null,
  "kv_cache_dtype": "auto",
  "enable_lora": false,
  "enable_prefix_caching": true,
  "enforce_eager": true,
  "disable_custom_all_reduce": false,
  "tensor_parallel_size": 2,
  "data_parallel_size": 1,
  "pipeline_parallel_size": 1,
  "enable_expert_parallel": true,
  "all2all_backend": "allgather_reducescatter",
  "kv_connector": null,
  "max_model_len": 4096,
  "max_num_seqs": 1024,
  "max_num_batched_tokens": 16384,
  "attention_backend": "FLASHINFER",
  "compilation_mode": "NONE",
  "spec_decode_method": "ngram",
  "num_speculative_tokens": 3,
  "enable_eplb": true,
  "num_redundant_experts": 8,
  "num_experts": 60
}
@zlxi02 zlxi02 force-pushed the usage-stats-report-more-engine-config branch 3 times, most recently from a2f92c2 to f0d49d1 Compare June 8, 2026 05:46
Adds the following fields to the usage stats payload:

- max_model_len, max_num_seqs, max_num_batched_tokens (batching knobs
  operators commonly override)
- attention_backend (user-requested; None = auto-selected at runtime)
- compilation_mode (NONE / STOCK_TORCH_COMPILE / DYNAMO_TRACE_ONCE /
  VLLM_COMPILE)
- spec_decode_method, num_speculative_tokens (None when spec decode is
  disabled)
- enable_eplb, num_redundant_experts, num_experts (wide expert parallel
  shape; num_experts is needed to interpret num_redundant_experts)

All fields are operational config or build metadata. Nothing model-
identifying: num_experts is a public property of the architecture (e.g.
a Mixtral fine-tune still has 8 experts), so it doesn't add fingerprint
beyond the existing model_architecture field.

Motivation: today we have no visibility into spec decoding adoption or
wide-EP shape, and the batching limits are the most commonly tuned
knobs but aren't reported.

Signed-off-by: Zach Xi <zachary.xi@gmail.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
@zlxi02 zlxi02 force-pushed the usage-stats-report-more-engine-config branch from f0d49d1 to 37020f6 Compare June 8, 2026 06:59
@simon-mo simon-mo merged commit 3f627eb into vllm-project:main Jun 8, 2026
46 of 48 checks passed
ekagra-ranjan pushed a commit to ekagra-ranjan/vllm that referenced this pull request Jun 9, 2026
…llm-project#44595)

Signed-off-by: Zach Xi <zachary.xi@gmail.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>
waqahmed-amd-fi pushed a commit to waqahmed-amd-fi/vllm that referenced this pull request Jun 10, 2026
…llm-project#44595)

Signed-off-by: Zach Xi <zachary.xi@gmail.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Signed-off-by: Waqar Ahmed <waqar.ahmed@amd.com>
Saddss pushed a commit to Saddss/vllm that referenced this pull request Jun 14, 2026
…llm-project#44595)

Signed-off-by: Zach Xi <zachary.xi@gmail.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
vivek8123 pushed a commit to odh-on-pz/vllm-upstream that referenced this pull request Jun 18, 2026
…llm-project#44595)

Signed-off-by: Zach Xi <zachary.xi@gmail.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
divineearthly pushed a commit to divineearthly/vllm that referenced this pull request Jun 19, 2026
…llm-project#44595)

Signed-off-by: Zach Xi <zachary.xi@gmail.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Signed-off-by: divineearthly <divineearthly@gmail.com>
tunglinwood pushed a commit to tunglinwood/vllm that referenced this pull request Jun 22, 2026
…llm-project#44595)

Signed-off-by: Zach Xi <zachary.xi@gmail.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
nkzhenhua pushed a commit to nkzhenhua/vllm that referenced this pull request Jun 24, 2026
…llm-project#44595)

Signed-off-by: Zach Xi <zachary.xi@gmail.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
ohsono pushed a commit to ohsono/vllm that referenced this pull request Jul 3, 2026
…llm-project#44595)

Signed-off-by: Zach Xi <zachary.xi@gmail.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed v1

2 participants