Feature/offloading manager stats by Srinivasoo7 · Pull Request #35669 · vllm-project/vllm

Srinivasoo7 · 2026-03-01T17:30:18Z

Purpose

This PR adds support for telemetry emissions from the OffloadingManager interface inside the KV Connector.

As part of isolating the changes from the block-reuse frequency tracking PR, the OffloadingManager interface requires a standardized method to emit statistics up to the scheduler's KVConnector interfaces. This change:

Adds get_stats() -> dict[str, Any] to the base OffloadingManager abstract class.
Implements stats aggregation inside OffloadingConnectorStats.reduce to gracefully support flat scalar statistics (ints, floats) alongside its standard list metrics.
Aggregates metrics polled from the currently active self.connector_scheduler.manager within OffloadingConnector.get_kv_connector_stats().
Exposes stores_skipped directly from the FilteredOffloadingManager (formerly StoreReusedOffloadingManager), tracking exactly how many block hashes failed the LRU reuse threshold.

Test Plan

Ran pytest tests/v1/kv_offload to ensure manager abstraction and logic remain structurally sound.
Ran ruff format and ruff check on the modified files to ensure compliance with the repository layout.

Test Result

Pytest and Ruff linters pass cleanly across the modified abstract and connector files.
Emitted OffloadingConnectorStats successfully aggregate scalar dict values without validation errors.

gemini-code-assist

Code Review

This pull request effectively adds telemetry support for the OffloadingManager and introduces a FilteredOffloadingManager to gate offloading based on block reuse frequency. The implementation is well-structured, particularly the use of a decorator pattern for the filtering logic and the inclusion of comprehensive unit tests. I have identified a couple of areas for improvement to enhance configurability and robustness.

gemini-code-assist · 2026-03-01T17:31:59Z

+            store_threshold = int(self.extra_config.get("store_threshold", 0))
+            if store_threshold > 1:
+                self._manager = FilteredOffloadingManager(
+                    backing=self._manager,
+                    store_threshold=store_threshold,
+                )


The FilteredOffloadingManager is initialized without a configurable max_tracker_size. This means the LRU tracker for block reuse frequency will always use the default size. This parameter can have a significant impact on memory usage and filtering effectiveness, and should be configurable for performance tuning. I suggest reading max_tracker_size from self.extra_config, similar to how store_threshold is handled.

store_threshold = int(self.extra_config.get("store_threshold", 0)) if store_threshold > 1: max_tracker_size = int( self.extra_config.get("max_tracker_size", 64_000) ) self._manager = FilteredOffloadingManager( backing=self._manager, store_threshold=store_threshold, max_tracker_size=max_tracker_size, )

gemini-code-assist · 2026-03-01T17:31:59Z

+        stats = self._backing.get_stats()
+        stats["stores_skipped"] = self.stores_skipped
+        return stats


The get_stats method modifies the dictionary returned by self._backing.get_stats() in-place. While this is currently safe because the existing backing managers return a new empty dictionary, this pattern is fragile. If a backing manager's implementation changes in the future to return a shared or cached dictionary, this could lead to unintended side effects. It's safer to create a new dictionary to avoid mutating the object returned by the backing manager.

Suggested change

stats = self._backing.get_stats()

stats["stores_skipped"] = self.stores_skipped

return stats

return {**self._backing.get_stats(), "stores_skipped": self.stores_skipped}

mergify · 2026-03-03T01:33:12Z

Hi @Srinivasoo7, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?

mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

Srinivasoo7 · 2026-03-15T17:12:50Z

Hi @orozery
Since we merged PR #35342, can you look into the supporting stats for this functionality?

Thanks

orozery · 2026-03-16T06:13:04Z

@Srinivasoo7 can you please rebase?

orozery · 2026-03-17T09:48:53Z

The stats dictionary returned by the connector is later used to feed OffloadingConnectorStats.
OffloadingConnectorStats currently assumes the a hard-code structure of transfer_type -> ops_list.
The changed proposed here will break this assumption.
Specifically, I expect the OffloadingConnectorStats.aggregate and OffloadingConnectorStats.reduce to fail.

Srinivasoo7 · 2026-03-17T13:07:19Z

Hi @orozery
You are right, in the original OffloadingConnectorStats, I assumed a hard-coded transfer_type -> ops_list structure, but the branch already updates both aggregate() and reduce() to handle scalar values alongside list values.
Hence, the stores_skipped flows through both methods correctly without breaking the existing transfer-type logic.

mergify · 2026-03-18T00:14:03Z

Hi @Srinivasoo7, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?

mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

mergify · 2026-03-18T17:10:04Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @Srinivasoo7.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

orozery · 2026-03-19T10:14:59Z

Please rebase :)

Srinivasoo7 · 2026-03-19T13:29:49Z

Done @orozery

Srinivasoo7 · 2026-03-19T15:40:37Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces telemetry for the offloading manager, which is a valuable addition. The changes to support a new statistics structure with both transfers and gauges are well-implemented in metrics.py and offloading_connector.py. However, I've identified a critical bug in reuse_manager.py that could lead to a runtime error, and another high-risk issue related to in-place data mutation. I have provided suggestions to address both of these concerns.

gemini-code-assist · 2026-03-19T15:46:22Z

            bh for bh in block_hashes if self.counts.get(bh, 0) >= self.store_threshold
        ]

+        self.stores_skipped += len(block_hashes) - len(eligible)
+


The block_hashes parameter is an Iterable. Calling len(block_hashes) after it has been potentially consumed by the list comprehension on the preceding lines can lead to incorrect behavior or a TypeError if block_hashes is a generator. To ensure correctness, you should first convert the iterable to a list and use that list for both operations.

block_hashes_list = list(block_hashes) eligible = [ bh for bh in block_hashes_list if self.counts.get(bh, 0) >= self.store_threshold ] self.stores_skipped += len(block_hashes_list) - len(eligible)

gemini-code-assist · 2026-03-19T15:46:22Z

+    def get_stats(self) -> dict[str, Any]:
+        stats = self._backing.get_stats()
+        stats.setdefault("gauges", {})["stores_skipped"] = self.stores_skipped
+        return stats


The current implementation modifies the dictionary returned by self._backing.get_stats() in-place. This can cause unexpected side effects if the backing manager's returned dictionary is not meant to be mutated. To prevent this, you should operate on a copy. The safest approach is to use copy.deepcopy.

You will need to add import copy to the file.

def get_stats(self) -> dict[str, Any]: stats = copy.deepcopy(self._backing.get_stats()) stats.setdefault("gauges", {})["stores_skipped"] = self.stores_skipped return stats

Srinivasoo7 · 2026-03-20T17:06:43Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces telemetry for the OffloadingManager, allowing statistics to be collected and emitted. The changes include adding a get_stats method to the OffloadingManager interface, implementing it in FilterReusedOffloadingManager to report stores_skipped, and updating the metrics aggregation logic to handle these new scalar statistics (gauges) alongside existing transfer metrics. The OffloadingConnector is also modified to collect and aggregate these stats from the scheduler's manager.

My main concern is with the new test file tests/v1/kv_offload/test_reuse_manager.py. It appears to be written for a different version of the code, referencing classes (FilteredOffloadingManager, BlockReuseTracker) that do not exist in vllm/v1/kv_offload/reuse_manager.py. This is a critical issue as it means the new functionality is not being tested correctly. Please see my detailed comment on this file.

gemini-code-assist · 2026-03-20T17:11:15Z

+    BlockReuseTracker = _mod.BlockReuseTracker  # type: ignore[assignment,misc]
+    FilteredOffloadingManager = (  # type: ignore[assignment,misc]
+        _mod.FilteredOffloadingManager
+    )


This test file attempts to import BlockReuseTracker and FilteredOffloadingManager from the reuse_manager module. However, the implementation in vllm/v1/kv_offload/reuse_manager.py defines a class named FilterReusedOffloadingManager and does not contain a BlockReuseTracker class. This will cause an AttributeError at runtime, and these tests will fail to run.

Please ensure the test file is updated to use the correct class names from the module under test. FilteredOffloadingManager should likely be FilterReusedOffloadingManager. The logic for BlockReuseTracker seems to be part of FilterReusedOffloadingManager now, so the tests might need significant refactoring to target the public API of FilterReusedOffloadingManager.

orozery · 2026-03-23T05:02:07Z

+            mgr_stats_data = self.connector_scheduler.manager.get_stats()
+            if mgr_stats_data:
+                mgr_stats = self.build_kv_connector_stats(mgr_stats_data)
+                if mgr_stats is not None:
+                    if stats is not None:
+                        stats = stats.aggregate(mgr_stats)
+                    else:
+                        stats = mgr_stats


I don't think we need aggregation, as it is not possible for both self.connector_scheduler and self.connector_worker to be not None.

orozery · 2026-03-23T05:06:13Z


    def is_empty(self) -> bool:
-        return not self.data
+        return not self.data.get("transfers") and not self.data.get("gauges")


Why do we need this change?

orozery · 2026-03-23T05:07:06Z


    def reset(self):
-        self.data: dict[str, list[OffloadingOperationMetrics]] = {}
+        self.data: dict[str, Any] = {"transfers": {}, "gauges": {}}


Let's try and make it clearer by:

Defining "transfers" and "gauges" as string constants, e..g TRANSFERS_KEY = "transfers", GAUGUES_KEY = "gauges".

Add a docstring to OffloadingConnectorStats describing its expected structure.

orozery · 2026-03-23T05:11:39Z


+    def get_stats(self) -> dict[str, Any]:
+        stats = copy.deepcopy(self._backing.get_stats())
+        stats.setdefault("gauges", {})["stores_skipped"] = self.stores_skipped


Can we create and use a KVConnectorStats.set_gauge(gauge_name, gauge_value) method?

orozery · 2026-03-23T05:12:52Z

        """
        return ()
+
+    def get_stats(self) -> dict[str, Any]:


Let's define it as:

Suggested change

def get_stats(self) -> dict[str, Any]:

def get_stats(self) -> dict[str, Any] | None:

to save the dictionary allocation if stats are unused.

orozery · 2026-03-23T05:15:23Z

@@ -0,0 +1,360 @@
+# SPDX-License-Identifier: Apache-2.0


Looks like this is an old file?
We already have a unit test in test_cpu_manager.py
You should instead add a test that verifies the stores_skipped gauge.

Srinivasoo7 · 2026-03-30T15:29:38Z

Hi @orozery
Addressed the changes in the latest commit.

Thanks

mergify · 2026-06-09T07:57:30Z

Hi @Srinivasoo7, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

orozery · 2026-06-09T09:40:20Z

@Srinivasoo7 Also need to fix 2 more things:

pre-commit
assert self._connector_stats is None which fails some of the existing tests (see CI logs).

Srinivasoo7 · 2026-06-09T14:26:29Z

Yes boss @orozery, we'll fix it asap!

Signed-off-by: srinivas_oo7 <sklinkedin0120@gmail.com>

Signed-off-by: Or Ozeri <oro@il.ibm.com>

orozery

Thanks @Srinivasoo7 for the hard work! (and sorry for all of the nit picking :) )
I added an e2e test and noticed an issue that metric metadata is not serialized.
I changed the stats structs to next the metadata under the self.data dictionary, which is serialized.
Also another issue was we were missing empty worker side stats (until #43877 lands).
I pushed the fixes to your branch.

Srinivasoo7 · 2026-06-10T10:30:28Z

Gotcha @orozery.
This was a good stuff we worked on, though it took significant time, it was a great learning experience all together.
Yup I was trying to land both offloading metrics and worker side stats at once.

Also, from this PR we branched out to metrics redesign (#44008) request to help with your views there to start the redesign PR against the RFC.

Looking forward to more such contributions!

Signed-off-by: Sriusa4414@gmail.com Signed-off-by: srinivas_oo7 <Sriusa4414@gmail.com> Signed-off-by: srinivas_oo7 <sklinkedin0120@gmail.com> Signed-off-by: Srinivasoo7 <158864704+Srinivasoo7@users.noreply.github.com> Signed-off-by: Or Ozeri <oro@il.ibm.com> Co-authored-by: srinivas_oo7 <sklinkedin0120@gmail.com> Co-authored-by: Srinivasoo7 <158864704+Srinivasoo7@users.noreply.github.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> Co-authored-by: Or Ozeri <oro@il.ibm.com>

Signed-off-by: Sriusa4414@gmail.com Signed-off-by: srinivas_oo7 <Sriusa4414@gmail.com> Signed-off-by: srinivas_oo7 <sklinkedin0120@gmail.com> Signed-off-by: Srinivasoo7 <158864704+Srinivasoo7@users.noreply.github.com> Signed-off-by: Or Ozeri <oro@il.ibm.com> Co-authored-by: srinivas_oo7 <sklinkedin0120@gmail.com> Co-authored-by: Srinivasoo7 <158864704+Srinivasoo7@users.noreply.github.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> Co-authored-by: Or Ozeri <oro@il.ibm.com> Signed-off-by: divineearthly <divineearthly@gmail.com>

Signed-off-by: Sriusa4414@gmail.com Signed-off-by: srinivas_oo7 <Sriusa4414@gmail.com> Signed-off-by: srinivas_oo7 <sklinkedin0120@gmail.com> Signed-off-by: Srinivasoo7 <158864704+Srinivasoo7@users.noreply.github.com> Signed-off-by: Or Ozeri <oro@il.ibm.com> Co-authored-by: srinivas_oo7 <sklinkedin0120@gmail.com> Co-authored-by: Srinivasoo7 <158864704+Srinivasoo7@users.noreply.github.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> Co-authored-by: Or Ozeri <oro@il.ibm.com>

…HPU scheduler, ngram proposer and offloading connector tests to upstream API drift (#1556) ## Bug 1: Forward throttle_prefills in HPUAsyncScheduler.schedule - **State machine id**: hpu_async_scheduler_schedule_positional_arg - **Commit**: 957ba4d ### Root cause vLLM PR #44558 added a throttle_prefills positional arg to Scheduler.schedule(); EngineCore calls it positionally but the HPU override only accepted self. ### Upstream PR vllm-project/vllm#44558 ### Fix Accept throttle_prefills (default False) on the HPUAsyncScheduler.schedule override and forward it to super().schedule(). ## Bug 2: Pass num_speculative_tokens to NgramProposer.propose - **State machine id**: ngram_proposer_propose_missing_positional_arg - **Commit**: 82155ea ### Root cause vLLM PR #32374 (Dynamic SD) added a leading num_speculative_tokens positional arg to NgramProposer.propose(). ### Upstream PR vllm-project/vllm#32374 ### Fix Prepend self.speculative_config.num_speculative_tokens in propose_ngram_draft_token_ids to match the new upstream signature. ## Bug 3: Align OffloadingConnector stats tests with upstream flat-metrics API - **State machine id**: offloading_connector_cpu_to_gpu_metrics_missing - **Commit**: c1eb9e3 ### Root cause vLLM PR #35669 rewrote OffloadingConnectorStats to a self-describing {types, data} flat-metric payload, dropping the per-direction CPU_to_GPU/GPU_to_CPU list shape the tests still asserted. ### Upstream PR vllm-project/vllm#35669 ### Fix Rewrite test_metrics.py to exercise increase_counter/observe_histogram/aggregate/reduce/reset against the new self-describing stats contract. ## Bug 4: Align OffloadingConnector scheduler flush assertions with upstream defer-on-finish - **State machine id**: offloading_connector_flush_on_finish_deferred - **Commit**: 575a178 ### Root cause vLLM commit f428718ffe (PR #45823, "Defer on_request_finished until in-flight transfers drain") changed OffloadingConnectorScheduler: a finishing request with in-flight store jobs no longer flushes those stores immediately — finalization is deferred until transfers drain, and flush now fires only on preemption or block reuse. test_concurrent_lookups_of_the_same_prefix and test_abort_loading_requests still asserted flush-on-finish, so they failed once the target vLLM SHA picked up #45823. ### Upstream PR vllm-project/vllm#45823 ### Fix Drop the stale expected_flushed_gpu_block_indexes assertions in the two affected tests (matching upstream's own equivalents, which assert no flush in these scenarios). test_request_preemption keeps its flush-on-preemption assertion, which upstream still honors. --------- Signed-off-by: Paweł Olejniczak <pawelx.olejniczak@intel.com>

Signed-off-by: Sriusa4414@gmail.com Signed-off-by: srinivas_oo7 <Sriusa4414@gmail.com> Signed-off-by: srinivas_oo7 <sklinkedin0120@gmail.com> Signed-off-by: Srinivasoo7 <158864704+Srinivasoo7@users.noreply.github.com> Signed-off-by: Or Ozeri <oro@il.ibm.com> Co-authored-by: srinivas_oo7 <sklinkedin0120@gmail.com> Co-authored-by: Srinivasoo7 <158864704+Srinivasoo7@users.noreply.github.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> Co-authored-by: Or Ozeri <oro@il.ibm.com>

Srinivasoo7 requested review from ApostaC, NickLucche and orozery as code owners March 1, 2026 17:30

mergify Bot added v1 kv-connector labels Mar 1, 2026

gemini-code-assist Bot reviewed Mar 1, 2026

View reviewed changes

Srinivasoo7 force-pushed the feature/offloading-manager-stats branch from c13d578 to 944208b Compare March 3, 2026 01:40

This was referenced Mar 4, 2026

feat(kv-offload): Strategy A — StoreReusedOffloadingManager gates CPU stores on reuse frequency #35342

Merged

feat(kv-offload): Strategy B — AdaptiveOffloadingPolicy + non-blocking loads #35343

Closed

Srinivasoo7 force-pushed the feature/offloading-manager-stats branch from 944208b to 6c0ce3a Compare March 17, 2026 00:15

orozery reviewed Mar 17, 2026

View reviewed changes

Comment thread vllm/distributed/kv_transfer/kv_connector/v1/offloading_connector.py Outdated

Srinivasoo7 force-pushed the feature/offloading-manager-stats branch from e53557c to 6c0ce3a Compare March 18, 2026 00:21

mergify Bot added the needs-rebase label Mar 18, 2026

Srinivasoo7 force-pushed the feature/offloading-manager-stats branch from 48c4e97 to 5d75841 Compare March 19, 2026 13:27

mergify Bot removed the needs-rebase label Mar 19, 2026

gemini-code-assist Bot reviewed Mar 19, 2026

View reviewed changes

gemini-code-assist Bot reviewed Mar 20, 2026

View reviewed changes

orozery requested changes Mar 23, 2026

View reviewed changes

orozery added the ready ONLY add when PR is ready to merge/full CI is needed label Jun 9, 2026

srinivas_oo7 and others added 7 commits June 9, 2026 20:54

fix offloading metrics checks

bda7f85

Signed-off-by: srinivas_oo7 <sklinkedin0120@gmail.com>

Merge branch 'main' into feature/offloading-manager-stats

9be0ff1

Emit empty stats from worker-side to

d798b1b

Signed-off-by: Or Ozeri <oro@il.ibm.com>

Remove cast

672ec9b

Signed-off-by: Or Ozeri <oro@il.ibm.com>

Serialize metric metadata

36ef276

Signed-off-by: Or Ozeri <oro@il.ibm.com>

Group constants

7ac500a

Signed-off-by: Or Ozeri <oro@il.ibm.com>

Add tests

68bd1cb

Signed-off-by: Or Ozeri <oro@il.ibm.com>

orozery added ready ONLY add when PR is ready to merge/full CI is needed and removed ready ONLY add when PR is ready to merge/full CI is needed labels Jun 10, 2026

Trigger CI

7435534

Signed-off-by: Or Ozeri <oro@il.ibm.com>

orozery approved these changes Jun 10, 2026

View reviewed changes

Merge branch 'main' into feature/offloading-manager-stats

212a431

Merge branch 'main' into feature/offloading-manager-stats

71dfce6

orozery enabled auto-merge (squash) June 10, 2026 10:55

orozery merged commit 9dfc313 into vllm-project:main Jun 10, 2026
77 checks passed

github-project-automation Bot moved this from Backlog to Done in Prometheus Metrics Jun 10, 2026

pawel-olejniczak mentioned this pull request Jun 19, 2026

[FIX_FOR_VLLM_CUSTOM=ecf9d83520eb217401b47d8a5451a27c5231b8c2] Adapt HPU scheduler, ngram proposer and offloading connector tests to upstream API drift vllm-project/vllm-gaudi#1556

Merged

Change72 mentioned this pull request Jul 1, 2026

[KV Offload] Expose SimpleCPU offload metrics #41790

Open

	def get_stats(self) -> dict[str, Any]:
	def get_stats(self) -> dict[str, Any] \| None:

Uh oh!

Uh oh!

Conversation

Srinivasoo7 commented Mar 1, 2026 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

gemini-code-assist Bot Mar 1, 2026

Choose a reason for hiding this comment

gemini-code-assist Bot Mar 1, 2026

Choose a reason for hiding this comment

mergify Bot commented Mar 3, 2026

Srinivasoo7 commented Mar 15, 2026

orozery commented Mar 16, 2026

orozery commented Mar 17, 2026

Srinivasoo7 commented Mar 17, 2026

Uh oh!

mergify Bot commented Mar 18, 2026

mergify Bot commented Mar 18, 2026

orozery commented Mar 19, 2026

Srinivasoo7 commented Mar 19, 2026

Srinivasoo7 commented Mar 19, 2026

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

gemini-code-assist Bot Mar 19, 2026

Choose a reason for hiding this comment

gemini-code-assist Bot Mar 19, 2026

Choose a reason for hiding this comment

Srinivasoo7 commented Mar 20, 2026

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

gemini-code-assist Bot Mar 20, 2026

Choose a reason for hiding this comment

orozery Mar 23, 2026

Choose a reason for hiding this comment

orozery Mar 23, 2026

Choose a reason for hiding this comment

orozery Mar 23, 2026

Choose a reason for hiding this comment

orozery Mar 23, 2026

Choose a reason for hiding this comment

orozery Mar 23, 2026

Choose a reason for hiding this comment

orozery Mar 23, 2026

Choose a reason for hiding this comment

Srinivasoo7 commented Mar 30, 2026

mergify Bot commented Jun 9, 2026

orozery commented Jun 9, 2026

Srinivasoo7 commented Jun 9, 2026

orozery left a comment

Choose a reason for hiding this comment

Srinivasoo7 commented Jun 10, 2026

Uh oh!

Labels

3 participants

Srinivasoo7 commented Mar 1, 2026 •

edited by github-actions Bot

Loading