Skip to content

Add KV cache CPU offload support#35

Open
superxf wants to merge 1 commit into
hw-native-sys:mainfrom
superxf:moon_cpu
Open

Add KV cache CPU offload support#35
superxf wants to merge 1 commit into
hw-native-sys:mainfrom
superxf:moon_cpu

Conversation

@superxf

@superxf superxf commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Add KV cache CPU offload metadata, transfer jobs, CPU transfer backend, and LRU CPU slot eviction.
  • Wire CPU load/store jobs through scheduler, async engine, serving worker, and Qwen3 NPU runner page views.
  • Add --max-cpu-offload-blocks CLI control; 0 disables CPU offload.
  • Add CLI coverage and an NPU smoke script for CPU load/offload validation.

Validation

  • ruff check --cache-dir /tmp/pypto-ruff-cache --config /data/xufeng/pypto-serving/ruff.toml <changed files> passed.
  • python -m pytest /data/xufeng/pypto-serving/tests/test_cli.py -q passed: 7 passed.
  • python -m pytest /data/xufeng/pypto-serving/tests/test_batching.py -q passed: 9 passed.
  • NPU smoke previously passed for store/load observation with --long-prefill-token-threshold 64.

AI assistance was used for this change.

@coderabbitai

coderabbitai Bot commented Jun 15, 2026

Copy link
Copy Markdown

Review Change Stack

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: b66d8a7b-7337-41f6-a3ff-630534a48961

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This PR implements Phase 0 KV cache CPU offload for the NPU serving engine. It introduces a new kv_offload.py module with transfer data types and copy primitives, extends KvCacheManager with CPU slot tracking and LRU eviction, wires store/load transfer job scheduling into the Scheduler, executes transfers synchronously in WorkerProcess, and exposes a WorkerKVPageView materialization API through the runner/executor hierarchy. A CLI flag and integration test script are also added.

Changes

KV Cache CPU Offload

Layer / File(s) Summary
KV offload data model and copy primitives
python/core/kv_offload.py
Defines KVBlockLocation enum, NPULoadStoreSpec/CPULoadStoreSpec endpoint descriptors, TransferJob/TransferResult dataclasses, WorkerKVPageView for byte-level per-page NPU↔CPU tensor copies, and CPUKvOffloadBackend managing a shared-memory CPU slot pool with synchronous submit/wait.
IPC payload extensions
python/core/types.py
Adds kv_transfer_jobs to WorkerCommand and kv_transfer_results to StepOutput to carry transfer payloads across the engine↔worker boundary.
KvCacheManager CPU offload APIs and LRU eviction
python/core/kv_cache.py
Extends KVCacheBlock with residency fields, adds max_cpu_offload_blocks and CPU slot tracking to KvCacheManager, refactors block allocation to prefer consecutive IDs, and introduces build_cpu_store_job, build_cpu_load_job, complete_transfer_result, and LRU eviction helpers.
Scheduler store/load job scheduling
python/core/scheduler.py
Adds enable_kv_cpu_offload flag to SchedulerConfig, kv_transfer_jobs to SchedulerOutput, emits CPU load jobs on prefix-cache hits for CPU-resident blocks, queues CPU store jobs on block release, and adds complete_transfer_results.
AsyncEngine config fields and step wiring
python/core/async_engine.py
Adds enable_kv_cpu_offload/max_cpu_offload_blocks to EngineConfig, propagates them into KvCacheManager and SchedulerConfig, forwards kv_transfer_jobs in the step command, and calls scheduler.complete_transfer_results after each step.
Worker-side transfer execution
python/core/serving_worker.py
Preallocates shared CPU KV slots on model init, forwards kv_transfer_jobs into _execute_step with early-failure semantics, and implements submit/wait helpers plus lazy CPUKvOffloadBackend construction with capacity validation.
WorkerKVPageView materialization
python/core/model_runner.py, python/core/pypto_executor.py, examples/model/qwen3_14b/runner/npu_runner.py
Extends _KvCachePool with KV dimension metadata, adds materialize_kv_page_view (raises NotImplementedError) to ModelRunner, delegates through PyptoExecutor, and overrides in Qwen314BModelRunner with lazy KV cache init and full WorkerKVPageView construction.
CLI flag, config defaults, and tests
python/cli/main.py, tests/test_cli.py, tests/run_npu_kv_cpu_offload.py
Adds --max-cpu-offload-blocks CLI flag (lowers --long-prefill-token-threshold default to 64), wires it into EngineConfig, adds CLI unit tests for enabled/disabled defaults, and introduces an NPU integration test script that monkey-patches the scheduler to assert NPU→CPU store and CPU→NPU load transfer counts with deterministic output verification.

Sequence Diagram(s)

sequenceDiagram
  participant AsyncEngine
  participant Scheduler
  participant KvCacheManager
  participant WorkerProcess
  participant CPUKvOffloadBackend

  rect rgba(70, 130, 180, 0.5)
    note over AsyncEngine,Scheduler: Per-step schedule phase
    AsyncEngine->>Scheduler: schedule()
    Scheduler->>KvCacheManager: build_cpu_load_job(cpu_block_ids) [prefix-cache hit on CPU block]
    Scheduler->>KvCacheManager: build_cpu_store_job(freed_block_ids) [on request completion]
    Scheduler-->>AsyncEngine: SchedulerOutput(kv_transfer_jobs=[...])
  end

  rect rgba(60, 179, 113, 0.5)
    note over AsyncEngine,CPUKvOffloadBackend: Worker execution phase
    AsyncEngine->>WorkerProcess: WorkerCommand(kv_transfer_jobs=[...])
    WorkerProcess->>CPUKvOffloadBackend: submit(job) for each job
    CPUKvOffloadBackend->>CPUKvOffloadBackend: _copy_npu_to_cpu / _copy_cpu_to_npu
    WorkerProcess->>CPUKvOffloadBackend: wait(job_ids)
    CPUKvOffloadBackend-->>WorkerProcess: list[TransferResult]
    WorkerProcess-->>AsyncEngine: StepOutput(kv_transfer_results=[...])
  end

  rect rgba(210, 105, 30, 0.5)
    note over AsyncEngine,KvCacheManager: Transfer completion phase
    AsyncEngine->>Scheduler: complete_transfer_results(kv_transfer_results)
    Scheduler->>KvCacheManager: complete_transfer_result(result) for each result
    KvCacheManager->>KvCacheManager: update block location/slot/page_id, free CPU slot if NPU load
  end
Loading

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Possibly related issues

  • [Feature] KV Cache NPU-to-SSD Offload Plan #27: This PR directly implements the Phase 0 KV cache CPU offload feature described in that issue, including the transfer job infrastructure (TransferJob/TransferResult), scheduler store/load job integration, worker-side synchronous execution, and the --max-cpu-offload-blocks CLI flag.

Possibly related PRs

  • hw-native-sys/pypto-serving#24: Both PRs modify examples/model/qwen3_14b/runner/npu_runner.py; the prior PR allocates the key_pages/value_pages tensors on NPU that this PR's materialize_kv_page_view wraps into a WorkerKVPageView.
  • hw-native-sys/pypto-serving#21: Both PRs extend the same CLI argument parsing and EngineConfig construction path in python/cli/main.py and tests/test_cli.py.

Poem

🐇 Hop, hop, the pages fly,
From NPU to CPU they scurry by.
LRU evicts the old with grace,
Store jobs queue, load jobs race.
A TransferResult seals the deal—
My KV cache stays freshly real! 🌿

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 51.19% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title 'Add KV cache CPU offload support' directly and clearly summarizes the main change—introducing CPU offload functionality for KV cache across the codebase.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description check ✅ Passed The pull request description clearly describes the changeset, covering the addition of KV cache CPU offload features, integration points, CLI configuration, and validation steps performed.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces KV cache CPU offload support to PyPTO, allowing logical KV blocks to be transferred between NPU pages and CPU slots. It adds a CLI option --max-cpu-offload-blocks, implements the transfer logic and backend in python/core/kv_offload.py, and integrates the offloading mechanism into the scheduler, executor, and serving worker. A critical issue was identified in python/core/kv_cache.py where a transfer failure leaves block residency metadata and CPU slots in a corrupted or leaked state; a robust reversion of block states is suggested to handle failures properly.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread python/core/kv_cache.py
Comment on lines +401 to +404
if not result.success:
for block in blocks:
block.location = KVBlockLocation.NPU if block.physical_page_id is not None else KVBlockLocation.CPU
raise RuntimeError(result.error or "KV offload transfer failed")

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

On transfer failure, the block residency metadata and CPU slots are left in a corrupted state or leaked. Specifically:\n- For Store Jobs (NPU -> CPU): if the transfer fails, the block's cpu_slot_id is not cleared, the CPU slot is leaked, and if ref_cnt == 0, the block is leaked from the free_queue.\n- For Load Jobs (CPU -> NPU): if the transfer fails, the block's location is incorrectly set to KVBlockLocation.NPU and physical_page_id remains set, which will cause the engine to read uninitialized/stale NPU pages.\n\nWe should check the transfer direction and correctly revert the state of the blocks and CPU slots on failure.

        if not result.success:\n            if isinstance(result.src, NPULoadStoreSpec) and isinstance(result.dst, CPULoadStoreSpec):\n                for block in blocks:\n                    block.location = KVBlockLocation.NPU\n                    if block.cpu_slot_id is not None:\n                        self._free_cpu_slots.append(block.cpu_slot_id)\n                        block.cpu_slot_id = None\n                    if block.ref_cnt == 0:\n                        self.free_queue.append(block)\n            elif isinstance(result.src, CPULoadStoreSpec) and isinstance(result.dst, NPULoadStoreSpec):\n                for block in blocks:\n                    block.location = KVBlockLocation.CPU\n                    block.physical_page_id = None\n            else:\n                for block in blocks:\n                    block.location = KVBlockLocation.NPU if block.physical_page_id is not None else KVBlockLocation.CPU\n            raise RuntimeError(result.error or "KV offload transfer failed")

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
python/core/async_engine.py (1)

275-282: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Transfer failure exceptions propagate uncaught and can crash the engine loop.

complete_transfer_results can raise RuntimeError if a transfer fails (from kv_cache.complete_transfer_result). This exception propagates out of _process_step_output and crashes the engine loop. Consider wrapping this in error handling similar to how step_output.error is handled.

Proposed fix: add error handling
     def _process_step_output(
         self, scheduler_output: SchedulerOutput, step_output: StepOutput
     ) -> None:
         """Process worker results: update scheduler state, push tokens to request queues."""
-        self.scheduler.complete_transfer_results(step_output.kv_transfer_results)
+        try:
+            self.scheduler.complete_transfer_results(step_output.kv_transfer_results)
+        except RuntimeError as e:
+            logger.error(f"KV transfer completion failed: {e}")
+            self._handle_step_error(scheduler_output)
+            return
         request_outputs = self.scheduler.update_from_output(
             scheduler_output, step_output.new_tokens
         )
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@python/core/async_engine.py` around lines 275 - 282, The
`complete_transfer_results` call in the `_process_step_output` method can raise
an uncaught `RuntimeError` that crashes the engine loop. Wrap the call to
`self.scheduler.complete_transfer_results(step_output.kv_transfer_results)` in a
try-except block to catch `RuntimeError`, and handle the error gracefully (log
it and continue, or handle it similarly to how `step_output.error` is already
being handled elsewhere in the method) instead of allowing the exception to
propagate and crash the engine.
🧹 Nitpick comments (3)
python/core/kv_offload.py (1)

250-253: 💤 Low value

wait silently ignores unknown job IDs.

If wait is called with a job_id that was never submitted or already retrieved, it's silently skipped. While this may be intentional for a synchronous backend, it could mask bugs where callers pass incorrect job IDs.

Consider logging a warning or raising for unknown IDs during development/debug builds.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@python/core/kv_offload.py` around lines 250 - 253, The wait method silently
skips job IDs that are not found in self._completed, which could mask bugs where
callers pass incorrect job IDs. Modify the wait method to log a warning (or
raise an exception for debug builds) when a job_id is not present in
self._completed, rather than silently ignoring it. This will help catch issues
during development where incorrect job IDs are being passed to the method.
tests/test_cli.py (1)

87-95: ⚡ Quick win

Add a regression test for negative --max-cpu-offload-blocks.

Current tests cover enabled/default paths, but not invalid negative input. Add a parser rejection test to lock in the CLI contract once non-negative validation is enforced.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/test_cli.py` around lines 87 - 95, Add a new test function in
test_cli.py to validate that the argument parser rejects negative values for the
--max-cpu-offload-blocks flag. The test should call _parse_args with a negative
value (e.g., --max-cpu-offload-blocks -1) and assert that an appropriate error
is raised (such as SystemExit or argparse.ArgumentTypeError), ensuring the CLI
contract properly validates non-negative input for this parameter.
python/core/kv_cache.py (1)

340-370: ⚡ Quick win

Consider validating that blocks have ref_cnt == 0 before storing to CPU.

The method allows storing blocks with ref_cnt > 0 to CPU, which could corrupt active request state. While the scheduler currently guards this (line 459 in scheduler.py), adding validation here would prevent misuse from other callers.

Proposed defensive check
 def build_cpu_store_job(self, block_ids: list[int], request_id: str | None = None) -> TransferJob:
     """Build a transfer job that stores resident NPU blocks into CPU slots."""
     blocks = [self.blocks[block_id] for block_id in block_ids]
+    for block in blocks:
+        if block.ref_cnt > 0:
+            raise RuntimeError(f"Cannot offload block {block.block_id} with active references")
     if len(blocks) > self.max_cpu_offload_blocks:
         raise RuntimeError("Insufficient CPU KV offload slots.")
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@python/core/kv_cache.py` around lines 340 - 370, The build_cpu_store_job
method does not validate that blocks have ref_cnt equal to zero before allowing
them to be stored to CPU, which could corrupt active request state. Add a
validation check early in the method after retrieving the blocks to ensure all
blocks in the blocks list have ref_cnt == 0. If any block has a non-zero
ref_cnt, raise a RuntimeError with a descriptive message indicating that blocks
with active references cannot be offloaded to CPU. This defensive check should
be placed before the loop that processes the blocks and pops CPU slots.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@python/cli/main.py`:
- Around line 82-87: The `--max-cpu-offload-blocks` argument parser currently
accepts any integer via `type=int`, including negative values, and then silently
disables offload when negative values are encountered elsewhere in the code.
Instead of allowing invalid input to pass through, add validation to the
argument parser itself to reject negative values at parse time. Use a custom
type function or the choices parameter in the parser.add_argument call for the
max-cpu-offload-blocks argument to ensure only non-negative integers (0 or
greater) are accepted, failing the CLI validation immediately with an
appropriate error message if a negative value is provided.

In `@python/core/kv_cache.py`:
- Around line 395-423: The failure recovery logic in the
complete_transfer_result method incorrectly restores block location when a
transfer fails. Currently it checks only whether physical_page_id is not None,
but for CPU→NPU transfers that fail, the physical_page_id is already set from
build_cpu_load_job yet the data remains on CPU. Replace the simple
physical_page_id check at line 403 with logic that examines the transfer
direction (using result.src and result.dst type checks) to determine whether the
block should be restored to NPU or CPU, ensuring you correctly restore blocks to
their original location before the failed transfer attempt.

In `@python/core/kv_offload.py`:
- Around line 202-214: The ensure_num_cpu_slots method allocates a new tensor
and replaces self.cpu_slots, which breaks shared-memory references held by other
processes (such as forked worker children) that still point to the old tensor.
Fix this by either preventing resizes after initialization by raising an
exception if ensure_num_cpu_slots is called post-fork, or by preallocating
sufficient capacity upfront during initialization based on
max_cpu_offload_blocks so that ensure_num_cpu_slots is never called after the
CPU slots have been shared with child processes. This ensures all processes
reference the same underlying shared-memory pool.

In `@tests/run_npu_kv_cpu_offload.py`:
- Line 40: Add validation to ensure the --max-cpu-offload-blocks argument is a
positive integer. After parsing arguments, check that the parsed
max_cpu_offload_blocks value is greater than 0, and raise an error with a clear
message if it is not. This validation should fail fast before reaching the code
at lines 129-130 where enable_kv_cpu_offload is set to True, preventing
timeout-driven failures when non-positive block counts are provided.

---

Outside diff comments:
In `@python/core/async_engine.py`:
- Around line 275-282: The `complete_transfer_results` call in the
`_process_step_output` method can raise an uncaught `RuntimeError` that crashes
the engine loop. Wrap the call to
`self.scheduler.complete_transfer_results(step_output.kv_transfer_results)` in a
try-except block to catch `RuntimeError`, and handle the error gracefully (log
it and continue, or handle it similarly to how `step_output.error` is already
being handled elsewhere in the method) instead of allowing the exception to
propagate and crash the engine.

---

Nitpick comments:
In `@python/core/kv_cache.py`:
- Around line 340-370: The build_cpu_store_job method does not validate that
blocks have ref_cnt equal to zero before allowing them to be stored to CPU,
which could corrupt active request state. Add a validation check early in the
method after retrieving the blocks to ensure all blocks in the blocks list have
ref_cnt == 0. If any block has a non-zero ref_cnt, raise a RuntimeError with a
descriptive message indicating that blocks with active references cannot be
offloaded to CPU. This defensive check should be placed before the loop that
processes the blocks and pops CPU slots.

In `@python/core/kv_offload.py`:
- Around line 250-253: The wait method silently skips job IDs that are not found
in self._completed, which could mask bugs where callers pass incorrect job IDs.
Modify the wait method to log a warning (or raise an exception for debug builds)
when a job_id is not present in self._completed, rather than silently ignoring
it. This will help catch issues during development where incorrect job IDs are
being passed to the method.

In `@tests/test_cli.py`:
- Around line 87-95: Add a new test function in test_cli.py to validate that the
argument parser rejects negative values for the --max-cpu-offload-blocks flag.
The test should call _parse_args with a negative value (e.g.,
--max-cpu-offload-blocks -1) and assert that an appropriate error is raised
(such as SystemExit or argparse.ArgumentTypeError), ensuring the CLI contract
properly validates non-negative input for this parameter.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 4bf0603c-5d38-4dd6-9a5a-f04980eccbcb

📥 Commits

Reviewing files that changed from the base of the PR and between 0b0d8a0 and 81b450c.

📒 Files selected for processing (12)
  • examples/model/qwen3_14b/runner/npu_runner.py
  • python/cli/main.py
  • python/core/async_engine.py
  • python/core/kv_cache.py
  • python/core/kv_offload.py
  • python/core/model_runner.py
  • python/core/pypto_executor.py
  • python/core/scheduler.py
  • python/core/serving_worker.py
  • python/core/types.py
  • tests/run_npu_kv_cpu_offload.py
  • tests/test_cli.py

Comment thread python/cli/main.py
Comment on lines +82 to +87
parser.add_argument(
"--max-cpu-offload-blocks",
type=int,
default=0,
help="Maximum number of KV blocks to keep in CPU offload storage. 0 disables CPU offload (default: 0).",
)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Reject negative --max-cpu-offload-blocks values at parse time.

Line 83 currently accepts any integer, and Lines 123-124 propagate negative values into EngineConfig while silently disabling offload via > 0. This accepts invalid input instead of failing fast on CLI validation.

Proposed fix
+def _non_negative_int(value: str) -> int:
+    v = int(value)
+    if v < 0:
+        raise argparse.ArgumentTypeError("--max-cpu-offload-blocks must be >= 0")
+    return v
+
 def build_parser() -> argparse.ArgumentParser:
@@
     parser.add_argument(
         "--max-cpu-offload-blocks",
-        type=int,
+        type=_non_negative_int,
         default=0,
         help="Maximum number of KV blocks to keep in CPU offload storage. 0 disables CPU offload (default: 0).",
     )

Also applies to: 123-124

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@python/cli/main.py` around lines 82 - 87, The `--max-cpu-offload-blocks`
argument parser currently accepts any integer via `type=int`, including negative
values, and then silently disables offload when negative values are encountered
elsewhere in the code. Instead of allowing invalid input to pass through, add
validation to the argument parser itself to reject negative values at parse
time. Use a custom type function or the choices parameter in the
parser.add_argument call for the max-cpu-offload-blocks argument to ensure only
non-negative integers (0 or greater) are accepted, failing the CLI validation
immediately with an appropriate error message if a negative value is provided.

Comment thread python/core/kv_cache.py
Comment on lines +395 to +423
def complete_transfer_result(self, result: TransferResult) -> None:
"""Apply a completed CPU offload transfer to block residency metadata."""
block_ids = self._pending_transfer_blocks.pop(result.job_id, [])
if not block_ids:
return
blocks = [self.blocks[block_id] for block_id in block_ids]
if not result.success:
for block in blocks:
block.location = KVBlockLocation.NPU if block.physical_page_id is not None else KVBlockLocation.CPU
raise RuntimeError(result.error or "KV offload transfer failed")

if isinstance(result.src, NPULoadStoreSpec) and isinstance(result.dst, CPULoadStoreSpec):
for block, slot_id in zip(blocks, result.dst.slot_ids, strict=True):
block.location = KVBlockLocation.CPU
block.physical_page_id = None
block.cpu_slot_id = slot_id
self._touch_cpu_block(block)
if block.ref_cnt == 0:
self.free_queue.append(block)
elif isinstance(result.src, CPULoadStoreSpec) and isinstance(result.dst, NPULoadStoreSpec):
for block, page_id in zip(blocks, result.dst.page_ids, strict=True):
block.location = KVBlockLocation.NPU
block.physical_page_id = page_id
if block.cpu_slot_id is not None:
self._free_cpu_slots.append(block.cpu_slot_id)
block.cpu_slot_id = None
block.cpu_last_access = 0
else:
raise TypeError("Unsupported KV offload transfer result")

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Failure recovery logic incorrectly restores CPU→NPU load jobs to NPU location.

When a CPU→NPU load transfer fails, the block's physical_page_id was already set to block_id in build_cpu_load_job (line 381). The recovery at line 403 checks physical_page_id is not None and incorrectly restores the location to NPU, but the actual data is still on CPU.

Proposed fix: track original location or check transfer direction
         if not result.success:
             for block in blocks:
-                block.location = KVBlockLocation.NPU if block.physical_page_id is not None else KVBlockLocation.CPU
+                # Restore based on transfer direction, not physical_page_id
+                if isinstance(result.src, NPULoadStoreSpec):
+                    # Store job failed: block was on NPU
+                    block.location = KVBlockLocation.NPU
+                else:
+                    # Load job failed: block was on CPU
+                    block.location = KVBlockLocation.CPU
+                    block.physical_page_id = None  # Clear the prematurely set page_id
             raise RuntimeError(result.error or "KV offload transfer failed")
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@python/core/kv_cache.py` around lines 395 - 423, The failure recovery logic
in the complete_transfer_result method incorrectly restores block location when
a transfer fails. Currently it checks only whether physical_page_id is not None,
but for CPU→NPU transfers that fail, the physical_page_id is already set from
build_cpu_load_job yet the data remains on CPU. Replace the simple
physical_page_id check at line 403 with logic that examines the transfer
direction (using result.src and result.dst type checks) to determine whether the
block should be restored to NPU or CPU, ensuring you correctly restore blocks to
their original location before the failed transfer attempt.

Comment thread python/core/kv_offload.py Outdated
Comment on lines +202 to +214
def ensure_num_cpu_slots(self, num_cpu_slots: int) -> None:
"""Grow the CPU slot pool while preserving existing contents."""

num_cpu_slots = int(num_cpu_slots)
if num_cpu_slots <= self.cpu_slots.shape[0]:
return
new_slots = torch.empty(
(num_cpu_slots, self.page_view.page_size_bytes),
dtype=torch.uint8,
device="cpu",
).share_memory_()
new_slots[: self.cpu_slots.shape[0]].copy_(self.cpu_slots)
self.cpu_slots = new_slots

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Resizing shared-memory pool breaks cross-process references.

ensure_num_cpu_slots allocates a new tensor and copies data, but the old shared-memory tensor reference is discarded. If other processes (e.g., forked chip children) hold references to the original cpu_slots, they will not see the resized pool and will continue using stale memory.

From serving_worker.py, the CPU slots are preallocated before forking (_preallocate_kv_offload_cpu_slots), so any post-fork resize would violate the shared-memory contract. Consider either:

  1. Preventing resizes after initialization (raise if called post-init)
  2. Preallocating sufficient capacity upfront based on max_cpu_offload_blocks
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@python/core/kv_offload.py` around lines 202 - 214, The ensure_num_cpu_slots
method allocates a new tensor and replaces self.cpu_slots, which breaks
shared-memory references held by other processes (such as forked worker
children) that still point to the old tensor. Fix this by either preventing
resizes after initialization by raising an exception if ensure_num_cpu_slots is
called post-fork, or by preallocating sufficient capacity upfront during
initialization based on max_cpu_offload_blocks so that ensure_num_cpu_slots is
never called after the CPU slots have been shared with child processes. This
ensures all processes reference the same underlying shared-memory pool.

Comment thread tests/run_npu_kv_cpu_offload.py Outdated
parser.add_argument("--max-seq-len", type=int, default=512)
parser.add_argument("--max-new-tokens", type=int, default=2)
parser.add_argument("--long-prefill-token-threshold", type=int, default=512)
parser.add_argument("--max-cpu-offload-blocks", type=int, default=2)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Fail fast when --max-cpu-offload-blocks is non-positive in this verifier script.

Line 129 forces enable_kv_cpu_offload=True regardless of the value from Line 40. Passing 0 or a negative value can lead to timeout-driven false failures instead of immediate argument validation.

Proposed fix
 async def main() -> None:
     args = _parse_args()
+    if args.max_cpu_offload_blocks <= 0:
+        raise ValueError("--max-cpu-offload-blocks must be > 0 for this offload verification script.")
@@
     engine_config = EngineConfig(
@@
-        enable_kv_cpu_offload=True,
+        enable_kv_cpu_offload=True,
         max_cpu_offload_blocks=args.max_cpu_offload_blocks,
     )

Also applies to: 129-130

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/run_npu_kv_cpu_offload.py` at line 40, Add validation to ensure the
--max-cpu-offload-blocks argument is a positive integer. After parsing
arguments, check that the parsed max_cpu_offload_blocks value is greater than 0,
and raise an error with a clear message if it is not. This validation should
fail fast before reaching the code at lines 129-130 where enable_kv_cpu_offload
is set to True, preventing timeout-driven failures when non-positive block
counts are provided.

@superxf superxf force-pushed the moon_cpu branch 2 times, most recently from c484d5f to 4c9f5d7 Compare June 16, 2026 02:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant