Add TurboQuant KV cache compression to serving pipeline by sunghajung6688 · Pull Request #33 · hw-native-sys/pypto-serving

sunghajung6688 · 2026-06-11T15:39:52Z

Integrate TurboQuant module (compressor, lloyd_max) with serving-v3 stack: extend kv_cache.py with compress/restore/stats methods, add TurboQuant hooks in npu_runner.py (compress after prefill/decode, restore before decode), clean compressed segments on preemption in serving_worker.py, and add kv_quant JSON config parsing in CLI.

coderabbitai · 2026-06-11T15:40:05Z

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: fdef9aa5-ec15-4352-b745-0dddbe7e27f7

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

📝 Walkthrough

Walkthrough

This PR implements comprehensive TurboQuant KV-cache quantization across the PyPTO serving stack. It introduces quantization types and compression algorithms, extends the KV cache manager with quantization support, adds TQ variants of CPU and NPU executors with fused kernels, integrates TQ into the CLI and serving pipelines, and provides analysis tools for validation and debugging.

Changes

TurboQuant KV-Cache Quantization Feature

Layer / File(s)	Summary
TurboQuant types and core compression `python/core/types.py`, `python/core/turboquant/__init__.py`, `python/core/turboquant/compressor.py`, `python/core/turboquant/lloyd_max.py`	Introduces `KvQuantConfig` dataclass and TurboQuant compression module with rotation matrix generation using QR decomposition, bit-packing quantizer for sub-8-bit indices, and Lloyd-Max optimal codebook solver for Gaussian distributions.
KV Cache Manager quantization support `python/core/kv_cache.py`	Extends `KvCacheManager` with lazy `KVCompressor` loading, quantized tensor allocation/materialization, per-request quantization state tracking, and helpers for compressed block metadata and effective bit-width computation.
Serving CLI TurboQuant integration `python/cli/main.py`	Adds `--kv-quant`, `--kv-quant-key-bits`, `--kv-quant-value-bits`, `--kv-quant-residual-window`, and protected-layer flags; conditionally constructs `KvQuantConfig` and passes it into `RuntimeConfig`.
CPU executor quantization `examples/model/qwen3_14b/runner/cpu_executor.py`	Adds tensor dumping infrastructure to `CpuModelExecutor` and introduces `CpuTqModelExecutor` with lazy codebook/rotation-matrix initialization, quantize/dequantize primitives (UINT8 indices + FP32 norms), and dequantized-KV attention.
CPU generation with TurboQuant and modes `examples/model/qwen3_14b/cpu_generate.py`	Refactors generation script with optional prompts, `--tq` flag for TurboQuant, `--compare` for FP-vs-TQ comparison, interactive REPL mode, and helper functions (`run_single`, `run_interactive`, `run_compare`, `create_engine`).
NPU executor TQ kernel compilation `examples/model/qwen3_14b/runner/npu_executor.py`	Extends executor to compile `prefill_tq` and `decode_tq` kernels when `kv_quant_config.enabled`, allocating rotation matrices, decode codebook, QJL projection matrices, and quantized K/V cache side-caches.
NPU runner TQ execution `examples/model/qwen3_14b/runner/npu_runner.py`	Updates `run_prefill` and `run_decode` to branch between TQ fused kernels (materializing quantized KV) and standard BF16 paged-KV paths; refactors L2 argument orchestration and KV page syncing.
NPU generation and serving config `examples/model/qwen3_14b/npu_generate.py`, `examples/model/qwen3_14b/npu_serving.json`	Adds `--kv-quant*` CLI flags and `--dump-dir` for NPU generation; conditionally constructs `KvQuantConfig`; updates serving config weight_dtype to bfloat16.
Comparison and verification tools `examples/model/qwen3_14b/compare_dump.py`, `examples/model/qwen3_14b/compare_npu_dump.py`, `examples/model/qwen3_14b/verify_tq_prefill.py`	Implements utilities for analyzing TurboQuant behavior: PT tensor comparison (cosine/MSE/MAE/error metrics), NPU KV-cache dequantization and per-layer comparison, and CPU vs. NPU quantization index verification.
Lloyd-Max solver tests `tests/test_lloyd_max.py`	Validates Lloyd-Max codebook computation by comparing numerical-integration and analytical Gaussian CDF solvers across configurations, checking MSE within tolerance and codebook properties.
Documentation `README.md`	Adds TurboQuant usage sections for offline mode (`--kv-quant` with flag table) and online serving mode (`kv_quant` JSON config); includes benchmark and common CLI flags references.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Possibly related issues

[Feature] TurboQuant NPU Kernel Integration #20: The PR directly implements end-to-end TurboQuant KV-quantization support for both CPU and NPU executors with fused kernels, addressing the NPU TurboQuant kernel integration requirements described in the issue.

Possibly related PRs

hw-native-sys/pypto-serving#24: Overlaps in refactoring python/core/kv_cache.py KV cache materialization paths and modifying NPU runner KV allocation logic.
hw-native-sys/pypto-serving#21: Both modify python/cli/main.py serving CLI and config construction to wire KV-quantization settings into runtime configuration.
hw-native-sys/pypto-serving#14: Both update NPU executor/runner kernel compilation and prefill/decode execution dispatch, with the main PR adding TQ kernel paths alongside existing fused kernel wiring.

Poem

🐰 Quantize and compress, the caches now glow,
Eight bits or four bits—watch the KV flow,
Rotation and codebook, a dance in the code,
TurboQuant's magic lighter the load!
From CPU to NPU, the path shines so bright,
Compare and verify—the rabbits got it right! 🎉

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 74.31% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately describes the main change: integrating TurboQuant KV cache compression into the serving pipeline, which is the core purpose of this PR.
Description check	✅ Passed	The description is directly related to the changeset, outlining the integration of TurboQuant module, extensions to kv_cache.py, NPU runner hooks, preemption handling, and CLI configuration updates.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist

Code Review

This pull request introduces TurboQuant (TQ) KV cache compression for the Qwen3-14B model, implementing both CPU and NPU execution paths, along with offline/online verification tools and Lloyd-Max quantization codebooks. The code review identified two critical bugs: an out-of-bounds indexing issue in multi-batch quantization due to incorrect tensor allocation size in KvCacheManager.register_model, and a missing padding issue in the NPU decode path (run_decode) that could cause out-of-bounds memory accesses on the NPU. Additionally, performance optimizations were suggested to simplify redundant casting chains in the CPU executor's quantization and dequantization functions.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-11T15:42:30Z

+            quant_key_indices = torch.zeros(
+                config.num_hidden_layers, max_blocks_per_seq,
+                config.num_key_value_heads, runtime.page_size, config.head_dim,
+                dtype=torch.uint8, device=runtime.device,
+            )
+            quant_key_norms = torch.zeros(
+                config.num_hidden_layers, max_blocks_per_seq,
+                config.num_key_value_heads, runtime.page_size, 1,
+                dtype=torch.float32, device=runtime.device,
+            )
+            quant_val_indices = torch.zeros(
+                config.num_hidden_layers, max_blocks_per_seq,
+                config.num_key_value_heads, runtime.page_size, config.head_dim,
+                dtype=torch.uint8, device=runtime.device,
+            )
+            quant_val_norms = torch.zeros(
+                config.num_hidden_layers, max_blocks_per_seq,
+                config.num_key_value_heads, runtime.page_size, 1,
+                dtype=torch.float32, device=runtime.device,
+            )


Critical Bug: Out-of-Bounds Indexing in Multi-Batch Quantization

In KvCacheManager.register_model, the quantization tensors (quant_key_indices, quant_key_norms, quant_val_indices, quant_val_norms) are allocated with max_blocks_per_seq in the second dimension instead of num_pages.

Since num_pages is max_batch_size * max_blocks_per_seq, whenever max_batch_size > 1, num_pages is strictly greater than max_blocks_per_seq. Any physical page index allocated to a request that is >= max_blocks_per_seq will cause an IndexError (out of bounds) when accessed, crashing the serving worker.

To fix this, allocate these tensors with num_pages in the second dimension, matching the allocation of key_pages.

Suggested change

quant_key_indices = torch.zeros(

config.num_hidden_layers, max_blocks_per_seq,

config.num_key_value_heads, runtime.page_size, config.head_dim,

dtype=torch.uint8, device=runtime.device,

)

quant_key_norms = torch.zeros(

config.num_hidden_layers, max_blocks_per_seq,

config.num_key_value_heads, runtime.page_size, 1,

dtype=torch.float32, device=runtime.device,

)

quant_val_indices = torch.zeros(

config.num_hidden_layers, max_blocks_per_seq,

config.num_key_value_heads, runtime.page_size, config.head_dim,

dtype=torch.uint8, device=runtime.device,

)

quant_val_norms = torch.zeros(

config.num_hidden_layers, max_blocks_per_seq,

config.num_key_value_heads, runtime.page_size, 1,

dtype=torch.float32, device=runtime.device,

)

quant_key_indices = torch.zeros(

config.num_hidden_layers, num_pages,

config.num_key_value_heads, runtime.page_size, config.head_dim,

dtype=torch.uint8, device=runtime.device,

)

quant_key_norms = torch.zeros(

config.num_hidden_layers, num_pages,

config.num_key_value_heads, runtime.page_size, 1,

dtype=torch.float32, device=runtime.device,

)

quant_val_indices = torch.zeros(

config.num_hidden_layers, num_pages,

config.num_key_value_heads, runtime.page_size, config.head_dim,

dtype=torch.uint8, device=runtime.device,

)

quant_val_norms = torch.zeros(

config.num_hidden_layers, num_pages,

config.num_key_value_heads, runtime.page_size, 1,

dtype=torch.float32, device=runtime.device,

)

gemini-code-assist · 2026-06-11T15:42:31Z

+        if use_tq:
+            # TQ decode path: hybrid resident (BF16) + compressed (UINT8) attention.
+            hidden = decode_inputs.hidden
+            quant_k, quant_v, quant_k_norms, quant_v_norms = (
+                self._kv_cache_manager.materialize_full_layer_quant_cache(model_id)
+            )
+            # Build compressed block tracking tensors.
+            cmp_block_table, cmp_num_blocks = self._kv_cache_manager.get_compressed_block_info(
+                model_id, batch,
+            )

-        # Pad active inputs up to the fixed kernel batch by replicating row 0.
-        def _pad_rows(active: torch.Tensor, rows_each: int) -> torch.Tensor:
-            view = active.reshape(actual_batch, rows_each)
-            padded = view[0:1].expand(kernel_batch - actual_batch, rows_each)
-            return torch.cat([view, padded], dim=0).reshape(-1).contiguous()
-
-        hidden = torch.zeros((kernel_batch, model.config.hidden_size), dtype=torch.bfloat16)
-        hidden[:actual_batch] = decode_inputs.hidden
-        hidden[actual_batch:] = decode_inputs.hidden[0:1]
-        hidden = hidden.share_memory_()
-        seq_lens = _pad_rows(decode_inputs.seq_lens, 1).to(torch.int32).share_memory_()
-        block_table = _pad_rows(decode_inputs.block_table, max_blocks).to(torch.int32).share_memory_()
-        slot_mapping = _pad_rows(decode_inputs.slot_mapping, 1).to(torch.int32).share_memory_()
-
-        # Padded block_table / slot_mapping only ever reference row 0's
-        # already-valid pages, so bound-check exactly what the kernel will read.
-        self._validate_kv_cache_bounds(model, block_table, slot_mapping, k_cache)
-
-        logits_padded = compiled.decode_logits_buffer  # full [kernel_batch, vocab]; trimmed below
-        # Argument order MUST match decode_layer.decode_fwd (PAGED):
-        #   hidden_states, input_rms_weight, wq, wk, wv, q_norm_weight,
-        #   k_norm_weight, seq_lens, block_table, slot_mapping, rope_cos, rope_sin,
-        #   k_cache, v_cache, wo, w_gate, w_up, w_down, post_rms_weight,
-        #   final_norm_weight, lm_head_weight, out.
-        self._run_l2_program(
-            compiled.decode,
-            hidden,
-            self._l2_child_tensor(rt, dw["decode_input_rms_weight"]),
-            self._l2_child_tensor(rt, dw["decode_wq"]),
-            self._l2_child_tensor(rt, dw["decode_wk"]),
-            self._l2_child_tensor(rt, dw["decode_wv"]),
-            self._l2_child_tensor(rt, dw["decode_q_norm_weight"]),
-            self._l2_child_tensor(rt, dw["decode_k_norm_weight"]),
-            seq_lens,
-            block_table,
-            slot_mapping,
-            self._l2_child_tensor(rt, compiled.rope_cos),
-            self._l2_child_tensor(rt, compiled.rope_sin),
-            k_cache,
-            v_cache,
-            self._l2_child_tensor(rt, dw["decode_wo"]),
-            self._l2_child_tensor(rt, dw["decode_w_gate"]),
-            self._l2_child_tensor(rt, dw["decode_w_up"]),
-            self._l2_child_tensor(rt, dw["decode_w_down"]),
-            self._l2_child_tensor(rt, dw["decode_post_rms_weight"]),
-            self._l2_child_tensor(rt, compiled.final_norm_weight),
-            self._l2_child_tensor(rt, compiled.padded_lm_head_weight),
-            logits_padded,
-        )
+            # decode_tq outputs logits directly (includes rms_lm_head).
+            decode_logits_out = compiled.decode_logits_buffer
+
+            self._run_l2_program(
+                compiled.decode_tq,
+                hidden,
+                self._l2_child_tensor(compiled.decode_tq.runtime_name, dw["decode_input_rms_weight"]),
+                self._l2_child_tensor(compiled.decode_tq.runtime_name, dw["decode_wq"]),
+                self._l2_child_tensor(compiled.decode_tq.runtime_name, dw["decode_wk"]),
+                self._l2_child_tensor(compiled.decode_tq.runtime_name, dw["decode_wv"]),
+                self._l2_child_tensor(compiled.decode_tq.runtime_name, dw["decode_q_norm_weight"]),
+                self._l2_child_tensor(compiled.decode_tq.runtime_name, dw["decode_k_norm_weight"]),
+                decode_inputs.seq_lens,
+                decode_inputs.block_table,
+                decode_inputs.slot_mapping,
+                self._l2_child_tensor(compiled.decode_tq.runtime_name, compiled.rope_cos),
+                self._l2_child_tensor(compiled.decode_tq.runtime_name, compiled.rope_sin),
+                quant_k,
+                quant_v,
+                quant_k_norms,
+                quant_v_norms,
+                cmp_block_table,
+                cmp_num_blocks,
+                self._l2_child_tensor(compiled.decode_tq.runtime_name, compiled.rot_matrices),
+                self._l2_child_tensor(compiled.decode_tq.runtime_name, compiled.tq_codebook),
+                self._l2_child_tensor(compiled.decode_tq.runtime_name, compiled.qjl_matrices),
+                compiled.quant_k_signs_cache,
+                compiled.qjl_norms_cache,
+                self._l2_child_tensor(compiled.decode_tq.runtime_name, dw["decode_wo"]),
+                self._l2_child_tensor(compiled.decode_tq.runtime_name, dw["decode_post_rms_weight"]),
+                self._l2_child_tensor(compiled.decode_tq.runtime_name, dw["decode_w_gate"]),
+                self._l2_child_tensor(compiled.decode_tq.runtime_name, dw["decode_w_up"]),
+                self._l2_child_tensor(compiled.decode_tq.runtime_name, dw["decode_w_down"]),
+                self._l2_child_tensor(compiled.decode_tq.runtime_name, compiled.final_norm_weight),
+                self._l2_child_tensor(compiled.decode_tq.runtime_name, compiled.padded_lm_head_weight),
+                decode_logits_out,
+            )
+            logits_padded = decode_logits_out
+        else:
+            # Standard BF16 decode path (device-resident, fixed-batch padded).
+            kv_cache = self._kv_caches.get(model_id)
+            if kv_cache is None:
+                raise RuntimeError(f"KV cache for model {model_id!r} is not initialized")
+            k_cache = kv_cache.key_pages
+            v_cache = kv_cache.value_pages
+
+            # Pad active inputs up to the fixed kernel batch by replicating row 0.
+            def _pad_rows(active: torch.Tensor, rows_each: int) -> torch.Tensor:
+                view = active.reshape(actual_batch, rows_each)
+                padded = view[0:1].expand(kernel_batch - actual_batch, rows_each)
+                return torch.cat([view, padded], dim=0).reshape(-1).contiguous()
+
+            hidden = torch.zeros((kernel_batch, model.config.hidden_size), dtype=torch.bfloat16)
+            hidden[:actual_batch] = decode_inputs.hidden
+            hidden[actual_batch:] = decode_inputs.hidden[0:1]
+            hidden = hidden.share_memory_()
+            seq_lens = _pad_rows(decode_inputs.seq_lens, 1).to(torch.int32).share_memory_()
+            block_table = _pad_rows(decode_inputs.block_table, max_blocks).to(torch.int32).share_memory_()
+            slot_mapping = _pad_rows(decode_inputs.slot_mapping, 1).to(torch.int32).share_memory_()


Critical Bug: Missing Padding in TurboQuant Decode Path

In run_decode, the use_tq path does not pad hidden, seq_lens, block_table, slot_mapping, cmp_block_table, and cmp_num_blocks to kernel_batch (which is max_batch_size).

Since the compiled decode_tq kernel expects fixed-batch inputs of size kernel_batch, passing tensors of size actual_batch (when actual_batch < kernel_batch) will cause out-of-bounds memory accesses on the NPU, leading to silent data corruption, incorrect logits, or device hangs.

To fix this, define _pad_rows once before the conditional blocks, and use it to pad all inputs (including cmp_block_table and cmp_num_blocks) to kernel_batch in the use_tq path.

# Pad active inputs up to the fixed kernel batch by replicating row 0. def _pad_rows(active: torch.Tensor, rows_each: int) -> torch.Tensor: view = active.reshape(actual_batch, rows_each) padded = view[0:1].expand(kernel_batch - actual_batch, rows_each) return torch.cat([view, padded], dim=0).reshape(-1).contiguous() if use_tq: # TQ decode path: hybrid resident (BF16) + compressed (UINT8) attention. hidden = torch.zeros((kernel_batch, model.config.hidden_size), dtype=torch.bfloat16) hidden[:actual_batch] = decode_inputs.hidden hidden[actual_batch:] = decode_inputs.hidden[0:1] hidden = hidden.share_memory_() seq_lens = _pad_rows(decode_inputs.seq_lens, 1).to(torch.int32).share_memory_() block_table = _pad_rows(decode_inputs.block_table, max_blocks).to(torch.int32).share_memory_() slot_mapping = _pad_rows(decode_inputs.slot_mapping, 1).to(torch.int32).share_memory_() quant_k, quant_v, quant_k_norms, quant_v_norms = ( self._kv_cache_manager.materialize_full_layer_quant_cache(model_id) ) # Build compressed block tracking tensors. cmp_block_table, cmp_num_blocks = self._kv_cache_manager.get_compressed_block_info( model_id, batch, ) cmp_block_table = _pad_rows(cmp_block_table, max_blocks).to(torch.int32).share_memory_() cmp_num_blocks = _pad_rows(cmp_num_blocks, 1).to(torch.int32).share_memory_() # decode_tq outputs logits directly (includes rms_lm_head). decode_logits_out = compiled.decode_logits_buffer self._run_l2_program( compiled.decode_tq, hidden, self._l2_child_tensor(compiled.decode_tq.runtime_name, dw["decode_input_rms_weight"]), self._l2_child_tensor(compiled.decode_tq.runtime_name, dw["decode_wq"]), self._l2_child_tensor(compiled.decode_tq.runtime_name, dw["decode_wk"]), self._l2_child_tensor(compiled.decode_tq.runtime_name, dw["decode_wv"]), self._l2_child_tensor(compiled.decode_tq.runtime_name, dw["decode_q_norm_weight"]), self._l2_child_tensor(compiled.decode_tq.runtime_name, dw["decode_k_norm_weight"]), seq_lens, block_table, slot_mapping, self._l2_child_tensor(compiled.decode_tq.runtime_name, compiled.rope_cos), self._l2_child_tensor(compiled.decode_tq.runtime_name, compiled.rope_sin), quant_k, quant_v, quant_k_norms, quant_v_norms, cmp_block_table, cmp_num_blocks, self._l2_child_tensor(compiled.decode_tq.runtime_name, compiled.rot_matrices), self._l2_child_tensor(compiled.decode_tq.runtime_name, compiled.tq_codebook), self._l2_child_tensor(compiled.decode_tq.runtime_name, compiled.qjl_matrices), compiled.quant_k_signs_cache, compiled.qjl_norms_cache, self._l2_child_tensor(compiled.decode_tq.runtime_name, dw["decode_wo"]), self._l2_child_tensor(compiled.decode_tq.runtime_name, dw["decode_post_rms_weight"]), self._l2_child_tensor(compiled.decode_tq.runtime_name, dw["decode_w_gate"]), self._l2_child_tensor(compiled.decode_tq.runtime_name, dw["decode_w_up"]), self._l2_child_tensor(compiled.decode_tq.runtime_name, dw["decode_w_down"]), self._l2_child_tensor(compiled.decode_tq.runtime_name, compiled.final_norm_weight), self._l2_child_tensor(compiled.decode_tq.runtime_name, compiled.padded_lm_head_weight), decode_logits_out, ) logits_padded = decode_logits_out else: # Standard BF16 decode path (device-resident, fixed-batch padded). kv_cache = self._kv_caches.get(model_id) if kv_cache is None: raise RuntimeError(f"KV cache for model {model_id!r} is not initialized") k_cache = kv_cache.key_pages v_cache = kv_cache.value_pages hidden = torch.zeros((kernel_batch, model.config.hidden_size), dtype=torch.bfloat16) hidden[:actual_batch] = decode_inputs.hidden hidden[actual_batch:] = decode_inputs.hidden[0:1] hidden = hidden.share_memory_() seq_lens = _pad_rows(decode_inputs.seq_lens, 1).to(torch.int32).share_memory_() block_table = _pad_rows(decode_inputs.block_table, max_blocks).to(torch.int32).share_memory_() slot_mapping = _pad_rows(decode_inputs.slot_mapping, 1).to(torch.int32).share_memory_()

gemini-code-assist · 2026-06-11T15:42:31Z

+        indices = torch.searchsorted(boundaries, x_rot)
+        indices_u8 = indices.to(torch.int32).to(torch.float16).to(torch.uint8)


Performance: Redundant and Inefficient Casting Chain

In _tq_quantize, the casting chain indices.to(torch.int32).to(torch.float16).to(torch.uint8) is redundant and inefficient on CPU. It allocates multiple intermediate tensors and performs unnecessary conversions.

Since indices is an integer tensor returned by torch.searchsorted, you can cast it directly to torch.uint8.

Suggested change

indices = torch.searchsorted(boundaries, x_rot)

indices_u8 = indices.to(torch.int32).to(torch.float16).to(torch.uint8)

indices = torch.searchsorted(boundaries, x_rot)

indices_u8 = indices.to(torch.uint8)

gemini-code-assist · 2026-06-11T15:42:31Z

+        idx_int32 = quant_indices.to(torch.float16).to(torch.int32)
+        y_hat = centroids[idx_int32.long()]  # [N, HEAD_DIM] in rotated space


Performance: Redundant and Inefficient Casting Chain

In _tq_dequantize, the casting chain quant_indices.to(torch.float16).to(torch.int32) followed by .long() is redundant and inefficient on CPU.

Since quant_indices is a uint8 tensor, you can cast it directly to torch.long (or .long()) to index centroids.

Suggested change

idx_int32 = quant_indices.to(torch.float16).to(torch.int32)

y_hat = centroids[idx_int32.long()] # [N, HEAD_DIM] in rotated space

y_hat = centroids[quant_indices.long()] # [N, HEAD_DIM] in rotated space

coderabbitai

Actionable comments posted: 20

Note

Due to the large number of review comments, Critical, Major severity comments were prioritized as inline comments.

🟡 Minor comments (4)

README.md-152-171 (1)

152-171: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

JSON config example is missing protected_layers and protected_bits fields.

The offline TurboQuant documentation (lines 104-105) describes --kv-quant-protected-layers and --kv-quant-protected-bits parameters, but the online JSON example (lines 160-165) omits these fields. If these parameters are supported in online mode, include them in the JSON example for completeness. If they're offline-only, clarify that limitation in line 154.
📝 Suggested addition to JSON example
   "kv_quant": {
     "enabled": true,
     "key_bits": 4,
     "value_bits": 2,
-    "residual_window": 128
+    "residual_window": 128,
+    "protected_layers": 4,
+    "protected_bits": 8
   }
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@README.md` around lines 152 - 171, The README's online JSON example under the
kv_quant section is missing the protected_layers and protected_bits fields
referenced for offline mode; update the kv_quant example to include
"protected_layers" (array of layer indices) and "protected_bits" (integer or
array) with brief default/example values, or explicitly state in the kv_quant
section that --kv-quant-protected-layers and --kv-quant-protected-bits are
offline-only if they are not supported online, so readers know whether to add
those fields to the JSON or not.

examples/model/qwen3_14b/verify_tq_prefill.py-36-40 (1)

36-40: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Guard zero-norm vectors in cpu_quantize to prevent NaN-based index artifacts.

If norm == 0, normalization yields NaNs and the resulting indices become meaningless for verification output.

Proposed fix

 def cpu_quantize(fp_vec, rot, codebook):
     norm = np.linalg.norm(fp_vec)
+    if norm < 1e-12:
+        return np.zeros(fp_vec.shape[0], dtype=np.uint8), 0.0
     rotated = fp_vec @ rot
     unit = rotated / norm
     indices = np.argmin(np.abs(unit[:, None] - codebook[None, :]), axis=-1).astype(np.uint8)
     return indices, norm

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/model/qwen3_14b/verify_tq_prefill.py` around lines 36 - 40, In
cpu_quantize, guard against division by zero when computing unit = rotated /
norm: detect zero norms from fp_vec (norm == 0) and handle them before dividing
(e.g., set unit for those vectors to a safe value such as zeros or use a tiny
epsilon to avoid NaN), then compute indices = np.argmin(...) on the safe unit
and return the original norm (0) for those entries; update the logic around
norm, rotated, unit, and indices to ensure no NaNs propagate into the
verification output.

examples/model/qwen3_14b/cpu_generate.py-166-168 (1)

166-168: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Guard the overlap percentage for empty generations.

If both runs return zero tokens, total becomes 0 and Line 168 raises ZeroDivisionError, which makes compare mode crash on an otherwise valid edge case.

Proposed fix

     common = len(set(result_tq.token_ids) & set(result_fp.token_ids))
     total = max(len(result_tq.token_ids), len(result_fp.token_ids))
-    print(f"\nToken overlap: {common}/{total} ({common / total * 100:.1f}%)")
+    if total == 0:
+        print("\nToken overlap: n/a (both generations were empty)")
+    else:
+        print(f"\nToken overlap: {common}/{total} ({common / total * 100:.1f}%)")

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/model/qwen3_14b/cpu_generate.py` around lines 166 - 168, Guard the
overlap percentage calculation against division by zero: compute common as
set(result_tq.token_ids) & set(result_fp.token_ids) and total as
max(len(result_tq.token_ids), len(result_fp.token_ids)), then if total == 0
print a suitable message (e.g. "Token overlap: 0/0 (0.0%)" or "N/A") instead of
performing common / total; modify the block around variables common, total and
the print call in cpu_generate.py to handle the empty-generation case.

examples/model/qwen3_14b/runner/npu_executor.py-1044-1049 (1)

1044-1049: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Populate _L2Callable.name in this helper.

This constructor call omits the required name field, so the first caller of _compile_l2_callable_from_jit() will fail with a TypeError instead of returning a usable callable.

🐛 Proposed fix

         return _L2Callable(
             chip_callable=chip_callable,
+            name=name,
             runtime_name=runtime_name,
             block_dim=int(runtime_config.get("block_dim", 24)),
             aicpu_thread_num=int(runtime_config.get("aicpu_thread_num", 4)),
             param_infos=tuple(param_infos),
         )

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/model/qwen3_14b/runner/npu_executor.py` around lines 1044 - 1049,
The _L2Callable constructor call is missing the required name parameter causing
_compile_l2_callable_from_jit() to raise a TypeError; update the call that
builds _L2Callable (the block creating chip_callable, runtime_name, block_dim,
aicpu_thread_num, param_infos) to pass a meaningful name (e.g.,
name=runtime_name or fallback to chip_callable.__name__/str(runtime_name)) so
the created object has its name set and callers of
_compile_l2_callable_from_jit() receive a usable callable.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@examples/model/qwen3_14b/compare_dump.py`:
- Line 164: The print statement uses an unnecessary f-string which triggers Ruff
F541; locate the line containing print(f"\n  Top-10 lowest cosine similarity:")
in examples/model/qwen3_14b/compare_dump.py and remove the f prefix so it
becomes a normal string (print("\n  Top-10 lowest cosine similarity:")),
ensuring no interpolation is used.

In `@examples/model/qwen3_14b/compare_npu_dump.py`:
- Around line 160-162: The code builds layers from num_layers and layer_filter
but never validates layer_filter, so an out-of-range value will trigger
IndexError later when indexing fp_k; add an explicit bounds check after
computing num_layers (using fp_k.shape[0]) that if layer_filter is not None it
must be an int within 0 <= layer_filter < num_layers, otherwise raise a clear
ValueError (or print an error and exit) before forming layers = range(...) or
[layer_filter]; apply the same validation wherever layers are used later (the
block around the code referenced at lines 175-177) to ensure no downstream
indexing occurs with an invalid layer index.
- Around line 141-158: Check existence of every file before any np.load calls:
verify fp_k_path and fp_v_path exist before loading fp_k/fp_v, and verify
tq_qk_path, tq_qv_path, tq_kn_path, and tq_vn_path exist before loading
tq_qk/tq_qv/tq_kn/tq_vn; if any required file is missing, print a single skip
message (e.g., f"  [skip] {tag}: missing KV dump") and return instead of
attempting to np.load, using the existing variables fp_k_path, fp_v_path,
tq_qk_path, tq_qv_path, tq_kn_path, tq_vn_path to locate checks around the
current load logic.
- Around line 271-276: The fallback to a hardcoded num_layers=40 is unsafe;
instead, derive num_layers from the available cache files in fp_dir by globbing
all per-layer k/v cache files (e.g. "*_k_cache.npy" and/or "*_v_cache.npy") and
computing the appropriate layer count from their shapes (for example using the
max .shape[0] across matches). Update the logic around fp_sample and num_layers
(referencing fp_dir, fp_sample, and the "prefill_step0000_k_cache.npy" pattern)
to collect all matching cache files, compute num_layers from their loaded
shapes, and only fall back to a safer default after confirming no cache files
exist (and log a clear warning if you must).

In `@examples/model/qwen3_14b/cpu_generate.py`:
- Around line 183-186: The num_layers_override flag is only passed to
CpuTqModelExecutor, so CpuModelExecutor (and other FP-path constructors
referenced by CpuModelExecutor usage at the other similar blocks) ignores the
layer-limit; update the FP constructors to accept and forward
num_layers_override as CpuModelExecutor(...,
num_layers_override=num_layers_override) (and ensure any other instantiations in
the same file that construct FP executors in compare/single-FP paths also
receive the flag) so both TQ and FP executors honor the CLI layer-limit
consistently.
- Around line 141-157: The compare routine uses the same model_id for both
engines causing the FP engine to be queried with the TQ model name; update
run_compare to accept two model ids (e.g., model_id_tq, model_id_fp) or
otherwise pass distinct ids when calling engine_tq.generate_result and
engine_fp.generate_result so engine_tq uses the TQ model id and engine_fp uses
the FP model id; update all call sites of run_compare accordingly and ensure the
registered FP model name ("qwen3-14b-fp-cpu") is passed to
engine_fp.generate_result while the TQ name is passed to
engine_tq.generate_result.

In `@examples/model/qwen3_14b/npu_generate.py`:
- Around line 494-503: The CLI currently allows arbitrary kv-quant bit sizes but
the NPU kernels in runner/npu_executor.py are hard-coded for 4-bit (int4)
quantization; add a fail-fast validation before constructing KvQuantConfig: when
args.kv_quant is true, check args.kv_quant_key_bits and args.kv_quant_value_bits
are exactly 4 and if not raise/exit with a clear error message (or
argparse.ArgumentType/ArgumentParser error) explaining that only 4-bit KV
quantization is supported by the NPU executor; keep constructing KvQuantConfig
only after this validation so runner/npu_executor.py's contract is preserved.

In `@examples/model/qwen3_14b/runner/cpu_executor.py`:
- Around line 446-458: run_decode fails when called before any prefill because
it does not initialize TurboQuant state; before iterating layers (and before
calling _layer_decode_tq) call the same initializers used in run_prefill —
specifically invoke _ensure_codebook() and _ensure_rot_matrices() (so _codebook
and _rot_matrices[layer_idx] are populated) on the executor instance (or
model/runtime as appropriate) so decode-first flows honor the base executor
contract and _layer_decode_tq can safely access those members.
- Around line 784-799: The code currently sets alloc.tokens_used =
max(alloc.tokens_used, start_token + seq_len) even when the inner loop breaks
early due to running out of pages, causing unwritten slots to be treated as
valid; fix by tracking how many tokens were actually written (e.g., introduce a
local counter like written_count and increment it for each tok_offset that
completes the per-head writes) and then set alloc.tokens_used =
max(alloc.tokens_used, start_token + written_count) instead of using seq_len;
reference alloc.tokens_used, start_token, seq_len, tok_offset and alloc.page_ids
and ensure this change prevents _read_full_quant_cache() from reading unwritten
slots.

In `@examples/model/qwen3_14b/runner/npu_executor.py`:
- Around line 395-399: In _compile_model the variable prefill_block_table_stride
is undefined and causes a crash; update the call that builds prefill_tq (the
call to self._compile_prefill_tq_callable using qwen3_prefill_tq.prefill_fwd_tq)
to pass the already-computed max_blocks_per_seq instead of
prefill_block_table_stride so the block-table stride uses max_blocks_per_seq
(locate usages around prefill_tq and where max_blocks_per_seq is computed and
substitute the undefined symbol accordingly).

In `@examples/model/qwen3_14b/runner/npu_runner.py`:
- Around line 261-265: The TQ branch (when use_tq is true) must reject
chunked/resumed prefills because prefill_tq only supports
single-chunk/full-prefix prefill; add an explicit guard before the quantized KV
write (the block assigning quant_k, quant_v, quant_k_norms, quant_v_norms) that
checks chunk_offsets and chunk_lens and raises a clear exception (e.g.,
ValueError) if they indicate a resumed/chunked prefill; reference the use_tq
conditional and the prefill_tq call to ensure this guard runs only for the TQ
path and prevents silent wrong-cache writes.
- Around line 772-776: The annotation KvAllocation used in the method
_sync_all_kv_pages_from_npu is undefined in this module; add an import for
KvAllocation from its defining module (or wrap the annotation in quotes to make
it a forward reference) so Ruff F821 stops failing—locate the KvAllocation
symbol referenced in _sync_all_kv_pages_from_npu and either add "from <module>
import KvAllocation" at the top of the file or change the signature to def
_sync_all_kv_pages_from_npu(self, model_id: str, alloc: "KvAllocation") -> None.

In `@examples/model/qwen3_14b/verify_tq_prefill.py`:
- Line 63: Remove the unnecessary f-string prefixes on the non-interpolated
print statements (e.g., the print call that currently reads print(f"\n--- SCALES
---") and the similar one at line 72) to satisfy Ruff F541; locate these print
statements in verify_tq_prefill.py and change them to plain string prints
(print("\n--- SCALES ---") and the other analogous print) so they are not
treated as formatted strings.

In `@python/cli/main.py`:
- Around line 84-89: Validate the TurboQuant CLI flags before constructing
KvQuantConfig: ensure kv-quant-key-bits and kv-quant-value-bits are in the range
1..8 and that kv-quant-residual-window, kv-quant-protected-layers, and
kv-quant-protected-bits are >= 0 (or adjust protected-bits minimum to 1 if
TurboQuantCompressor requires it); perform this validation after parsing args
and call parser.error(...) with a clear message on invalid values (or use
argparse choices/type constraints) so invalid inputs are rejected before
KvQuantConfig/TurboQuantCompressor are created; apply the same checks for the
duplicate flag block referenced at lines 133-143 as well.

In `@python/core/kv_cache.py`:
- Around line 133-140: The type name KvQuantConfig used by the _CachePool
attribute kv_quant_config is not imported, causing F821 and runtime failures;
fix by adding an explicit import for KvQuantConfig at the top-level imports of
this module (import the actual module where KvQuantConfig is defined, e.g., add
a from <defining_module> import KvQuantConfig) so that _CachePool and the
kv_quant_config: KvQuantConfig | None annotation resolve at runtime; ensure the
import is not only inside TYPE_CHECKING so runtime evaluation of the annotation
succeeds.
- Around line 246-268: The quantized-page tensors quant_key_indices,
quant_key_norms, quant_val_indices, and quant_val_norms are currently sized with
max_blocks_per_seq but must be sized with the total num_pages so they can be
indexed by physical alloc.page_ids (used by get_compressed_block_info); change
their second dimension from max_blocks_per_seq to num_pages (keeping other dims,
dtypes, and device the same) so accesses using alloc.page_ids in multi-request
batches cannot index out of range or overlap other requests.

In `@tests/test_lloyd_max.py`:
- Around line 48-79: The test defines a duplicate solver solve_lloyd_max_new
instead of using the production implementation; delete the local
solve_lloyd_max_new and import and call the canonical solver
(python.core.turboquant.lloyd_max.solve_lloyd_max) in the test, replacing all
references to solve_lloyd_max_new with that imported function so tests exercise
the shipped implementation rather than a re-implementation in tests.
- Line 127: The failing Ruff F541 errors are caused by unnecessary f-string
prefixes on literal-only print statements (e.g., the print call showing "  PASS:
new MSE <= old MSE (within 0.5%)"); remove the leading f from these string
literals so they are plain string literals instead of f-strings; update all
occurrences in tests/test_lloyd_max.py (the prints around the PASS messages at
the referenced lines) to use plain quotes to satisfy Ruff and keep existing text
unchanged.
- Around line 107-120: quantization_mse currently samples extreme [-1e10,1e10]
which makes pdf_vals underflow; update quantization_mse to integrate over
finite, sigma-dependent supports per interval (use boundaries already computed)
and sample each interval [a,b] with xs = torch.linspace(a, b, n_pts,
dtype=torch.float64) where a,b come from boundaries (or clamp a,b to e.g. mean ±
K*sigma), compute dx = (b - a) / (n_pts - 1) and accumulate mse_i = ((xs -
cs[i])**2 * pdf_vals).sum() * dx (not divide by n_pts), or replace numerical
sampling by analytical Gaussian interval integrals if preferred; ensure you
reference function quantization_mse, variables cs, boundaries, sigma, n_pts when
making the change.
- Around line 82-185: The file defines test_case(d, bits) which pytest will
collect as a test with undefined fixtures and currently only prints PASS/FAIL;
convert it to a proper parametrized pytest test by: rename or keep test_case but
add `@pytest.mark.parametrize`("d,bits", [...]) using the test_configs list
instead of the main loop; replace all printed PASS/FAIL checks (MSE comparison,
monotonicity, symmetry) with assertions (use solve_lloyd_max_old,
solve_lloyd_max_new, quantization_mse, and the diffs_signed/sym_diff
computations) so failures raise AssertionError; remove or disable the standalone
main() invocation so pytest controls execution. Ensure imports include pytest
and that timing/printing is optional (can remain but not required for
assertions).

---

Minor comments:
In `@examples/model/qwen3_14b/cpu_generate.py`:
- Around line 166-168: Guard the overlap percentage calculation against division
by zero: compute common as set(result_tq.token_ids) & set(result_fp.token_ids)
and total as max(len(result_tq.token_ids), len(result_fp.token_ids)), then if
total == 0 print a suitable message (e.g. "Token overlap: 0/0 (0.0%)" or "N/A")
instead of performing common / total; modify the block around variables common,
total and the print call in cpu_generate.py to handle the empty-generation case.

In `@examples/model/qwen3_14b/runner/npu_executor.py`:
- Around line 1044-1049: The _L2Callable constructor call is missing the
required name parameter causing _compile_l2_callable_from_jit() to raise a
TypeError; update the call that builds _L2Callable (the block creating
chip_callable, runtime_name, block_dim, aicpu_thread_num, param_infos) to pass a
meaningful name (e.g., name=runtime_name or fallback to
chip_callable.__name__/str(runtime_name)) so the created object has its name set
and callers of _compile_l2_callable_from_jit() receive a usable callable.

In `@examples/model/qwen3_14b/verify_tq_prefill.py`:
- Around line 36-40: In cpu_quantize, guard against division by zero when
computing unit = rotated / norm: detect zero norms from fp_vec (norm == 0) and
handle them before dividing (e.g., set unit for those vectors to a safe value
such as zeros or use a tiny epsilon to avoid NaN), then compute indices =
np.argmin(...) on the safe unit and return the original norm (0) for those
entries; update the logic around norm, rotated, unit, and indices to ensure no
NaNs propagate into the verification output.

In `@README.md`:
- Around line 152-171: The README's online JSON example under the kv_quant
section is missing the protected_layers and protected_bits fields referenced for
offline mode; update the kv_quant example to include "protected_layers" (array
of layer indices) and "protected_bits" (integer or array) with brief
default/example values, or explicitly state in the kv_quant section that
--kv-quant-protected-layers and --kv-quant-protected-bits are offline-only if
they are not supported online, so readers know whether to add those fields to
the JSON or not.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 9159bd11-16d1-478a-98ea-853d56892e56

📥 Commits

Reviewing files that changed from the base of the PR and between 36fbdb8 and 99cc8c2.

📒 Files selected for processing (19)

.claude/skills
.claude/skills
README.md
examples/model/qwen3_14b/compare_dump.py
examples/model/qwen3_14b/compare_npu_dump.py
examples/model/qwen3_14b/cpu_generate.py
examples/model/qwen3_14b/npu_generate.py
examples/model/qwen3_14b/npu_serving.json
examples/model/qwen3_14b/runner/cpu_executor.py
examples/model/qwen3_14b/runner/npu_executor.py
examples/model/qwen3_14b/runner/npu_runner.py
examples/model/qwen3_14b/verify_tq_prefill.py
python/cli/main.py
python/core/kv_cache.py
python/core/turboquant/__init__.py
python/core/turboquant/compressor.py
python/core/turboquant/lloyd_max.py
python/core/types.py
tests/test_lloyd_max.py

coderabbitai · 2026-06-11T15:56:09Z

+    # Find worst entries
+    if items:
+        worst = sorted(items, key=lambda x: x[1])[:10]
+        print(f"\n  Top-10 lowest cosine similarity:")


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Remove the unnecessary f-string prefix to satisfy Ruff F541.

This string has no interpolation, so Ruff flags F541 and the lint gate fails.

Proposed fix

- print(f"\n Top-10 lowest cosine similarity:") + print("\n Top-10 lowest cosine similarity:")

As per coding guidelines, Ruff is configured with line length 110, target-version = "py310", F checks enabled, and F841 ignored.

🧰 Tools

🪛 Ruff (0.15.15)

[error] 164-164: f-string without any placeholders

Remove extraneous f prefix

(F541)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@examples/model/qwen3_14b/compare_dump.py` at line 164, The print statement uses an unnecessary f-string which triggers Ruff F541; locate the line containing print(f"\n Top-10 lowest cosine similarity:") in examples/model/qwen3_14b/compare_dump.py and remove the f prefix so it becomes a normal string (print("\n Top-10 lowest cosine similarity:")), ensuring no interpolation is used.

Sources: Coding guidelines, Linters/SAST tools

coderabbitai · 2026-06-11T15:56:09Z

+    if not fp_k_path.exists():
+        print(f"  [skip] {tag}: FP k_cache not found")
+        return
+    fp_k = np.load(fp_k_path)   # [layers, kv_heads, page_size, head_dim]
+    fp_v = np.load(fp_v_path)
+
+    # ---- Load TQ KV cache ----
+    tq_qk_path = tq_dir / f"{tag}_quant_k.npy"
+    tq_qv_path = tq_dir / f"{tag}_quant_v.npy"
+    tq_kn_path = tq_dir / f"{tag}_quant_k_norm.npy"
+    tq_vn_path = tq_dir / f"{tag}_quant_v_norm.npy"
+    if not tq_qk_path.exists():
+        print(f"  [skip] {tag}: TQ quant_k not found")
+        return
+    tq_qk = np.load(tq_qk_path)
+    tq_qv = np.load(tq_qv_path)
+    tq_kn = np.load(tq_kn_path)
+    tq_vn = np.load(tq_vn_path)


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Validate all required dump files before loading.

Only k_cache/quant_k presence is checked, but v_cache, quant_v, and norm files are loaded unguarded. Missing any of them causes a hard crash.

Proposed fix

- if not fp_k_path.exists(): - print(f" [skip] {tag}: FP k_cache not found") + required_fp = [fp_k_path, fp_v_path] + missing_fp = [p.name for p in required_fp if not p.exists()] + if missing_fp: + print(f" [skip] {tag}: missing FP files: {missing_fp}") return @@ - if not tq_qk_path.exists(): - print(f" [skip] {tag}: TQ quant_k not found") + required_tq = [tq_qk_path, tq_qv_path, tq_kn_path, tq_vn_path] + missing_tq = [p.name for p in required_tq if not p.exists()] + if missing_tq: + print(f" [skip] {tag}: missing TQ files: {missing_tq}") return

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@examples/model/qwen3_14b/compare_npu_dump.py` around lines 141 - 158, Check existence of every file before any np.load calls: verify fp_k_path and fp_v_path exist before loading fp_k/fp_v, and verify tq_qk_path, tq_qv_path, tq_kn_path, and tq_vn_path exist before loading tq_qk/tq_qv/tq_kn/tq_vn; if any required file is missing, print a single skip message (e.g., f" [skip] {tag}: missing KV dump") and return instead of attempting to np.load, using the existing variables fp_k_path, fp_v_path, tq_qk_path, tq_qv_path, tq_kn_path, tq_vn_path to locate checks around the current load logic.

coderabbitai · 2026-06-11T15:56:09Z

+    num_layers = fp_k.shape[0]
+    layers = range(num_layers) if layer_filter is None else [layer_filter]
+


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Validate --layer bounds before indexing layer tensors.

A user-provided out-of-range layer currently raises IndexError during dequantization/comparison.

Proposed fix

num_layers = fp_k.shape[0] - layers = range(num_layers) if layer_filter is None else [layer_filter] + if layer_filter is not None and not (0 <= layer_filter < num_layers): + print(f" [skip] {tag}: layer {layer_filter} out of range (0..{num_layers - 1})") + return + layers = range(num_layers) if layer_filter is None else [layer_filter]

Also applies to: 175-177

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@examples/model/qwen3_14b/compare_npu_dump.py` around lines 160 - 162, The code builds layers from num_layers and layer_filter but never validates layer_filter, so an out-of-range value will trigger IndexError later when indexing fp_k; add an explicit bounds check after computing num_layers (using fp_k.shape[0]) that if layer_filter is not None it must be an int within 0 <= layer_filter < num_layers, otherwise raise a clear ValueError (or print an error and exit) before forming layers = range(...) or [layer_filter]; apply the same validation wherever layers are used later (the block around the code referenced at lines 175-177) to ensure no downstream indexing occurs with an invalid layer index.

coderabbitai · 2026-06-11T15:56:09Z

+    fp_sample = sorted(fp_dir.glob("prefill_step0000_k_cache.npy"))
+    if fp_sample:
+        num_layers = np.load(fp_sample[0]).shape[0]
+    else:
+        num_layers = 40
+        print(f"Warning: could not detect num_layers, assuming {num_layers}")


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Avoid hardcoded num_layers=40; infer from available cache files.

Falling back to 40 can produce wrong rotation-matrix sizing and downstream indexing failures for other model configs.

Proposed fix

- fp_sample = sorted(fp_dir.glob("prefill_step0000_k_cache.npy")) + fp_sample = sorted(fp_dir.glob(f"{args.phase}_step*_k_cache.npy")) or sorted(fp_dir.glob("*_k_cache.npy")) if fp_sample: num_layers = np.load(fp_sample[0]).shape[0] else: - num_layers = 40 - print(f"Warning: could not detect num_layers, assuming {num_layers}") + print("Error: could not detect num_layers from FP cache dumps.") + sys.exit(1)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@examples/model/qwen3_14b/compare_npu_dump.py` around lines 271 - 276, The fallback to a hardcoded num_layers=40 is unsafe; instead, derive num_layers from the available cache files in fp_dir by globbing all per-layer k/v cache files (e.g. "*_k_cache.npy" and/or "*_v_cache.npy") and computing the appropriate layer count from their shapes (for example using the max .shape[0] across matches). Update the logic around fp_sample and num_layers (referencing fp_dir, fp_sample, and the "prefill_step0000_k_cache.npy" pattern) to collect all matching cache files, compute num_layers from their loaded shapes, and only fall back to a safer default after confirming no cache files exist (and log a clear warning if you must).

coderabbitai · 2026-06-11T15:56:09Z

+def run_compare(
+    engine_tq: LLMEngine,
+    engine_fp: LLMEngine,
+    model_id: str,
+    prompt: str,
+    config: GenerateConfig,
+) -> None:
+    """Run same prompt on both TQ and FP, print comparison."""
+    print(f"\nPrompt: {prompt}")
+    print("-" * 60)
+
+    t0 = time.perf_counter()
+    result_tq = engine_tq.generate_result(model_id, prompt, config)
+    dt_tq = time.perf_counter() - t0
+
+    t0 = time.perf_counter()
+    result_fp = engine_fp.generate_result(model_id, prompt, config)


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Pass separate model IDs to the two engines in compare mode.

Lines 246-249 register the FP engine under "qwen3-14b-fp-cpu", but Lines 153 and 157 query both engines with the single model_id argument. In compare mode that means the FP engine is asked for the TQ model ID, which will either fail lookup or hit the wrong model.

Proposed fix

def run_compare( engine_tq: LLMEngine, + model_id_tq: str, engine_fp: LLMEngine, - model_id: str, + model_id_fp: str, prompt: str, config: GenerateConfig, ) -> None: @@ - result_tq = engine_tq.generate_result(model_id, prompt, config) + result_tq = engine_tq.generate_result(model_id_tq, prompt, config) @@ - result_fp = engine_fp.generate_result(model_id, prompt, config) + result_fp = engine_fp.generate_result(model_id_fp, prompt, config)

- run_compare(engine_tq, engine_fp, args.model_id, prompt, config) + run_compare(engine_tq, args.model_id, engine_fp, "qwen3-14b-fp-cpu", prompt, config)

Also applies to: 241-252

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@examples/model/qwen3_14b/cpu_generate.py` around lines 141 - 157, The compare routine uses the same model_id for both engines causing the FP engine to be queried with the TQ model name; update run_compare to accept two model ids (e.g., model_id_tq, model_id_fp) or otherwise pass distinct ids when calling engine_tq.generate_result and engine_fp.generate_result so engine_tq uses the TQ model id and engine_fp uses the FP model id; update all call sites of run_compare accordingly and ensure the registered FP model name ("qwen3-14b-fp-cpu") is passed to engine_fp.generate_result while the TQ name is passed to engine_tq.generate_result.

coderabbitai · 2026-06-11T15:56:10Z

+            # Compressed storage: non-packed uint8, one byte per element.
+            # For int4 (n_levels=16), each element stores values 0-15 in uint8.
+            # Shape: [num_layers, max_seq_pages, num_kv_heads, page_size, head_dim]
+            quant_key_indices = torch.zeros(
+                config.num_hidden_layers, max_blocks_per_seq,
+                config.num_key_value_heads, runtime.page_size, config.head_dim,
+                dtype=torch.uint8, device=runtime.device,
+            )
+            quant_key_norms = torch.zeros(
+                config.num_hidden_layers, max_blocks_per_seq,
+                config.num_key_value_heads, runtime.page_size, 1,
+                dtype=torch.float32, device=runtime.device,
+            )
+            quant_val_indices = torch.zeros(
+                config.num_hidden_layers, max_blocks_per_seq,
+                config.num_key_value_heads, runtime.page_size, config.head_dim,
+                dtype=torch.uint8, device=runtime.device,
+            )
+            quant_val_norms = torch.zeros(
+                config.num_hidden_layers, max_blocks_per_seq,
+                config.num_key_value_heads, runtime.page_size, 1,
+                dtype=torch.float32, device=runtime.device,
+            )


⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Size the quantized page tensors with num_pages, not max_blocks_per_seq.

These buffers are later addressed with physical alloc.page_ids via get_compressed_block_info(), but alloc.page_ids come from the global block pool in [0, num_pages). As soon as max_batch_size > 1, page IDs beyond max_blocks_per_seq - 1 can point past the quant cache or into another request's data.

Suggested fix

- quant_key_indices = torch.zeros( - config.num_hidden_layers, max_blocks_per_seq, + quant_key_indices = torch.zeros( + config.num_hidden_layers, num_pages, config.num_key_value_heads, runtime.page_size, config.head_dim, dtype=torch.uint8, device=runtime.device, ) quant_key_norms = torch.zeros( - config.num_hidden_layers, max_blocks_per_seq, + config.num_hidden_layers, num_pages, config.num_key_value_heads, runtime.page_size, 1, dtype=torch.float32, device=runtime.device, ) quant_val_indices = torch.zeros( - config.num_hidden_layers, max_blocks_per_seq, + config.num_hidden_layers, num_pages, config.num_key_value_heads, runtime.page_size, config.head_dim, dtype=torch.uint8, device=runtime.device, ) quant_val_norms = torch.zeros( - config.num_hidden_layers, max_blocks_per_seq, + config.num_hidden_layers, num_pages, config.num_key_value_heads, runtime.page_size, 1, dtype=torch.float32, device=runtime.device, )

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@python/core/kv_cache.py` around lines 246 - 268, The quantized-page tensors quant_key_indices, quant_key_norms, quant_val_indices, and quant_val_norms are currently sized with max_blocks_per_seq but must be sized with the total num_pages so they can be indexed by physical alloc.page_ids (used by get_compressed_block_info); change their second dimension from max_blocks_per_seq to num_pages (keeping other dims, dtypes, and device the same) so accesses using alloc.page_ids in multi-request batches cannot index out of range or overlap other requests.

coderabbitai · 2026-06-11T15:56:10Z

+def solve_lloyd_max_new(d: int, bits: int, max_iter: int = 200, tol: float = 1e-10):
+    """Optimized implementation: analytical Gaussian conditional mean."""
+    n_levels = 2 ** bits
+    sigma = 1.0 / math.sqrt(d)
+    lo, hi = -3.5 * sigma, 3.5 * sigma
+    centroids = torch.linspace(lo, hi, n_levels + 2)[1:-1].double()
+
+    for _ in range(max_iter):
+        boundaries = (centroids[:-1] + centroids[1:]) / 2.0
+        edges = torch.zeros(n_levels + 1, dtype=torch.float64)
+        edges[0] = lo * 3
+        edges[1:-1] = boundaries
+        edges[-1] = hi * 3
+
+        z = edges / sigma
+        phi_z = (-0.5 * z * z).exp() * (1.0 / math.sqrt(2.0 * math.pi))
+        cdf_z = 0.5 * (1.0 + torch.erf(z * (1.0 / math.sqrt(2.0))))
+
+        phi_diff = phi_z[:-1] - phi_z[1:]
+        cdf_diff = cdf_z[1:] - cdf_z[:-1]
+        valid = cdf_diff > 1e-15
+        new_centroids = torch.where(
+            valid,
+            sigma * phi_diff / cdf_diff.clamp(min=1e-15),
+            centroids,
+        )
+        max_shift = (new_centroids - centroids).abs().max().item()
+        centroids = new_centroids
+        if max_shift < tol:
+            break
+
+    return centroids


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Import the production solver instead of re-implementing it in the test.

solve_lloyd_max_new() is a second copy of the algorithm, not a call to python.core.turboquant.lloyd_max.solve_lloyd_max(). That means this file can stay green while the shipped solver regresses or simply behaves differently.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/test_lloyd_max.py` around lines 48 - 79, The test defines a duplicate solver solve_lloyd_max_new instead of using the production implementation; delete the local solve_lloyd_max_new and import and call the canonical solver (python.core.turboquant.lloyd_max.solve_lloyd_max) in the test, replacing all references to solve_lloyd_max_new with that imported function so tests exercise the shipped implementation rather than a re-implementation in tests.

coderabbitai · 2026-06-11T15:56:10Z

+def test_case(d: int, bits: int):
+    n_levels = 2 ** bits
+    print(f"\n{'='*60}")
+    print(f"d={d}, bits={bits}, n_levels={n_levels}")
+    print(f"{'='*60}")
+
+    t0 = time.perf_counter()
+    c_old = solve_lloyd_max_old(d, bits)
+    dt_old = (time.perf_counter() - t0) * 1000
+
+    t0 = time.perf_counter()
+    c_new = solve_lloyd_max_new(d, bits)
+    dt_new = (time.perf_counter() - t0) * 1000
+
+    diff = (c_new - c_old).abs()
+    sigma = 1.0 / math.sqrt(d)
+
+    print(f"  old centroids ({dt_old:.1f} ms): {c_old.tolist()}")
+    print(f"  new centroids ({dt_new:.1f} ms): {c_new.tolist()}")
+    print(f"  max  abs diff: {diff.max().item():.2e}  (sigma={sigma:.4f})")
+    print(f"  mean abs diff: {diff.mean().item():.2e}")
+    print(f"  speedup: {dt_old / max(dt_new, 0.01):.1f}x")
+
+    # Verify: new centroids are the better quantizer (lower MSE for Gaussian)
+    # MSE = sum_{i} integral_{b_i}^{b_{i+1}} (x - c_i)^2 * f(x) dx
+    def quantization_mse(centroids_in):
+        """Compute MSE of a scalar quantizer for N(0, sigma^2)."""
+        cs = sorted(centroids_in.tolist())
+        boundaries = [-1e10] + [(cs[i] + cs[i+1]) / 2 for i in range(len(cs)-1)] + [1e10]
+        total_mse = 0.0
+        n_pts = 10000
+        for i in range(len(cs)):
+            a, b = boundaries[i], boundaries[i+1]
+            xs = torch.linspace(a, b, n_pts, dtype=torch.float64)
+            pdf_vals = torch.exp(-0.5 * (xs / sigma) ** 2) / (sigma * math.sqrt(2 * math.pi))
+            mse_i = ((xs - cs[i]) ** 2 * pdf_vals).sum()
+            total_mse += mse_i.item()
+        dx = (boundaries[-1] - boundaries[0]) / n_pts  # approximate
+        return total_mse / n_pts  # normalized
+
+    mse_old = quantization_mse(c_old)
+    mse_new = quantization_mse(c_new)
+    print(f"  MSE old: {mse_old:.6e}")
+    print(f"  MSE new: {mse_new:.6e}")
+    if mse_new <= mse_old * 1.005:
+        print(f"  PASS: new MSE <= old MSE (within 0.5%)")
+    else:
+        print(f"  FAIL: new MSE > old MSE by {(mse_new/mse_old-1)*100:.2f}%")
+
+    # Verify centroids are monotonically increasing
+    diffs_signed = c_new[1:] - c_new[:-1]
+    if (diffs_signed > 0).all():
+        print(f"  PASS: centroids monotonically increasing")
+    else:
+        print(f"  FAIL: centroids not monotonically increasing")
+
+    # Verify symmetry: centroids should be symmetric around 0
+    mid = n_levels // 2
+    if n_levels % 2 == 0:
+        sym_diff = (c_new[:mid] + c_new[mid:].flip(0)).abs().max().item()
+    else:
+        sym_diff = (c_new[:mid] + c_new[mid+1:].flip(0)).abs().max().item()
+    print(f"  symmetry around 0: max deviation = {sym_diff:.2e}")
+    if sym_diff < 1e-8:
+        print(f"  PASS: symmetric around 0")
+    else:
+        print(f"  WARN: not perfectly symmetric (expected for finite iterations)")
+
+
+def main():
+    all_pass = True
+
+    # Typical Qwen3-14B configurations
+    test_configs = [
+        # (head_dim, bits)
+        (128, 2),   # value bits
+        (128, 4),   # key bits
+        (128, 3),   # alternative
+        (64, 2),    # smaller model
+        (64, 4),
+        (128, 6),   # higher precision
+    ]
+
+    for d, bits in test_configs:
+        test_case(d, bits)
+
+    print(f"\n{'='*60}")
+    print("Summary: compare speed across configs")
+    print(f"{'='*60}")
+    print(f"{'d':>6} {'bits':>5} {'old(ms)':>10} {'new(ms)':>10} {'speedup':>10}")
+    print("-" * 45)
+    for d, bits in test_configs:
+        t0 = time.perf_counter()
+        solve_lloyd_max_old(d, bits)
+        dt_old = (time.perf_counter() - t0) * 1000
+
+        t0 = time.perf_counter()
+        solve_lloyd_max_new(d, bits)
+        dt_new = (time.perf_counter() - t0) * 1000
+        print(f"{d:>6} {bits:>5} {dt_old:>10.1f} {dt_new:>10.1f} {dt_old/max(dt_new,0.01):>9.1f}x")
+
+
+if __name__ == "__main__":
+    main()


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash fd -HI 'pytest.ini|pyproject.toml|tox.ini|conftest.py' . rg -n --hidden 'pytest|python -m pytest|tool.pytest|addopts|testpaths' .

Repository: hw-native-sys/pypto-serving

Length of output: 2006

Convert tests/test_lloyd_max.py demo into a real parametrized pytest test (with asserts)

tests/test_lloyd_max.py defines def test_case(d: int, bits: int): at module scope; pytest will treat it as a test (name starts with test_) and interpret d/bits as fixtures unless you parametrize (or otherwise prevent collection).

The current PASS/FAIL logic only prints and never uses assert, so CI won’t fail even if MSE/ordering/symmetry checks are wrong—replace the prints with assert statements and drive d,bits via @pytest.mark.parametrize over test_configs.

🧰 Tools

🪛 Ruff (0.15.15)

[error] 127-127: f-string without any placeholders

Remove extraneous f prefix

(F541)

[error] 134-134: f-string without any placeholders

Remove extraneous f prefix

(F541)

[error] 136-136: f-string without any placeholders

Remove extraneous f prefix

(F541)

[error] 146-146: f-string without any placeholders

Remove extraneous f prefix

(F541)

[error] 148-148: f-string without any placeholders

Remove extraneous f prefix

(F541)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/test_lloyd_max.py` around lines 82 - 185, The file defines test_case(d, bits) which pytest will collect as a test with undefined fixtures and currently only prints PASS/FAIL; convert it to a proper parametrized pytest test by: rename or keep test_case but add `@pytest.mark.parametrize`("d,bits", [...]) using the test_configs list instead of the main loop; replace all printed PASS/FAIL checks (MSE comparison, monotonicity, symmetry) with assertions (use solve_lloyd_max_old, solve_lloyd_max_new, quantization_mse, and the diffs_signed/sym_diff computations) so failures raise AssertionError; remove or disable the standalone main() invocation so pytest controls execution. Ensure imports include pytest and that timing/printing is optional (can remain but not required for assertions).

coderabbitai · 2026-06-11T15:56:10Z

+    def quantization_mse(centroids_in):
+        """Compute MSE of a scalar quantizer for N(0, sigma^2)."""
+        cs = sorted(centroids_in.tolist())
+        boundaries = [-1e10] + [(cs[i] + cs[i+1]) / 2 for i in range(len(cs)-1)] + [1e10]
+        total_mse = 0.0
+        n_pts = 10000
+        for i in range(len(cs)):
+            a, b = boundaries[i], boundaries[i+1]
+            xs = torch.linspace(a, b, n_pts, dtype=torch.float64)
+            pdf_vals = torch.exp(-0.5 * (xs / sigma) ** 2) / (sigma * math.sqrt(2 * math.pi))
+            mse_i = ((xs - cs[i]) ** 2 * pdf_vals).sum()
+            total_mse += mse_i.item()
+        dx = (boundaries[-1] - boundaries[0]) / n_pts  # approximate
+        return total_mse / n_pts  # normalized


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

The MSE check is sampling essentially zero probability mass.

torch.linspace(-1e10, 1e10, 10000) is far too coarse for these Gaussian scales, so pdf_vals underflows to ~0 almost everywhere and mse_old/mse_new stop being meaningful. Use a finite support tied to sigma/the solver edges, or compute the interval integrals analytically.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/test_lloyd_max.py` around lines 107 - 120, quantization_mse currently samples extreme [-1e10,1e10] which makes pdf_vals underflow; update quantization_mse to integrate over finite, sigma-dependent supports per interval (use boundaries already computed) and sample each interval [a,b] with xs = torch.linspace(a, b, n_pts, dtype=torch.float64) where a,b come from boundaries (or clamp a,b to e.g. mean ± K*sigma), compute dx = (b - a) / (n_pts - 1) and accumulate mse_i = ((xs - cs[i])**2 * pdf_vals).sum() * dx (not divide by n_pts), or replace numerical sampling by analytical Gaussian interval integrals if preferred; ensure you reference function quantization_mse, variables cs, boundaries, sigma, n_pts when making the change.

coderabbitai · 2026-06-11T15:56:10Z

+    print(f"  MSE old: {mse_old:.6e}")
+    print(f"  MSE new: {mse_new:.6e}")
+    if mse_new <= mse_old * 1.005:
+        print(f"  PASS: new MSE <= old MSE (within 0.5%)")


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Remove the literal-only f prefixes or Ruff will fail this file.

These strings have no placeholders, so they trip F541 on Lines 127, 134, 136, 146, and 148. Plain string literals are enough here.

As per coding guidelines, Ruff is configured with line length 110, target-version = "py310", with F checks enabled.

Also applies to: 134-136, 146-148

🧰 Tools

🪛 Ruff (0.15.15)

[error] 127-127: f-string without any placeholders

Remove extraneous f prefix

(F541)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/test_lloyd_max.py` at line 127, The failing Ruff F541 errors are caused by unnecessary f-string prefixes on literal-only print statements (e.g., the print call showing " PASS: new MSE <= old MSE (within 0.5%)"); remove the leading f from these string literals so they are plain string literals instead of f-strings; update all occurrences in tests/test_lloyd_max.py (the prints around the PASS messages at the referenced lines) to use plain quotes to satisfy Ruff and keep existing text unchanged.

Sources: Coding guidelines, Linters/SAST tools

Integrate TurboQuant module (compressor, lloyd_max) with serving-v3 stack: extend kv_cache.py with compress/restore/stats methods, add TurboQuant hooks in npu_runner.py (compress after prefill/decode, restore before decode), clean compressed segments on preemption in serving_worker.py, and add kv_quant JSON config parsing in CLI.

gemini-code-assist Bot reviewed Jun 11, 2026

View reviewed changes

coderabbitai Bot reviewed Jun 11, 2026

View reviewed changes

sunghajung6688 force-pushed the turboquant branch from 99cc8c2 to c1d87d6 Compare June 14, 2026 14:43

		indices = torch.searchsorted(boundaries, x_rot)
		indices_u8 = indices.to(torch.int32).to(torch.float16).to(torch.uint8)

		idx_int32 = quant_indices.to(torch.float16).to(torch.int32)
		y_hat = centroids[idx_int32.long()] # [N, HEAD_DIM] in rotated space

	idx_int32 = quant_indices.to(torch.float16).to(torch.int32)
	y_hat = centroids[idx_int32.long()] # [N, HEAD_DIM] in rotated space
	y_hat = centroids[quant_indices.long()] # [N, HEAD_DIM] in rotated space

		num_layers = fp_k.shape[0]
		layers = range(num_layers) if layer_filter is None else [layer_filter]

Conversation

sunghajung6688 commented Jun 11, 2026

Uh oh!

coderabbitai Bot commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Walkthrough

Changes

Estimated code review effort

Possibly related issues

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 11, 2026

Choose a reason for hiding this comment

Critical Bug: Out-of-Bounds Indexing in Multi-Batch Quantization

Uh oh!

gemini-code-assist Bot Jun 11, 2026

Choose a reason for hiding this comment

Critical Bug: Missing Padding in TurboQuant Decode Path

Uh oh!

gemini-code-assist Bot Jun 11, 2026

Choose a reason for hiding this comment

Performance: Redundant and Inefficient Casting Chain

Uh oh!

gemini-code-assist Bot Jun 11, 2026

Choose a reason for hiding this comment

Performance: Redundant and Inefficient Casting Chain

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

coderabbitai Bot commented Jun 11, 2026 •

edited

Loading