Add TurboQuant KV cache compression to serving pipeline#33
Add TurboQuant KV cache compression to serving pipeline#33sunghajung6688 wants to merge 1 commit into
Conversation
|
Important Review skippedAuto incremental reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
📝 WalkthroughWalkthroughThis PR implements comprehensive TurboQuant KV-cache quantization across the PyPTO serving stack. It introduces quantization types and compression algorithms, extends the KV cache manager with quantization support, adds TQ variants of CPU and NPU executors with fused kernels, integrates TQ into the CLI and serving pipelines, and provides analysis tools for validation and debugging. ChangesTurboQuant KV-Cache Quantization Feature
Estimated code review effort🎯 4 (Complex) | ⏱️ ~75 minutes Possibly related issues
Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request introduces TurboQuant (TQ) KV cache compression for the Qwen3-14B model, implementing both CPU and NPU execution paths, along with offline/online verification tools and Lloyd-Max quantization codebooks. The code review identified two critical bugs: an out-of-bounds indexing issue in multi-batch quantization due to incorrect tensor allocation size in KvCacheManager.register_model, and a missing padding issue in the NPU decode path (run_decode) that could cause out-of-bounds memory accesses on the NPU. Additionally, performance optimizations were suggested to simplify redundant casting chains in the CPU executor's quantization and dequantization functions.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| quant_key_indices = torch.zeros( | ||
| config.num_hidden_layers, max_blocks_per_seq, | ||
| config.num_key_value_heads, runtime.page_size, config.head_dim, | ||
| dtype=torch.uint8, device=runtime.device, | ||
| ) | ||
| quant_key_norms = torch.zeros( | ||
| config.num_hidden_layers, max_blocks_per_seq, | ||
| config.num_key_value_heads, runtime.page_size, 1, | ||
| dtype=torch.float32, device=runtime.device, | ||
| ) | ||
| quant_val_indices = torch.zeros( | ||
| config.num_hidden_layers, max_blocks_per_seq, | ||
| config.num_key_value_heads, runtime.page_size, config.head_dim, | ||
| dtype=torch.uint8, device=runtime.device, | ||
| ) | ||
| quant_val_norms = torch.zeros( | ||
| config.num_hidden_layers, max_blocks_per_seq, | ||
| config.num_key_value_heads, runtime.page_size, 1, | ||
| dtype=torch.float32, device=runtime.device, | ||
| ) |
There was a problem hiding this comment.
Critical Bug: Out-of-Bounds Indexing in Multi-Batch Quantization
In KvCacheManager.register_model, the quantization tensors (quant_key_indices, quant_key_norms, quant_val_indices, quant_val_norms) are allocated with max_blocks_per_seq in the second dimension instead of num_pages.
Since num_pages is max_batch_size * max_blocks_per_seq, whenever max_batch_size > 1, num_pages is strictly greater than max_blocks_per_seq. Any physical page index allocated to a request that is >= max_blocks_per_seq will cause an IndexError (out of bounds) when accessed, crashing the serving worker.
To fix this, allocate these tensors with num_pages in the second dimension, matching the allocation of key_pages.
| quant_key_indices = torch.zeros( | |
| config.num_hidden_layers, max_blocks_per_seq, | |
| config.num_key_value_heads, runtime.page_size, config.head_dim, | |
| dtype=torch.uint8, device=runtime.device, | |
| ) | |
| quant_key_norms = torch.zeros( | |
| config.num_hidden_layers, max_blocks_per_seq, | |
| config.num_key_value_heads, runtime.page_size, 1, | |
| dtype=torch.float32, device=runtime.device, | |
| ) | |
| quant_val_indices = torch.zeros( | |
| config.num_hidden_layers, max_blocks_per_seq, | |
| config.num_key_value_heads, runtime.page_size, config.head_dim, | |
| dtype=torch.uint8, device=runtime.device, | |
| ) | |
| quant_val_norms = torch.zeros( | |
| config.num_hidden_layers, max_blocks_per_seq, | |
| config.num_key_value_heads, runtime.page_size, 1, | |
| dtype=torch.float32, device=runtime.device, | |
| ) | |
| quant_key_indices = torch.zeros( | |
| config.num_hidden_layers, num_pages, | |
| config.num_key_value_heads, runtime.page_size, config.head_dim, | |
| dtype=torch.uint8, device=runtime.device, | |
| ) | |
| quant_key_norms = torch.zeros( | |
| config.num_hidden_layers, num_pages, | |
| config.num_key_value_heads, runtime.page_size, 1, | |
| dtype=torch.float32, device=runtime.device, | |
| ) | |
| quant_val_indices = torch.zeros( | |
| config.num_hidden_layers, num_pages, | |
| config.num_key_value_heads, runtime.page_size, config.head_dim, | |
| dtype=torch.uint8, device=runtime.device, | |
| ) | |
| quant_val_norms = torch.zeros( | |
| config.num_hidden_layers, num_pages, | |
| config.num_key_value_heads, runtime.page_size, 1, | |
| dtype=torch.float32, device=runtime.device, | |
| ) |
| if use_tq: | ||
| # TQ decode path: hybrid resident (BF16) + compressed (UINT8) attention. | ||
| hidden = decode_inputs.hidden | ||
| quant_k, quant_v, quant_k_norms, quant_v_norms = ( | ||
| self._kv_cache_manager.materialize_full_layer_quant_cache(model_id) | ||
| ) | ||
| # Build compressed block tracking tensors. | ||
| cmp_block_table, cmp_num_blocks = self._kv_cache_manager.get_compressed_block_info( | ||
| model_id, batch, | ||
| ) | ||
|
|
||
| # Pad active inputs up to the fixed kernel batch by replicating row 0. | ||
| def _pad_rows(active: torch.Tensor, rows_each: int) -> torch.Tensor: | ||
| view = active.reshape(actual_batch, rows_each) | ||
| padded = view[0:1].expand(kernel_batch - actual_batch, rows_each) | ||
| return torch.cat([view, padded], dim=0).reshape(-1).contiguous() | ||
|
|
||
| hidden = torch.zeros((kernel_batch, model.config.hidden_size), dtype=torch.bfloat16) | ||
| hidden[:actual_batch] = decode_inputs.hidden | ||
| hidden[actual_batch:] = decode_inputs.hidden[0:1] | ||
| hidden = hidden.share_memory_() | ||
| seq_lens = _pad_rows(decode_inputs.seq_lens, 1).to(torch.int32).share_memory_() | ||
| block_table = _pad_rows(decode_inputs.block_table, max_blocks).to(torch.int32).share_memory_() | ||
| slot_mapping = _pad_rows(decode_inputs.slot_mapping, 1).to(torch.int32).share_memory_() | ||
|
|
||
| # Padded block_table / slot_mapping only ever reference row 0's | ||
| # already-valid pages, so bound-check exactly what the kernel will read. | ||
| self._validate_kv_cache_bounds(model, block_table, slot_mapping, k_cache) | ||
|
|
||
| logits_padded = compiled.decode_logits_buffer # full [kernel_batch, vocab]; trimmed below | ||
| # Argument order MUST match decode_layer.decode_fwd (PAGED): | ||
| # hidden_states, input_rms_weight, wq, wk, wv, q_norm_weight, | ||
| # k_norm_weight, seq_lens, block_table, slot_mapping, rope_cos, rope_sin, | ||
| # k_cache, v_cache, wo, w_gate, w_up, w_down, post_rms_weight, | ||
| # final_norm_weight, lm_head_weight, out. | ||
| self._run_l2_program( | ||
| compiled.decode, | ||
| hidden, | ||
| self._l2_child_tensor(rt, dw["decode_input_rms_weight"]), | ||
| self._l2_child_tensor(rt, dw["decode_wq"]), | ||
| self._l2_child_tensor(rt, dw["decode_wk"]), | ||
| self._l2_child_tensor(rt, dw["decode_wv"]), | ||
| self._l2_child_tensor(rt, dw["decode_q_norm_weight"]), | ||
| self._l2_child_tensor(rt, dw["decode_k_norm_weight"]), | ||
| seq_lens, | ||
| block_table, | ||
| slot_mapping, | ||
| self._l2_child_tensor(rt, compiled.rope_cos), | ||
| self._l2_child_tensor(rt, compiled.rope_sin), | ||
| k_cache, | ||
| v_cache, | ||
| self._l2_child_tensor(rt, dw["decode_wo"]), | ||
| self._l2_child_tensor(rt, dw["decode_w_gate"]), | ||
| self._l2_child_tensor(rt, dw["decode_w_up"]), | ||
| self._l2_child_tensor(rt, dw["decode_w_down"]), | ||
| self._l2_child_tensor(rt, dw["decode_post_rms_weight"]), | ||
| self._l2_child_tensor(rt, compiled.final_norm_weight), | ||
| self._l2_child_tensor(rt, compiled.padded_lm_head_weight), | ||
| logits_padded, | ||
| ) | ||
| # decode_tq outputs logits directly (includes rms_lm_head). | ||
| decode_logits_out = compiled.decode_logits_buffer | ||
|
|
||
| self._run_l2_program( | ||
| compiled.decode_tq, | ||
| hidden, | ||
| self._l2_child_tensor(compiled.decode_tq.runtime_name, dw["decode_input_rms_weight"]), | ||
| self._l2_child_tensor(compiled.decode_tq.runtime_name, dw["decode_wq"]), | ||
| self._l2_child_tensor(compiled.decode_tq.runtime_name, dw["decode_wk"]), | ||
| self._l2_child_tensor(compiled.decode_tq.runtime_name, dw["decode_wv"]), | ||
| self._l2_child_tensor(compiled.decode_tq.runtime_name, dw["decode_q_norm_weight"]), | ||
| self._l2_child_tensor(compiled.decode_tq.runtime_name, dw["decode_k_norm_weight"]), | ||
| decode_inputs.seq_lens, | ||
| decode_inputs.block_table, | ||
| decode_inputs.slot_mapping, | ||
| self._l2_child_tensor(compiled.decode_tq.runtime_name, compiled.rope_cos), | ||
| self._l2_child_tensor(compiled.decode_tq.runtime_name, compiled.rope_sin), | ||
| quant_k, | ||
| quant_v, | ||
| quant_k_norms, | ||
| quant_v_norms, | ||
| cmp_block_table, | ||
| cmp_num_blocks, | ||
| self._l2_child_tensor(compiled.decode_tq.runtime_name, compiled.rot_matrices), | ||
| self._l2_child_tensor(compiled.decode_tq.runtime_name, compiled.tq_codebook), | ||
| self._l2_child_tensor(compiled.decode_tq.runtime_name, compiled.qjl_matrices), | ||
| compiled.quant_k_signs_cache, | ||
| compiled.qjl_norms_cache, | ||
| self._l2_child_tensor(compiled.decode_tq.runtime_name, dw["decode_wo"]), | ||
| self._l2_child_tensor(compiled.decode_tq.runtime_name, dw["decode_post_rms_weight"]), | ||
| self._l2_child_tensor(compiled.decode_tq.runtime_name, dw["decode_w_gate"]), | ||
| self._l2_child_tensor(compiled.decode_tq.runtime_name, dw["decode_w_up"]), | ||
| self._l2_child_tensor(compiled.decode_tq.runtime_name, dw["decode_w_down"]), | ||
| self._l2_child_tensor(compiled.decode_tq.runtime_name, compiled.final_norm_weight), | ||
| self._l2_child_tensor(compiled.decode_tq.runtime_name, compiled.padded_lm_head_weight), | ||
| decode_logits_out, | ||
| ) | ||
| logits_padded = decode_logits_out | ||
| else: | ||
| # Standard BF16 decode path (device-resident, fixed-batch padded). | ||
| kv_cache = self._kv_caches.get(model_id) | ||
| if kv_cache is None: | ||
| raise RuntimeError(f"KV cache for model {model_id!r} is not initialized") | ||
| k_cache = kv_cache.key_pages | ||
| v_cache = kv_cache.value_pages | ||
|
|
||
| # Pad active inputs up to the fixed kernel batch by replicating row 0. | ||
| def _pad_rows(active: torch.Tensor, rows_each: int) -> torch.Tensor: | ||
| view = active.reshape(actual_batch, rows_each) | ||
| padded = view[0:1].expand(kernel_batch - actual_batch, rows_each) | ||
| return torch.cat([view, padded], dim=0).reshape(-1).contiguous() | ||
|
|
||
| hidden = torch.zeros((kernel_batch, model.config.hidden_size), dtype=torch.bfloat16) | ||
| hidden[:actual_batch] = decode_inputs.hidden | ||
| hidden[actual_batch:] = decode_inputs.hidden[0:1] | ||
| hidden = hidden.share_memory_() | ||
| seq_lens = _pad_rows(decode_inputs.seq_lens, 1).to(torch.int32).share_memory_() | ||
| block_table = _pad_rows(decode_inputs.block_table, max_blocks).to(torch.int32).share_memory_() | ||
| slot_mapping = _pad_rows(decode_inputs.slot_mapping, 1).to(torch.int32).share_memory_() |
There was a problem hiding this comment.
Critical Bug: Missing Padding in TurboQuant Decode Path
In run_decode, the use_tq path does not pad hidden, seq_lens, block_table, slot_mapping, cmp_block_table, and cmp_num_blocks to kernel_batch (which is max_batch_size).
Since the compiled decode_tq kernel expects fixed-batch inputs of size kernel_batch, passing tensors of size actual_batch (when actual_batch < kernel_batch) will cause out-of-bounds memory accesses on the NPU, leading to silent data corruption, incorrect logits, or device hangs.
To fix this, define _pad_rows once before the conditional blocks, and use it to pad all inputs (including cmp_block_table and cmp_num_blocks) to kernel_batch in the use_tq path.
# Pad active inputs up to the fixed kernel batch by replicating row 0.
def _pad_rows(active: torch.Tensor, rows_each: int) -> torch.Tensor:
view = active.reshape(actual_batch, rows_each)
padded = view[0:1].expand(kernel_batch - actual_batch, rows_each)
return torch.cat([view, padded], dim=0).reshape(-1).contiguous()
if use_tq:
# TQ decode path: hybrid resident (BF16) + compressed (UINT8) attention.
hidden = torch.zeros((kernel_batch, model.config.hidden_size), dtype=torch.bfloat16)
hidden[:actual_batch] = decode_inputs.hidden
hidden[actual_batch:] = decode_inputs.hidden[0:1]
hidden = hidden.share_memory_()
seq_lens = _pad_rows(decode_inputs.seq_lens, 1).to(torch.int32).share_memory_()
block_table = _pad_rows(decode_inputs.block_table, max_blocks).to(torch.int32).share_memory_()
slot_mapping = _pad_rows(decode_inputs.slot_mapping, 1).to(torch.int32).share_memory_()
quant_k, quant_v, quant_k_norms, quant_v_norms = (
self._kv_cache_manager.materialize_full_layer_quant_cache(model_id)
)
# Build compressed block tracking tensors.
cmp_block_table, cmp_num_blocks = self._kv_cache_manager.get_compressed_block_info(
model_id, batch,
)
cmp_block_table = _pad_rows(cmp_block_table, max_blocks).to(torch.int32).share_memory_()
cmp_num_blocks = _pad_rows(cmp_num_blocks, 1).to(torch.int32).share_memory_()
# decode_tq outputs logits directly (includes rms_lm_head).
decode_logits_out = compiled.decode_logits_buffer
self._run_l2_program(
compiled.decode_tq,
hidden,
self._l2_child_tensor(compiled.decode_tq.runtime_name, dw["decode_input_rms_weight"]),
self._l2_child_tensor(compiled.decode_tq.runtime_name, dw["decode_wq"]),
self._l2_child_tensor(compiled.decode_tq.runtime_name, dw["decode_wk"]),
self._l2_child_tensor(compiled.decode_tq.runtime_name, dw["decode_wv"]),
self._l2_child_tensor(compiled.decode_tq.runtime_name, dw["decode_q_norm_weight"]),
self._l2_child_tensor(compiled.decode_tq.runtime_name, dw["decode_k_norm_weight"]),
seq_lens,
block_table,
slot_mapping,
self._l2_child_tensor(compiled.decode_tq.runtime_name, compiled.rope_cos),
self._l2_child_tensor(compiled.decode_tq.runtime_name, compiled.rope_sin),
quant_k,
quant_v,
quant_k_norms,
quant_v_norms,
cmp_block_table,
cmp_num_blocks,
self._l2_child_tensor(compiled.decode_tq.runtime_name, compiled.rot_matrices),
self._l2_child_tensor(compiled.decode_tq.runtime_name, compiled.tq_codebook),
self._l2_child_tensor(compiled.decode_tq.runtime_name, compiled.qjl_matrices),
compiled.quant_k_signs_cache,
compiled.qjl_norms_cache,
self._l2_child_tensor(compiled.decode_tq.runtime_name, dw["decode_wo"]),
self._l2_child_tensor(compiled.decode_tq.runtime_name, dw["decode_post_rms_weight"]),
self._l2_child_tensor(compiled.decode_tq.runtime_name, dw["decode_w_gate"]),
self._l2_child_tensor(compiled.decode_tq.runtime_name, dw["decode_w_up"]),
self._l2_child_tensor(compiled.decode_tq.runtime_name, dw["decode_w_down"]),
self._l2_child_tensor(compiled.decode_tq.runtime_name, compiled.final_norm_weight),
self._l2_child_tensor(compiled.decode_tq.runtime_name, compiled.padded_lm_head_weight),
decode_logits_out,
)
logits_padded = decode_logits_out
else:
# Standard BF16 decode path (device-resident, fixed-batch padded).
kv_cache = self._kv_caches.get(model_id)
if kv_cache is None:
raise RuntimeError(f"KV cache for model {model_id!r} is not initialized")
k_cache = kv_cache.key_pages
v_cache = kv_cache.value_pages
hidden = torch.zeros((kernel_batch, model.config.hidden_size), dtype=torch.bfloat16)
hidden[:actual_batch] = decode_inputs.hidden
hidden[actual_batch:] = decode_inputs.hidden[0:1]
hidden = hidden.share_memory_()
seq_lens = _pad_rows(decode_inputs.seq_lens, 1).to(torch.int32).share_memory_()
block_table = _pad_rows(decode_inputs.block_table, max_blocks).to(torch.int32).share_memory_()
slot_mapping = _pad_rows(decode_inputs.slot_mapping, 1).to(torch.int32).share_memory_()| indices = torch.searchsorted(boundaries, x_rot) | ||
| indices_u8 = indices.to(torch.int32).to(torch.float16).to(torch.uint8) |
There was a problem hiding this comment.
Performance: Redundant and Inefficient Casting Chain
In _tq_quantize, the casting chain indices.to(torch.int32).to(torch.float16).to(torch.uint8) is redundant and inefficient on CPU. It allocates multiple intermediate tensors and performs unnecessary conversions.
Since indices is an integer tensor returned by torch.searchsorted, you can cast it directly to torch.uint8.
| indices = torch.searchsorted(boundaries, x_rot) | |
| indices_u8 = indices.to(torch.int32).to(torch.float16).to(torch.uint8) | |
| indices = torch.searchsorted(boundaries, x_rot) | |
| indices_u8 = indices.to(torch.uint8) |
| idx_int32 = quant_indices.to(torch.float16).to(torch.int32) | ||
| y_hat = centroids[idx_int32.long()] # [N, HEAD_DIM] in rotated space |
There was a problem hiding this comment.
Performance: Redundant and Inefficient Casting Chain
In _tq_dequantize, the casting chain quant_indices.to(torch.float16).to(torch.int32) followed by .long() is redundant and inefficient on CPU.
Since quant_indices is a uint8 tensor, you can cast it directly to torch.long (or .long()) to index centroids.
| idx_int32 = quant_indices.to(torch.float16).to(torch.int32) | |
| y_hat = centroids[idx_int32.long()] # [N, HEAD_DIM] in rotated space | |
| y_hat = centroids[quant_indices.long()] # [N, HEAD_DIM] in rotated space |
There was a problem hiding this comment.
Actionable comments posted: 20
Note
Due to the large number of review comments, Critical, Major severity comments were prioritized as inline comments.
🟡 Minor comments (4)
README.md-152-171 (1)
152-171:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winJSON config example is missing
protected_layersandprotected_bitsfields.The offline TurboQuant documentation (lines 104-105) describes
--kv-quant-protected-layersand--kv-quant-protected-bitsparameters, but the online JSON example (lines 160-165) omits these fields. If these parameters are supported in online mode, include them in the JSON example for completeness. If they're offline-only, clarify that limitation in line 154.📝 Suggested addition to JSON example
"kv_quant": { "enabled": true, "key_bits": 4, "value_bits": 2, - "residual_window": 128 + "residual_window": 128, + "protected_layers": 4, + "protected_bits": 8 } }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@README.md` around lines 152 - 171, The README's online JSON example under the kv_quant section is missing the protected_layers and protected_bits fields referenced for offline mode; update the kv_quant example to include "protected_layers" (array of layer indices) and "protected_bits" (integer or array) with brief default/example values, or explicitly state in the kv_quant section that --kv-quant-protected-layers and --kv-quant-protected-bits are offline-only if they are not supported online, so readers know whether to add those fields to the JSON or not.examples/model/qwen3_14b/verify_tq_prefill.py-36-40 (1)
36-40:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winGuard zero-norm vectors in
cpu_quantizeto prevent NaN-based index artifacts.If
norm == 0, normalization yields NaNs and the resulting indices become meaningless for verification output.Proposed fix
def cpu_quantize(fp_vec, rot, codebook): norm = np.linalg.norm(fp_vec) + if norm < 1e-12: + return np.zeros(fp_vec.shape[0], dtype=np.uint8), 0.0 rotated = fp_vec @ rot unit = rotated / norm indices = np.argmin(np.abs(unit[:, None] - codebook[None, :]), axis=-1).astype(np.uint8) return indices, norm🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@examples/model/qwen3_14b/verify_tq_prefill.py` around lines 36 - 40, In cpu_quantize, guard against division by zero when computing unit = rotated / norm: detect zero norms from fp_vec (norm == 0) and handle them before dividing (e.g., set unit for those vectors to a safe value such as zeros or use a tiny epsilon to avoid NaN), then compute indices = np.argmin(...) on the safe unit and return the original norm (0) for those entries; update the logic around norm, rotated, unit, and indices to ensure no NaNs propagate into the verification output.examples/model/qwen3_14b/cpu_generate.py-166-168 (1)
166-168:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winGuard the overlap percentage for empty generations.
If both runs return zero tokens,
totalbecomes0and Line 168 raisesZeroDivisionError, which makes compare mode crash on an otherwise valid edge case.Proposed fix
common = len(set(result_tq.token_ids) & set(result_fp.token_ids)) total = max(len(result_tq.token_ids), len(result_fp.token_ids)) - print(f"\nToken overlap: {common}/{total} ({common / total * 100:.1f}%)") + if total == 0: + print("\nToken overlap: n/a (both generations were empty)") + else: + print(f"\nToken overlap: {common}/{total} ({common / total * 100:.1f}%)")🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@examples/model/qwen3_14b/cpu_generate.py` around lines 166 - 168, Guard the overlap percentage calculation against division by zero: compute common as set(result_tq.token_ids) & set(result_fp.token_ids) and total as max(len(result_tq.token_ids), len(result_fp.token_ids)), then if total == 0 print a suitable message (e.g. "Token overlap: 0/0 (0.0%)" or "N/A") instead of performing common / total; modify the block around variables common, total and the print call in cpu_generate.py to handle the empty-generation case.examples/model/qwen3_14b/runner/npu_executor.py-1044-1049 (1)
1044-1049:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winPopulate
_L2Callable.namein this helper.This constructor call omits the required
namefield, so the first caller of_compile_l2_callable_from_jit()will fail with aTypeErrorinstead of returning a usable callable.🐛 Proposed fix
return _L2Callable( chip_callable=chip_callable, + name=name, runtime_name=runtime_name, block_dim=int(runtime_config.get("block_dim", 24)), aicpu_thread_num=int(runtime_config.get("aicpu_thread_num", 4)), param_infos=tuple(param_infos), )🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@examples/model/qwen3_14b/runner/npu_executor.py` around lines 1044 - 1049, The _L2Callable constructor call is missing the required name parameter causing _compile_l2_callable_from_jit() to raise a TypeError; update the call that builds _L2Callable (the block creating chip_callable, runtime_name, block_dim, aicpu_thread_num, param_infos) to pass a meaningful name (e.g., name=runtime_name or fallback to chip_callable.__name__/str(runtime_name)) so the created object has its name set and callers of _compile_l2_callable_from_jit() receive a usable callable.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@examples/model/qwen3_14b/compare_dump.py`:
- Line 164: The print statement uses an unnecessary f-string which triggers Ruff
F541; locate the line containing print(f"\n Top-10 lowest cosine similarity:")
in examples/model/qwen3_14b/compare_dump.py and remove the f prefix so it
becomes a normal string (print("\n Top-10 lowest cosine similarity:")),
ensuring no interpolation is used.
In `@examples/model/qwen3_14b/compare_npu_dump.py`:
- Around line 160-162: The code builds layers from num_layers and layer_filter
but never validates layer_filter, so an out-of-range value will trigger
IndexError later when indexing fp_k; add an explicit bounds check after
computing num_layers (using fp_k.shape[0]) that if layer_filter is not None it
must be an int within 0 <= layer_filter < num_layers, otherwise raise a clear
ValueError (or print an error and exit) before forming layers = range(...) or
[layer_filter]; apply the same validation wherever layers are used later (the
block around the code referenced at lines 175-177) to ensure no downstream
indexing occurs with an invalid layer index.
- Around line 141-158: Check existence of every file before any np.load calls:
verify fp_k_path and fp_v_path exist before loading fp_k/fp_v, and verify
tq_qk_path, tq_qv_path, tq_kn_path, and tq_vn_path exist before loading
tq_qk/tq_qv/tq_kn/tq_vn; if any required file is missing, print a single skip
message (e.g., f" [skip] {tag}: missing KV dump") and return instead of
attempting to np.load, using the existing variables fp_k_path, fp_v_path,
tq_qk_path, tq_qv_path, tq_kn_path, tq_vn_path to locate checks around the
current load logic.
- Around line 271-276: The fallback to a hardcoded num_layers=40 is unsafe;
instead, derive num_layers from the available cache files in fp_dir by globbing
all per-layer k/v cache files (e.g. "*_k_cache.npy" and/or "*_v_cache.npy") and
computing the appropriate layer count from their shapes (for example using the
max .shape[0] across matches). Update the logic around fp_sample and num_layers
(referencing fp_dir, fp_sample, and the "prefill_step0000_k_cache.npy" pattern)
to collect all matching cache files, compute num_layers from their loaded
shapes, and only fall back to a safer default after confirming no cache files
exist (and log a clear warning if you must).
In `@examples/model/qwen3_14b/cpu_generate.py`:
- Around line 183-186: The num_layers_override flag is only passed to
CpuTqModelExecutor, so CpuModelExecutor (and other FP-path constructors
referenced by CpuModelExecutor usage at the other similar blocks) ignores the
layer-limit; update the FP constructors to accept and forward
num_layers_override as CpuModelExecutor(...,
num_layers_override=num_layers_override) (and ensure any other instantiations in
the same file that construct FP executors in compare/single-FP paths also
receive the flag) so both TQ and FP executors honor the CLI layer-limit
consistently.
- Around line 141-157: The compare routine uses the same model_id for both
engines causing the FP engine to be queried with the TQ model name; update
run_compare to accept two model ids (e.g., model_id_tq, model_id_fp) or
otherwise pass distinct ids when calling engine_tq.generate_result and
engine_fp.generate_result so engine_tq uses the TQ model id and engine_fp uses
the FP model id; update all call sites of run_compare accordingly and ensure the
registered FP model name ("qwen3-14b-fp-cpu") is passed to
engine_fp.generate_result while the TQ name is passed to
engine_tq.generate_result.
In `@examples/model/qwen3_14b/npu_generate.py`:
- Around line 494-503: The CLI currently allows arbitrary kv-quant bit sizes but
the NPU kernels in runner/npu_executor.py are hard-coded for 4-bit (int4)
quantization; add a fail-fast validation before constructing KvQuantConfig: when
args.kv_quant is true, check args.kv_quant_key_bits and args.kv_quant_value_bits
are exactly 4 and if not raise/exit with a clear error message (or
argparse.ArgumentType/ArgumentParser error) explaining that only 4-bit KV
quantization is supported by the NPU executor; keep constructing KvQuantConfig
only after this validation so runner/npu_executor.py's contract is preserved.
In `@examples/model/qwen3_14b/runner/cpu_executor.py`:
- Around line 446-458: run_decode fails when called before any prefill because
it does not initialize TurboQuant state; before iterating layers (and before
calling _layer_decode_tq) call the same initializers used in run_prefill —
specifically invoke _ensure_codebook() and _ensure_rot_matrices() (so _codebook
and _rot_matrices[layer_idx] are populated) on the executor instance (or
model/runtime as appropriate) so decode-first flows honor the base executor
contract and _layer_decode_tq can safely access those members.
- Around line 784-799: The code currently sets alloc.tokens_used =
max(alloc.tokens_used, start_token + seq_len) even when the inner loop breaks
early due to running out of pages, causing unwritten slots to be treated as
valid; fix by tracking how many tokens were actually written (e.g., introduce a
local counter like written_count and increment it for each tok_offset that
completes the per-head writes) and then set alloc.tokens_used =
max(alloc.tokens_used, start_token + written_count) instead of using seq_len;
reference alloc.tokens_used, start_token, seq_len, tok_offset and alloc.page_ids
and ensure this change prevents _read_full_quant_cache() from reading unwritten
slots.
In `@examples/model/qwen3_14b/runner/npu_executor.py`:
- Around line 395-399: In _compile_model the variable prefill_block_table_stride
is undefined and causes a crash; update the call that builds prefill_tq (the
call to self._compile_prefill_tq_callable using qwen3_prefill_tq.prefill_fwd_tq)
to pass the already-computed max_blocks_per_seq instead of
prefill_block_table_stride so the block-table stride uses max_blocks_per_seq
(locate usages around prefill_tq and where max_blocks_per_seq is computed and
substitute the undefined symbol accordingly).
In `@examples/model/qwen3_14b/runner/npu_runner.py`:
- Around line 261-265: The TQ branch (when use_tq is true) must reject
chunked/resumed prefills because prefill_tq only supports
single-chunk/full-prefix prefill; add an explicit guard before the quantized KV
write (the block assigning quant_k, quant_v, quant_k_norms, quant_v_norms) that
checks chunk_offsets and chunk_lens and raises a clear exception (e.g.,
ValueError) if they indicate a resumed/chunked prefill; reference the use_tq
conditional and the prefill_tq call to ensure this guard runs only for the TQ
path and prevents silent wrong-cache writes.
- Around line 772-776: The annotation KvAllocation used in the method
_sync_all_kv_pages_from_npu is undefined in this module; add an import for
KvAllocation from its defining module (or wrap the annotation in quotes to make
it a forward reference) so Ruff F821 stops failing—locate the KvAllocation
symbol referenced in _sync_all_kv_pages_from_npu and either add "from <module>
import KvAllocation" at the top of the file or change the signature to def
_sync_all_kv_pages_from_npu(self, model_id: str, alloc: "KvAllocation") -> None.
In `@examples/model/qwen3_14b/verify_tq_prefill.py`:
- Line 63: Remove the unnecessary f-string prefixes on the non-interpolated
print statements (e.g., the print call that currently reads print(f"\n--- SCALES
---") and the similar one at line 72) to satisfy Ruff F541; locate these print
statements in verify_tq_prefill.py and change them to plain string prints
(print("\n--- SCALES ---") and the other analogous print) so they are not
treated as formatted strings.
In `@python/cli/main.py`:
- Around line 84-89: Validate the TurboQuant CLI flags before constructing
KvQuantConfig: ensure kv-quant-key-bits and kv-quant-value-bits are in the range
1..8 and that kv-quant-residual-window, kv-quant-protected-layers, and
kv-quant-protected-bits are >= 0 (or adjust protected-bits minimum to 1 if
TurboQuantCompressor requires it); perform this validation after parsing args
and call parser.error(...) with a clear message on invalid values (or use
argparse choices/type constraints) so invalid inputs are rejected before
KvQuantConfig/TurboQuantCompressor are created; apply the same checks for the
duplicate flag block referenced at lines 133-143 as well.
In `@python/core/kv_cache.py`:
- Around line 133-140: The type name KvQuantConfig used by the _CachePool
attribute kv_quant_config is not imported, causing F821 and runtime failures;
fix by adding an explicit import for KvQuantConfig at the top-level imports of
this module (import the actual module where KvQuantConfig is defined, e.g., add
a from <defining_module> import KvQuantConfig) so that _CachePool and the
kv_quant_config: KvQuantConfig | None annotation resolve at runtime; ensure the
import is not only inside TYPE_CHECKING so runtime evaluation of the annotation
succeeds.
- Around line 246-268: The quantized-page tensors quant_key_indices,
quant_key_norms, quant_val_indices, and quant_val_norms are currently sized with
max_blocks_per_seq but must be sized with the total num_pages so they can be
indexed by physical alloc.page_ids (used by get_compressed_block_info); change
their second dimension from max_blocks_per_seq to num_pages (keeping other dims,
dtypes, and device the same) so accesses using alloc.page_ids in multi-request
batches cannot index out of range or overlap other requests.
In `@tests/test_lloyd_max.py`:
- Around line 48-79: The test defines a duplicate solver solve_lloyd_max_new
instead of using the production implementation; delete the local
solve_lloyd_max_new and import and call the canonical solver
(python.core.turboquant.lloyd_max.solve_lloyd_max) in the test, replacing all
references to solve_lloyd_max_new with that imported function so tests exercise
the shipped implementation rather than a re-implementation in tests.
- Line 127: The failing Ruff F541 errors are caused by unnecessary f-string
prefixes on literal-only print statements (e.g., the print call showing " PASS:
new MSE <= old MSE (within 0.5%)"); remove the leading f from these string
literals so they are plain string literals instead of f-strings; update all
occurrences in tests/test_lloyd_max.py (the prints around the PASS messages at
the referenced lines) to use plain quotes to satisfy Ruff and keep existing text
unchanged.
- Around line 107-120: quantization_mse currently samples extreme [-1e10,1e10]
which makes pdf_vals underflow; update quantization_mse to integrate over
finite, sigma-dependent supports per interval (use boundaries already computed)
and sample each interval [a,b] with xs = torch.linspace(a, b, n_pts,
dtype=torch.float64) where a,b come from boundaries (or clamp a,b to e.g. mean ±
K*sigma), compute dx = (b - a) / (n_pts - 1) and accumulate mse_i = ((xs -
cs[i])**2 * pdf_vals).sum() * dx (not divide by n_pts), or replace numerical
sampling by analytical Gaussian interval integrals if preferred; ensure you
reference function quantization_mse, variables cs, boundaries, sigma, n_pts when
making the change.
- Around line 82-185: The file defines test_case(d, bits) which pytest will
collect as a test with undefined fixtures and currently only prints PASS/FAIL;
convert it to a proper parametrized pytest test by: rename or keep test_case but
add `@pytest.mark.parametrize`("d,bits", [...]) using the test_configs list
instead of the main loop; replace all printed PASS/FAIL checks (MSE comparison,
monotonicity, symmetry) with assertions (use solve_lloyd_max_old,
solve_lloyd_max_new, quantization_mse, and the diffs_signed/sym_diff
computations) so failures raise AssertionError; remove or disable the standalone
main() invocation so pytest controls execution. Ensure imports include pytest
and that timing/printing is optional (can remain but not required for
assertions).
---
Minor comments:
In `@examples/model/qwen3_14b/cpu_generate.py`:
- Around line 166-168: Guard the overlap percentage calculation against division
by zero: compute common as set(result_tq.token_ids) & set(result_fp.token_ids)
and total as max(len(result_tq.token_ids), len(result_fp.token_ids)), then if
total == 0 print a suitable message (e.g. "Token overlap: 0/0 (0.0%)" or "N/A")
instead of performing common / total; modify the block around variables common,
total and the print call in cpu_generate.py to handle the empty-generation case.
In `@examples/model/qwen3_14b/runner/npu_executor.py`:
- Around line 1044-1049: The _L2Callable constructor call is missing the
required name parameter causing _compile_l2_callable_from_jit() to raise a
TypeError; update the call that builds _L2Callable (the block creating
chip_callable, runtime_name, block_dim, aicpu_thread_num, param_infos) to pass a
meaningful name (e.g., name=runtime_name or fallback to
chip_callable.__name__/str(runtime_name)) so the created object has its name set
and callers of _compile_l2_callable_from_jit() receive a usable callable.
In `@examples/model/qwen3_14b/verify_tq_prefill.py`:
- Around line 36-40: In cpu_quantize, guard against division by zero when
computing unit = rotated / norm: detect zero norms from fp_vec (norm == 0) and
handle them before dividing (e.g., set unit for those vectors to a safe value
such as zeros or use a tiny epsilon to avoid NaN), then compute indices =
np.argmin(...) on the safe unit and return the original norm (0) for those
entries; update the logic around norm, rotated, unit, and indices to ensure no
NaNs propagate into the verification output.
In `@README.md`:
- Around line 152-171: The README's online JSON example under the kv_quant
section is missing the protected_layers and protected_bits fields referenced for
offline mode; update the kv_quant example to include "protected_layers" (array
of layer indices) and "protected_bits" (integer or array) with brief
default/example values, or explicitly state in the kv_quant section that
--kv-quant-protected-layers and --kv-quant-protected-bits are offline-only if
they are not supported online, so readers know whether to add those fields to
the JSON or not.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 9159bd11-16d1-478a-98ea-853d56892e56
📒 Files selected for processing (19)
.claude/skills.claude/skillsREADME.mdexamples/model/qwen3_14b/compare_dump.pyexamples/model/qwen3_14b/compare_npu_dump.pyexamples/model/qwen3_14b/cpu_generate.pyexamples/model/qwen3_14b/npu_generate.pyexamples/model/qwen3_14b/npu_serving.jsonexamples/model/qwen3_14b/runner/cpu_executor.pyexamples/model/qwen3_14b/runner/npu_executor.pyexamples/model/qwen3_14b/runner/npu_runner.pyexamples/model/qwen3_14b/verify_tq_prefill.pypython/cli/main.pypython/core/kv_cache.pypython/core/turboquant/__init__.pypython/core/turboquant/compressor.pypython/core/turboquant/lloyd_max.pypython/core/types.pytests/test_lloyd_max.py
| # Find worst entries | ||
| if items: | ||
| worst = sorted(items, key=lambda x: x[1])[:10] | ||
| print(f"\n Top-10 lowest cosine similarity:") |
There was a problem hiding this comment.
Remove the unnecessary f-string prefix to satisfy Ruff F541.
This string has no interpolation, so Ruff flags F541 and the lint gate fails.
Proposed fix
- print(f"\n Top-10 lowest cosine similarity:")
+ print("\n Top-10 lowest cosine similarity:")As per coding guidelines, Ruff is configured with line length 110, target-version = "py310", F checks enabled, and F841 ignored.
🧰 Tools
🪛 Ruff (0.15.15)
[error] 164-164: f-string without any placeholders
Remove extraneous f prefix
(F541)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@examples/model/qwen3_14b/compare_dump.py` at line 164, The print statement
uses an unnecessary f-string which triggers Ruff F541; locate the line
containing print(f"\n Top-10 lowest cosine similarity:") in
examples/model/qwen3_14b/compare_dump.py and remove the f prefix so it becomes a
normal string (print("\n Top-10 lowest cosine similarity:")), ensuring no
interpolation is used.
Sources: Coding guidelines, Linters/SAST tools
| if not fp_k_path.exists(): | ||
| print(f" [skip] {tag}: FP k_cache not found") | ||
| return | ||
| fp_k = np.load(fp_k_path) # [layers, kv_heads, page_size, head_dim] | ||
| fp_v = np.load(fp_v_path) | ||
|
|
||
| # ---- Load TQ KV cache ---- | ||
| tq_qk_path = tq_dir / f"{tag}_quant_k.npy" | ||
| tq_qv_path = tq_dir / f"{tag}_quant_v.npy" | ||
| tq_kn_path = tq_dir / f"{tag}_quant_k_norm.npy" | ||
| tq_vn_path = tq_dir / f"{tag}_quant_v_norm.npy" | ||
| if not tq_qk_path.exists(): | ||
| print(f" [skip] {tag}: TQ quant_k not found") | ||
| return | ||
| tq_qk = np.load(tq_qk_path) | ||
| tq_qv = np.load(tq_qv_path) | ||
| tq_kn = np.load(tq_kn_path) | ||
| tq_vn = np.load(tq_vn_path) |
There was a problem hiding this comment.
Validate all required dump files before loading.
Only k_cache/quant_k presence is checked, but v_cache, quant_v, and norm files are loaded unguarded. Missing any of them causes a hard crash.
Proposed fix
- if not fp_k_path.exists():
- print(f" [skip] {tag}: FP k_cache not found")
+ required_fp = [fp_k_path, fp_v_path]
+ missing_fp = [p.name for p in required_fp if not p.exists()]
+ if missing_fp:
+ print(f" [skip] {tag}: missing FP files: {missing_fp}")
return
@@
- if not tq_qk_path.exists():
- print(f" [skip] {tag}: TQ quant_k not found")
+ required_tq = [tq_qk_path, tq_qv_path, tq_kn_path, tq_vn_path]
+ missing_tq = [p.name for p in required_tq if not p.exists()]
+ if missing_tq:
+ print(f" [skip] {tag}: missing TQ files: {missing_tq}")
return🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@examples/model/qwen3_14b/compare_npu_dump.py` around lines 141 - 158, Check
existence of every file before any np.load calls: verify fp_k_path and fp_v_path
exist before loading fp_k/fp_v, and verify tq_qk_path, tq_qv_path, tq_kn_path,
and tq_vn_path exist before loading tq_qk/tq_qv/tq_kn/tq_vn; if any required
file is missing, print a single skip message (e.g., f" [skip] {tag}: missing KV
dump") and return instead of attempting to np.load, using the existing variables
fp_k_path, fp_v_path, tq_qk_path, tq_qv_path, tq_kn_path, tq_vn_path to locate
checks around the current load logic.
| num_layers = fp_k.shape[0] | ||
| layers = range(num_layers) if layer_filter is None else [layer_filter] | ||
|
|
There was a problem hiding this comment.
Validate --layer bounds before indexing layer tensors.
A user-provided out-of-range layer currently raises IndexError during dequantization/comparison.
Proposed fix
num_layers = fp_k.shape[0]
- layers = range(num_layers) if layer_filter is None else [layer_filter]
+ if layer_filter is not None and not (0 <= layer_filter < num_layers):
+ print(f" [skip] {tag}: layer {layer_filter} out of range (0..{num_layers - 1})")
+ return
+ layers = range(num_layers) if layer_filter is None else [layer_filter]Also applies to: 175-177
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@examples/model/qwen3_14b/compare_npu_dump.py` around lines 160 - 162, The
code builds layers from num_layers and layer_filter but never validates
layer_filter, so an out-of-range value will trigger IndexError later when
indexing fp_k; add an explicit bounds check after computing num_layers (using
fp_k.shape[0]) that if layer_filter is not None it must be an int within 0 <=
layer_filter < num_layers, otherwise raise a clear ValueError (or print an error
and exit) before forming layers = range(...) or [layer_filter]; apply the same
validation wherever layers are used later (the block around the code referenced
at lines 175-177) to ensure no downstream indexing occurs with an invalid layer
index.
| fp_sample = sorted(fp_dir.glob("prefill_step0000_k_cache.npy")) | ||
| if fp_sample: | ||
| num_layers = np.load(fp_sample[0]).shape[0] | ||
| else: | ||
| num_layers = 40 | ||
| print(f"Warning: could not detect num_layers, assuming {num_layers}") |
There was a problem hiding this comment.
Avoid hardcoded num_layers=40; infer from available cache files.
Falling back to 40 can produce wrong rotation-matrix sizing and downstream indexing failures for other model configs.
Proposed fix
- fp_sample = sorted(fp_dir.glob("prefill_step0000_k_cache.npy"))
+ fp_sample = sorted(fp_dir.glob(f"{args.phase}_step*_k_cache.npy")) or sorted(fp_dir.glob("*_k_cache.npy"))
if fp_sample:
num_layers = np.load(fp_sample[0]).shape[0]
else:
- num_layers = 40
- print(f"Warning: could not detect num_layers, assuming {num_layers}")
+ print("Error: could not detect num_layers from FP cache dumps.")
+ sys.exit(1)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@examples/model/qwen3_14b/compare_npu_dump.py` around lines 271 - 276, The
fallback to a hardcoded num_layers=40 is unsafe; instead, derive num_layers from
the available cache files in fp_dir by globbing all per-layer k/v cache files
(e.g. "*_k_cache.npy" and/or "*_v_cache.npy") and computing the appropriate
layer count from their shapes (for example using the max .shape[0] across
matches). Update the logic around fp_sample and num_layers (referencing fp_dir,
fp_sample, and the "prefill_step0000_k_cache.npy" pattern) to collect all
matching cache files, compute num_layers from their loaded shapes, and only fall
back to a safer default after confirming no cache files exist (and log a clear
warning if you must).
| def run_compare( | ||
| engine_tq: LLMEngine, | ||
| engine_fp: LLMEngine, | ||
| model_id: str, | ||
| prompt: str, | ||
| config: GenerateConfig, | ||
| ) -> None: | ||
| """Run same prompt on both TQ and FP, print comparison.""" | ||
| print(f"\nPrompt: {prompt}") | ||
| print("-" * 60) | ||
|
|
||
| t0 = time.perf_counter() | ||
| result_tq = engine_tq.generate_result(model_id, prompt, config) | ||
| dt_tq = time.perf_counter() - t0 | ||
|
|
||
| t0 = time.perf_counter() | ||
| result_fp = engine_fp.generate_result(model_id, prompt, config) |
There was a problem hiding this comment.
Pass separate model IDs to the two engines in compare mode.
Lines 246-249 register the FP engine under "qwen3-14b-fp-cpu", but Lines 153 and 157 query both engines with the single model_id argument. In compare mode that means the FP engine is asked for the TQ model ID, which will either fail lookup or hit the wrong model.
Proposed fix
def run_compare(
engine_tq: LLMEngine,
+ model_id_tq: str,
engine_fp: LLMEngine,
- model_id: str,
+ model_id_fp: str,
prompt: str,
config: GenerateConfig,
) -> None:
@@
- result_tq = engine_tq.generate_result(model_id, prompt, config)
+ result_tq = engine_tq.generate_result(model_id_tq, prompt, config)
@@
- result_fp = engine_fp.generate_result(model_id, prompt, config)
+ result_fp = engine_fp.generate_result(model_id_fp, prompt, config)- run_compare(engine_tq, engine_fp, args.model_id, prompt, config)
+ run_compare(engine_tq, args.model_id, engine_fp, "qwen3-14b-fp-cpu", prompt, config)Also applies to: 241-252
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@examples/model/qwen3_14b/cpu_generate.py` around lines 141 - 157, The compare
routine uses the same model_id for both engines causing the FP engine to be
queried with the TQ model name; update run_compare to accept two model ids
(e.g., model_id_tq, model_id_fp) or otherwise pass distinct ids when calling
engine_tq.generate_result and engine_fp.generate_result so engine_tq uses the TQ
model id and engine_fp uses the FP model id; update all call sites of
run_compare accordingly and ensure the registered FP model name
("qwen3-14b-fp-cpu") is passed to engine_fp.generate_result while the TQ name is
passed to engine_tq.generate_result.
| # Compressed storage: non-packed uint8, one byte per element. | ||
| # For int4 (n_levels=16), each element stores values 0-15 in uint8. | ||
| # Shape: [num_layers, max_seq_pages, num_kv_heads, page_size, head_dim] | ||
| quant_key_indices = torch.zeros( | ||
| config.num_hidden_layers, max_blocks_per_seq, | ||
| config.num_key_value_heads, runtime.page_size, config.head_dim, | ||
| dtype=torch.uint8, device=runtime.device, | ||
| ) | ||
| quant_key_norms = torch.zeros( | ||
| config.num_hidden_layers, max_blocks_per_seq, | ||
| config.num_key_value_heads, runtime.page_size, 1, | ||
| dtype=torch.float32, device=runtime.device, | ||
| ) | ||
| quant_val_indices = torch.zeros( | ||
| config.num_hidden_layers, max_blocks_per_seq, | ||
| config.num_key_value_heads, runtime.page_size, config.head_dim, | ||
| dtype=torch.uint8, device=runtime.device, | ||
| ) | ||
| quant_val_norms = torch.zeros( | ||
| config.num_hidden_layers, max_blocks_per_seq, | ||
| config.num_key_value_heads, runtime.page_size, 1, | ||
| dtype=torch.float32, device=runtime.device, | ||
| ) |
There was a problem hiding this comment.
Size the quantized page tensors with num_pages, not max_blocks_per_seq.
These buffers are later addressed with physical alloc.page_ids via get_compressed_block_info(), but alloc.page_ids come from the global block pool in [0, num_pages). As soon as max_batch_size > 1, page IDs beyond max_blocks_per_seq - 1 can point past the quant cache or into another request's data.
Suggested fix
- quant_key_indices = torch.zeros(
- config.num_hidden_layers, max_blocks_per_seq,
+ quant_key_indices = torch.zeros(
+ config.num_hidden_layers, num_pages,
config.num_key_value_heads, runtime.page_size, config.head_dim,
dtype=torch.uint8, device=runtime.device,
)
quant_key_norms = torch.zeros(
- config.num_hidden_layers, max_blocks_per_seq,
+ config.num_hidden_layers, num_pages,
config.num_key_value_heads, runtime.page_size, 1,
dtype=torch.float32, device=runtime.device,
)
quant_val_indices = torch.zeros(
- config.num_hidden_layers, max_blocks_per_seq,
+ config.num_hidden_layers, num_pages,
config.num_key_value_heads, runtime.page_size, config.head_dim,
dtype=torch.uint8, device=runtime.device,
)
quant_val_norms = torch.zeros(
- config.num_hidden_layers, max_blocks_per_seq,
+ config.num_hidden_layers, num_pages,
config.num_key_value_heads, runtime.page_size, 1,
dtype=torch.float32, device=runtime.device,
)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@python/core/kv_cache.py` around lines 246 - 268, The quantized-page tensors
quant_key_indices, quant_key_norms, quant_val_indices, and quant_val_norms are
currently sized with max_blocks_per_seq but must be sized with the total
num_pages so they can be indexed by physical alloc.page_ids (used by
get_compressed_block_info); change their second dimension from
max_blocks_per_seq to num_pages (keeping other dims, dtypes, and device the
same) so accesses using alloc.page_ids in multi-request batches cannot index out
of range or overlap other requests.
| def solve_lloyd_max_new(d: int, bits: int, max_iter: int = 200, tol: float = 1e-10): | ||
| """Optimized implementation: analytical Gaussian conditional mean.""" | ||
| n_levels = 2 ** bits | ||
| sigma = 1.0 / math.sqrt(d) | ||
| lo, hi = -3.5 * sigma, 3.5 * sigma | ||
| centroids = torch.linspace(lo, hi, n_levels + 2)[1:-1].double() | ||
|
|
||
| for _ in range(max_iter): | ||
| boundaries = (centroids[:-1] + centroids[1:]) / 2.0 | ||
| edges = torch.zeros(n_levels + 1, dtype=torch.float64) | ||
| edges[0] = lo * 3 | ||
| edges[1:-1] = boundaries | ||
| edges[-1] = hi * 3 | ||
|
|
||
| z = edges / sigma | ||
| phi_z = (-0.5 * z * z).exp() * (1.0 / math.sqrt(2.0 * math.pi)) | ||
| cdf_z = 0.5 * (1.0 + torch.erf(z * (1.0 / math.sqrt(2.0)))) | ||
|
|
||
| phi_diff = phi_z[:-1] - phi_z[1:] | ||
| cdf_diff = cdf_z[1:] - cdf_z[:-1] | ||
| valid = cdf_diff > 1e-15 | ||
| new_centroids = torch.where( | ||
| valid, | ||
| sigma * phi_diff / cdf_diff.clamp(min=1e-15), | ||
| centroids, | ||
| ) | ||
| max_shift = (new_centroids - centroids).abs().max().item() | ||
| centroids = new_centroids | ||
| if max_shift < tol: | ||
| break | ||
|
|
||
| return centroids |
There was a problem hiding this comment.
Import the production solver instead of re-implementing it in the test.
solve_lloyd_max_new() is a second copy of the algorithm, not a call to python.core.turboquant.lloyd_max.solve_lloyd_max(). That means this file can stay green while the shipped solver regresses or simply behaves differently.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tests/test_lloyd_max.py` around lines 48 - 79, The test defines a duplicate
solver solve_lloyd_max_new instead of using the production implementation;
delete the local solve_lloyd_max_new and import and call the canonical solver
(python.core.turboquant.lloyd_max.solve_lloyd_max) in the test, replacing all
references to solve_lloyd_max_new with that imported function so tests exercise
the shipped implementation rather than a re-implementation in tests.
| def test_case(d: int, bits: int): | ||
| n_levels = 2 ** bits | ||
| print(f"\n{'='*60}") | ||
| print(f"d={d}, bits={bits}, n_levels={n_levels}") | ||
| print(f"{'='*60}") | ||
|
|
||
| t0 = time.perf_counter() | ||
| c_old = solve_lloyd_max_old(d, bits) | ||
| dt_old = (time.perf_counter() - t0) * 1000 | ||
|
|
||
| t0 = time.perf_counter() | ||
| c_new = solve_lloyd_max_new(d, bits) | ||
| dt_new = (time.perf_counter() - t0) * 1000 | ||
|
|
||
| diff = (c_new - c_old).abs() | ||
| sigma = 1.0 / math.sqrt(d) | ||
|
|
||
| print(f" old centroids ({dt_old:.1f} ms): {c_old.tolist()}") | ||
| print(f" new centroids ({dt_new:.1f} ms): {c_new.tolist()}") | ||
| print(f" max abs diff: {diff.max().item():.2e} (sigma={sigma:.4f})") | ||
| print(f" mean abs diff: {diff.mean().item():.2e}") | ||
| print(f" speedup: {dt_old / max(dt_new, 0.01):.1f}x") | ||
|
|
||
| # Verify: new centroids are the better quantizer (lower MSE for Gaussian) | ||
| # MSE = sum_{i} integral_{b_i}^{b_{i+1}} (x - c_i)^2 * f(x) dx | ||
| def quantization_mse(centroids_in): | ||
| """Compute MSE of a scalar quantizer for N(0, sigma^2).""" | ||
| cs = sorted(centroids_in.tolist()) | ||
| boundaries = [-1e10] + [(cs[i] + cs[i+1]) / 2 for i in range(len(cs)-1)] + [1e10] | ||
| total_mse = 0.0 | ||
| n_pts = 10000 | ||
| for i in range(len(cs)): | ||
| a, b = boundaries[i], boundaries[i+1] | ||
| xs = torch.linspace(a, b, n_pts, dtype=torch.float64) | ||
| pdf_vals = torch.exp(-0.5 * (xs / sigma) ** 2) / (sigma * math.sqrt(2 * math.pi)) | ||
| mse_i = ((xs - cs[i]) ** 2 * pdf_vals).sum() | ||
| total_mse += mse_i.item() | ||
| dx = (boundaries[-1] - boundaries[0]) / n_pts # approximate | ||
| return total_mse / n_pts # normalized | ||
|
|
||
| mse_old = quantization_mse(c_old) | ||
| mse_new = quantization_mse(c_new) | ||
| print(f" MSE old: {mse_old:.6e}") | ||
| print(f" MSE new: {mse_new:.6e}") | ||
| if mse_new <= mse_old * 1.005: | ||
| print(f" PASS: new MSE <= old MSE (within 0.5%)") | ||
| else: | ||
| print(f" FAIL: new MSE > old MSE by {(mse_new/mse_old-1)*100:.2f}%") | ||
|
|
||
| # Verify centroids are monotonically increasing | ||
| diffs_signed = c_new[1:] - c_new[:-1] | ||
| if (diffs_signed > 0).all(): | ||
| print(f" PASS: centroids monotonically increasing") | ||
| else: | ||
| print(f" FAIL: centroids not monotonically increasing") | ||
|
|
||
| # Verify symmetry: centroids should be symmetric around 0 | ||
| mid = n_levels // 2 | ||
| if n_levels % 2 == 0: | ||
| sym_diff = (c_new[:mid] + c_new[mid:].flip(0)).abs().max().item() | ||
| else: | ||
| sym_diff = (c_new[:mid] + c_new[mid+1:].flip(0)).abs().max().item() | ||
| print(f" symmetry around 0: max deviation = {sym_diff:.2e}") | ||
| if sym_diff < 1e-8: | ||
| print(f" PASS: symmetric around 0") | ||
| else: | ||
| print(f" WARN: not perfectly symmetric (expected for finite iterations)") | ||
|
|
||
|
|
||
| def main(): | ||
| all_pass = True | ||
|
|
||
| # Typical Qwen3-14B configurations | ||
| test_configs = [ | ||
| # (head_dim, bits) | ||
| (128, 2), # value bits | ||
| (128, 4), # key bits | ||
| (128, 3), # alternative | ||
| (64, 2), # smaller model | ||
| (64, 4), | ||
| (128, 6), # higher precision | ||
| ] | ||
|
|
||
| for d, bits in test_configs: | ||
| test_case(d, bits) | ||
|
|
||
| print(f"\n{'='*60}") | ||
| print("Summary: compare speed across configs") | ||
| print(f"{'='*60}") | ||
| print(f"{'d':>6} {'bits':>5} {'old(ms)':>10} {'new(ms)':>10} {'speedup':>10}") | ||
| print("-" * 45) | ||
| for d, bits in test_configs: | ||
| t0 = time.perf_counter() | ||
| solve_lloyd_max_old(d, bits) | ||
| dt_old = (time.perf_counter() - t0) * 1000 | ||
|
|
||
| t0 = time.perf_counter() | ||
| solve_lloyd_max_new(d, bits) | ||
| dt_new = (time.perf_counter() - t0) * 1000 | ||
| print(f"{d:>6} {bits:>5} {dt_old:>10.1f} {dt_new:>10.1f} {dt_old/max(dt_new,0.01):>9.1f}x") | ||
|
|
||
|
|
||
| if __name__ == "__main__": | ||
| main() |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
fd -HI 'pytest.ini|pyproject.toml|tox.ini|conftest.py' .
rg -n --hidden 'pytest|python -m pytest|tool.pytest|addopts|testpaths' .Repository: hw-native-sys/pypto-serving
Length of output: 2006
Convert tests/test_lloyd_max.py demo into a real parametrized pytest test (with asserts)
tests/test_lloyd_max.pydefinesdef test_case(d: int, bits: int):at module scope; pytest will treat it as a test (name starts withtest_) and interpretd/bitsas fixtures unless you parametrize (or otherwise prevent collection).- The current PASS/FAIL logic only prints and never uses
assert, so CI won’t fail even if MSE/ordering/symmetry checks are wrong—replace the prints withassertstatements and drived,bitsvia@pytest.mark.parametrizeovertest_configs.
🧰 Tools
🪛 Ruff (0.15.15)
[error] 127-127: f-string without any placeholders
Remove extraneous f prefix
(F541)
[error] 134-134: f-string without any placeholders
Remove extraneous f prefix
(F541)
[error] 136-136: f-string without any placeholders
Remove extraneous f prefix
(F541)
[error] 146-146: f-string without any placeholders
Remove extraneous f prefix
(F541)
[error] 148-148: f-string without any placeholders
Remove extraneous f prefix
(F541)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tests/test_lloyd_max.py` around lines 82 - 185, The file defines test_case(d,
bits) which pytest will collect as a test with undefined fixtures and currently
only prints PASS/FAIL; convert it to a proper parametrized pytest test by:
rename or keep test_case but add `@pytest.mark.parametrize`("d,bits", [...]) using
the test_configs list instead of the main loop; replace all printed PASS/FAIL
checks (MSE comparison, monotonicity, symmetry) with assertions (use
solve_lloyd_max_old, solve_lloyd_max_new, quantization_mse, and the
diffs_signed/sym_diff computations) so failures raise AssertionError; remove or
disable the standalone main() invocation so pytest controls execution. Ensure
imports include pytest and that timing/printing is optional (can remain but not
required for assertions).
| def quantization_mse(centroids_in): | ||
| """Compute MSE of a scalar quantizer for N(0, sigma^2).""" | ||
| cs = sorted(centroids_in.tolist()) | ||
| boundaries = [-1e10] + [(cs[i] + cs[i+1]) / 2 for i in range(len(cs)-1)] + [1e10] | ||
| total_mse = 0.0 | ||
| n_pts = 10000 | ||
| for i in range(len(cs)): | ||
| a, b = boundaries[i], boundaries[i+1] | ||
| xs = torch.linspace(a, b, n_pts, dtype=torch.float64) | ||
| pdf_vals = torch.exp(-0.5 * (xs / sigma) ** 2) / (sigma * math.sqrt(2 * math.pi)) | ||
| mse_i = ((xs - cs[i]) ** 2 * pdf_vals).sum() | ||
| total_mse += mse_i.item() | ||
| dx = (boundaries[-1] - boundaries[0]) / n_pts # approximate | ||
| return total_mse / n_pts # normalized |
There was a problem hiding this comment.
The MSE check is sampling essentially zero probability mass.
torch.linspace(-1e10, 1e10, 10000) is far too coarse for these Gaussian scales, so pdf_vals underflows to ~0 almost everywhere and mse_old/mse_new stop being meaningful. Use a finite support tied to sigma/the solver edges, or compute the interval integrals analytically.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tests/test_lloyd_max.py` around lines 107 - 120, quantization_mse currently
samples extreme [-1e10,1e10] which makes pdf_vals underflow; update
quantization_mse to integrate over finite, sigma-dependent supports per interval
(use boundaries already computed) and sample each interval [a,b] with xs =
torch.linspace(a, b, n_pts, dtype=torch.float64) where a,b come from boundaries
(or clamp a,b to e.g. mean ± K*sigma), compute dx = (b - a) / (n_pts - 1) and
accumulate mse_i = ((xs - cs[i])**2 * pdf_vals).sum() * dx (not divide by
n_pts), or replace numerical sampling by analytical Gaussian interval integrals
if preferred; ensure you reference function quantization_mse, variables cs,
boundaries, sigma, n_pts when making the change.
| print(f" MSE old: {mse_old:.6e}") | ||
| print(f" MSE new: {mse_new:.6e}") | ||
| if mse_new <= mse_old * 1.005: | ||
| print(f" PASS: new MSE <= old MSE (within 0.5%)") |
There was a problem hiding this comment.
Remove the literal-only f prefixes or Ruff will fail this file.
These strings have no placeholders, so they trip F541 on Lines 127, 134, 136, 146, and 148. Plain string literals are enough here.
As per coding guidelines, Ruff is configured with line length 110, target-version = "py310", with F checks enabled.
Also applies to: 134-136, 146-148
🧰 Tools
🪛 Ruff (0.15.15)
[error] 127-127: f-string without any placeholders
Remove extraneous f prefix
(F541)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tests/test_lloyd_max.py` at line 127, The failing Ruff F541 errors are caused
by unnecessary f-string prefixes on literal-only print statements (e.g., the
print call showing " PASS: new MSE <= old MSE (within 0.5%)"); remove the
leading f from these string literals so they are plain string literals instead
of f-strings; update all occurrences in tests/test_lloyd_max.py (the prints
around the PASS messages at the referenced lines) to use plain quotes to satisfy
Ruff and keep existing text unchanged.
Sources: Coding guidelines, Linters/SAST tools
Integrate TurboQuant module (compressor, lloyd_max) with serving-v3 stack: extend kv_cache.py with compress/restore/stats methods, add TurboQuant hooks in npu_runner.py (compress after prefill/decode, restore before decode), clean compressed segments on preemption in serving_worker.py, and add kv_quant JSON config parsing in CLI.
99cc8c2 to
c1d87d6
Compare
Integrate TurboQuant module (compressor, lloyd_max) with serving-v3 stack: extend kv_cache.py with compress/restore/stats methods, add TurboQuant hooks in npu_runner.py (compress after prefill/decode, restore before decode), clean compressed segments on preemption in serving_worker.py, and add kv_quant JSON config parsing in CLI.