Add MergingPress: scorer-agnostic merge-on-evict for KV cache compression 🤖🤖🤖#219
Add MergingPress: scorer-agnostic merge-on-evict for KV cache compression 🤖🤖🤖#219jg-codes wants to merge 24 commits intoNVIDIA:mainfrom
Conversation
ExpectedAttentionPress benchmark resultsSetup: RULER-4096, Qwen3-8B, fraction=0.1 (~650 samples), seed=42 Three configurations compared:
Average scores
MergingPress consistently matches or beats bare EA hard eviction. The gain is largest at CR=0.75 (+4.6 pp), matching the pattern seen with KnormPress (+6.0 pp). Flagship per-task result: niah_single_3 at CR=0.75
Merge-on-evict recovers nearly all lost accuracy on this retrieval task — same quality as AdaKV's per-head budget allocation. Per-task breakdown (CR=0.75)
Observations
🤖🤖🤖 |
|
@jg-codes run it with KVzap too |
|
@jg-codes we are currently investigating the best way to interact with AI agents in this repository. To help us could you report any information on you ? (e.g. which agent harness are you using, which model, your config, who's running you etc.) |
Development: Githup Copilot (VsCode "Autopilot" mode) in combination with Agentic Cowork features, e.g. for research tasks. All under my supervision. Unfortunately, guardrails don't stop the agent from publishing local drafts—yet. |
Running in the basesetup, we loose against KVzap. Hence, only merge 75% of token and require a minimum similarity_threshold. Setup: RULER-4096, Qwen3-8B, fraction=0.1 (~650 samples), seed=42, M(KVzap) = MergingPress(KVzapPress(model_type="mlp"), merge_fraction=0.75, similarity_threshold=0.5) — selective merge-on-evict On QA we loose significantly. On niah-mv it is still looking fine. Not sure about significance here. Average Scores
Task Breakdown
Wall clock time (averaged second per task)
|
|
@jg-codes could you give me more information about you ?
Nice results. Could you run with DMSPress(press=KVzapPress(model_type="mlp")), it's the SOTA press for now. Use thresholds of −4 and -3. |
The experiments stem from a multiple non-autonomous AI assistant setup: one for research, one for thinking & one for challenging thereof, one for interdisciplinary perspectives, etc. Funnily, I've asked 'what is the KV press SOTA' to assess an optimization angle, first the AI named H20, after challenging that the SOTA would be three years old, it named SnapKV, later AdaKV, only then I'd stumbled on the KV Press Leaderboard. DMSPress required adding a MergingPress(DMSPress(KVzapPress)) resultsSetup: RULER-4096, Qwen3-8B, f=0.1 (~650 samples), seed=42, A100
Per-task at threshold −4
At t=−4 DMSPress barely evicts — 9/13 tasks are already perfect. Only qa_1 shows movement (+2.1 with mf=0.75). Per-task at threshold −3
Key takeaways
I suppose MergingPress would benefit from more aggressive thresholds; I'd need more time to ponder. What would be your recommendation to proceed? Are the extensions and modifications to extend any press the right way? |
bbdcfb9 to
f42738b
Compare
Signed-off-by: Johannes <[email protected]>
Signed-off-by: Johannes <[email protected]>
Signed-off-by: Johannes <[email protected]>
…ge 🤖🤖🤖 Implement a vectorized merge-on-evict kernel as a standalone function. Partitions tokens by score into keep/evict sets, computes batched cosine similarity between each evicted and surviving key, then folds evicted values into their nearest survivor via similarity-weighted scatter-add with float32 accumulation. This is the core building block for MergingPress; threshold gating, value-norm weighting, and merge caps are added in follow-up commits. Signed-off-by: Johannes <[email protected]>
…cap 🤖🤖🤖 Add four configurable features to _merge_on_evict: - similarity_threshold: gate merges by minimum cosine similarity - value_norm_weighting: scale merge budget by relative value L2 norm - max_merge_per_token: cap merges per survivor to prevent dilution - merge_keys: optionally merge evicted info into survivor keys Signed-off-by: Johannes <[email protected]>
…ng 🤖🤖🤖 Document the merge-on-evict algorithm with: - Perturbation bound derivation: merge error ≤ 1/(1+w) of hard-eviction error, where w is the cosine similarity between evicted and survivor keys - Full parameter descriptions (NumPy-style docstring) - References: Token Merging (Bolya 2023), D2O (Wan 2024), KeepKV (Huang 2025) - Debug-level logging of merge statistics (count, mean similarity, max merges per survivor) behind isEnabledFor guard Signed-off-by: Johannes <[email protected]>
Introduce MergingPress as a BasePress dataclass that wraps any ScorerPress and delegates scoring entirely, replacing only the eviction step with merge-on-evict. API design choices: - similarity_threshold (default 0.0): gate merges by minimum cosine similarity; 0.0 blocks opposite-direction merges while permitting all reasonable ones - merge_keys=False by default: preserves RoPE positional encoding - value_norm_weighting=True: scales merge budget by relative value L2 norm - max_merge_per_token=0: optional dilution cap for high-compression Differs from CAMPress in routing merges to the most similar token (position-agnostic) rather than sequential neighbors. Signed-off-by: Johannes <[email protected]>
…erences Document each parameter with its default, rationale, and empirical impact on RULER-4096 benchmarks with Qwen3-8B: - merge_keys=True hurts quality (−2.5 pp at CR=0.75) - value_norm_weighting=True improves accuracy (~1.9 pp) - similarity_threshold=0.0 blocks only opposite-direction merges Align kernel code comments with upstream style (explicit section headers, float32 accumulation note, partition order comment). Signed-off-by: Johannes <[email protected]>
- kvpress/__init__.py: import + __all__ entry - evaluation/evaluate_registry.py: merging_knorm and merging_snapkv press configs - tests/default_presses.py: MergingPress(KnormPress) at CR 0.2/0.8 - README.md: one-line description in wrapper presses list Signed-off-by: Johannes <[email protected]>
7 tests covering constructor validation and core merge behaviour: - test_requires_scorer_press: rejects non-ScorerPress - test_threshold_bounds: rejects out-of-range thresholds - test_compression_ratio_delegation: property delegates to wrapped press - test_zero_compression_is_identity: CR=0 → no eviction - test_runs_with_model: smoke test with KnormPress and SnapKVPress - test_merge_differs_from_hard_eviction: merged values ≠ hard-evicted - test_threshold_gates_merges: high threshold → closer to hard evict Signed-off-by: Johannes <[email protected]>
…g) 🤖🤖🤖 4 tests verifying each configurable feature: - test_default_preserves_keys: merge_keys=False keeps keys unchanged - test_half_precision_no_nan: fp16/bf16 produce finite results - test_repeated_compression_stable: multi-turn recompression stays finite - test_value_norm_weighting_differs: vnorm=True changes merge output Signed-off-by: Johannes <[email protected]>
6 tests covering edge cases and integration: - test_merge_preserves_more_info_than_hard_eviction: reconstruction error is lower with merging than with hard eviction - test_batch_size_greater_than_one: partition works for B>1 - test_max_merge_per_token_validation: rejects negative cap - test_max_merge_per_token_changes_output: cap=1 differs from uncapped - test_high_compression_short_sequence: high CR on short seq doesn't crash - test_quantized_cache_compatibility: QuantizedCache + quanto (skip if N/A) Signed-off-by: Johannes <[email protected]>
Extends DecodingPress with cosine-similarity merge-on-evict (position-agnostic alternative to CAMPress). Shares _merge_on_evict kernel with MergingPress. - MergingDecodingPress class in merging_press.py - Exported in __init__.py - Registry entries: merging_decoding_knorm, merging_decoding_adakv_snapkv - 3 tests (instantiation, compress override, parameter forwarding) Co-authored-by: GitHub Copilot <[email protected]> Signed-off-by: Johannes <[email protected]>
Widen press field from ScorerPress to BasePress and add dispatch in compress() for future mask-based and hook-based press composition. ScorerPress path moved to _compress_scorer() with no behavior change. Co-Authored-By: Claude Opus 4.6 <[email protected]> Signed-off-by: Johannes <[email protected]>
New kernel that handles variable per-head eviction counts from adaptive budget allocation (AdaKV, DMSPress). Iterates per (batch, head) pair, merges evicted tokens in-place into full-length tensors. Evicted positions are left unchanged for attention masking. Co-Authored-By: Claude Opus 4.6 <[email protected]> Signed-off-by: Johannes <[email protected]>
Wire _merge_on_evict_adaptive into compress() for mask-based presses (AdaKV, CriticalAdaKV). Delegates to inner press.compress(), reads masked_key_indices, merges evicted tokens into survivors in-place. 6 tests: construction, delegation, model smoke, merge differs, identity. Co-Authored-By: Claude Opus 4.6 <[email protected]> Signed-off-by: Johannes <[email protected]>
Detect hook-based presses (DMSPress) via _is_hook_based_press() and delegate to their forward_hook, then merge evicted tokens using the adaptive kernel. Adds threshold/compression_ratio property passthrough. Co-Authored-By: Claude Opus 4.6 <[email protected]> Signed-off-by: Johannes <[email protected]>
TestMergingPressWithDMS: 7 tests covering hook detection, threshold passthrough, model execution, and merge-vs-plain comparison. Uses RandomPress to avoid HF model dependency. Co-Authored-By: Claude Opus 4.6 <[email protected]> Signed-off-by: Johannes <[email protected]>
Allows keeping only top fraction of evicted tokens (by similarity) for merging, hard-evicting the rest. Default 1.0 (merge all) preserves backward compatibility. Threaded through both kernels and both classes. Co-Authored-By: Claude Opus 4.6 <[email protected]> Signed-off-by: Johannes <[email protected]>
Skip merges where estimated error ‖v_i‖*(1-w)/(1+w) exceeds gate. Prevents high-norm evicted tokens from corrupting survivors — fixes qa_1 regression where QA answer tokens get blended into context. Default 0.0 (disabled) preserves backward compatibility. Co-Authored-By: Claude Opus 4.6 <[email protected]> Signed-off-by: Johannes <[email protected]>
Verify tight gate blocks high-error merges (output closer to hard eviction) and gate=0.0 produces identical output to default. Co-Authored-By: Claude Opus 4.6 <[email protected]> Signed-off-by: Johannes <[email protected]>
Registry: merging_knorm, merging_snapkv, merging_adakv_snapkv, merging_dms_kvzap_mlp, merging_decoding_knorm. Evaluate: handle MergingPress(DMSPress) threshold delegation in _setup_press(). Co-Authored-By: Claude Opus 4.6 <[email protected]> Signed-off-by: Johannes <[email protected]>
transformers 4.48+ calls KVzapConfig() with no args in to_diff_dict(). Adding defaults (0) for required params prevents the TypeError while from_pretrained still passes the real values from the saved config. Co-Authored-By: Claude Opus 4.6 <[email protected]> Signed-off-by: Johannes <[email protected]>
6a9a3d7 to
3989b4e
Compare
MergingPress(diagnostics=True) collects per-layer, per-head merge stats: eviction positions, merge targets, cosine similarities, value norms, and DMS importance scores for evicted vs surviving tokens. DMSPress.full_scores captures per-token KVzap scores before the scores_buffer is trimmed, enabling post-hoc analysis of eviction decisions. Zero overhead when diagnostics are disabled. Co-Authored-By: Claude Opus 4.6 <[email protected]> Signed-off-by: Johannes <[email protected]>
Closes #214
What
MergingPressis a prefill-time wrapper that replaces hard eviction with merge-on-evict: each evicted token is folded into its most cosine-similar survivor via weighted value blending, instead of being discarded.It wraps any
ScorerPress— scoring is delegated entirely; only the eviction step changes. This makes it composable with all existing scorers (KnormPress, SnapKVPress, etc.).How it works
ScorerPresssimilarity_threshold)Perturbation bound
For evicted token i routed to survivor j with cosine similarity w:
At w ≥ 0.7 the merge error is at most 59% of hard-eviction error; at w = 1 it halves exactly.
Parameters
pressScorerPresswhose scores determine which tokens survivesimilarity_threshold0.0merge_keysFalseFalsepreserves Rotary Positional Encoding info)value_norm_weightingTruemax_merge_per_token0Empirical defaults (RULER-4096, Qwen3-8B)
merge_keys=Truehurts quality (−2.5 pp at CR=0.75) — RoPE corruption (?)value_norm_weighting=Trueimproves accuracy (~1.9 pp)similarity_threshold=0.0is sufficient — nearly no tokens have negative max similarity; empirical threshold unclear and in general cases may not be requiredmax_merge_per_token=0(unlimited) works well up to CR=0.75; at CR=0.88 the AdaKV results below show a broad −0.8 pp regression (1 win / 7 losses), suggesting too many evicted tokens pile onto the same survivors. Capping at 3–5 may help at extreme compression, but generalisation is unclear.Benchmark results
RULER-4096, Qwen3-8B, fraction=1.0 (all 13 subtasks), seed=42:
Average scores
MergingPress consistently outperforms hard eviction across all compression ratios, with the largest gains at high compression where merge-on-evict recovers the most discarded information.
Per-task breakdown
M+K = MergingPress(KnormPress), K = KnormPress. Knorm and no_press baselines from the kvpress leaderboard.
Key observations:
Scorer generality: AdaKVPress (f=0.1, ~650 samples)
Exploratory runs on AdaKV(SnapKVPress) confirm that MergingPress generalises beyond KnormPress. These used fraction=0.1 (~650 of ~6500 RULER samples), so treat as directional:
Pattern matches KnormPress: positive gains at CR 0.25–0.75, with an inversion at CR=0.88 where the merge overhead may dilute the few surviving tokens. Per-task win/loss breakdown: CR=0.25 has 5 wins / 0 losses, CR=0.50 has 7/2, CR=0.75 has 5/6 (net positive due to larger wins on niah_s1 +10.6, vt +12.2), CR=0.88 has 1/7. The CR=0.88 regression (−0.8 pp) is small but broad — suggesting that
max_merge_per_tokencapping or a highersimilarity_thresholdcould help at extreme compression.Computational overhead
The merge kernel adds one batched cosine-similarity matmul per layer: O(B · H · CR · (1−CR) · L² · D) — same complexity class as attention but over KV heads only (8 vs 32 query heads for Qwen3-8B) and bounded by CR·(1−CR) ≤ 0.25. Runs once at prefill; decoding is unaffected.
Theoretical peak: ~6% of attention FLOPs at CR=0.50, i.e. ~2–3% of total prefill FLOPs. No extra forward passes, no learned parameters.
Changes
kvpress/presses/merging_press.py_merge_on_evictkernel +MergingPressdataclasstests/presses/test_merging_press.pykvpress/__init__.py__all__entryevaluation/evaluate_registry.pymerging_knormandmerging_snapkvconfigstests/default_presses.pyREADME.mdTotal: 6 files, +618 lines
Design choices vs. related work
References:
Usage
Tests
17 test methods in
tests/presses/test_merging_press.py— 18 passed, 1 skipped (test_quantized_cache_compatibilityrequiresoptimum-quanto).Coverage: parameter validation, compression-ratio delegation, identity at zero compression, model forward pass (KnormPress + SnapKVPress), merge-vs-hard-eviction difference, threshold gating, key preservation, fp16/bf16 numerical stability, repeated compression, value-norm weighting, information preservation, batching,
max_merge_per_tokenvalidation + effect, short-sequence edge case, quantized cache compatibility.CI
Awaiting
/ok to testfrom a collaborator. Local results:ruff check✅ — no issues on all changed filespytest tests/presses/test_merging_press.py✅ — 18 passed, 1 skipped (no GPU needed for unit tests)make style/make test— not run locally (full suite requires GPU fordefault_pressesintegration tests)AI disclosure
This PR was developed with AI assistance. Commits authored by AI are marked with 🤖🤖🤖. In fact, AI did most of the work. The API design, parameter selection, empirical tuning (...), and docstring proofreading are human contributions.
Checklist
AGENTS.mdguidelines (dataclass, BasePress, SPDX headers)ruff checkpasses on all changed filesoptimum-quanto)kvpress/__init__.py,tests/default_presses.py,evaluation/evaluate_registry.py,README.mdmake style/make teston CI (awaiting/ok to test)