Core-Aware Selective KV Compression for Reasoning Traces
CASK treats reasoning-time KV compression as a behavior-preserving selective consolidation problem rather than a pure scoring problem. The paper-facing method is cask: protect a small core of reasoning states, selectively consolidate redundant scratch states, and use a two-stage policy for prompt-heavy runs.
| Item | Current answer |
|---|---|
| Main method | cask |
| Main baseline | triattention |
| Main metric family | teacher-forced reference fidelity |
| Headline claim | CASK improves the minimum usable budget frontier rather than trying to be the most aggressive compressor at every setting |
| Prompt-heavy policy | Stage 1 prefix eviction, then Stage 2 decode consolidation |
| Primary runner | python scripts/cli.py run-one ... --method cask |
| Main replay harness | scripts/replay_reference_fidelity.py |
| Command provenance | artifacts/COMMAND_MAP.md |
| Figure assets | docs/assets/ |
| Paper source | paper/ with the vendored cask_arxiv.sty preprint style |
| Stage | What happens | Why it exists |
|---|---|---|
| Stage 1: prefix compression | TriAttention-style eviction plus a small coverage reserve | Prevent prompt-heavy runs from exhausting the budget before decode starts |
| Stage 2: decode compression | Split decode states into protected core and mergeable scratch, then consolidate scratch only |
Preserve answer-critical states while compressing redundant reasoning work |
CASK is built around two paper-facing diagnostics: lost representative mass and kappa-dispersion.
If a scratch group preserves most of the horizon-weighted mass and its members stay close to the representative under the calibrated kappa geometry, then replacing that group with one representative changes the next-token score only by a small additive amount.
This is the main reason the repo reports fidelity and saved ratio together rather than treating compression as a pure ranking problem.
| Mode | Purpose | Status |
|---|---|---|
fullkv |
Uncompressed reference run | primary reference |
triattention |
Original eviction baseline | primary baseline |
cask |
CASK mainline implementation | primary paper method |
expectedattention |
Closest prior-style comparison kept in-tree | optional baseline |
If you are running the current paper candidate, use --method cask.
The tables below report both fidelity and terminal saved ratio so the quality-memory tradeoff is visible at a glance.
| Slice | Budget | Tri Top-1 | CASK Top-1 | Tri Mean NLL | CASK Mean NLL | Tri Saved Ratio | CASK Saved Ratio |
|---|---|---|---|---|---|---|---|
AIME24 ref6 |
256 |
86.10 |
88.43 |
0.463 |
0.359 |
65.31% |
65.28% |
AIME24 ref6 |
384 |
88.25 |
90.69 |
0.383 |
0.268 |
61.55% |
61.55% |
AIME24 ref6 |
512 |
89.42 |
91.68 |
0.333 |
0.233 |
43.57% |
43.52% |
AIME25 ref6 |
256 |
85.71 |
86.77 |
0.500 |
0.504 |
63.37% |
55.87% |
AIME25 ref6 |
384 |
89.13 |
90.32 |
0.357 |
0.313 |
59.52% |
63.26% |
AIME25 ref6 |
512 |
89.94 |
91.68 |
0.321 |
0.254 |
44.76% |
37.25% |
Detail: H100 reasoning replay package
| Dataset | Budget | Tri Top-1 | CASK Top-1 | Tri Mean NLL | CASK Mean NLL | CASK Saved Ratio | Read |
|---|---|---|---|---|---|---|---|
qasper |
256 |
67.19 |
71.09 |
1.315 |
1.247 |
90.90% |
same-budget replay win |
multi_news |
384 |
53.71 |
61.33 |
2.052 |
1.540 |
84.07% |
decode-active replay win |
hotpotqa |
384 |
81.25 |
96.88 |
1.344 |
0.110 |
96.49% |
strongest same-budget witness |
2wikimqa |
384 |
59.38 |
56.25 |
3.415 |
2.397 |
94.41% |
retained boundary |
Detail: Prompt-heavy replay readout
| Task | Comparison | Lexical / Sequence | Semantic Sim. | Official Metric | CASK Terminal Saved Ratio | Read |
|---|---|---|---|---|---|---|
qasper |
CASK @ 256 vs TriAttention @ 512 |
0.238 > 0.173 |
0.791 > 0.678 |
12.77 > 11.94 |
90.90% |
clean budget crossing on all three axes |
multi_news |
CASK @ 384 vs TriAttention @ 384 |
0.169 > 0.000 |
0.952 > 0.452 |
15.16 > 0.00 |
84.07% |
strongest decode-active output bridge |
hotpotqa |
CASK @ 256 vs TriAttention @ 256 |
1.000 = 1.000 |
1.000 = 1.000 |
27.27 = 27.27 |
97.57% |
non-regression parity |
Detail: H100 actual-output bridge package
| Axis | Current read |
|---|---|
| Reasoning replay | CASK beats TriAttention at the same budget on the tracked H100 gate and shows partial crossing |
| Prompt-heavy replay | CASK has a strong same-budget replay package, with multi_news and vcsum as the current replay-level decode-active witnesses |
| Output-level bridge | multi_news remains the strongest decode-active output bridge; vcsum is a lexical-vs-semantic boundary rather than a clean headline win |
| Savings interpretation | The claim is not "always compress more"; it is "keep full-KV behavior alive at a lower usable budget" |
| Claim boundary | Active decode regime and prefix_budget_exhausted regime must be separated explicitly |
Start here: artifacts/README.md
Command trace: artifacts/COMMAND_MAP.md
| If you want to know... | Open this | Why this is the right package |
|---|---|---|
| whether CASK wins the main reasoning replay gate | H100 reasoning replay gate | contains the AIME24 / AIME25 synchronized replay tables and crossing read |
| whether replay gains show up in actual generation | H100 actual-output bridge | contains the tracked qasper, multi_news, and hotpotqa output bridge rows |
| how to read the full prompt-heavy story | H100 prompt-heavy follow-up | separates replay-level decode-active wins from output-level bridge rows and prefix_budget_exhausted boundaries |
| where compact raw output provenance lives | Raw output snapshot | preserves LongBench actual-output and math witness logs without committing obsolete exploratory output trees |
| which figures are current and ready to paste into the draft | Figure asset pack | bundles the synchronized PNG/PDF versions of the reasoning gate, prompt-heavy, bridge, and method-overview figures |
| Package | Use it for | Do not use it for |
|---|---|---|
| H100 reasoning replay gate | main reasoning replay headline | final benchmark-accuracy headline |
| H100 actual-output bridge | showing replay-to-output linkage | broad decode-stage generalization claims by itself |
| H100 prompt-heavy follow-up | regime separation and prompt-heavy narrative | pretending every prompt-heavy task is decode-active |
| Raw output snapshot | audit trail for compact generation logs | paper headline interpretation without the packaged summaries |
Python 3.10+ is required.
git clone https://github.com/Skyline-23/CASK.git
cd CASK
pip install -e .For Linux benchmarking and vLLM runtime work, install the matching CUDA, FlashAttention, and vLLM stack separately. The HuggingFace replay harnesses also run on a single 16 GB consumer GPU with sdpa.
| Task | Command |
|---|---|
| FullKV reference | python scripts/cli.py run-one --model Qwen3-8B --dataset math500 --method fullkv |
| TriAttention baseline | python scripts/cli.py run-one --model Qwen3-8B --dataset math500 --method triattention --budget 104 --stats-path cask/calibration/for_aime24_experiment/qwen3_8b.pt |
| CASK mainline | python scripts/cli.py run-one --model Qwen3-8B --dataset math500 --method cask --budget 104 --stats-path cask/calibration/for_aime24_experiment/qwen3_8b.pt |
| Teacher-forced replay | python scripts/replay_reference_fidelity.py --reference ... --model-path ... --method cask --budget 104 --triattention-stats-file ... |
Replay example:
python scripts/replay_reference_fidelity.py \
--reference experiments/outputs/math500/Qwen3-8B/sample1/fullkv/fullkv_selection_geometry_reference \
--model-path experiments/models/Qwen3-8B \
--method cask \
--budget 104 \
--triattention-stats-file cask/calibration/for_aime24_experiment/qwen3_8b.pt \
--attn-implementation sdpa \
--count-prompt-tokens true \
--slack-budget-trigger true \
--json-output experiments/reports/geometry_teacher_forced_fidelity_vs_fullkv_sm104_v2.json| Path | Purpose |
|---|---|
scripts/cli.py |
high-level experiment wrapper |
scripts/worker.py |
HuggingFace execution path |
scripts/replay_reference_fidelity.py |
teacher-forced replay harness |
scripts/run_replay_fidelity_frontier.py |
batch replay frontier launcher |
scripts/run_promptheavy_pack.py |
generic prompt-heavy fullkv + replay package planner/launcher |
scripts/run_replay_queue.ps1 |
PowerShell queue wrapper for replay configs |
scripts/run_longbench_suite.py |
LongBench generation harness |
scripts/compare_kv_fidelity.py |
output-level comparison helper |
scripts/build_promptheavy_saved_ratio_audit.py |
package prompt-heavy replay summaries |
scripts/build_actual_bridge_artifacts.py |
package actual-output bridge summaries |
scripts/sync_reasoning_gate.py |
sync replay reports into the tracked reasoning-gate summaries |
scripts/build_paper_figures.py |
generate the current paper-facing figure pack under docs/assets/ |
scripts/refresh_paper_figures.ps1 |
optional sync + figure refresh wrapper |
paper/ |
venue-neutral paper source, bibliography, and build entrypoints |
paper/make_arxiv_package.ps1 |
generate and optionally verify the arXiv source zip from the canonical paper source |
cask/methods/triattention.py |
TriAttention baseline implementation |
cask/methods/cask.py |
CASK implementation |
artifacts/ |
tracked paper-facing summaries |
docs/assets/ |
paper-facing rendered figures |
paper/content.tex is the canonical manuscript body. There is no tracked
arxiv_submit source copy; submission sources are generated from the canonical
files. To build the local PDFs:
Push-Location paper
latexmk -g -pdf -interaction=nonstopmode -halt-on-error main_author.tex
latexmk -g -pdf -interaction=nonstopmode -halt-on-error main_anonymous.tex
Pop-LocationTo create the arXiv source package from the canonical source:
paper\make_arxiv_package.ps1 -VerifyThe script stages a clean temporary submission tree, rewrites the figure path for
local figures/, copies the current PDFs from docs/assets/, creates
paper/cask_arxiv_source.zip, and verifies that the zip compiles independently.
This codebase started from the TriAttention implementation and now serves as the active research repository for CASK. Some internal names remain for compatibility with existing tooling, but the paper-facing method is CASK.
Apache 2.0. See LICENSE.