Skip to content

Fix hardcoded 2048 max_tokens cap silently mangling thinking models#18

Open
asvarnon wants to merge 3 commits into
outsourc-e:mainfrom
asvarnon:fix/thinking-model-token-budget
Open

Fix hardcoded 2048 max_tokens cap silently mangling thinking models#18
asvarnon wants to merge 3 commits into
outsourc-e:mainfrom
asvarnon:fix/thinking-model-token-budget

Conversation

@asvarnon

Copy link
Copy Markdown

Fixes #17

Changes

  • New --max-tokens N flag on benchloop run, threaded through run_benchmarkBenchmarkSuite.run_task (covers all five non-agent suites in one place) plus AgentSuite.run_task and _run_speed_task separately. No-op if you don't pass it.
  • dataextract: replaced strict json.loads with extract_json() — direct parse → fenced-block match → bracket-scan recovery, so stray prose around the JSON doesn't score it 0. Same tolerance coding.py already has for code fences.
  • Reordered cli.py to save the run before printing the console report — a rendering crash (cp1252 choking on the emoji header) was silently discarding completed runs.

Why

An always-on-reasoning model can blow through 2048 tokens of <think> before answering, gets cut off, and openai_compat.py's content = reasoning fallback dumps the raw CoT into content — which then fails strict JSON/code-fence parsing outright, regardless of whether the model could do the task.

Validation

Re-ran 5 models on the same hardware/quant/harness, flag on vs off:

  • Qwen3.6-35B-A3B (APEX-MTP): 65.2 → 84.9
  • Qwen3.6-27B, thinking on: 61.1 → 81.2 (flips which mode wins)
  • Gemma 4 12B Q4_K_M: 58.8 → 69.3
  • Gemma 4 12B Q8_0: 61.1 → 71.2
  • Gemma-4-12B-coder (concise CoT, control case): 81.1 → 83.2

asvarnon added 3 commits June 18, 2026 00:24
Thinking models can exceed the fixtures' hardcoded max_tokens: 2048
before reaching an answer, get cut off mid-thought, and have the raw
reasoning dumped into content (see openai_compat.py's reasoning->content
fallback). That contaminated content then fails dataextract's strict
json.loads and coding's code-block regex outright, scoring 0 regardless
of whether the model was actually capable of the task.

- Add a --max-tokens CLI flag that overrides every task's max_tokens
  (threaded through run_benchmark, BenchmarkSuite.run_task,
  AgentSuite.run_task, and _run_speed_task).
- dataextract: replace the all-or-nothing json.loads with extract_json,
  which falls back to a fenced-block match and then a string-aware
  bracket-matching scan to recover JSON surrounded by stray prose,
  mirroring the tolerance coding.py already has for code fences.
  Records which extraction path succeeded in task metadata.
print_run_report ran before save_run, so a crash while rendering the
summary (e.g. UnicodeEncodeError on legacy Windows cp1252 terminals
choking on the emoji in the header) silently discarded the entire
run's results -- happened on a coding-only rerun that scored 93.8 but
never got written to disk. Save first, then best-effort print.
Speed fixtures carry deliberately tiny per-task caps (32/48/64/160/384)
to measure raw decode throughput on a known-short generation length.
The thinking-model --max-tokens override was applying globally,
letting trivial tasks run all the way to the override cap and
corrupting the latency/tok-s measurement.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Thinking models get silently mangled by the hardcoded 2048 max_tokens cap

1 participant