Fix hardcoded 2048 max_tokens cap silently mangling thinking models by asvarnon · Pull Request #18 · outsourc-e/bench-loop

asvarnon · 2026-06-18T23:19:34Z

Fixes #17

Changes

New --max-tokens N flag on benchloop run, threaded through run_benchmark → BenchmarkSuite.run_task (covers all five non-agent suites in one place) plus AgentSuite.run_task and _run_speed_task separately. No-op if you don't pass it.
dataextract: replaced strict json.loads with extract_json() — direct parse → fenced-block match → bracket-scan recovery, so stray prose around the JSON doesn't score it 0. Same tolerance coding.py already has for code fences.
Reordered cli.py to save the run before printing the console report — a rendering crash (cp1252 choking on the emoji header) was silently discarding completed runs.

Why

An always-on-reasoning model can blow through 2048 tokens of <think> before answering, gets cut off, and openai_compat.py's content = reasoning fallback dumps the raw CoT into content — which then fails strict JSON/code-fence parsing outright, regardless of whether the model could do the task.

Validation

Re-ran 5 models on the same hardware/quant/harness, flag on vs off:

Qwen3.6-35B-A3B (APEX-MTP): 65.2 → 84.9
Qwen3.6-27B, thinking on: 61.1 → 81.2 (flips which mode wins)
Gemma 4 12B Q4_K_M: 58.8 → 69.3
Gemma 4 12B Q8_0: 61.1 → 71.2
Gemma-4-12B-coder (concise CoT, control case): 81.1 → 83.2

Thinking models can exceed the fixtures' hardcoded max_tokens: 2048 before reaching an answer, get cut off mid-thought, and have the raw reasoning dumped into content (see openai_compat.py's reasoning->content fallback). That contaminated content then fails dataextract's strict json.loads and coding's code-block regex outright, scoring 0 regardless of whether the model was actually capable of the task. - Add a --max-tokens CLI flag that overrides every task's max_tokens (threaded through run_benchmark, BenchmarkSuite.run_task, AgentSuite.run_task, and _run_speed_task). - dataextract: replace the all-or-nothing json.loads with extract_json, which falls back to a fenced-block match and then a string-aware bracket-matching scan to recover JSON surrounded by stray prose, mirroring the tolerance coding.py already has for code fences. Records which extraction path succeeded in task metadata.

print_run_report ran before save_run, so a crash while rendering the summary (e.g. UnicodeEncodeError on legacy Windows cp1252 terminals choking on the emoji in the header) silently discarded the entire run's results -- happened on a coding-only rerun that scored 93.8 but never got written to disk. Save first, then best-effort print.

Speed fixtures carry deliberately tiny per-task caps (32/48/64/160/384) to measure raw decode throughput on a known-short generation length. The thinking-model --max-tokens override was applying globally, letting trivial tasks run all the way to the override cap and corrupting the latency/tok-s measurement.

asvarnon added 3 commits June 18, 2026 00:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix hardcoded 2048 max_tokens cap silently mangling thinking models#18

Fix hardcoded 2048 max_tokens cap silently mangling thinking models#18
asvarnon wants to merge 3 commits into
outsourc-e:mainfrom
asvarnon:fix/thinking-model-token-budget

asvarnon commented Jun 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

asvarnon commented Jun 18, 2026

Changes

Why

Validation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant