Fix hardcoded 2048 max_tokens cap silently mangling thinking models#18
Open
asvarnon wants to merge 3 commits into
Open
Fix hardcoded 2048 max_tokens cap silently mangling thinking models#18asvarnon wants to merge 3 commits into
asvarnon wants to merge 3 commits into
Conversation
Thinking models can exceed the fixtures' hardcoded max_tokens: 2048 before reaching an answer, get cut off mid-thought, and have the raw reasoning dumped into content (see openai_compat.py's reasoning->content fallback). That contaminated content then fails dataextract's strict json.loads and coding's code-block regex outright, scoring 0 regardless of whether the model was actually capable of the task. - Add a --max-tokens CLI flag that overrides every task's max_tokens (threaded through run_benchmark, BenchmarkSuite.run_task, AgentSuite.run_task, and _run_speed_task). - dataextract: replace the all-or-nothing json.loads with extract_json, which falls back to a fenced-block match and then a string-aware bracket-matching scan to recover JSON surrounded by stray prose, mirroring the tolerance coding.py already has for code fences. Records which extraction path succeeded in task metadata.
print_run_report ran before save_run, so a crash while rendering the summary (e.g. UnicodeEncodeError on legacy Windows cp1252 terminals choking on the emoji in the header) silently discarded the entire run's results -- happened on a coding-only rerun that scored 93.8 but never got written to disk. Save first, then best-effort print.
Speed fixtures carry deliberately tiny per-task caps (32/48/64/160/384) to measure raw decode throughput on a known-short generation length. The thinking-model --max-tokens override was applying globally, letting trivial tasks run all the way to the override cap and corrupting the latency/tok-s measurement.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #17
Changes
--max-tokens Nflag onbenchloop run, threaded throughrun_benchmark→BenchmarkSuite.run_task(covers all five non-agent suites in one place) plusAgentSuite.run_taskand_run_speed_taskseparately. No-op if you don't pass it.dataextract: replaced strictjson.loadswithextract_json()— direct parse → fenced-block match → bracket-scan recovery, so stray prose around the JSON doesn't score it 0. Same tolerancecoding.pyalready has for code fences.cli.pyto save the run before printing the console report — a rendering crash (cp1252 choking on the emoji header) was silently discarding completed runs.Why
An always-on-reasoning model can blow through 2048 tokens of
<think>before answering, gets cut off, andopenai_compat.py'scontent = reasoningfallback dumps the raw CoT intocontent— which then fails strict JSON/code-fence parsing outright, regardless of whether the model could do the task.Validation
Re-ran 5 models on the same hardware/quant/harness, flag on vs off: