Skip to content

1. Add native baselines and judge controls#43

Open
ErlisLushtaku wants to merge 4 commits into
mainfrom
pr32-split/01-judge-limits-baselines
Open

1. Add native baselines and judge controls#43
ErlisLushtaku wants to merge 4 commits into
mainfrom
pr32-split/01-judge-limits-baselines

Conversation

@ErlisLushtaku
Copy link
Copy Markdown
Collaborator

Summary

  • Add dataset-native baseline resolution for AlpacaEval, Arena-Hard, m-Arena-Hard, and MT-Bench.
  • Add judge-side limit controls and split judge/battle engine kwargs needed by benchmark runs.
  • Update benchmark docs and focused baseline/limit tests.

Stack

  1. This PR targets main.
  2. Next: pr32-split/02-inference-metadata-thinking.

Test plan

  • uv run pytest tests/test_cli.py tests/test_generate_and_evaluate.py tests/test_instruction_dataset.py tests/test_mt_bench_downloads.py
  • uv run ruff check ... on touched files

Update the download_all regression test to expect both versioned m-Arena-Hard datasets introduced by the native-baseline split.

Includes-AI-Code: true
@ErlisLushtaku ErlisLushtaku requested review from geoalgo and kargibora and removed request for geoalgo May 19, 2026 13:54
@ErlisLushtaku ErlisLushtaku changed the title Add native baselines and judge controls 1. Add native baselines and judge controls May 19, 2026
Copy link
Copy Markdown
Collaborator

@kargibora kargibora left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see no apperant issue, and I trust that this is already a well-tested PR. I have quite liked the refactoring that is done on the dataset files. Thank you for your hard work!

My only concern is as we expand the main branch we should also be quite careful about how we add the functionalities. It is already feels quite populated with different entrance points generate_and_evaluate, evaluate, estimate_elo_ratings getting more and more features. I will create an issue for possible refactoring (did it once, but since then some of the structures have been changed so I will propose it once more).

use_tqdm: bool = False


@dataclass(frozen=True)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like these class definitions does not belong to the generate_and_evaluate.py. I am still insisting on the idea for keeping these entrance points clean and splitting the functions/classes to a more common folder hierarchy. So generate_and_evaluate becomes a single comprensive function that builds all the necessary things with the builders. This also applies for evaluate. Otherwise it becomes quite hard to read it (so many if statements, name resolves for cache/filenames).

On the otherhand; lets keep this way for now and introduce this refactoring in the next PR's after we have merged the significant ones.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants