1. Add native baselines and judge controls by ErlisLushtaku · Pull Request #43 · OpenEuroLLM/JudgeArena

ErlisLushtaku · 2026-05-19T12:02:22Z

Summary

Add dataset-native baseline resolution for AlpacaEval, Arena-Hard, m-Arena-Hard, and MT-Bench.
Add judge-side limit controls and split judge/battle engine kwargs needed by benchmark runs.
Update benchmark docs and focused baseline/limit tests.

Stack

This PR targets main.
Next: pr32-split/02-inference-metadata-thinking.

Test plan

uv run pytest tests/test_cli.py tests/test_generate_and_evaluate.py tests/test_instruction_dataset.py tests/test_mt_bench_downloads.py
uv run ruff check ... on touched files

Update the download_all regression test to expect both versioned m-Arena-Hard datasets introduced by the native-baseline split. Includes-AI-Code: true

kargibora

I see no apperant issue, and I trust that this is already a well-tested PR. I have quite liked the refactoring that is done on the dataset files. Thank you for your hard work!

My only concern is as we expand the main branch we should also be quite careful about how we add the functionalities. It is already feels quite populated with different entrance points generate_and_evaluate, evaluate, estimate_elo_ratings getting more and more features. I will create an issue for possible refactoring (did it once, but since then some of the structures have been changed so I will propose it once more).

kargibora · 2026-05-19T14:20:37Z

    use_tqdm: bool = False


+@dataclass(frozen=True)


I feel like these class definitions does not belong to the generate_and_evaluate.py. I am still insisting on the idea for keeping these entrance points clean and splitting the functions/classes to a more common folder hierarchy. So generate_and_evaluate becomes a single comprensive function that builds all the necessary things with the builders. This also applies for evaluate. Otherwise it becomes quite hard to read it (so many if statements, name resolves for cache/filenames).

On the otherhand; lets keep this way for now and introduce this refactoring in the next PR's after we have merged the significant ones.

ErlisLushtaku added 4 commits May 19, 2026 00:31

Add native baselines and judge controls

d0f4604

Fix versioned m-Arena-Hard download test

092c0e8

Update the download_all regression test to expect both versioned m-Arena-Hard datasets introduced by the native-baseline split. Includes-AI-Code: true

remove backward compatibility code

28c21b8

revert

2213b68

ErlisLushtaku mentioned this pull request May 19, 2026

2. Add prompt presets and localized judge prompts #48

Open

ErlisLushtaku requested review from geoalgo and kargibora and removed request for geoalgo May 19, 2026 13:54

ErlisLushtaku changed the title ~~Add native baselines and judge controls~~ 1. Add native baselines and judge controls May 19, 2026

kargibora reviewed May 19, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

1. Add native baselines and judge controls#43

1. Add native baselines and judge controls#43
ErlisLushtaku wants to merge 4 commits into
mainfrom
pr32-split/01-judge-limits-baselines

ErlisLushtaku commented May 19, 2026

Uh oh!

kargibora left a comment

Uh oh!

kargibora May 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ErlisLushtaku commented May 19, 2026

Summary

Stack

Test plan

Uh oh!

kargibora left a comment

Choose a reason for hiding this comment

Uh oh!

kargibora May 19, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants