1. Add native baselines and judge controls#43
Conversation
Update the download_all regression test to expect both versioned m-Arena-Hard datasets introduced by the native-baseline split. Includes-AI-Code: true
kargibora
left a comment
There was a problem hiding this comment.
I see no apperant issue, and I trust that this is already a well-tested PR. I have quite liked the refactoring that is done on the dataset files. Thank you for your hard work!
My only concern is as we expand the main branch we should also be quite careful about how we add the functionalities. It is already feels quite populated with different entrance points generate_and_evaluate, evaluate, estimate_elo_ratings getting more and more features. I will create an issue for possible refactoring (did it once, but since then some of the structures have been changed so I will propose it once more).
| use_tqdm: bool = False | ||
|
|
||
|
|
||
| @dataclass(frozen=True) |
There was a problem hiding this comment.
I feel like these class definitions does not belong to the generate_and_evaluate.py. I am still insisting on the idea for keeping these entrance points clean and splitting the functions/classes to a more common folder hierarchy. So generate_and_evaluate becomes a single comprensive function that builds all the necessary things with the builders. This also applies for evaluate. Otherwise it becomes quite hard to read it (so many if statements, name resolves for cache/filenames).
On the otherhand; lets keep this way for now and introduce this refactoring in the next PR's after we have merged the significant ones.
Summary
Stack
main.pr32-split/02-inference-metadata-thinking.Test plan
uv run pytest tests/test_cli.py tests/test_generate_and_evaluate.py tests/test_instruction_dataset.py tests/test_mt_bench_downloads.pyuv run ruff check ...on touched files