WIP leaderboard integation with experiments by Samoed · Pull Request #4779 · embeddings-benchmark/mteb

Samoed · 2026-06-07T08:08:57Z

Close #1211
Frontend embeddings-benchmark/leaderboard-frontend#4

ModelMeta.loader/loader_kwargs are now Field(exclude=True) with safe defaults so the meta JSON-roundtrips without the callable; on-disk files written by to_dict() store loader as a string, so a before-mode validator coerces strings back to None. ResultCache.load_results now parses the per-experiment model_meta.json into a full ModelMeta and attaches it to each ModelResult — variant attributes (e.g. model_type flipping to late-interaction) flow through unchanged. _build_pre_agg_df emits two new columns: - `experiments`: the variant kwargs dict (None on base rows) - `model_meta`: the per-experiment ModelMeta.to_dict() (None on base rows so the aggregator can fall back to MODEL_REGISTRY cheaply) Aggregator/frontend consumption of these columns is deferred — the parquet now has everything they'll need. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Minimal hand-port of the experiments-branch variant-row feature against api's refactored ``_create_table.py``. Each experiment variant of a model now gets its own (model_name, variant_id) row in summary and per-task tables instead of being silently averaged into the base model. Mechanics: - ``_ensure_variant_id`` derives a stable ``_variant_id`` Utf8 column from the parquet's ``experiments`` dict (empty string for base rows, serialized kwargs for variants — matches the on-disk experiment folder name). - ``_build_per_task_pivot`` and ``_create_summary_table`` group/pivot on ``(model_name, _variant_id)``; ``_per_task_rows_and_cols`` keys the resulting map the same way. - ``_attach_model_metadata``'s inner join on ``model_name`` keeps ``_variant_id`` as a left-side passthrough; ``_order_summary_cols`` slots it in right after ``Model``. - API aggregator (``build_benchmark_summary``) reads ``_variant_id`` from the summary row, looks up the per-variant experiments dict from the long frame, and populates ``SummaryRowSchema.experiments`` (``None`` on base rows). - Borda ranks stay keyed by ``model_name`` — variants of the same model share the base model's rank. Acceptable for the minimal scope; a richer port can rank over (model_name, variant_id). - Gradio styler drops ``_variant_id`` at the display boundary so the bookkeeping column doesn't leak into the leaderboard grid. Per-variant ModelMeta overlays (embed_dim flipping, model_type flipping to late-interaction, etc.) are intentionally out of scope here — the parquet still carries ``model_meta`` per variant for a later richer port. 249 benchmark tests pass; ``ViDoRe(v1&v2)`` failure in test_leaderboard_app_does_not_crash predates this commit (verified on api HEAD). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Three small follow-ups to the variant-rows work: - ResultCache._rebuild_from_full_repository now passes load_experiments=MATCH_NAME so the rebuilt parquet carries every ablation row. The default (MATCH_KWARGS) only loads variants when the caller supplied experiment_kwargs, which silently stripped every experiment from the leaderboard parquet. - _ensure_variant_id's inline serializer and the API aggregator's variants_by_model loop both drop null entries from the Struct. Polars unions every variant's keys into one Struct schema and pads absent keys with null, so an experiment that ran with only `{colbert: True}` previously surfaced as `{colbert: True, use_image_modality: None}` over the wire — the cleaned dict matches what actually drove the run and the variant id stays stable. Tests still pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Samoed and others added 3 commits June 14, 2026 13:32

Samoed force-pushed the experiments branch from f03bb04 to 4d122d3 Compare June 14, 2026 11:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP leaderboard integation with experiments#4779

WIP leaderboard integation with experiments#4779
Samoed wants to merge 3 commits into
apifrom
experiments

Samoed commented Jun 7, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Samoed commented Jun 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Samoed commented Jun 7, 2026 •

edited

Loading