Skip to content

WIP leaderboard integation with experiments#4779

Draft
Samoed wants to merge 3 commits into
apifrom
experiments
Draft

WIP leaderboard integation with experiments#4779
Samoed wants to merge 3 commits into
apifrom
experiments

Conversation

@Samoed

@Samoed Samoed commented Jun 7, 2026

Copy link
Copy Markdown
Member

Samoed and others added 3 commits June 14, 2026 13:32
ModelMeta.loader/loader_kwargs are now Field(exclude=True) with safe
defaults so the meta JSON-roundtrips without the callable; on-disk
files written by to_dict() store loader as a string, so a before-mode
validator coerces strings back to None. ResultCache.load_results now
parses the per-experiment model_meta.json into a full ModelMeta and
attaches it to each ModelResult — variant attributes (e.g. model_type
flipping to late-interaction) flow through unchanged.

_build_pre_agg_df emits two new columns:
- `experiments`: the variant kwargs dict (None on base rows)
- `model_meta`: the per-experiment ModelMeta.to_dict() (None on base
  rows so the aggregator can fall back to MODEL_REGISTRY cheaply)

Aggregator/frontend consumption of these columns is deferred — the
parquet now has everything they'll need.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Minimal hand-port of the experiments-branch variant-row feature against
api's refactored ``_create_table.py``. Each experiment variant of a
model now gets its own (model_name, variant_id) row in summary and
per-task tables instead of being silently averaged into the base model.

Mechanics:

- ``_ensure_variant_id`` derives a stable ``_variant_id`` Utf8 column
  from the parquet's ``experiments`` dict (empty string for base rows,
  serialized kwargs for variants — matches the on-disk experiment
  folder name).
- ``_build_per_task_pivot`` and ``_create_summary_table`` group/pivot
  on ``(model_name, _variant_id)``; ``_per_task_rows_and_cols`` keys
  the resulting map the same way.
- ``_attach_model_metadata``'s inner join on ``model_name`` keeps
  ``_variant_id`` as a left-side passthrough; ``_order_summary_cols``
  slots it in right after ``Model``.
- API aggregator (``build_benchmark_summary``) reads ``_variant_id``
  from the summary row, looks up the per-variant experiments dict from
  the long frame, and populates ``SummaryRowSchema.experiments``
  (``None`` on base rows).
- Borda ranks stay keyed by ``model_name`` — variants of the same
  model share the base model's rank. Acceptable for the minimal scope;
  a richer port can rank over (model_name, variant_id).
- Gradio styler drops ``_variant_id`` at the display boundary so the
  bookkeeping column doesn't leak into the leaderboard grid.

Per-variant ModelMeta overlays (embed_dim flipping, model_type
flipping to late-interaction, etc.) are intentionally out of scope
here — the parquet still carries ``model_meta`` per variant for a
later richer port.

249 benchmark tests pass; ``ViDoRe(v1&v2)`` failure in
test_leaderboard_app_does_not_crash predates this commit (verified on
api HEAD).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Three small follow-ups to the variant-rows work:

- ResultCache._rebuild_from_full_repository now passes
  load_experiments=MATCH_NAME so the rebuilt parquet carries every
  ablation row. The default (MATCH_KWARGS) only loads variants when
  the caller supplied experiment_kwargs, which silently stripped every
  experiment from the leaderboard parquet.
- _ensure_variant_id's inline serializer and the API aggregator's
  variants_by_model loop both drop null entries from the Struct.
  Polars unions every variant's keys into one Struct schema and pads
  absent keys with null, so an experiment that ran with only
  `{colbert: True}` previously surfaced as
  `{colbert: True, use_image_modality: None}` over the wire — the
  cleaned dict matches what actually drove the run and the variant id
  stays stable.

Tests still pass.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant