WIP leaderboard integation with experiments#4779
Draft
Samoed wants to merge 3 commits into
Draft
Conversation
ModelMeta.loader/loader_kwargs are now Field(exclude=True) with safe defaults so the meta JSON-roundtrips without the callable; on-disk files written by to_dict() store loader as a string, so a before-mode validator coerces strings back to None. ResultCache.load_results now parses the per-experiment model_meta.json into a full ModelMeta and attaches it to each ModelResult — variant attributes (e.g. model_type flipping to late-interaction) flow through unchanged. _build_pre_agg_df emits two new columns: - `experiments`: the variant kwargs dict (None on base rows) - `model_meta`: the per-experiment ModelMeta.to_dict() (None on base rows so the aggregator can fall back to MODEL_REGISTRY cheaply) Aggregator/frontend consumption of these columns is deferred — the parquet now has everything they'll need. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Minimal hand-port of the experiments-branch variant-row feature against api's refactored ``_create_table.py``. Each experiment variant of a model now gets its own (model_name, variant_id) row in summary and per-task tables instead of being silently averaged into the base model. Mechanics: - ``_ensure_variant_id`` derives a stable ``_variant_id`` Utf8 column from the parquet's ``experiments`` dict (empty string for base rows, serialized kwargs for variants — matches the on-disk experiment folder name). - ``_build_per_task_pivot`` and ``_create_summary_table`` group/pivot on ``(model_name, _variant_id)``; ``_per_task_rows_and_cols`` keys the resulting map the same way. - ``_attach_model_metadata``'s inner join on ``model_name`` keeps ``_variant_id`` as a left-side passthrough; ``_order_summary_cols`` slots it in right after ``Model``. - API aggregator (``build_benchmark_summary``) reads ``_variant_id`` from the summary row, looks up the per-variant experiments dict from the long frame, and populates ``SummaryRowSchema.experiments`` (``None`` on base rows). - Borda ranks stay keyed by ``model_name`` — variants of the same model share the base model's rank. Acceptable for the minimal scope; a richer port can rank over (model_name, variant_id). - Gradio styler drops ``_variant_id`` at the display boundary so the bookkeeping column doesn't leak into the leaderboard grid. Per-variant ModelMeta overlays (embed_dim flipping, model_type flipping to late-interaction, etc.) are intentionally out of scope here — the parquet still carries ``model_meta`` per variant for a later richer port. 249 benchmark tests pass; ``ViDoRe(v1&v2)`` failure in test_leaderboard_app_does_not_crash predates this commit (verified on api HEAD). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Three small follow-ups to the variant-rows work:
- ResultCache._rebuild_from_full_repository now passes
load_experiments=MATCH_NAME so the rebuilt parquet carries every
ablation row. The default (MATCH_KWARGS) only loads variants when
the caller supplied experiment_kwargs, which silently stripped every
experiment from the leaderboard parquet.
- _ensure_variant_id's inline serializer and the API aggregator's
variants_by_model loop both drop null entries from the Struct.
Polars unions every variant's keys into one Struct schema and pads
absent keys with null, so an experiment that ran with only
`{colbert: True}` previously surfaced as
`{colbert: True, use_image_modality: None}` over the wire — the
cleaned dict matches what actually drove the run and the variant id
stays stable.
Tests still pass.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Close #1211
Frontend embeddings-benchmark/leaderboard-frontend#4