Conversation
New mteb/api subpackage exposes the leaderboard data as a FastAPI service backed by ResultCache + the existing polars summary builders. Routes mirror the SvelteKit frontend's data needs: benchmark menu, benchmark detail, and prerendered summary tables. CORS origins, preload, and cache locations come from settings. Dockerfile clones mteb@api, installs .[api], and serves uvicorn on :7860 as UID 1000 — drop-in for a Hugging Face Space. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
pydantic-settings' EnvSettingsSource tries to json.loads any field it considers complex *before* invoking field_validators, which made the documented comma-separated MTEB_API_CORS_ORIGINS format crash with JSONDecodeError at app startup inside the HF Space. NoDecode skips that pre-parse step and lets the existing field_validator split on commas as advertised. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
`RUN git clone` always produces the same layer hash because the command string never changes, so HF Spaces was rebuilding the image on top of a stale checkout — the cors_origins NoDecode fix never made it into the running container. Pull the latest commit SHA from GitHub via ADD just before the clone; ADD invalidates the layer whenever the response body changes, which forces a fresh clone per push. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
# Conflicts: # mteb/benchmarks/benchmark.py
The api module needed only this one-line helper from mteb.leaderboard.app, but importing it pulled in gradio, pandas, and cachetools — none of which belong in the [api] extra. Promoting it to a property on ResultCache lets every consumer (api, leaderboard, bench script) reach the path without dragging the Gradio stack into the API container. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Drops the cold-start cost of cloning the GitHub results repo on first request by pulling the same data from huggingface.co/datasets/mteb/results during image build. Goes into the default huggingface_hub cache under HF_HOME so callers reach it via the standard hub APIs. The download is guarded with `|| true` so it stays non-fatal while the dataset is still being populated upstream — the API just falls back to the GitHub clone on first request. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The results-repo sync now pushes one HF dataset config per benchmark (plus a ``default`` config holding every result, deduped). Rewires the API consumer to match: * ``_load_from_hub`` enumerates configs and ``load_dataset(name=cfg, split='train')`` each. A failure on one config no longer poisons the whole load. * ``_load_per_benchmark_frames`` collapses to two paths — hub or cold rebuild — and returns a ``(per_benchmark, all_results)`` tuple instead of the ``_LoadedFrames`` dataclass. The two named wrappers (``get_all_benchmark_frames`` / ``get_all_results_df``) go away; callers destructure inline. * Hub-supplied ``default`` config short-circuits the per-benchmark concat for the unified view. Other follow-ups: * ``BenchmarkResults`` gains ``load_leaderboard_frame`` and ``split_leaderboard_frame`` so loading the raw combined frame can be decoupled from splitting it. The new ``_split_by_benchmark_tasks`` filters via an inner join on ``(task_name, split, subset)`` tuples — off-spec subsets/splits no longer leak through to ``_create_summary_table``'s ``group_by(model_name, task_name).mean()``. * ``MTEB_API_CACHE_REPO`` moves to ``Settings`` alongside ``cors_origins`` / ``preload``; consumers go through ``settings.cache_repo()``. * /robots.txt added to silence Space probes. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adds the MVEB (Massive Video Embedding Benchmark) benchmark objects to main so the leaderboard and get_benchmark() can resolve them. The underlying tasks are already on main; this adds only the curated benchmark groupings and their registration. - benchmarks.py: MVEB (23 tasks), MVEB(text, video) (19), MVEB(video) (9), MVEB(beta, extended) (184, alias MVEB(extended)). - benchmarks/__init__.py: import + __all__ registration. - _leaderboard_menu.py: new "Video" group under General Purpose. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Correctness:
- frames.py: _rebuild_from_full_repository() was called with no args,
triggering TypeError on the no-disk-cache fallback path.
- aggregators.py: _pick_leader had a duplicated `r.total_params_b is None`.
- benchmark.py: RtebBenchmark renamed Retrieval -> Mean (Task) then
immediately dropped Mean (Task) — collapsed to one drop+rename chain.
- schemas.py: BenchmarkSchema.language_view lost dedupe; restored.
- routes.py: robots.txt docstring sync'd to the new Allow body.
- aggregators.py: build_benchmark_per_language now offloads polars work
to asyncio.to_thread so cold misses don't pin the event loop.
- aggregators.py: row[col] -> row.get(col) for optional mean cols.
Cache + concurrency:
- routes.py: _leader_bytes bounded with LRU eviction (was unbounded).
- frames.py: atomic disk-cache write — .tmp + Path.replace per shard,
manifest atomic-swapped, stale sweep happens AFTER swap.
- warmup.py + app.py: preload runs as asyncio.create_task on the
serving loop instead of a daemon thread with its own asyncio.run().
- icons.py: cache_clear() no longer wipes _fetch_locks.
Table builders:
- new _STANDARD_META_COLS + _order_summary_cols replace 5 copies of
the final column-ordering boilerplate.
- _build_joint_with_type_means_and_borda shared by mean_task +
mean_task_type builders.
- _PublicPrivateBuild dataclass + _build_public_private_joint shared
by mean_public_private and Vidore; Vidore's wrapper inlined.
- per-task table + mean_subset migrated to _borda_rank_from_long;
only Benchmark.to_dataframe still uses the wide-form _get_borda_rank.
- leaderboard/table.py: deleted dead pandas Borda helpers.
API helpers:
- aggregators.py: _per_task_rows_and_cols, _filter_long_df_by_languages,
_read_row_metrics slice the 140-line build_benchmark_summary.
- aggregators.py: _extract_trained_on_map is one polars groupby instead
of a per-row setdefault loop; build_task_scores derives all_subsets
while filling seen.
- cache.py: _cache_or_build generic single-flight helper; _cached_bytes
and get_summary are thin wrappers. summary-schema cache now emits
hit/miss metrics.
- routes.py: _serialize_schemas + _safe_load_frames fold the per-list
and per-map boilerplate; _require_task / _require_model helpers +
dropped dead try/except KeyError in model_scores.
- routes.py: deleted deprecated /benchmarks/{name}/summary alias.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Bundle each single-flight cache (store + locks + LRU cap + metric label) into a CacheLayer dataclass so the 11 module globals collapse to 6 instances and helper signatures lose 3 args. Share the ResultCache root for the leaderboard disk cache so MTEB_CACHE overrides apply uniformly, and promote the JSON Cache-Control max-age to a settings knob (HTTP_MAX_AGE) so dev hard refreshes can opt out of browser caching. Aggregators get smaller too: inlined _read_row_metrics, dropped redundant float() coercions, and renamed lenient_means to language_filtered to match its definition. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
# Conflicts: # pyproject.toml # uv.lock
There was a problem hiding this comment.
Pull request overview
This PR introduces a new mteb.api FastAPI service to expose leaderboard data over HTTP, plus tooling and packaging changes to support deployment (including an HF Space–oriented Dockerfile) and pre-rendered Open Graph (OG) image generation/serving.
Changes:
- Added
mteb/apiFastAPI app with cached JSON endpoints, warmup/preload, Prometheus metrics, icon proxying, and optional OpenTelemetry tracing. - Added OG hero-card HTML template + Playwright-based generator, and mounted generated PNGs under
/ogin the API service. - Refactored leaderboard/summary table building and related display logic to support the API’s canonical column naming and metadata handling; added language label helpers and moved leaderboard parquet path onto
ResultCache.
Reviewed changes
Copilot reviewed 36 out of 39 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/test_benchmarks/test_get_benchmarks.py | Updates the “display_on_leaderboard” test fixture benchmark name. |
| scripts/og-template/template.html | Adds a parameterized HTML template used for OG card rendering. |
| scripts/generate_og_images.py | Adds Playwright-based OG image renderer with incremental hashing and manifest output. |
| scripts/bench_leaderboard.py | Switches parquet path lookup to ResultCache.leaderboard_parquet_path. |
| scripts/bench_api.py | Adds a stdlib-only benchmark script for measuring API endpoint latency/size/gzip/ETag behavior. |
| scripts/bench_api_inproc.py | Adds an in-process benchmark using httpx ASGITransport for stable perf comparisons. |
| pyproject.toml | Adds api and og extras, packages mteb.api static assets, and enables Ruff FAST rules. |
| mteb/results/benchmark_results.py | Extends result frame handling (incl. trained-on flag) and adds frame-splitting helpers. |
| mteb/leaderboard/table.py | Adjusts summary/per-task/per-language styling; adds model short-name/link wrapping and type header humanization. |
| mteb/leaderboard/figures.py | Updates model-name parsing and task-type column detection consistent with canonical column names. |
| mteb/leaderboard/app.py | Removes unused typing import and uses cache.leaderboard_parquet_path. |
| mteb/languages/iso_mappings.py | Adds language_label() helper with script display aliases and caching. |
| mteb/languages/init.py | Exposes language_label from the languages package. |
| mteb/cache/result_cache.py | Adds leaderboard_parquet_path property to decouple consumers from the Gradio leaderboard module. |
| mteb/benchmarks/benchmarks/benchmarks.py | Adds new_version metadata and sets specific benchmark aggregation behavior. |
| mteb/benchmarks/benchmark.py | Adds BenchmarkAggregation, summary-table refactor (pivot reuse), and benchmark-specific aggregation config. |
| mteb/benchmarks/_leaderboard_menu.py | Introduces HOME_BENCHMARK_ENTRIES for the new API/menu surface. |
| mteb/benchmarks/_create_table.py | Introduces SummaryTable wrapper and refactors summary/per-task/per-language builders and metadata attachment. |
| mteb/api/warmup.py | Adds startup warmup orchestration and optional background preload for summary/per-language caches. |
| mteb/api/settings.py | Adds pydantic-settings based environment configuration for the API service. |
| mteb/api/serialization.py | Adds shared JSON+gzip+ETag serialization primitives used by caches/routes. |
| mteb/api/schemas.py | Adds pydantic response models matching frontend types, including language labeling and icon proxy behavior. |
| mteb/api/routes.py | Adds FastAPI route handlers (cached bytes responses, icons, favicon, metrics, menu/bench/task/model endpoints). |
| mteb/api/README.md | Documents API install/run, endpoints, CORS, and observability setup. |
| mteb/api/otel.py | Adds optional OTEL tracing setup/instrumentation for FastAPI. |
| mteb/api/metrics.py | Adds Prometheus middleware and registry-scoped metrics for requests/caches/entities. |
| mteb/api/icons.py | Adds cached icon proxying with timeouts and negative caching. |
| mteb/api/frames.py | Adds hub/disk-cache loading and split/unified polars-frame management for API aggregators. |
| mteb/api/cache.py | Adds single-flight caches for schemas and pre-serialized bytes per endpoint. |
| mteb/api/app.py | Adds FastAPI application factory, middleware, route mounting, and /og static mount. |
| mteb/api/aggregators.py | Adds builders for summary/task/model/per-language/leaders payloads. |
| mteb/api/adapters.py | Adds cached adapters around schema construction and a threaded prewarm routine. |
| mteb/api/init.py | Exposes create_app() entrypoint for the API package. |
| Makefile | Adds serve-api target and includes the api extra in test installs. |
| Dockerfile | Replaces the previous single-stage image with a multi-stage build for API runtime + OG builder. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
generate_og_images_mpl.py is for generating the same benchmark "preview" as generate_og_images, but using only Matplotlib. Currently in Docker, generate_og_images is used, but it renders from a browser with Playwright. I'm using it because it looks a bit nicer than Matplotlib
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Replace the Dockerfile's git-clone-then-install path with a COPY from
the build context (filtered by a new .dockerignore) so CI tests the
checkout under review instead of whatever's already on the upstream
branch. Add api_docker.yml — builds the image, polls /health for up to
2 min, and publishes ghcr.io/<repo>/api:{sha,latest} on main. Drop the
two old docker-test workflows (leaderboard_docker.yml,
hf_space_docker.yml) and strip leaderboard_refresh.yaml down to the
HF Space rebuild curl (publishing now lives in api_docker.yml).
Healthcheck points at the real backend /health endpoint.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
KennethEnevoldsen
left a comment
There was a problem hiding this comment.
Let me start with some general comments before I go deeper.
packaging: Do we want the API in the mteb package
while I agree that we want the API in this repo (so we can test the two jointly), I would probably consider making it into a second package within the repo called mteb-api.
This made be think that we could also factor out models into a packages (mteb-models) to isolate our dependency hell outside of mteb. Not for this PR though.
leaderboard cli:
For the current leaderboard CLI how do we want it to behave? should the leaderboard CLI run the current leaderboard? if so we probably want to clean up gradio.
There was a problem hiding this comment.
? do we want use this as the favicon?
There was a problem hiding this comment.
I'm not sure. I used this to differentiate during development between api and leaderboard
There was a problem hiding this comment.
How big is this once built?
There was a problem hiding this comment.
~6 GB.
- 1 GB source image.
- 2 GB pip dependencies
- Results 250 mb
- Open graph images 750mb
|
|
||
| def test_benchmark_on_leaderboard(): | ||
| on_leaderboard = "MTEB(Multilingual, v2)" | ||
| on_leaderboard = "RTEB(eng, beta)" |
There was a problem hiding this comment.
why is this changed?
There was a problem hiding this comment.
I changed list of "visible" benchmarks with new menu. Mteb multilingual is "primary" benchmark, so it's not listed in new object. We can change this, but I'm not sure how to handle primary benchmarks
There was a problem hiding this comment.
would be nice to combine all of these scripts into one folder
There was a problem hiding this comment.
delete this file?
There was a problem hiding this comment.
I think this could help for future api modifications to not have performance degradation
| display_name: str | None = None | ||
| language_view: list[str] | Literal["all"] = field(default_factory=list) | ||
| benchmark_hf_repo: str | None = None | ||
| new_version: Sequence[str] | None = None |
There was a problem hiding this comment.
superseded_by? (for consistency)
| benchmark_hf_repo: str | None = None | ||
| new_version: Sequence[str] | None = None | ||
| # Api aggregation functions | ||
| aggregations: Sequence[BenchmarkAggregation] = ( |
There was a problem hiding this comment.
if we have this here - we might even be able to deprecate the RTEBBenchmark object
There was a problem hiding this comment.
maybe some of the other ones as well
There was a problem hiding this comment.
Updated to use aggregations directly
| # Short, reader-friendly overrides for ISO 15924 names whose canonical labels | ||
| # would read awkwardly when appended in parentheses (e.g. the official name | ||
| # for "Hant" is "Han (Traditional variant)"). | ||
| _SCRIPT_DISPLAY_ALIASES = { |
There was a problem hiding this comment.
Does some of these belong in the API rather than in mteb core?
There was a problem hiding this comment.
I think we can reuse this in future somewhere. I don't think that mapping from lang codes to langname is api specific
| mteb.get_benchmarks(display_on_leaderboard=True), key=lambda x: x.name | ||
| ) | ||
|
|
||
| seen: set[str] = set() |
There was a problem hiding this comment.
why is this function changed in this PR?
There was a problem hiding this comment.
I changed list of visible benchmarks. From GP_BENCHMARKS + RTEB_BENCHMARK to
mteb/mteb/benchmarks/benchmark.py
Line 60 in 30757af
I think we can keep as is for now like previous leaderboard. We can refactor it in future
I think we should keep old leaderboard. I think some people can fork mteb to create their propriete benchmarks and to see scores they could use gradio implmenetation. I don't think that this is possible to run new leaderboard from our package because it's js app |
# Conflicts: # .github/workflows/leaderboard_healthcheck.yml # pyproject.toml
The unified results frame previously collapsed (model, task, subset, split) → (model, task, subset) via `max(score)` before serialising, so the leaderboard could never show per-split scores even though tasks like MassiveIntentClassification evaluate on multiple splits. Plumbs `split` through: - `_UNIFIED_SCHEMA` carries `split`; `_dedupe_unified` groups by `(model, task, split, subset)`, deduping only across rerun rows. - New `_CACHE_SCHEMA_VERSION = 2`, written into and validated against `manifest.json`. Stale disk caches from before the bump are rebuilt on next boot. - `TaskScoreRowSchema.subset_scores` becomes `dict[str, dict[str, float]]` (outer subset, inner split) so clients can pivot either axis off one payload. - `TaskScoresSchema` adds a top-level `splits: list[str]` listing every split observed across models for the task. - `build_task_scores` walks the deduped unified frame directly and populates the nested map. The per-row `score` rollup keeps the prior semantics — per-subset value is the max across splits the model ran, then mean across subsets when the model covers every subset — so existing leaderboard ranks don't shift just from surfacing the extra axis. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
A FastAPI service that powers the leaderboard frontend. New package under
mteb/api/.Module layout (
mteb/api/)app.pyrouterunder/v1andinfra_routerat root, lifespan-driven warmup,/ogstatic mount with 1-dayCache-Control.routes.py_cached_json(handles 304, gzip negotiation,Cache-Control); uncached ones return pydantic schemas.schemas.pysnake_casein Python,camelCaseover the wire (matchesleaderboardv2/src/lib/types.ts).adapters.pySchema.from_*constructors so each benchmark/task/model pays construction cost once.aggregators.pybuild_benchmark_summary,build_benchmark_per_language,build_benchmark_leaders,build_model_scores,build_task_scores) — turn long polars frames into schema objects.frames.pycachesoaggregatorscan depend on it without dragging in the bytes cache.cache.pyCacheLayergeneric: single-flight per-key locks + LRU store + Prometheus labels. Holds the warm serialised bytes routes hand out.serialization.pywarmup.pymetrics.py/metricsrenderer.otel.pytraceparentpropagation. No-op unlessOTEL_EXPORTER_OTLP_ENDPOINTis set.icons.pysettings.pypydantic-settingsknobs:CORS_ORIGINS,PRELOAD,CACHE_REPO,OG_DIR,PREWARM_MAX_WORKERS,PRELOAD_CONCURRENCY,HTTP_MAX_AGE,DISK_CACHE, log level, OTEL vars.static/favicon.pngEndpoint map
Data routes under
/v1, infra at root.Request flow
Long-frame source of truth lives in
frames.py(loaded once at startup or first request; persisted to~/.cache/mteb/leaderboard/, invalidated by HF dataset commit SHA).