Skip to content

feat: add code-execution routing mode (GCORE_MCP_ROUTING=code_exec)#13

Open
algis-dumbris wants to merge 8 commits into
mainfrom
feat/code-execution-mode-impl
Open

feat: add code-execution routing mode (GCORE_MCP_ROUTING=code_exec)#13
algis-dumbris wants to merge 8 commits into
mainfrom
feat/code-execution-mode-impl

Conversation

@algis-dumbris

Copy link
Copy Markdown
Contributor

Summary

Adds a new routing mode that exposes 3 meta-tools instead of registering all ~700 SDK methods individually, addressing the tool-list overflow that prevents most LLM clients from connecting the server with all tools enabled.

  • New env var GCORE_MCP_ROUTING accepts code_exec (new default) or direct (legacy).
  • code_exec mode exposes search_tools(query), get_tool_schema(name), and execute_code(code). The LLM-generated Python runs in a Pydantic Monty sandbox and reaches the SDK only through a host-injected call_tool().
  • direct mode preserves today's behavior byte-for-byte.

Branched off the design spec PR; the spec commit (docs/superpowers/specs/2026-05-15-code-execution-mode-design.md) is included in this branch's history.

Files

  • gcore_mcp_server/code_exec/ — new package: catalog.py, dispatch.py, runner.py, meta_tools.py
  • gcore_mcp_server/core/serialize.py — extracted from server.py so the dispatcher can reuse it
  • gcore_mcp_server/config/settings.py — adds GCORE_MCP_ROUTING parser
  • gcore_mcp_server/server.py — branches registration on routing mode
  • pyproject.toml — adds pydantic-monty>=0.0.17,<0.1 and rank-bm25>=0.2,<1 (~8 MB combined wheel size, pre-built wheels on macOS/Linux/Windows × CPython 3.10-3.14)
  • tests/code_exec/ — 41 unit tests + 6 e2e tests against the real Gcore API
  • README.md — adds Routing modes section
  • docs/superpowers/specs/2026-05-15-code-execution-mode-design.md — design doc

Test plan

  • uv run ruff format . — clean
  • uv run ruff check . — passes
  • uv run pyright gcore_mcp_server/code_exec/ gcore_mcp_server/core/serialize.py gcore_mcp_server/config/settings.py — 0 errors in new code (pre-existing pyright errors in legacy make_wrapper are untouched)
  • uv run pytest tests/code_exec/47 passed (41 unit + 6 e2e)
  • uv run pytest tests/test_schema.py tests/test_inspection.py tests/test_pattern_filtering.py38 passed (no regression in pre-existing tests)
  • Manual smoke test: server boots in both modes
    • GCORE_MCP_ROUTING=code_exec → "Registered 3 meta-tools over 668 SDK methods"
    • GCORE_MCP_ROUTING=direct GCORE_TOOLS=management → "Registered 17 tools" (unchanged from main)
  • E2E test on real Gcore API (../gcore-terraform/.env credentials):
    • Builds catalog from real client (>100 entries)
    • search_tools("list regions") returns cloud.regions.list in top 5
    • Sandbox script await call_tool('cloud.regions.list') succeeds and returns real region IDs
    • Sandbox call_tool('nope.does.not.exist') surfaces a clean KeyError

Migration / compatibility

  • code_exec is the new default. Existing clients with hard-coded tool names will see a different surface. The one-line opt-out is GCORE_MCP_ROUTING=direct. Documented in the README "Routing modes" section.
  • In code_exec mode, GCORE_TOOLS is logged as ignored — catalog filtering doesn't apply when only 3 meta-tools are registered.
  • Server logs the active mode prominently at startup.

Sandbox notes (for clients that will use code_exec)

  • Supported: async/await, comprehensions, exceptions, stdlib json/re/datetime.
  • Not supported (Pydantic Monty v0.0.17): class, with, import, match, generators.
  • Defaults: 30 s wall-clock timeout, 200 MB memory cap, 40 KB result and stdout truncation budget (with _truncated marker so the model can re-query with narrower filters).
  • The Gcore API key lives on the host-side client and is never exposed as a sandbox input or external function — the sandbox can authenticate calls via call_tool but cannot read the key.

Add design doc for a new GCORE_MCP_ROUTING=code_exec mode that exposes
three meta-tools (search_tools, get_tool_schema, execute_code) backed by
a Pydantic Monty sandbox, replacing the ~700-tool registration. Direct
mode remains available as an opt-out.
Move the SDK-result serializer out of server.py into a small shared
module so the upcoming code_exec dispatcher can reuse it. Renamed
_serialize_result → serialize_result since it's now a public helper.
Behavior unchanged.
- pydantic-monty (>=0.0.17,<0.1) — embedded secure Python interpreter
  used by the new code_exec mode to safely run LLM-generated scripts.
- rank-bm25 (>=0.2,<1) — BM25 search index over the SDK catalog so the
  search_tools meta-tool can rank ~700 SDK methods by relevance.

Both ship pre-built wheels for macOS, Linux, and Windows × CPython
3.10-3.14; combined install adds ~8 MB.
Introduce a new routing mode that exposes three meta-tools
(search_tools, get_tool_schema, execute_code) instead of registering
each SDK method individually. The LLM-generated Python runs in a
Pydantic Monty sandbox and calls SDK methods via host-injected
call_tool().

Package layout:
- code_exec/catalog.py — ToolEntry + BM25-indexed Catalog
- code_exec/dispatch.py — make_call_tool() with auto-injection of
  project_id and region_id
- code_exec/runner.py — execute_code() driving Pydantic Monty with
  result/stream truncation and typed ExecResult
- code_exec/meta_tools.py — registers the three meta-tools on FastMCP

Selected via GCORE_MCP_ROUTING={code_exec|direct}, parsed in
config/settings.py:get_routing_mode(). Default is code_exec; set
GCORE_MCP_ROUTING=direct to restore the legacy ~700-tool surface.
- 41 unit tests in tests/code_exec/ covering Catalog (build + BM25
  search + boosts + get_schema), dispatch (call_tool injection +
  awaitable handling + error paths), result/stream truncation,
  Pydantic Monty runner integration with stub catalogs, register
  meta_tools, and get_routing_mode env parsing.
- 6 e2e tests against the real Gcore API. They auto-load credentials
  from ../gcore-terraform/.env so local devs can run them without
  shell config; CI skips when no real key is available. The e2e
  real_client fixture clears the SDK introspection cache so methods
  rebind to the real client (other test files run earlier may have
  poisoned the cache with a dummy-keyed client).
Add a Routing modes section near the top of the README so users
discover code_exec as the new default and know how to opt out via
GCORE_MCP_ROUTING=direct. Lists the sandbox's supported / unsupported
Python features and default resource limits.
Copilot AI review requested due to automatic review settings May 15, 2026 07:37

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new default “code execution” routing mode to avoid MCP tool-list overflow by exposing only three meta-tools (search/schema/code execution) while preserving the existing per-SDK-method tool registration in a legacy direct mode.

Changes:

  • Introduces GCORE_MCP_ROUTING with code_exec (new default) vs direct (legacy) and updates server startup/registration accordingly.
  • Adds a code_exec package implementing SDK catalog + search, host-side dispatch (call_tool), and Monty sandbox execution with truncation.
  • Adds dependencies (pydantic-monty, rank-bm25) plus a dedicated tests/code_exec/ suite and README/docs updates.

Reviewed changes

Copilot reviewed 19 out of 21 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
uv.lock Locks new dependencies (Monty, BM25 + transitive deps like numpy) and updates resolution metadata.
pyproject.toml Adds pydantic-monty and rank-bm25 runtime dependencies.
README.md Documents routing modes, default behavior, and sandbox capabilities/limitations.
gcore_mcp_server/server.py Branches tool registration by routing mode; registers 3 meta-tools in code_exec and preserves legacy direct behavior.
gcore_mcp_server/core/serialize.py Extracts shared result serialization for reuse by the sandbox dispatcher.
gcore_mcp_server/config/settings.py Adds routing-mode parsing (GCORE_MCP_ROUTING) with normalization + warning fallback.
gcore_mcp_server/code_exec/__init__.py Exposes the code-exec public surface (catalog/runner/dispatch/meta-tools).
gcore_mcp_server/code_exec/catalog.py Builds SDK catalog via introspection; BM25 search + schema export.
gcore_mcp_server/code_exec/dispatch.py Implements host-side call_tool with project/region auto-injection + serialization.
gcore_mcp_server/code_exec/meta_tools.py Registers search_tools, get_tool_schema, execute_code on the FastMCP server.
gcore_mcp_server/code_exec/runner.py Runs Monty sandbox, captures stdout/stderr, enforces limits, truncates outputs/results.
docs/superpowers/specs/2026-05-15-code-execution-mode-design.md Design spec describing the new routing mode, tool surface, architecture, and risks.
tests/code_exec/__init__.py Marks the code_exec test package.
tests/code_exec/conftest.py Sets anyio backend for async tests in this suite.
tests/code_exec/test_catalog.py Verifies catalog build from real SDK + search/schema behavior.
tests/code_exec/test_dispatch.py Tests call_tool dispatch, async handling, injection rules, and serialization.
tests/code_exec/test_e2e_real_api.py Optional real-API e2e coverage (skipped unless real creds are available).
tests/code_exec/test_meta_tools.py Ensures exactly three meta-tools are registered and callable.
tests/code_exec/test_runner.py Validates sandbox execution contract (stdout, errors, timeout, truncation, async).
tests/code_exec/test_settings_routing.py Tests routing-mode env parsing and warning fallback.
tests/code_exec/test_truncation.py Tests truncation helpers for unicode safety and marker behavior.
Comments suppressed due to low confidence (1)

gcore_mcp_server/code_exec/runner.py:131

  • In dict truncation, the budget check only triggers when d is non-empty (and d). If the first key/value pair exceeds the remaining budget, it will still be included, the budget can go negative, and no _truncated marker / hit=True will be produced. Handle the empty-dict case explicitly so oversized first entries still yield a truncated result within budget.
            entry_size = _json_size({key: trimmed})
            if state.budget - entry_size < 0 and d:
                state.hit = True
                dropped = total_keys - idx
                d["_truncated"] = True
                d["_dropped_items"] = dropped

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

short_name: str
doc_short: str
doc_full: str
params: list[ParamInfo] = field(default_factory=list[ParamInfo])
Comment thread gcore_mcp_server/code_exec/runner.py Outdated
Comment on lines +108 to +113
size = _json_size(trimmed)
if state.budget - size < 0 and out:
state.hit = True
dropped = total - idx
out.append({"_truncated": True, "_dropped_items": dropped})
return out
Comment thread gcore_mcp_server/code_exec/runner.py Outdated
Comment on lines +137 to +139
# Scalars (or anything else) – just account for size; do not trim mid-string.
state.budget -= _json_size(value)
return value
Comment on lines +150 to +152
state = _TruncationState(budget=int(max_bytes))
trimmed = _truncate_value(value, state)
return trimmed, state.hit
SDK methods that take their own `name` argument (e.g.
cloud.ssh_keys.create) collided with the dispatcher's first parameter,
so `call_tool('cloud.ssh_keys.create', name='k')` raised
"Multiple values provided for parameter name" from Monty's type
checker. Made the tool selector positional-only (`tool_name`, /) in
both the dispatcher and the TYPE_STUBS so a `name` keyword is
forwarded to the SDK method. Found via real-API create+delete testing
through mcpproxy. Adds a regression test.
Valid findings fixed in runner.py:
- _truncate_value: drop the `and out`/`and d` guard that let an
  oversized FIRST list/dict element through with the budget going
  negative and `hit` left False. Containers now check the budget
  before descending into each child and emit a _truncated marker
  once exhausted.
- Oversized scalar strings are now truncated in place via
  _truncate_bytes instead of being returned whole; every
  over-budget path sets state.hit so ExecResult.truncated can no
  longer report False while the byte cap was breached.
- Budget is decremented exactly once per scalar leaf (containers no
  longer double-account children).
- _truncate_for_return annotated -> tuple[Any, bool] with corrected
  docstring.

Added regression tests for the oversized-first-element and
oversized-scalar-string cases.

catalog.py: kept `field(default_factory=list[ParamInfo])` (added a
clarifying comment). The Copilot suggestion to use bare `list` is
incorrect here — `list[ParamInfo]()` does not raise (GenericAlias is
callable and returns []), and bare `list` makes pyright strict mode
report `params` as partially-unknown.
@algis-dumbris

Copy link
Copy Markdown
Contributor Author

Addressed the Copilot review in bfe6bc2.

runner.py — all three valid, fixed:

  • _truncate_value oversized-first-element (lines 113/126): removed the and out / and d guard. Containers now check the budget before descending into each child and emit a {"_truncated": True, "_dropped_items": N} marker once exhausted, so a huge first element can no longer slip through with a negative budget and hit=False.
  • Scalars never truncated (line 139): oversized strings are now cut in place via _truncate_bytes; every over-budget path sets state.hit, so ExecResult.truncated can't report False while the cap was breached. Budget is now decremented exactly once per scalar leaf (containers no longer double-account children).
  • Return annotation (line 152): _truncate_for_return is now -> tuple[Any, bool] with a corrected docstring.
  • Added regression tests for the oversized-first-element and oversized-scalar-string cases.

catalog.py:36 — not changed (respectfully disagree): the comment states list[ParamInfo] "is not callable and will raise at runtime." It is callable — list[ParamInfo]() returns [] (types.GenericAlias.__call__ delegates to the origin), verified on CPython 3.11; instantiating ToolEntry() without params= works fine. Switching to bare default_factory=list additionally regresses pyright strict mode (reportUnknownVariableType: params becomes partially-unknown), and this project sets typeCheckingMode = "strict". The parameterized form is intentional; I added a comment documenting why.

All 50 tests/code_exec tests pass; ruff clean; pyright 0 errors.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants