feat: add code-execution routing mode (GCORE_MCP_ROUTING=code_exec) by algis-dumbris · Pull Request #13 · G-Core/gcore-mcp-server

algis-dumbris · 2026-05-15T07:37:13Z

Summary

Adds a new routing mode that exposes 3 meta-tools instead of registering all ~700 SDK methods individually, addressing the tool-list overflow that prevents most LLM clients from connecting the server with all tools enabled.

New env var GCORE_MCP_ROUTING accepts code_exec (new default) or direct (legacy).
code_exec mode exposes search_tools(query), get_tool_schema(name), and execute_code(code). The LLM-generated Python runs in a Pydantic Monty sandbox and reaches the SDK only through a host-injected call_tool().
direct mode preserves today's behavior byte-for-byte.

Branched off the design spec PR; the spec commit (docs/superpowers/specs/2026-05-15-code-execution-mode-design.md) is included in this branch's history.

Files

gcore_mcp_server/code_exec/ — new package: catalog.py, dispatch.py, runner.py, meta_tools.py
gcore_mcp_server/core/serialize.py — extracted from server.py so the dispatcher can reuse it
gcore_mcp_server/config/settings.py — adds GCORE_MCP_ROUTING parser
gcore_mcp_server/server.py — branches registration on routing mode
pyproject.toml — adds pydantic-monty>=0.0.17,<0.1 and rank-bm25>=0.2,<1 (~8 MB combined wheel size, pre-built wheels on macOS/Linux/Windows × CPython 3.10-3.14)
tests/code_exec/ — 41 unit tests + 6 e2e tests against the real Gcore API
README.md — adds Routing modes section
docs/superpowers/specs/2026-05-15-code-execution-mode-design.md — design doc

Test plan

uv run ruff format . — clean
uv run ruff check . — passes
uv run pyright gcore_mcp_server/code_exec/ gcore_mcp_server/core/serialize.py gcore_mcp_server/config/settings.py — 0 errors in new code (pre-existing pyright errors in legacy make_wrapper are untouched)
uv run pytest tests/code_exec/ — 47 passed (41 unit + 6 e2e)
uv run pytest tests/test_schema.py tests/test_inspection.py tests/test_pattern_filtering.py — 38 passed (no regression in pre-existing tests)
Manual smoke test: server boots in both modes
- GCORE_MCP_ROUTING=code_exec → "Registered 3 meta-tools over 668 SDK methods"
- GCORE_MCP_ROUTING=direct GCORE_TOOLS=management → "Registered 17 tools" (unchanged from main)
E2E test on real Gcore API (../gcore-terraform/.env credentials):
- Builds catalog from real client (>100 entries)
- search_tools("list regions") returns cloud.regions.list in top 5
- Sandbox script await call_tool('cloud.regions.list') succeeds and returns real region IDs
- Sandbox call_tool('nope.does.not.exist') surfaces a clean KeyError

Migration / compatibility

code_exec is the new default. Existing clients with hard-coded tool names will see a different surface. The one-line opt-out is GCORE_MCP_ROUTING=direct. Documented in the README "Routing modes" section.
In code_exec mode, GCORE_TOOLS is logged as ignored — catalog filtering doesn't apply when only 3 meta-tools are registered.
Server logs the active mode prominently at startup.

Sandbox notes (for clients that will use `code_exec`)

Supported: async/await, comprehensions, exceptions, stdlib json/re/datetime.
Not supported (Pydantic Monty v0.0.17): class, with, import, match, generators.
Defaults: 30 s wall-clock timeout, 200 MB memory cap, 40 KB result and stdout truncation budget (with _truncated marker so the model can re-query with narrower filters).
The Gcore API key lives on the host-side client and is never exposed as a sandbox input or external function — the sandbox can authenticate calls via call_tool but cannot read the key.

Add design doc for a new GCORE_MCP_ROUTING=code_exec mode that exposes three meta-tools (search_tools, get_tool_schema, execute_code) backed by a Pydantic Monty sandbox, replacing the ~700-tool registration. Direct mode remains available as an opt-out.

Move the SDK-result serializer out of server.py into a small shared module so the upcoming code_exec dispatcher can reuse it. Renamed _serialize_result → serialize_result since it's now a public helper. Behavior unchanged.

- pydantic-monty (>=0.0.17,<0.1) — embedded secure Python interpreter used by the new code_exec mode to safely run LLM-generated scripts. - rank-bm25 (>=0.2,<1) — BM25 search index over the SDK catalog so the search_tools meta-tool can rank ~700 SDK methods by relevance. Both ship pre-built wheels for macOS, Linux, and Windows × CPython 3.10-3.14; combined install adds ~8 MB.

Introduce a new routing mode that exposes three meta-tools (search_tools, get_tool_schema, execute_code) instead of registering each SDK method individually. The LLM-generated Python runs in a Pydantic Monty sandbox and calls SDK methods via host-injected call_tool(). Package layout: - code_exec/catalog.py — ToolEntry + BM25-indexed Catalog - code_exec/dispatch.py — make_call_tool() with auto-injection of project_id and region_id - code_exec/runner.py — execute_code() driving Pydantic Monty with result/stream truncation and typed ExecResult - code_exec/meta_tools.py — registers the three meta-tools on FastMCP Selected via GCORE_MCP_ROUTING={code_exec|direct}, parsed in config/settings.py:get_routing_mode(). Default is code_exec; set GCORE_MCP_ROUTING=direct to restore the legacy ~700-tool surface.

- 41 unit tests in tests/code_exec/ covering Catalog (build + BM25 search + boosts + get_schema), dispatch (call_tool injection + awaitable handling + error paths), result/stream truncation, Pydantic Monty runner integration with stub catalogs, register meta_tools, and get_routing_mode env parsing. - 6 e2e tests against the real Gcore API. They auto-load credentials from ../gcore-terraform/.env so local devs can run them without shell config; CI skips when no real key is available. The e2e real_client fixture clears the SDK introspection cache so methods rebind to the real client (other test files run earlier may have poisoned the cache with a dummy-keyed client).

Add a Routing modes section near the top of the README so users discover code_exec as the new default and know how to opt out via GCORE_MCP_ROUTING=direct. Lists the sandbox's supported / unsupported Python features and default resource limits.

Copilot

Pull request overview

Adds a new default “code execution” routing mode to avoid MCP tool-list overflow by exposing only three meta-tools (search/schema/code execution) while preserving the existing per-SDK-method tool registration in a legacy direct mode.

Changes:

Introduces GCORE_MCP_ROUTING with code_exec (new default) vs direct (legacy) and updates server startup/registration accordingly.
Adds a code_exec package implementing SDK catalog + search, host-side dispatch (call_tool), and Monty sandbox execution with truncation.
Adds dependencies (pydantic-monty, rank-bm25) plus a dedicated tests/code_exec/ suite and README/docs updates.

Reviewed changes

Copilot reviewed 19 out of 21 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
`uv.lock`	Locks new dependencies (Monty, BM25 + transitive deps like numpy) and updates resolution metadata.
`pyproject.toml`	Adds `pydantic-monty` and `rank-bm25` runtime dependencies.
`README.md`	Documents routing modes, default behavior, and sandbox capabilities/limitations.
`gcore_mcp_server/server.py`	Branches tool registration by routing mode; registers 3 meta-tools in `code_exec` and preserves legacy `direct` behavior.
`gcore_mcp_server/core/serialize.py`	Extracts shared result serialization for reuse by the sandbox dispatcher.
`gcore_mcp_server/config/settings.py`	Adds routing-mode parsing (`GCORE_MCP_ROUTING`) with normalization + warning fallback.
`gcore_mcp_server/code_exec/__init__.py`	Exposes the code-exec public surface (catalog/runner/dispatch/meta-tools).
`gcore_mcp_server/code_exec/catalog.py`	Builds SDK catalog via introspection; BM25 search + schema export.
`gcore_mcp_server/code_exec/dispatch.py`	Implements host-side `call_tool` with project/region auto-injection + serialization.
`gcore_mcp_server/code_exec/meta_tools.py`	Registers `search_tools`, `get_tool_schema`, `execute_code` on the FastMCP server.
`gcore_mcp_server/code_exec/runner.py`	Runs Monty sandbox, captures stdout/stderr, enforces limits, truncates outputs/results.
`docs/superpowers/specs/2026-05-15-code-execution-mode-design.md`	Design spec describing the new routing mode, tool surface, architecture, and risks.
`tests/code_exec/__init__.py`	Marks the `code_exec` test package.
`tests/code_exec/conftest.py`	Sets anyio backend for async tests in this suite.
`tests/code_exec/test_catalog.py`	Verifies catalog build from real SDK + search/schema behavior.
`tests/code_exec/test_dispatch.py`	Tests `call_tool` dispatch, async handling, injection rules, and serialization.
`tests/code_exec/test_e2e_real_api.py`	Optional real-API e2e coverage (skipped unless real creds are available).
`tests/code_exec/test_meta_tools.py`	Ensures exactly three meta-tools are registered and callable.
`tests/code_exec/test_runner.py`	Validates sandbox execution contract (stdout, errors, timeout, truncation, async).
`tests/code_exec/test_settings_routing.py`	Tests routing-mode env parsing and warning fallback.
`tests/code_exec/test_truncation.py`	Tests truncation helpers for unicode safety and marker behavior.

Comments suppressed due to low confidence (1)

gcore_mcp_server/code_exec/runner.py:131

In dict truncation, the budget check only triggers when d is non-empty (and d). If the first key/value pair exceeds the remaining budget, it will still be included, the budget can go negative, and no _truncated marker / hit=True will be produced. Handle the empty-dict case explicitly so oversized first entries still yield a truncated result within budget.

            entry_size = _json_size({key: trimmed})
            if state.budget - entry_size < 0 and d:
                state.hit = True
                dropped = total_keys - idx
                d["_truncated"] = True
                d["_dropped_items"] = dropped

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+    short_name: str
+    doc_short: str
+    doc_full: str
+    params: list[ParamInfo] = field(default_factory=list[ParamInfo])


+            size = _json_size(trimmed)
+            if state.budget - size < 0 and out:
+                state.hit = True
+                dropped = total - idx
+                out.append({"_truncated": True, "_dropped_items": dropped})
+                return out


+    # Scalars (or anything else) – just account for size; do not trim mid-string.
+    state.budget -= _json_size(value)
+    return value


+    state = _TruncationState(budget=int(max_bytes))
+    trimmed = _truncate_value(value, state)
+    return trimmed, state.hit


SDK methods that take their own `name` argument (e.g. cloud.ssh_keys.create) collided with the dispatcher's first parameter, so `call_tool('cloud.ssh_keys.create', name='k')` raised "Multiple values provided for parameter name" from Monty's type checker. Made the tool selector positional-only (`tool_name`, /) in both the dispatcher and the TYPE_STUBS so a `name` keyword is forwarded to the SDK method. Found via real-API create+delete testing through mcpproxy. Adds a regression test.

Valid findings fixed in runner.py: - _truncate_value: drop the `and out`/`and d` guard that let an oversized FIRST list/dict element through with the budget going negative and `hit` left False. Containers now check the budget before descending into each child and emit a _truncated marker once exhausted. - Oversized scalar strings are now truncated in place via _truncate_bytes instead of being returned whole; every over-budget path sets state.hit so ExecResult.truncated can no longer report False while the byte cap was breached. - Budget is decremented exactly once per scalar leaf (containers no longer double-account children). - _truncate_for_return annotated -> tuple[Any, bool] with corrected docstring. Added regression tests for the oversized-first-element and oversized-scalar-string cases. catalog.py: kept `field(default_factory=list[ParamInfo])` (added a clarifying comment). The Copilot suggestion to use bare `list` is incorrect here — `list[ParamInfo]()` does not raise (GenericAlias is callable and returns []), and bare `list` makes pyright strict mode report `params` as partially-unknown.

algis-dumbris · 2026-05-18T08:23:21Z

Addressed the Copilot review in bfe6bc2.

runner.py — all three valid, fixed:

_truncate_value oversized-first-element (lines 113/126): removed the and out / and d guard. Containers now check the budget before descending into each child and emit a {"_truncated": True, "_dropped_items": N} marker once exhausted, so a huge first element can no longer slip through with a negative budget and hit=False.
Scalars never truncated (line 139): oversized strings are now cut in place via _truncate_bytes; every over-budget path sets state.hit, so ExecResult.truncated can't report False while the cap was breached. Budget is now decremented exactly once per scalar leaf (containers no longer double-account children).
Return annotation (line 152): _truncate_for_return is now -> tuple[Any, bool] with a corrected docstring.
Added regression tests for the oversized-first-element and oversized-scalar-string cases.

catalog.py:36 — not changed (respectfully disagree): the comment states list[ParamInfo] "is not callable and will raise at runtime." It is callable — list[ParamInfo]() returns [] (types.GenericAlias.__call__ delegates to the origin), verified on CPython 3.11; instantiating ToolEntry() without params= works fine. Switching to bare default_factory=list additionally regresses pyright strict mode (reportUnknownVariableType: params becomes partially-unknown), and this project sets typeCheckingMode = "strict". The parameterized form is intentional; I added a comment documenting why.

All 50 tests/code_exec tests pass; ruff clean; pyright 0 errors.

algis-dumbris added 6 commits May 15, 2026 09:52

refactor: extract _serialize_result to core/serialize.py

5ad18bd

Move the SDK-result serializer out of server.py into a small shared module so the upcoming code_exec dispatcher can reuse it. Renamed _serialize_result → serialize_result since it's now a public helper. Behavior unchanged.

Copilot AI review requested due to automatic review settings May 15, 2026 07:37

Copilot started reviewing on behalf of algis-dumbris May 15, 2026 07:37 View session

Copilot AI reviewed May 15, 2026

View reviewed changes

algis-dumbris added 2 commits May 15, 2026 13:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: add code-execution routing mode (GCORE_MCP_ROUTING=code_exec)#13

feat: add code-execution routing mode (GCORE_MCP_ROUTING=code_exec)#13
algis-dumbris wants to merge 8 commits into
mainfrom
feat/code-execution-mode-impl

algis-dumbris commented May 15, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

algis-dumbris commented May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

algis-dumbris commented May 15, 2026

Summary

Files

Test plan

Migration / compatibility

Sandbox notes (for clients that will use code_exec)

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

algis-dumbris commented May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Sandbox notes (for clients that will use `code_exec`)