Skip to content

Add Zipkin tracing parity#2217

Merged
charlesbluca merged 66 commits into
NVIDIA:mainfrom
charlesbluca:codex/tracing-zipkin-parity-spec
Jun 23, 2026
Merged

Add Zipkin tracing parity#2217
charlesbluca merged 66 commits into
NVIDIA:mainfrom
charlesbluca:codex/tracing-zipkin-parity-spec

Conversation

@charlesbluca

@charlesbluca charlesbluca commented Jun 8, 2026

Copy link
Copy Markdown
Collaborator

Description

This adds Zipkin parity and broader trace-context propagation across the Helm deployment and service execution paths.

  • Adds Helm support for Zipkin collector deployment, service wiring, endpoint overrides, and chart-wide tracing env generation guards.
  • Keeps tracing observability-only while carrying trace context through gateway proxy requests, service work, pipeline pools/executors, and NIM HTTP/internal inference calls.
  • Surfaces trace IDs through job/service responses and dashboard views so users can correlate service work with collected traces.
  • Documents the Zipkin tracing workflow and adds focused tests for Helm rendering, tracing helpers, NIM tracing, proxy/pool/executor propagation, ingest routing, job tracking, and dashboard trace UI.

Compatibility note: topology.zipkin.enabled intentionally remains true by default to preserve the managed trace-backend behavior shipped by the legacy 26.1.2 Helm chart. Upgrades with default values may create a chart-owned Zipkin Deployment and Service; set topology.zipkin.enabled=false before upgrading if your deployment uses an external backend or should not run chart-owned Zipkin. Context: #2217 (comment)

Validation: uv run pytest tests/test_helm_tracing_zipkin.py tests/test_nim_tracing.py tests/test_service_tracing.py tests/test_service_proxy_tracing.py tests/test_service_pool_tracing.py tests/test_service_pipeline_tracing.py tests/test_service_ingest_router.py tests/test_service_ingest_async.py tests/test_service_job_tracker.py tests/test_dashboard_static_trace_ui.py (152 passed).

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@charlesbluca charlesbluca changed the title [codex] Add Zipkin tracing parity Add Zipkin tracing parity Jun 8, 2026
@charlesbluca charlesbluca marked this pull request as ready for review June 8, 2026 17:09
@charlesbluca charlesbluca requested review from a team as code owners June 8, 2026 17:09
@charlesbluca charlesbluca requested a review from edknv June 8, 2026 17:09
@greptile-apps

greptile-apps Bot commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR adds Zipkin tracing parity to the Helm chart and propagates W3C trace context through the full service execution path — gateway proxy, pool workers, pipeline executor, NIM HTTP/gRPC inference, ingest router, and job tracker — while keeping tracing observability-only (all failures are logged and swallowed). trace_id is surfaced in job-creation responses, SSE events, and the dashboard.

  • Adds a Zipkin Deployment + Service to Helm (enabled by default for 26.1.2 parity), Zipkin-exporter injection into the OTel config, and per-pod OTel env generation for both the retriever service and NIMService containers.
  • Introduces nemo_retriever/src/nemo_retriever/service/tracing.py as the central OTel helper (span creation, context inject/extract, attribute sanitization), wires it into the ingest router, pool workers, pipeline executor, NIM clients, and dashboard; adds trace_id/trace_context fields to JobAggregate and response schemas.

Confidence Score: 5/5

Safe to merge. The tracing layer is observability-only — every span creation and context injection is wrapped in try/except with warnings, so no tracing failure can break request handling. The Helm changes are additive and the Zipkin-on-upgrade behavior is documented.

All new trace context paths gracefully degrade to no-ops on failure. The two findings are about a private-OTel-internals mutation in test-reset code and non-deterministic env var ordering in Helm output — neither affects runtime correctness of the service.

nemo_retriever/src/nemo_retriever/service/tracing.py (private OTel internals in _reset_tracing_for_tests) and nemo_retriever/helm/templates/_helpers.tpl (map-iteration ordering in the OTel env helpers).

Important Files Changed

Filename Overview
nemo_retriever/src/nemo_retriever/service/tracing.py New central OTel helper; well-structured with graceful degradation. _reset_tracing_for_tests mutates private OTel SDK internals (_TRACER_PROVIDER, _TRACER_PROVIDER_SET_ONCE._done) directly, which is fragile against SDK version changes.
nemo_retriever/src/nemo_retriever/service/routers/ingest.py Adds _start_accept_span, _trace_context_from_request_or_job, and trace-context plumbing to job creation and document upload endpoints. Logic is correct; tracing imports are lazy and exception-guarded.
nemo_retriever/src/nemo_retriever/models/nim/primitives/nim_client.py Adds trace context capture/injection for HTTP and gRPC NIM inference. _same_trace_context check in _batcher_loop prevents cross-trace batching, which can serialize dynamic batching when concurrent callers each carry unique spans.
nemo_retriever/helm/templates/_helpers.tpl Adds 280+ lines of tracing Helm helpers. The range over Go maps in the OTel env helpers produces non-deterministic env var ordering, which can cause spurious GitOps diffs on re-renders.
nemo_retriever/helm/templates/deployment-zipkin.yaml New Zipkin Deployment template with probes, resources, and security context support. All Helm Chart Standards satisfied: resource requests/limits, liveness/readiness/startup probes, configurable image.
nemo_retriever/src/nemo_retriever/service/services/pipeline_pool.py Adds trace_context field to WorkItem and propagates W3C context into pool worker spans; tracing is correctly scoped and exception-guarded.
nemo_retriever/src/nemo_retriever/service/services/job_tracker.py Adds trace_id and trace_context fields to JobAggregate and propagates them through register_job and the SSE event bus.

Sequence Diagram

%%{init: {'theme': 'neutral'}}%%
sequenceDiagram
    participant Client
    participant Gateway as GatewayProxy
    participant Ingest as IngestRouter
    participant JobTracker
    participant Pool as PipelinePool
    participant Executor as PipelineExecutor
    participant NIM

    Client->>+Gateway: POST /v1/ingest/job (traceparent header)
    Gateway->>+Ingest: forward + inject_trace_context()
    Ingest->>Ingest: start_span(ingest.job, SpanKind.SERVER)
    Ingest->>+JobTracker: register_job(trace_id, trace_context)
    JobTracker-->>-Ingest: JobAggregate
    Ingest-->>-Gateway: 201 job_id + trace_id
    Gateway-->>-Client: 201 job_id + trace_id

    Client->>+Gateway: POST /v1/ingest/job/id/document
    Gateway->>+Ingest: forward + inject_trace_context()
    Ingest->>Ingest: start_span(ingest.document.accept)
    Ingest->>+Pool: submit(WorkItem with trace_context)
    Pool-->>-Ingest: accepted
    Ingest-->>-Gateway: 202
    Gateway-->>-Client: 202

    Pool->>Pool: extract_trace_context(item.trace_context)
    Pool->>Pool: "start_span(pool.name.process, context=parent)"
    Pool->>+Executor: work_fn(item)
    Executor->>Executor: capture_trace_context()
    Executor->>+NIM: HTTP/gRPC infer (traceparent header)
    NIM-->>-Executor: result
    Executor-->>-Pool: rows + data
    Pool->>JobTracker: mark_completed(item.id)

    Client->>Gateway: GET /v1/ingest/job/id/events SSE
    Gateway->>Ingest: forward
    Ingest->>JobTracker: get_job(id)
    JobTracker-->>Ingest: SSE event with trace_id
    Ingest-->>Client: job_created with trace_id
    Ingest-->>Client: job_finalized
Loading
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
sequenceDiagram
    participant Client
    participant Gateway as GatewayProxy
    participant Ingest as IngestRouter
    participant JobTracker
    participant Pool as PipelinePool
    participant Executor as PipelineExecutor
    participant NIM

    Client->>+Gateway: POST /v1/ingest/job (traceparent header)
    Gateway->>+Ingest: forward + inject_trace_context()
    Ingest->>Ingest: start_span(ingest.job, SpanKind.SERVER)
    Ingest->>+JobTracker: register_job(trace_id, trace_context)
    JobTracker-->>-Ingest: JobAggregate
    Ingest-->>-Gateway: 201 job_id + trace_id
    Gateway-->>-Client: 201 job_id + trace_id

    Client->>+Gateway: POST /v1/ingest/job/id/document
    Gateway->>+Ingest: forward + inject_trace_context()
    Ingest->>Ingest: start_span(ingest.document.accept)
    Ingest->>+Pool: submit(WorkItem with trace_context)
    Pool-->>-Ingest: accepted
    Ingest-->>-Gateway: 202
    Gateway-->>-Client: 202

    Pool->>Pool: extract_trace_context(item.trace_context)
    Pool->>Pool: "start_span(pool.name.process, context=parent)"
    Pool->>+Executor: work_fn(item)
    Executor->>Executor: capture_trace_context()
    Executor->>+NIM: HTTP/gRPC infer (traceparent header)
    NIM-->>-Executor: result
    Executor-->>-Pool: rows + data
    Pool->>JobTracker: mark_completed(item.id)

    Client->>Gateway: GET /v1/ingest/job/id/events SSE
    Gateway->>Ingest: forward
    Ingest->>JobTracker: get_job(id)
    JobTracker-->>Ingest: SSE event with trace_id
    Ingest-->>Client: job_created with trace_id
    Ingest-->>Client: job_finalized
Loading

Reviews (7): Last reviewed commit: "Enable OTel tracing by default" | Re-trigger Greptile

Comment thread nemo_retriever/tests/test_dashboard_static_trace_ui.py
Comment thread nemo_retriever/helm/values.yaml
Comment thread nemo_retriever/helm/values.yaml

@jdye64 jdye64 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good overall. Had single comment about defaults.

Comment thread nemo_retriever/helm/values.yaml Outdated
@charlesbluca charlesbluca merged commit 9969e63 into NVIDIA:main Jun 23, 2026
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants