Skip to content

Add Substrate routing-latency metric and Cloud Monitoring dashboard#157

Merged
Max Smythe (maxsmythe) merged 1 commit into
agent-substrate:mainfrom
HavenXia:metric-e2e
Jun 5, 2026
Merged

Add Substrate routing-latency metric and Cloud Monitoring dashboard#157
Max Smythe (maxsmythe) merged 1 commit into
agent-substrate:mainfrom
HavenXia:metric-e2e

Conversation

@HavenXia
Copy link
Copy Markdown
Collaborator

@HavenXia Haven Xia (HavenXia) commented Jun 3, 2026

This PR introduces a new OpenTelemetry histogram "atenet.router.route.duration" from atenet-router over the existing
OTLP path. It measures substrate overhead -- from envoy receiving a request to envoy sending the request to the the resolved worker endpoint, excluding actor compute and the response. This is the Substrate E2E latency that under our control.

Add a well-defined dashboard for E2E metrics that contains 6 charts.

  • Substrate routing latency — P50/P95/P99
  • Substrate routing latency P99 - by stages
  • Substrate routing latency P99 — by ActorTemplate
  • Substrate routing QPS - by Status
  • Substrate E2E (full round-trip) Latency — P50/P95/P99
  • Substrate E2E (full round-trip) QPS — by Response Code
image

Fixes b/508613998

It's a good idea to open an issue first for discussion.

  • Tests pass
  • Appropriate changes to documentation are included in the PR

@HavenXia Haven Xia (HavenXia) force-pushed the metric-e2e branch 3 times, most recently from 4f2b5a3 to 094148b Compare June 3, 2026 04:25
@HavenXia Haven Xia (HavenXia) marked this pull request as ready for review June 3, 2026 04:31
@HavenXia Haven Xia (HavenXia) force-pushed the metric-e2e branch 3 times, most recently from 307bc29 to 76e1e9c Compare June 3, 2026 21:07
@HavenXia Haven Xia (HavenXia) changed the title Add Substrate E2E metric and cloud logging dashboard Add Substrate E2E metric and cloud monitoring dashboard Jun 4, 2026
@HavenXia
Copy link
Copy Markdown
Collaborator Author

Haven Xia (HavenXia) commented Jun 4, 2026

I added promethous endpoint for the router E2E metric so that the local promethous server added in #145 can scrape it.

Copy link
Copy Markdown
Collaborator

@maxsmythe Max Smythe (maxsmythe) left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nit and some questions about the dashboard queries.

It would be interesting to think about consolidating this dashboard and the benchmarking dashboards if possible. Also curious about Grafana or similar for Kind usage/etc., but that's out of scope for this PR.

Comment thread cmd/atenet/internal/app/router/metrics.go Outdated
Comment thread monitoring/dashboards/ate-e2e-latency-dashboard.json Outdated
"dataSets": [
{
"timeSeriesQuery": {
"prometheusQuery": "histogram_quantile(0.99, sum by (le) (rate({\"atenet.router.route.duration_bucket\", top_level_controller_name=\"atenet-router\"}[5m])))",
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This query seems to duplicate the query from the E2E widget above?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is acutally intentional -- the duplicated series is the overall P99 reused as a baseline, and the two panels answer different questions:

  • Panel 1 — P50/P95/P99: the percentile spread of the overall substrate latency (median vs. tail).
  • Panel 2 — stages, P99: the same overall P99 kept as a baseline, with the nested sub-stages (substrate routing > ate apiserver > atelet).

"dataSets": [
{
"timeSeriesQuery": {
"prometheusQuery": "histogram_quantile(0.99, sum by (le) (rate(envoy_http_downstream_rq_time_bucket{envoy_http_conn_manager_prefix=~\"ingress_.*\"}[5m])))",
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to break these down by status?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Envoy's downstream_rq_time histogram has no response-code dimension, so the round-trip latency can't be split by status.

According to https://www.envoyproxy.io/docs/envoy/latest/configuration/http/http_conn_man/stats, Envoy does expose is downstream_rq_xx (counts by response-code class), so I added the sixth graph "Substrate E2E (full round-trip) QPS — by Response Code" — breaking the round-trip throughput into 2xx/4xx/5xx.

@HavenXia Haven Xia (HavenXia) changed the title Add Substrate E2E metric and cloud monitoring dashboard Add Substrate routing-latency metric and Cloud Monitoring dashboard Jun 5, 2026
@HavenXia Haven Xia (HavenXia) force-pushed the metric-e2e branch 2 times, most recently from cda1ce3 to 07ee338 Compare June 5, 2026 05:20
@HavenXia
Copy link
Copy Markdown
Collaborator Author

The newest push added the Downward API block from #155 into atenet-router so it gets a stable pod-UID identity too. So that multiple instances of one service won't collide onto a single backend time series.

Copy link
Copy Markdown
Collaborator

@maxsmythe Max Smythe (maxsmythe) left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@maxsmythe Max Smythe (maxsmythe) merged commit a325d41 into agent-substrate:main Jun 5, 2026
8 checks passed
Julian Gutierrez Oschmann (juli4n) pushed a commit that referenced this pull request Jun 5, 2026
Submit after #157 #160 

- [x] Tests pass
- [x] Appropriate changes to documentation are included in the PR
@HavenXia Haven Xia (HavenXia) deleted the metric-e2e branch June 5, 2026 18:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants