Add Substrate routing-latency metric and Cloud Monitoring dashboard#157
Conversation
4f2b5a3 to
094148b
Compare
094148b to
f50f882
Compare
f50f882 to
44f44a7
Compare
307bc29 to
76e1e9c
Compare
76e1e9c to
7dcb5ca
Compare
|
I added promethous endpoint for the router E2E metric so that the local promethous server added in #145 can scrape it. |
Max Smythe (maxsmythe)
left a comment
There was a problem hiding this comment.
Minor nit and some questions about the dashboard queries.
It would be interesting to think about consolidating this dashboard and the benchmarking dashboards if possible. Also curious about Grafana or similar for Kind usage/etc., but that's out of scope for this PR.
| "dataSets": [ | ||
| { | ||
| "timeSeriesQuery": { | ||
| "prometheusQuery": "histogram_quantile(0.99, sum by (le) (rate({\"atenet.router.route.duration_bucket\", top_level_controller_name=\"atenet-router\"}[5m])))", |
There was a problem hiding this comment.
This query seems to duplicate the query from the E2E widget above?
There was a problem hiding this comment.
This is acutally intentional -- the duplicated series is the overall P99 reused as a baseline, and the two panels answer different questions:
- Panel 1 — P50/P95/P99: the percentile spread of the overall substrate latency (median vs. tail).
- Panel 2 — stages, P99: the same overall P99 kept as a baseline, with the nested sub-stages (substrate routing > ate apiserver > atelet).
| "dataSets": [ | ||
| { | ||
| "timeSeriesQuery": { | ||
| "prometheusQuery": "histogram_quantile(0.99, sum by (le) (rate(envoy_http_downstream_rq_time_bucket{envoy_http_conn_manager_prefix=~\"ingress_.*\"}[5m])))", |
There was a problem hiding this comment.
Is it possible to break these down by status?
There was a problem hiding this comment.
Envoy's downstream_rq_time histogram has no response-code dimension, so the round-trip latency can't be split by status.
According to https://www.envoyproxy.io/docs/envoy/latest/configuration/http/http_conn_man/stats, Envoy does expose is downstream_rq_xx (counts by response-code class), so I added the sixth graph "Substrate E2E (full round-trip) QPS — by Response Code" — breaking the round-trip throughput into 2xx/4xx/5xx.
cda1ce3 to
07ee338
Compare
07ee338 to
f8f9b2c
Compare
|
The newest push added the Downward API block from #155 into atenet-router so it gets a stable pod-UID identity too. So that multiple instances of one service won't collide onto a single backend time series. |
This PR introduces a new OpenTelemetry histogram
"atenet.router.route.duration"fromatenet-routerover the existingOTLP path. It measures substrate overhead -- from envoy receiving a request to envoy sending the request to the the resolved worker endpoint, excluding actor compute and the response. This is the
Substrate E2E latencythat under our control.Add a well-defined dashboard for E2E metrics that contains 6 charts.
Fixes b/508613998