Skip to content

Feature/report stale if one stale#506

Open
Timple wants to merge 2 commits into
ros:ros2from
nobleo:feature/report-stale-if-one-stale
Open

Feature/report stale if one stale#506
Timple wants to merge 2 commits into
ros:ros2from
nobleo:feature/report-stale-if-one-stale

Conversation

@Timple
Copy link
Copy Markdown
Contributor

@Timple Timple commented May 23, 2025

Going from:

Report stale as errors unless all stale

to:

If one STALE and no ERROR, report STALE

As a bonus, the message that caused the degradation of the diagnostics is added to the toplevel state. This is useful for reporting mechanisms such as error logs.

@mergify mergify Bot added the ros2 PR tackling a ROS2 branch label May 23, 2025
@Timple Timple force-pushed the feature/report-stale-if-one-stale branch from f9b78c0 to 8cd4d07 Compare May 26, 2025 07:42
@Timple
Copy link
Copy Markdown
Contributor Author

Timple commented May 26, 2025

Rebased as build failed on #507

@ct2034 ct2034 added the enhancement This tackles a new feature of the code (and not a bug) label May 26, 2025
@Timple
Copy link
Copy Markdown
Contributor Author

Timple commented Jun 30, 2025

Any objections here?

We think it makes more sense to report STALE as summary. Because now you can have an ERROR without any actual ERROR states.

@Timple
Copy link
Copy Markdown
Contributor Author

Timple commented Aug 15, 2025

If there are any reasons not to do this, we'd also like to hear. Or should this pass an MPC or something?

@Timple Timple force-pushed the feature/report-stale-if-one-stale branch from 8cd4d07 to 601b4a6 Compare August 15, 2025 07:44
@Timple
Copy link
Copy Markdown
Contributor Author

Timple commented Jan 30, 2026

Almost a year...

Can someone point us at someone who can shine a light if this is desired or undesired?

@ct2034 ct2034 self-assigned this Apr 20, 2026
@peci1
Copy link
Copy Markdown

peci1 commented May 15, 2026

I haven't gone through the logic of the implementation, but from a high-level point, I agree with this direction (I'm not a maintainer here, though).

(max_level > DiagnosticStatus::ERROR && min_level <= DiagnosticStatus::ERROR))
if (
max_level == diagnostic_msgs::msg::DiagnosticStatus::STALE &&
max_level_without_stale < diagnostic_msgs::msg::DiagnosticStatus::ERROR)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when no node has reported anything yet, max_level stays at -1, the new condition max_level == STALE && ... is false, and the else assigns max_level_without_stale = 0 (OK), so a freshly-started aggregator with a dead pipeline publishes a green toplevel. The old code defended against this with if (max_level < 0) level = ERROR; - that guard is gone here.

}
// When a non-ok item was found, copy the complete status message once
if (max_level > DiagnosticStatus::OK) {
diag_toplevel_state.name = msg_to_report->name;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO this silently breaks diagnostic_remote_logging. Its influxdb_connector.cpp https://github.com/ros/diagnostics/blob/ros2/diagnostic_remote_logging/src/influxdb_connector.cpp#L100 subscribes to /diagnostics_toplevel_state and uses status.name (via splitName) as the InfluxDB measurement name. Before this PR it was always toplevel_state. After this PR, on any degradation, the measurement renames to whatever analyzer is deepest (Cell3, GPS, etc.) and any dashboard keyed on toplevel_state loses its data.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

potential clean fix: keep name = "toplevel_state" stable and surface the offender via message = msg_to_report->name + ": " + msg_to_report->message (plus hardware_id and values). No consumer breaks.

Comment on lines +293 to +296
max_level == diagnostic_msgs::msg::DiagnosticStatus::STALE &&
max_level_without_stale < diagnostic_msgs::msg::DiagnosticStatus::ERROR)
{
// Top level is error if we got no diagnostic level or
// have stale items but not all are stale
diag_toplevel_state.level = DiagnosticStatus::ERROR;
diag_toplevel_state.level = diagnostic_msgs::msg::DiagnosticStatus::STALE;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 corner case worth a sanity check. Concrete scenario: battery node reports WARN (low voltage), a USB camera drops at the same time and its node goes STALE after the watchdog timeout. Today the operator sees a single rollup on /diagnostics_toplevel_state and acts on it.

with this PR: max_level = 3 (STALE), max_level_without_stale = 1 (WARN), predicate 3 == STALE && 1 < ERROR is true -> toplevel reports STALE. The battery WARN disappears. STALE reads like "camera reconnect, will resolve itself" and the live warning gets dismissed.

the old code reported ERROR here (max_level > ERROR && min_level <= ERROR). Also lossy, but at least it forced the operator to look deeper. Swapping it for STALE is a UX regression for that combination.

Potential fix:

if (max_level_without_stale > DiagnosticStatus::OK) {
  level = max_level_without_stale;          // WARN/ERROR always beats STALE
} else if (max_level == DiagnosticStatus::STALE) {
  level = DiagnosticStatus::STALE;
} else if (max_level < 0) {
  level = DiagnosticStatus::STALE;          // empty input
} else {
  level = DiagnosticStatus::OK;
}

diag_toplevel_state.values = msg_to_report->values;
}

non_ok_status_depth = 0;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reset zeroes non_ok_status_depth but msg_to_report is kept from loop 1, so any non-OK message in processed_other at depth >= 1 will overwrite a deeper pick from processed (because depth > 0 is trivially true after the reset), right?

diag_toplevel_state.level = DiagnosticStatus::STALE;
int max_level = -1;
int min_level = 255;
int8_t max_level_without_stale = 0;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

max_level_without_stale is int8_t here and unsigned char in analyzer_group.cpp:295. DiagnosticStatus::level is int8 in the IDL, so int8_t is the canonical choice. Worth unifying :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement This tackles a new feature of the code (and not a bug) ros2 PR tackling a ROS2 branch

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants