Feature/report stale if one stale#506
Conversation
f9b78c0 to
8cd4d07
Compare
|
Rebased as build failed on #507 |
|
Any objections here? We think it makes more sense to report STALE as summary. Because now you can have an ERROR without any actual ERROR states. |
|
If there are any reasons not to do this, we'd also like to hear. Or should this pass an MPC or something? |
8cd4d07 to
601b4a6
Compare
|
Almost a year... Can someone point us at someone who can shine a light if this is desired or undesired? |
|
I haven't gone through the logic of the implementation, but from a high-level point, I agree with this direction (I'm not a maintainer here, though). |
| (max_level > DiagnosticStatus::ERROR && min_level <= DiagnosticStatus::ERROR)) | ||
| if ( | ||
| max_level == diagnostic_msgs::msg::DiagnosticStatus::STALE && | ||
| max_level_without_stale < diagnostic_msgs::msg::DiagnosticStatus::ERROR) |
There was a problem hiding this comment.
when no node has reported anything yet, max_level stays at -1, the new condition max_level == STALE && ... is false, and the else assigns max_level_without_stale = 0 (OK), so a freshly-started aggregator with a dead pipeline publishes a green toplevel. The old code defended against this with if (max_level < 0) level = ERROR; - that guard is gone here.
| } | ||
| // When a non-ok item was found, copy the complete status message once | ||
| if (max_level > DiagnosticStatus::OK) { | ||
| diag_toplevel_state.name = msg_to_report->name; |
There was a problem hiding this comment.
IMHO this silently breaks diagnostic_remote_logging. Its influxdb_connector.cpp https://github.com/ros/diagnostics/blob/ros2/diagnostic_remote_logging/src/influxdb_connector.cpp#L100 subscribes to /diagnostics_toplevel_state and uses status.name (via splitName) as the InfluxDB measurement name. Before this PR it was always toplevel_state. After this PR, on any degradation, the measurement renames to whatever analyzer is deepest (Cell3, GPS, etc.) and any dashboard keyed on toplevel_state loses its data.
There was a problem hiding this comment.
potential clean fix: keep name = "toplevel_state" stable and surface the offender via message = msg_to_report->name + ": " + msg_to_report->message (plus hardware_id and values). No consumer breaks.
| max_level == diagnostic_msgs::msg::DiagnosticStatus::STALE && | ||
| max_level_without_stale < diagnostic_msgs::msg::DiagnosticStatus::ERROR) | ||
| { | ||
| // Top level is error if we got no diagnostic level or | ||
| // have stale items but not all are stale | ||
| diag_toplevel_state.level = DiagnosticStatus::ERROR; | ||
| diag_toplevel_state.level = diagnostic_msgs::msg::DiagnosticStatus::STALE; |
There was a problem hiding this comment.
1 corner case worth a sanity check. Concrete scenario: battery node reports WARN (low voltage), a USB camera drops at the same time and its node goes STALE after the watchdog timeout. Today the operator sees a single rollup on /diagnostics_toplevel_state and acts on it.
with this PR: max_level = 3 (STALE), max_level_without_stale = 1 (WARN), predicate 3 == STALE && 1 < ERROR is true -> toplevel reports STALE. The battery WARN disappears. STALE reads like "camera reconnect, will resolve itself" and the live warning gets dismissed.
the old code reported ERROR here (max_level > ERROR && min_level <= ERROR). Also lossy, but at least it forced the operator to look deeper. Swapping it for STALE is a UX regression for that combination.
Potential fix:
if (max_level_without_stale > DiagnosticStatus::OK) {
level = max_level_without_stale; // WARN/ERROR always beats STALE
} else if (max_level == DiagnosticStatus::STALE) {
level = DiagnosticStatus::STALE;
} else if (max_level < 0) {
level = DiagnosticStatus::STALE; // empty input
} else {
level = DiagnosticStatus::OK;
}| diag_toplevel_state.values = msg_to_report->values; | ||
| } | ||
|
|
||
| non_ok_status_depth = 0; |
There was a problem hiding this comment.
The reset zeroes non_ok_status_depth but msg_to_report is kept from loop 1, so any non-OK message in processed_other at depth >= 1 will overwrite a deeper pick from processed (because depth > 0 is trivially true after the reset), right?
| diag_toplevel_state.level = DiagnosticStatus::STALE; | ||
| int max_level = -1; | ||
| int min_level = 255; | ||
| int8_t max_level_without_stale = 0; |
There was a problem hiding this comment.
max_level_without_stale is int8_t here and unsigned char in analyzer_group.cpp:295. DiagnosticStatus::level is int8 in the IDL, so int8_t is the canonical choice. Worth unifying :)
Going from:
to:
As a bonus, the message that caused the degradation of the diagnostics is added to the toplevel state. This is useful for reporting mechanisms such as error logs.