fix: clear stuck DNS node conditions by toggling them both ways#21
fix: clear stuck DNS node conditions by toggling them both ways#21mattmattox wants to merge 1 commit into
Conversation
The DNS monitor (pkg/monitors/network/dns.go) emitted node conditions
additively — only ConditionTrue, never ConditionFalse — so every problem
condition latched True forever. Recovery wrote a separate *Healthy mirror
condition instead of clearing the corresponding *Down/*Degraded one, and on
restart initialSync reloaded the stale True verbatim (observed: a node carried
NodeDoctorClusterDNSDown=True for days while DNS was healthy). Every other
monitor (gateway, memory, cpu, disk) toggles one condition both ways.
Refactor the DNS monitor to the single-toggle pattern:
- Each problem condition is written with its current True/False every cycle
(mutually-exclusive groups via a new setMutuallyExclusiveCondition helper;
independent ones get an explicit False arm).
- Drop the parallel mirror conditions ClusterDNSHealthy, NetworkReachable,
DNSResolutionConsistent, CustomDNSHealthy (healthy = problem conditions all False).
- Rename the DNS external condition NetworkUnreachable -> ExternalDNSDown to stop
colliding with the gateway monitor, which owns NetworkUnreachable.
- Remove the four retired mirror conditions once on startup (initialSync) so
values left on existing nodes don't linger; NetworkUnreachable is excluded.
- Update detector.go's disable-cleanup map and docs/monitors.md to match.
Tests assert each condition goes True on failure then False on recovery, closing
the gap where the old recovery test only checked the healthy mirror appeared.
BREAKING (node conditions): removes NodeDoctor{ClusterDNSHealthy,NetworkReachable,
DNSResolutionConsistent,CustomDNSHealthy}; the DNS side of NodeDoctorNetworkUnreachable
becomes NodeDoctorExternalDNSDown. External alert rules keyed on those names must be
updated; NodeDoctorNetworkUnreachable now reflects gateway state only.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 9c1c64f2d7
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| active = "DNSResolutionInconsistent" | ||
| } | ||
|
|
||
| setMutuallyExclusiveCondition(status, []exclusiveCond{ |
There was a problem hiding this comment.
Aggregate consistency conditions across all custom queries
When more than one custom query has ConsistencyCheck enabled, checkCustomQueries calls checkDomainConsistency once per domain; if an earlier domain sets DNSResolutionDown/Intermittent/Inconsistent=True and a later domain is healthy, this call emits the same condition types as False, which overwrites the earlier True in the Kubernetes condition manager's pendingUpdates map keyed by condition type. In that configuration the node condition can show clear even though a configured consistency-check domain failed, so the clearing needs to happen after aggregating all consistency-query results or use domain-scoped condition types.
Useful? React with 👍 / 👎.
| active = "DNSResolutionDegraded" | ||
| } | ||
|
|
||
| setMutuallyExclusiveCondition(status, []exclusiveCond{ |
There was a problem hiding this comment.
Aggregate nameserver conditions across all custom queries
For multiple custom queries with TestEachNameserver, this per-domain toggle lets a later all-healthy domain emit CustomDNSDown=False and DNSResolutionDegraded=False, overwriting an earlier domain's failed/degraded result because the exporter ultimately stores one pending update per condition type. This means a node can lose the aggregate custom-nameserver alert while one configured domain is still failing; clear these global conditions only after aggregating all tested domains or make the condition domain-specific.
Useful? React with 👍 / 👎.
Fixes a latent bug where the DNS monitor latches node conditions
Trueforever.The bug
pkg/monitors/network/dns.goemits conditions additively — 20types.ConditionTruewrites, zerotypes.ConditionFalse. Every problem condition (ClusterDNSDown,DNSResolutionDown, etc.) sticksTrueforever, because the recovery branch writes a separate mirror condition (ClusterDNSHealthy) instead of clearing the down condition. On restart,condition_manager.goinitialSyncreloads the staleTrueverbatim.Observed in production: a node carried
NodeDoctorClusterDNSDown=Truefor 4+ days while DNS was healthy and the siblingNodeDoctorClusterDNSHealthy=True— the contradiction. The stale alarm masked a real (separate) DNS-datapath blip during triage.Every other monitor (gateway
gateway.go:551-568, memory, cpu, disk) toggles one condition both ways. The DNS monitor was the lone outlier. Bonus: its external check wroteNetworkUnreachable— the same condition the gateway monitor owns (a cross-monitor collision).The fix (single-toggle refactor)
True/Falseevery cycle. Mutually-exclusive groups (Down/Degraded/Intermittent) go through a newsetMutuallyExclusiveConditionhelper (active memberTrue, siblingsFalse); independent ones get an explicitFalsearm.ClusterDNSHealthy,NetworkReachable,DNSResolutionConsistent,CustomDNSHealthy(healthy == the problem conditions allFalse).NetworkUnreachable→ExternalDNSDown(gateway keepsNetworkUnreachable).initialSync) removes the 4 retired mirror conditions left on existing nodes;NetworkUnreachableis deliberately excluded.detector.godisable-cleanup map anddocs/monitors.mdto match.Tests
New tests assert each condition goes
Trueon failure →Falseon recovery (the gap in the old recovery test, which only checked the healthy mirror appeared), plus a startup-cleanup regression test.go build/vet/test ./...green:Removed:
NodeDoctorClusterDNSHealthy,NodeDoctorNetworkReachable,NodeDoctorDNSResolutionConsistent,NodeDoctorCustomDNSHealthy. Renamed: DNS-sideNodeDoctorNetworkUnreachable→NodeDoctorExternalDNSDown(the remainingNodeDoctorNetworkUnreachablereflects gateway state only). Grafana/Alertmanager rules keyed on those names need updating; rules usingkube_node_status_condition{status="true"}==1are fine and now auto-resolve.Follow-ups (out of scope)
AddTrendConditions) are still positive-only — same class of latch, left for a separate PR.