fix: clear stuck DNS node conditions by toggling them both ways by mattmattox · Pull Request #21 · SupportTools/node-doctor

mattmattox · 2026-06-08T09:18:50Z

Fixes a latent bug where the DNS monitor latches node conditions True forever.

The bug

pkg/monitors/network/dns.go emits conditions additively — 20 types.ConditionTrue writes, zero types.ConditionFalse. Every problem condition (ClusterDNSDown, DNSResolutionDown, etc.) sticks True forever, because the recovery branch writes a separate mirror condition (ClusterDNSHealthy) instead of clearing the down condition. On restart, condition_manager.go initialSync reloads the stale True verbatim.

Observed in production: a node carried NodeDoctorClusterDNSDown=True for 4+ days while DNS was healthy and the sibling NodeDoctorClusterDNSHealthy=True — the contradiction. The stale alarm masked a real (separate) DNS-datapath blip during triage.

Every other monitor (gateway gateway.go:551-568, memory, cpu, disk) toggles one condition both ways. The DNS monitor was the lone outlier. Bonus: its external check wrote NetworkUnreachable — the same condition the gateway monitor owns (a cross-monitor collision).

The fix (single-toggle refactor)

Each problem condition is now written with its current True/False every cycle. Mutually-exclusive groups (Down/Degraded/Intermittent) go through a new setMutuallyExclusiveCondition helper (active member True, siblings False); independent ones get an explicit False arm.
Dropped the parallel mirror conditions: ClusterDNSHealthy, NetworkReachable, DNSResolutionConsistent, CustomDNSHealthy (healthy == the problem conditions all False).
Renamed the DNS external condition NetworkUnreachable → ExternalDNSDown (gateway keeps NetworkUnreachable).
One-shot startup cleanup (initialSync) removes the 4 retired mirror conditions left on existing nodes; NetworkUnreachable is deliberately excluded.
Updated detector.go disable-cleanup map and docs/monitors.md to match.

Tests

New tests assert each condition goes True on failure → False on recovery (the gap in the old recovery test, which only checked the healthy mirror appeared), plus a startup-cleanup regression test. go build/vet/test ./... green:

ok  pkg/monitors/network        (220 pass)
ok  pkg/exporters/kubernetes    (incl. TestRemoveRetiredDNSConditionsOnInitialSync)
ok  pkg/detector

⚠️ Breaking (node-condition names) — audit external alerts

Removed: NodeDoctorClusterDNSHealthy, NodeDoctorNetworkReachable, NodeDoctorDNSResolutionConsistent, NodeDoctorCustomDNSHealthy. Renamed: DNS-side NodeDoctorNetworkUnreachable → NodeDoctorExternalDNSDown (the remaining NodeDoctorNetworkUnreachable reflects gateway state only). Grafana/Alertmanager rules keyed on those names need updating; rules using kube_node_status_condition{status="true"}==1 are fine and now auto-resolve.

Follow-ups (out of scope)

DNS correlation/trend conditions (AddTrendConditions) are still positive-only — same class of latch, left for a separate PR.
Per-domain checks still share domain-agnostic condition names (last-write-wins); not changed here.

The DNS monitor (pkg/monitors/network/dns.go) emitted node conditions additively — only ConditionTrue, never ConditionFalse — so every problem condition latched True forever. Recovery wrote a separate *Healthy mirror condition instead of clearing the corresponding *Down/*Degraded one, and on restart initialSync reloaded the stale True verbatim (observed: a node carried NodeDoctorClusterDNSDown=True for days while DNS was healthy). Every other monitor (gateway, memory, cpu, disk) toggles one condition both ways. Refactor the DNS monitor to the single-toggle pattern: - Each problem condition is written with its current True/False every cycle (mutually-exclusive groups via a new setMutuallyExclusiveCondition helper; independent ones get an explicit False arm). - Drop the parallel mirror conditions ClusterDNSHealthy, NetworkReachable, DNSResolutionConsistent, CustomDNSHealthy (healthy = problem conditions all False). - Rename the DNS external condition NetworkUnreachable -> ExternalDNSDown to stop colliding with the gateway monitor, which owns NetworkUnreachable. - Remove the four retired mirror conditions once on startup (initialSync) so values left on existing nodes don't linger; NetworkUnreachable is excluded. - Update detector.go's disable-cleanup map and docs/monitors.md to match. Tests assert each condition goes True on failure then False on recovery, closing the gap where the old recovery test only checked the healthy mirror appeared. BREAKING (node conditions): removes NodeDoctor{ClusterDNSHealthy,NetworkReachable, DNSResolutionConsistent,CustomDNSHealthy}; the DNS side of NodeDoctorNetworkUnreachable becomes NodeDoctorExternalDNSDown. External alert rules keyed on those names must be updated; NodeDoctorNetworkUnreachable now reflects gateway state only.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 9c1c64f2d7

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-08T09:22:23Z

+		active = "DNSResolutionInconsistent"
 	}

+	setMutuallyExclusiveCondition(status, []exclusiveCond{


Aggregate consistency conditions across all custom queries

When more than one custom query has ConsistencyCheck enabled, checkCustomQueries calls checkDomainConsistency once per domain; if an earlier domain sets DNSResolutionDown/Intermittent/Inconsistent=True and a later domain is healthy, this call emits the same condition types as False, which overwrites the earlier True in the Kubernetes condition manager's pendingUpdates map keyed by condition type. In that configuration the node condition can show clear even though a configured consistency-check domain failed, so the clearing needs to happen after aggregating all consistency-query results or use domain-scoped condition types.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-08T09:22:23Z

+		active = "DNSResolutionDegraded"
 	}
+
+	setMutuallyExclusiveCondition(status, []exclusiveCond{


Aggregate nameserver conditions across all custom queries

For multiple custom queries with TestEachNameserver, this per-domain toggle lets a later all-healthy domain emit CustomDNSDown=False and DNSResolutionDegraded=False, overwriting an earlier domain's failed/degraded result because the exporter ultimately stores one pending update per condition type. This means a node can lose the aggregate custom-nameserver alert while one configured domain is still failing; clear these global conditions only after aggregating all tested domains or make the condition domain-specific.

Useful? React with 👍 / 👎.

chatgpt-codex-connector Bot reviewed Jun 8, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: clear stuck DNS node conditions by toggling them both ways#21

fix: clear stuck DNS node conditions by toggling them both ways#21
mattmattox wants to merge 1 commit into
mainfrom
fix/dns-condition-latch-toggle

mattmattox commented Jun 8, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Jun 8, 2026

Uh oh!

chatgpt-codex-connector Bot Jun 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mattmattox commented Jun 8, 2026

The bug

The fix (single-toggle refactor)

Tests

⚠️ Breaking (node-condition names) — audit external alerts

Follow-ups (out of scope)

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 8, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Jun 8, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant