Skip to content

fix: clear stuck DNS node conditions by toggling them both ways#21

Open
mattmattox wants to merge 1 commit into
mainfrom
fix/dns-condition-latch-toggle
Open

fix: clear stuck DNS node conditions by toggling them both ways#21
mattmattox wants to merge 1 commit into
mainfrom
fix/dns-condition-latch-toggle

Conversation

@mattmattox

Copy link
Copy Markdown
Contributor

Fixes a latent bug where the DNS monitor latches node conditions True forever.

The bug

pkg/monitors/network/dns.go emits conditions additively — 20 types.ConditionTrue writes, zero types.ConditionFalse. Every problem condition (ClusterDNSDown, DNSResolutionDown, etc.) sticks True forever, because the recovery branch writes a separate mirror condition (ClusterDNSHealthy) instead of clearing the down condition. On restart, condition_manager.go initialSync reloads the stale True verbatim.

Observed in production: a node carried NodeDoctorClusterDNSDown=True for 4+ days while DNS was healthy and the sibling NodeDoctorClusterDNSHealthy=True — the contradiction. The stale alarm masked a real (separate) DNS-datapath blip during triage.

Every other monitor (gateway gateway.go:551-568, memory, cpu, disk) toggles one condition both ways. The DNS monitor was the lone outlier. Bonus: its external check wrote NetworkUnreachable — the same condition the gateway monitor owns (a cross-monitor collision).

The fix (single-toggle refactor)

  • Each problem condition is now written with its current True/False every cycle. Mutually-exclusive groups (Down/Degraded/Intermittent) go through a new setMutuallyExclusiveCondition helper (active member True, siblings False); independent ones get an explicit False arm.
  • Dropped the parallel mirror conditions: ClusterDNSHealthy, NetworkReachable, DNSResolutionConsistent, CustomDNSHealthy (healthy == the problem conditions all False).
  • Renamed the DNS external condition NetworkUnreachableExternalDNSDown (gateway keeps NetworkUnreachable).
  • One-shot startup cleanup (initialSync) removes the 4 retired mirror conditions left on existing nodes; NetworkUnreachable is deliberately excluded.
  • Updated detector.go disable-cleanup map and docs/monitors.md to match.

Tests

New tests assert each condition goes True on failure → False on recovery (the gap in the old recovery test, which only checked the healthy mirror appeared), plus a startup-cleanup regression test. go build/vet/test ./... green:

ok  pkg/monitors/network        (220 pass)
ok  pkg/exporters/kubernetes    (incl. TestRemoveRetiredDNSConditionsOnInitialSync)
ok  pkg/detector

⚠️ Breaking (node-condition names) — audit external alerts

Removed: NodeDoctorClusterDNSHealthy, NodeDoctorNetworkReachable, NodeDoctorDNSResolutionConsistent, NodeDoctorCustomDNSHealthy. Renamed: DNS-side NodeDoctorNetworkUnreachableNodeDoctorExternalDNSDown (the remaining NodeDoctorNetworkUnreachable reflects gateway state only). Grafana/Alertmanager rules keyed on those names need updating; rules using kube_node_status_condition{status="true"}==1 are fine and now auto-resolve.

Follow-ups (out of scope)

  • DNS correlation/trend conditions (AddTrendConditions) are still positive-only — same class of latch, left for a separate PR.
  • Per-domain checks still share domain-agnostic condition names (last-write-wins); not changed here.

The DNS monitor (pkg/monitors/network/dns.go) emitted node conditions
additively — only ConditionTrue, never ConditionFalse — so every problem
condition latched True forever. Recovery wrote a separate *Healthy mirror
condition instead of clearing the corresponding *Down/*Degraded one, and on
restart initialSync reloaded the stale True verbatim (observed: a node carried
NodeDoctorClusterDNSDown=True for days while DNS was healthy). Every other
monitor (gateway, memory, cpu, disk) toggles one condition both ways.

Refactor the DNS monitor to the single-toggle pattern:
- Each problem condition is written with its current True/False every cycle
  (mutually-exclusive groups via a new setMutuallyExclusiveCondition helper;
  independent ones get an explicit False arm).
- Drop the parallel mirror conditions ClusterDNSHealthy, NetworkReachable,
  DNSResolutionConsistent, CustomDNSHealthy (healthy = problem conditions all False).
- Rename the DNS external condition NetworkUnreachable -> ExternalDNSDown to stop
  colliding with the gateway monitor, which owns NetworkUnreachable.
- Remove the four retired mirror conditions once on startup (initialSync) so
  values left on existing nodes don't linger; NetworkUnreachable is excluded.
- Update detector.go's disable-cleanup map and docs/monitors.md to match.

Tests assert each condition goes True on failure then False on recovery, closing
the gap where the old recovery test only checked the healthy mirror appeared.

BREAKING (node conditions): removes NodeDoctor{ClusterDNSHealthy,NetworkReachable,
DNSResolutionConsistent,CustomDNSHealthy}; the DNS side of NodeDoctorNetworkUnreachable
becomes NodeDoctorExternalDNSDown. External alert rules keyed on those names must be
updated; NodeDoctorNetworkUnreachable now reflects gateway state only.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 9c1c64f2d7

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

active = "DNSResolutionInconsistent"
}

setMutuallyExclusiveCondition(status, []exclusiveCond{

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Aggregate consistency conditions across all custom queries

When more than one custom query has ConsistencyCheck enabled, checkCustomQueries calls checkDomainConsistency once per domain; if an earlier domain sets DNSResolutionDown/Intermittent/Inconsistent=True and a later domain is healthy, this call emits the same condition types as False, which overwrites the earlier True in the Kubernetes condition manager's pendingUpdates map keyed by condition type. In that configuration the node condition can show clear even though a configured consistency-check domain failed, so the clearing needs to happen after aggregating all consistency-query results or use domain-scoped condition types.

Useful? React with 👍 / 👎.

active = "DNSResolutionDegraded"
}

setMutuallyExclusiveCondition(status, []exclusiveCond{

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Aggregate nameserver conditions across all custom queries

For multiple custom queries with TestEachNameserver, this per-domain toggle lets a later all-healthy domain emit CustomDNSDown=False and DNSResolutionDegraded=False, overwriting an earlier domain's failed/degraded result because the exporter ultimately stores one pending update per condition type. This means a node can lose the aggregate custom-nameserver alert while one configured domain is still failing; clear these global conditions only after aggregating all tested domains or make the condition domain-specific.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant