fix(ha): clear stale ha_failed when refreshing HA active state#622
Open
krystian-elisity wants to merge 1 commit into
Open
fix(ha): clear stale ha_failed when refreshing HA active state#622krystian-elisity wants to merge 1 commit into
krystian-elisity wants to merge 1 commit into
Conversation
refresh_ha_active() updated each device's _ha_active flag from the live HA state but left ha_failed untouched. A device marked failed by a prior transient connection error therefore stayed "failed" even after it was reached again, so XapiWrapper kept routing API calls to the HA peer (the passive device) indefinitely. In a Panorama HA pair this sends User-ID / config calls to the passive member, which never propagates them. The else-branch is only reached after successfully reading each device's live HA state, proving it is reachable, so ha_failed can be cleared there. Non-authoritative states (disabled/initial) return early and are untouched. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
PanDevice.refresh_ha_active()refreshes each device's_ha_activeflag fromthe live HA state, but it never clears
ha_failed. A device that was markedfailed by a prior transient connection error (via
set_failed()) thereforestays "failed" even after it has recovered and is reachable again.
Because
XapiWrapperroutes API calls away from a device whoseha_failedisTrue(to the HA peer), calls keep going to the peer indefinitely. In aPanorama HA pair this means User-ID / config calls are sent to the passive
member, which silently never propagates them to the firewalls.
Root cause
refresh_ha_active()only reaches itselsebranch after successfully readingthe live HA state from both devices (
show_highavailability_state()). Asuccessful read proves the device is reachable, so
ha_failedshould be clearedthere — but the method only updated
_ha_active, leaving the staleha_failedbehind. There was no path (other than manual
dev.ha_failed = Falseorset_failed()on the peer) to recover routing.Fix
Clear
ha_failedfor each device whose live HA state was successfully read,alongside the existing
_ha_activeupdate. Non-authoritative states(
disabled/initial) return early, so the flag is left untouched when thestate is not trustworthy.
Tests
Adds
TestFirewallHacovering both paths:test_refresh_ha_active_clears_stale_ha_failed— a live refresh clearsha_failedon the reachable device.test_refresh_ha_active_keeps_ha_failed_when_state_not_authoritative— anon-authoritative (
initial/disabled) state leavesha_faileduntouched.Compatibility
No API change.
refresh_ha_active()keeps its signature and return value; itjust additionally clears the stale flag it was always meant to reconcile.