Skip to content

fix(ha): clear stale ha_failed when refreshing HA active state#622

Open
krystian-elisity wants to merge 1 commit into
PaloAltoNetworks:developfrom
krystian-elisity:fix/ha-failed-refresh
Open

fix(ha): clear stale ha_failed when refreshing HA active state#622
krystian-elisity wants to merge 1 commit into
PaloAltoNetworks:developfrom
krystian-elisity:fix/ha-failed-refresh

Conversation

@krystian-elisity

Copy link
Copy Markdown

Problem

PanDevice.refresh_ha_active() refreshes each device's _ha_active flag from
the live HA state, but it never clears ha_failed. A device that was marked
failed by a prior transient connection error (via set_failed()) therefore
stays "failed" even after it has recovered and is reachable again.

Because XapiWrapper routes API calls away from a device whose ha_failed is
True (to the HA peer), calls keep going to the peer indefinitely. In a
Panorama HA pair this means User-ID / config calls are sent to the passive
member, which silently never propagates them to the firewalls.

Root cause

refresh_ha_active() only reaches its else branch after successfully reading
the live HA state from both devices (show_highavailability_state()). A
successful read proves the device is reachable, so ha_failed should be cleared
there — but the method only updated _ha_active, leaving the stale ha_failed
behind. There was no path (other than manual dev.ha_failed = False or
set_failed() on the peer) to recover routing.

Fix

Clear ha_failed for each device whose live HA state was successfully read,
alongside the existing _ha_active update. Non-authoritative states
(disabled / initial) return early, so the flag is left untouched when the
state is not trustworthy.

else:
    for fw, state in ((self, self_state), (self.ha_peer, peer_state)):
        fw._ha_active = state == "active"
        # Reached this device for its state, so it is no longer failed.
        fw.ha_failed = False
    return self_state

Tests

Adds TestFirewallHa covering both paths:

  • test_refresh_ha_active_clears_stale_ha_failed — a live refresh clears
    ha_failed on the reachable device.
  • test_refresh_ha_active_keeps_ha_failed_when_state_not_authoritative — a
    non-authoritative (initial / disabled) state leaves ha_failed untouched.

Compatibility

No API change. refresh_ha_active() keeps its signature and return value; it
just additionally clears the stale flag it was always meant to reconcile.

refresh_ha_active() updated each device's _ha_active flag from the live
HA state but left ha_failed untouched. A device marked failed by a prior
transient connection error therefore stayed "failed" even after it was
reached again, so XapiWrapper kept routing API calls to the HA peer (the
passive device) indefinitely. In a Panorama HA pair this sends User-ID /
config calls to the passive member, which never propagates them.

The else-branch is only reached after successfully reading each device's
live HA state, proving it is reachable, so ha_failed can be cleared there.
Non-authoritative states (disabled/initial) return early and are untouched.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant