BGP PIC local - backup path hld#2292
BGP PIC local - backup path hld#2292venkit-nexthop wants to merge 2 commits intosonic-net:masterfrom
Conversation
Signed-off-by: Venkit Kasiviswanathan <[email protected]>
|
/azp run |
|
No pipelines are associated with this pull request. |
|
Hi, Could you please describe why are you doing this and exact use case? This is meaningless in equidistant topologies (e.g leaf-spine), each destination is available over ECMP, every ECMP member is as good as any other and is per definition loop free. Moreover, you are stating: 10.0.2.106 from 10.0.2.106 The orchagent/SAI integration is completely TBD. The actual hardware programming — the only part that matters for fast failover — isn't designed yet. Open items literally say: "Determine SAI API requirements for primary/backup nexthop group programming" and "Orchagent backup programming: TBD." So right now this feature just pushes backup info through the control plane all the way to APP_DB... and then does nothing with it in hardware. Thanks and looking forward to your answers, |
|
Hi Jeff,
Thanks a lot for the detailed review and questions.
**Why / exact use cases**
You’re right that in a perfectly equidistant leaf–spine fabric, for
prefixes that truly have a full ECMP set, the fabric itself already gives
you local protection: every ECMP member is usable and link failure handling
essentially involves removing the failed member from the group. The intent
of this work is not to address that case, but to cover a set of non‑ideal
and non‑ECMP scenarios we see in deployments.
If you refer to Section 4 of
https://github.com/sonic-net/SONiC/blob/master/doc/pic/bgp_pic_arch_doc.md
It describes the PIC local (aka Fast reroute) scenario.
It provides quick convergence at the egress side until the ingress
recovers, by rerouting traffic to a peering PE where another path is
available to reach the destination.
Additionally,
- **Single dominant next‑hop with a less preferred backup.**
For many important prefixes (e.g., default route towards a border, DCI
prefixes, service VRFs), operators deliberately steer traffic to a
*single* primary border/exit, with a different border or path as a
backup. Today, when the primary border/link fails, data-plane failover
waits on BGP convergence. With PIC local, we pre‑compute an explicitly
less-preferred backup and program it alongside the primary so we can switch
locally without waiting for control-plane reconvergence.
- **Non‑equidistant or non‑Clos topologies.**
SONiC is also used in collapsed core, WAN edge, and mixed DC/WAN
environments where paths are not strictly equidistant, ECMP is not always
available, and you still want sub‑second protection. Here, an explicit
backup next‑hop that is not in the primary ECMP set is required.
- **Node/rack protection rather than just a single local link.**
Even in DC, there are cases where the primary path is constrained to a
particular device or rack (e.g., primary border leaf, services leaf). The
goal is to have a pre‑installed backup via a different device/failure
domain, not just “some other member of the same ECMP group,” so that a
device/rack failure is locally protected.
I’ll make sure the HLD text explains these concrete use cases more
explicitly, and also clearly calls out that for “pure ECMP everywhere”
leaf–spine fabrics there is no additional benefit over existing ECMP
behaviour.
**On the “backup MUST NOT include any primary ECMP nexthop” rule / MED
example**
The intent of that rule is precisely to encode the requirement above: *the
backup path must be nexthop‑disjoint from the primary ECMP set*, so that it
represents a different failure domain.
The examples were quickly generated only to show the mnemonics used to
display the backup paths and not the use-case topology. I apologize for the
confusion there.
**Orchagent / SAI integration and hardware programming**
You’re absolutely right that for fast reroute the *only thing that really
matters is what ends up in hardware.* The current HLD intentionally focuses
on the end‑to‑end control‑plane path and data model:
- bgpd: backup path computation and signalling
- zebra: encoding backup nexthops in FPM (RTNH_F_BACKUP)
- fpmsyncd / APP_DB: modelling primary and backup nexthops
The orchagent/SAI parts are marked TBD because I wanted to first get
agreement on:
1. **The semantics and selection rules** for primary vs backup paths, and
2. **The configuration/YANG model** and how backup information flows through
APP_DB.
The next step is:
- Define how orchagent will consume the primary+backup info from APP_DB
and map it into ASIC nexthop groups, and
- Either **reuse existing SAI next‑hop group constructs** or, if that’s
insufficient, propose concrete SAI API requirements for primary/backup
groups to the SAI community.
Until that is implemented, I will explicitly state in the HLD that the
feature does not deliver sub‑second hardware failover for data‑plane
traffic yet; it only prepares the control‑plane and data‑model side.
Thanks again for taking the time to dig into both the topology assumptions
and the open items. I’ll update the HLD to reflect the points above and
call out the phased control-plane vs hardware scope more explicitly.
Best regards,
Venkit
…On Tue, Apr 14, 2026 at 11:48 AM Jeff Tantsura ***@***.***> wrote:
*jefftant* left a comment (sonic-net/SONiC#2292)
<#2292 (comment)>
Hi,
Could you please describe why are you doing this and exact use case?
This is meaningless in equidistant topologies (e.g leaf-spine), each
destination is available over ECMP, every ECMP member is as good as any
other and is per definition loop free.
There’s no need for an explicit backup, every ECMP path is active and a
backup.
On local link down, there’s no need for BGP convergence , failed next-hop
need to be removed from the FW, BGP convergence is only meaningful for
remote leafs, not local.
Moreover, you are stating:
The backup path MUST NOT include any nexthop that is already in the
primary ECMP set - this makes it of no use in DC, however all examples show
ECMP topologies, with MED as tiebreaker - quite "unusual"?
Reference:
10.0.1.101 from 10.0.1.101
Origin IGP, metric 1, valid, external, multipath, best (MED) <========
10.0.2.106 from 10.0.2.106
Origin IGP, metric 6, valid, external, backup
The orchagent/SAI integration is completely TBD. The actual hardware
programming — the only part that matters for fast failover — isn't designed
yet. Open items literally say: "Determine SAI API requirements for
primary/backup nexthop group programming" and "Orchagent backup
programming: TBD." So right now this feature just pushes backup info
through the control plane all the way to APP_DB... and then does nothing
with it in hardware.
Thanks and looking forward to your answers,
Jeff
—
Reply to this email directly, view it on GitHub
<#2292 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/BVH6TVFMB2JGSBH3SDBP6AD4V2BZXAVCNFSM6AAAAACXYMF3PWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHM2DENBWGMZTSMZXGY>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
BGP PIC Local (Fast Reroute) HLD
Adds a High-Level Design document for BGP PIC Local (Fast Reroute) in SONiC. This feature enables sub-second failover on local link failures by pre-computing a backup BGP path alongside
the primary path and installing both in the FIB. When the local link fails, the data plane switches to the backup immediately without waiting for BGP reconvergence.
The HLD covers the end-to-end design across:
Orchagent and nexthop group hardware programming are TBD and out of scope for this revision.