- Stretched fabric
- Multipod
- Multisite
- Intersite L3Out
- Remote leaf
- vPod
- Remote leaf direct forwarding
- Remote leaf + Multipod without direct forwarding
Stretched fabric
- max 3 sites
- single fabric, single control plane
- 10ms RTT on links between transit leaf and spine
- after restoration from failure spines synchronize COOP
Multipod
- 12 pods max
- max 400 leafs overall
- single fabric, isolated control plane
- PIM-BD, OSPF, DHCP relay, jumbo frames, 50ms RTT between spines, LLDP on IPN PE
- unique address pool for VTEP in each pod
- spines send their loopbacks as /32 towards IPN, proxy VTEP, anycast next-hop in EVPN + address for HW proxy in COOP DB
- spines do not participate in PIM, authoritative spine sends IGMP Join for group address (election via IS-IS)
- only spines with active uplinks to IPN can become authoritative
- spine back-to-back not supported (no PIM) because directly connected spines may not become authoritative simultaneously
- 40/100G between pods
- spine/leaf SN in DHCP as TLV
- APICs in pod have addresses from seed pod pool ⇒ /32 announced
- each pod has its own pool, the range is advertised to IPN via OSPF, spines then redistribute to IS-IS (metric = 64)
- spines in pod trigger DHCP only after leafs are detected
- GIPo for ARP gleaning 239.255.255.240 – for all BD; 225.0.0.0/8 – GIPo for BD
- stretched L3Out = stretched BD ⇒ routing adjacency with leafs in both pods; remote pod has bounce entry because MAC is known in other pod
- L3Out prefixes are announced via MP-BGP → prefer local because of IGP tiebreaker
- loopback:
- control plane TEP (CP-TEP): RID, BGP source intf
- dataplane TEP (E-TEP): EVPN net-hop, anycast, placeholder in COOP (does not participate in proxy)
- external MAC proxy: anycast
- external IPv4 proxy: anycast
- external IPv6 proxy: anycast
- leaf creates dynamic tunnels to other leaf:
- on remote EP learning
- on MP-BGP route
- hardware proxy forwards traffic to proxy from another pod, not directly to other pod’s leaf
- no support for location-based PBR for east-west traffic to different FW pairs because of blackhole, there is no state sync between pairs
Multisite
- isolated fabrics, connection to MSO via OOB (MSO is outside ACI fabric)
- vSphere 6 minimum
- no stretched L4-L7
- RTT between sites up to 1s
- MSO:
- 3 VMs with Docker Swarm
- Active-active
- 1 Gbps between VMs, RTT 150ms
- ESXi 5.5, 4 vCPU, 8Gb RAM, 5Gb HDD
- ports:
- TCP 2377: Docker Swarm mgmt
- TCP/UDP 7946: Docker Swarm container network discovery
- UDP 4789: VXLAN
- TCP 443
- ESP
- 12 sites
- spines host VNID translation table between sites, it’s fileld by MSO via APIC; receiving spine performs translation
- shadow EPGs are created locally for remote EPG to apply policy ⇒ no stretched vzAny, unenforced VRF
- translation creation:
- stretched EPG
- contract between EPGs on different sites
- sites connected via IPN over VLAN 4
- overlay unicast TEP on spine (O-UTEP) – src for VXLAN traffic, anycast dst for unicast
- overlay multicast TEP (O-MTEP) – anycast dst for BUM
- EVPN type-2 routes
- BUM:
- headend replication ⇒ spine back-to-back is supported (2 sites only, spines cannot forward transit traffic)
- not affected by contracts
- optimize WAN bandwidth:
- allocate mcast group for stretched BD (other BDs do not send traffic under this mcast group)
- GIPo for stretched BD from separate pool, no unnecessary flood to and from local BD
- on receiving IGMP Join on leaf ⇒ update spine COOP ⇒ spine updates border leaf ⇒ border leaf sends PIM Join
- EVPN speaker – communicates with other site, EVPN forwarder – communicates with speaker on its own site
- no support no rogue EP control between sites
- spine replaces src IP of VXLAN for O-UTEP ⇒ leafs on other sites learn EP as connected to spine
- in Multipod each pod is assigned separate O-UTEP; O-MTEP is common
- PBR is implemented by provider leaf for east-west traffic ⇒ all traffic goes through 1 PBR node on 1 site
- PBR is implemented by compute leaf for north-south traffic ⇒ PBR node on the site where EPG is located; however, L3Out for return traffic might be different (asymmetric routing)
- PBR to node on another site is not supported, redir_override action:
- if node local to site – redirect
- if node remote to site – permit
# show dcimgr repo vnid-maps
# show dcimgr repo sclass-maps
Ingress spine performs VNID and class ID translation
In the beginning policy is enforced on egress leaf ⇒ EP is learned via dataplane ⇒ policy for return traffic can be enforced immediately on ingress leaf
Intersite L3Out
- allows using remote L3Out for egress traffic
- incompatible with GOLF, remote leaf, CloudSec
- border leafs receive TEP addresses from external pool, spines announce /32
- prefixes from L3Out are announced between spines via VPNv4
- return traffic to border leaf goes directly from compute leaf, no multisite translation ⇒ border does not learn class ID via dataplane ⇒ policy is always enforced on compute leaf (if the first packet is originated on compute, then policy is not enforced); src IP is still translated to O-UTEP for compute → L3Out traffic
- transit routing: policy is enforced on the first leaf, no src IP tranlation, no pcTag assigned
Remote leaf
- no EVPN in IPN, COOP through IPN with spine
- no need for PIM-BD because traffic is sent unicast to spine
- requires VLAN 4 on uplink towards IPN
- TEP:
- RL-DP-TEP: per remote leaf, src address for VXLAN for orphan
- RL-vPC-TEP: anycast per vPC pair, src address for VXLAN for vPC
- RL-Ucast-TEP: anycast on spine; dst for VXLAN unicast from remote leaf
- RL-Mcast-TEP: anycast on spine; dst for BUM flooding from remote leaf; does not matter to which spine because all of them are connected to FTAG
- spine rewrites src address to Ucast TEP for traffic towards remote leaf, belonging to this pod or another; no rewriting for traffic from remote leaf
- external TEP pool: instead of announcing pod pool, only spine + APIC + border leaf
- uses DHCP to receive addresses for uplink and TEP
- fabric ports can be used to connect vPC pair back-to-back, preferred over connection via uplink
- vPC synchronize EP directly, not via ZeroMQ because IPN may fail ⇒ better to deploy vPC for two remote leafs, even if they have only single-homed downlinks
- no support for stretched BDs between remote leafs in Multisite between different sites (remote leaf – remote leaf, remote leaf – local leaf)
- inter-VRF traffic goes through WAN for vPC pair from own address to own address with VNID rewrite – policy enforcement in egress VRF; for orphans traffic passes through spine as usual; depends on COOP ≡ spines
- local PBR between EP via service node if vPC is used
- if all uplinks fail, downlinks are shut down
- MP-BGP session with external RR of every pod, COOP – with own pod only
vPod
- EVPN in IPN
- vSpine + vLeaf + AVE
- vSphere only
Remote leaf direct forwarding
- direct communication between remote leafs in own site (including RL in other pods)
- Multisite: communication through spine from own pod, spine rewrites src address to O-UTEP
- L2/L3 proxy – in software
- download relevant part of COOP DB locally, entry timeout – 15 mins
- spines do not forward BUM from RL because RL do it themselves
- if spines in own pod fail, RL establishes COOP session to spines in other pod (knob)
Remote leaf + Multipod without direct forwarding
S1 translates src address of mcast from IPN for HREP towards remote leafs in own pod.
VTEP src from remote leaf standpoint:
- LL2 from unicast
- RL-Ucast-TEP (pod 1)
- results in MAC flapping
VLAN 5 – connection between pods (VRF2), announce RL TEP pool, announce pod pool
VLAN 4 – connection between pod and remote leafs within pod (VRF1); announce RL TEP pool and pod pool; receiving RL TEP pool, from any pod except own pod, is denied via route-map
After adding VLAN 5 S2 sees RL via S1; S1 translates src address for traffic towards RL for any traffic in dataplane ⇒ no MAC flapping, every EP is known via RL-Ucast-TEP
In case of RL direct forwarding, there is direct tunnel between RL and S2 for both unicast and multicast ⇒ no MAC flapping