Stretched fabric

max 3 sites
single fabric, single control plane
10ms RTT on links between transit leaf and spine
after restoration from failure spines synchronize COOP

Multipod

12 pods max
max 400 leafs overall
single fabric, isolated control plane
PIM-BD, OSPF, DHCP relay, jumbo frames, 50ms RTT between spines, LLDP on IPN PE
unique address pool for VTEP in each pod
spines send their loopbacks as /32 towards IPN, proxy VTEP, anycast next-hop in EVPN + address for HW proxy in COOP DB
spines do not participate in PIM, authoritative spine sends IGMP Join for group address (election via IS-IS)
only spines with active uplinks to IPN can become authoritative
spine back-to-back not supported (no PIM) because directly connected spines may not become authoritative simultaneously
40/100G between pods
spine/leaf SN in DHCP as TLV
APICs in pod have addresses from seed pod pool ⇒ /32 announced
each pod has its own pool, the range is advertised to IPN via OSPF, spines then redistribute to IS-IS (metric = 64)
spines in pod trigger DHCP only after leafs are detected
GIPo for ARP gleaning 239.255.255.240 – for all BD; 225.0.0.0/8 – GIPo for BD
stretched L3Out = stretched BD ⇒ routing adjacency with leafs in both pods; remote pod has bounce entry because MAC is known in other pod
L3Out prefixes are announced via MP-BGP → prefer local because of IGP tiebreaker
loopback:
1. control plane TEP (CP-TEP): RID, BGP source intf
2. dataplane TEP (E-TEP): EVPN net-hop, anycast, placeholder in COOP (does not participate in proxy)
3. external MAC proxy: anycast
4. external IPv4 proxy: anycast
5. external IPv6 proxy: anycast
leaf creates dynamic tunnels to other leaf:
1. on remote EP learning
2. on MP-BGP route
hardware proxy forwards traffic to proxy from another pod, not directly to other pod’s leaf
no support for location-based PBR for east-west traffic to different FW pairs because of blackhole, there is no state sync between pairs

Multisite

isolated fabrics, connection to MSO via OOB (MSO is outside ACI fabric)
vSphere 6 minimum
no stretched L4-L7
RTT between sites up to 1s
MSO:
- 3 VMs with Docker Swarm
- Active-active
- 1 Gbps between VMs, RTT 150ms
- ESXi 5.5, 4 vCPU, 8Gb RAM, 5Gb HDD
ports:
1. TCP 2377: Docker Swarm mgmt
2. TCP/UDP 7946: Docker Swarm container network discovery
3. UDP 4789: VXLAN
4. TCP 443
5. ESP
12 sites
spines host VNID translation table between sites, it’s fileld by MSO via APIC; receiving spine performs translation
shadow EPGs are created locally for remote EPG to apply policy ⇒ no stretched vzAny, unenforced VRF
translation creation:
1. stretched EPG
2. contract between EPGs on different sites
sites connected via IPN over VLAN 4
overlay unicast TEP on spine (O-UTEP) – src for VXLAN traffic, anycast dst for unicast
overlay multicast TEP (O-MTEP) – anycast dst for BUM
EVPN type-2 routes
BUM:
1. headend replication ⇒ spine back-to-back is supported (2 sites only, spines cannot forward transit traffic)
2. not affected by contracts
3. optimize WAN bandwidth:
  - allocate mcast group for stretched BD (other BDs do not send traffic under this mcast group)
  - GIPo for stretched BD from separate pool, no unnecessary flood to and from local BD
4. on receiving IGMP Join on leaf ⇒ update spine COOP ⇒ spine updates border leaf ⇒ border leaf sends PIM Join
EVPN speaker – communicates with other site, EVPN forwarder – communicates with speaker on its own site
no support no rogue EP control between sites
spine replaces src IP of VXLAN for O-UTEP ⇒ leafs on other sites learn EP as connected to spine
in Multipod each pod is assigned separate O-UTEP; O-MTEP is common
PBR is implemented by provider leaf for east-west traffic ⇒ all traffic goes through 1 PBR node on 1 site
PBR is implemented by compute leaf for north-south traffic ⇒ PBR node on the site where EPG is located; however, L3Out for return traffic might be different (asymmetric routing)
PBR to node on another site is not supported, redir_override action:
1. if node local to site – redirect
2. if node remote to site – permit

# show dcimgr repo vnid-maps
# show dcimgr repo sclass-maps

Ingress spine performs VNID and class ID translation

In the beginning policy is enforced on egress leaf ⇒ EP is learned via dataplane ⇒ policy for return traffic can be enforced immediately on ingress leaf

Intersite L3Out

allows using remote L3Out for egress traffic
incompatible with GOLF, remote leaf, CloudSec
border leafs receive TEP addresses from external pool, spines announce /32
prefixes from L3Out are announced between spines via VPNv4
return traffic to border leaf goes directly from compute leaf, no multisite translation ⇒ border does not learn class ID via dataplane ⇒ policy is always enforced on compute leaf (if the first packet is originated on compute, then policy is not enforced); src IP is still translated to O-UTEP for compute → L3Out traffic
transit routing: policy is enforced on the first leaf, no src IP tranlation, no pcTag assigned

Remote leaf

no EVPN in IPN, COOP through IPN with spine
no need for PIM-BD because traffic is sent unicast to spine
requires VLAN 4 on uplink towards IPN
TEP:
1. RL-DP-TEP: per remote leaf, src address for VXLAN for orphan
2. RL-vPC-TEP: anycast per vPC pair, src address for VXLAN for vPC
3. RL-Ucast-TEP: anycast on spine; dst for VXLAN unicast from remote leaf
4. RL-Mcast-TEP: anycast on spine; dst for BUM flooding from remote leaf; does not matter to which spine because all of them are connected to FTAG
- spine rewrites src address to Ucast TEP for traffic towards remote leaf, belonging to this pod or another; no rewriting for traffic from remote leaf
external TEP pool: instead of announcing pod pool, only spine + APIC + border leaf
uses DHCP to receive addresses for uplink and TEP
fabric ports can be used to connect vPC pair back-to-back, preferred over connection via uplink
vPC synchronize EP directly, not via ZeroMQ because IPN may fail ⇒ better to deploy vPC for two remote leafs, even if they have only single-homed downlinks
no support for stretched BDs between remote leafs in Multisite between different sites (remote leaf – remote leaf, remote leaf – local leaf)
inter-VRF traffic goes through WAN for vPC pair from own address to own address with VNID rewrite – policy enforcement in egress VRF; for orphans traffic passes through spine as usual; depends on COOP ≡ spines
local PBR between EP via service node if vPC is used
if all uplinks fail, downlinks are shut down
MP-BGP session with external RR of every pod, COOP – with own pod only

vPod

EVPN in IPN
vSpine + vLeaf + AVE
vSphere only

Remote leaf direct forwarding

direct communication between remote leafs in own site (including RL in other pods)
Multisite: communication through spine from own pod, spine rewrites src address to O-UTEP
L2/L3 proxy – in software
download relevant part of COOP DB locally, entry timeout – 15 mins
spines do not forward BUM from RL because RL do it themselves
if spines in own pod fail, RL establishes COOP session to spines in other pod (knob)

Remote leaf + Multipod without direct forwarding

S1 translates src address of mcast from IPN for HREP towards remote leafs in own pod.

VTEP src from remote leaf standpoint:

LL2 from unicast
RL-Ucast-TEP (pod 1)
results in MAC flapping

VLAN 5 – connection between pods (VRF2), announce RL TEP pool, announce pod pool

VLAN 4 – connection between pod and remote leafs within pod (VRF1); announce RL TEP pool and pod pool; receiving RL TEP pool, from any pod except own pod, is denied via route-map

After adding VLAN 5 S2 sees RL via S1; S1 translates src address for traffic towards RL for any traffic in dataplane ⇒ no MAC flapping, every EP is known via RL-Ucast-TEP

In case of RL direct forwarding, there is direct tunnel between RL and S2 for both unicast and multicast ⇒ no MAC flapping

Networking and IT

Uncovering the Why

Multisite

Stretched fabric

Multipod

Multisite

Intersite L3Out

Remote leaf

vPod

Remote leaf direct forwarding

Remote leaf + Multipod without direct forwarding