Multisite

  1. Stretched fabric
  2. Multipod
  3. Multisite
  4. Intersite L3Out
  5. Remote leaf
  6. vPod
  7. Remote leaf direct forwarding
  8. Remote leaf + Multipod without direct forwarding

Stretched fabric

  • max 3 sites
  • single fabric, single control plane
  • 10ms RTT on links between transit leaf and spine
  • after restoration from failure spines synchronize COOP

Multipod

  • 12 pods max
  • max 400 leafs overall
  • single fabric, isolated control plane
  • PIM-BD, OSPF, DHCP relay, jumbo frames, 50ms RTT between spines, LLDP on IPN PE
  • unique address pool for VTEP in each pod
  • spines send their loopbacks as /32 towards IPN, proxy VTEP, anycast next-hop in EVPN + address for HW proxy in COOP DB
  • spines do not participate in PIM, authoritative spine sends IGMP Join for group address (election via IS-IS)
  • only spines with active uplinks to IPN can become authoritative
  • spine back-to-back not supported (no PIM) because directly connected spines may not become authoritative simultaneously
  • 40/100G between pods
  • spine/leaf SN in DHCP as TLV
  • APICs in pod have addresses from seed pod pool ⇒ /32 announced
  • each pod has its own pool, the range is advertised to IPN via OSPF, spines then redistribute to IS-IS (metric = 64)
  • spines in pod trigger DHCP only after leafs are detected
  • GIPo for ARP gleaning 239.255.255.240 – for all BD; 225.0.0.0/8 – GIPo for BD
  • stretched L3Out = stretched BD ⇒ routing adjacency with leafs in both pods; remote pod has bounce entry because MAC is known in other pod
  • L3Out prefixes are announced via MP-BGP → prefer local because of IGP tiebreaker
  • loopback:
    1. control plane TEP (CP-TEP): RID, BGP source intf
    2. dataplane TEP (E-TEP): EVPN net-hop, anycast, placeholder in COOP (does not participate in proxy)
    3. external MAC proxy: anycast
    4. external IPv4 proxy: anycast
    5. external IPv6 proxy: anycast
  • leaf creates dynamic tunnels to other leaf:
    1. on remote EP learning
    2. on MP-BGP route
  • hardware proxy forwards traffic to proxy from another pod, not directly to other pod’s leaf
  • no support for location-based PBR for east-west traffic to different FW pairs because of blackhole, there is no state sync between pairs

Multisite

  • isolated fabrics, connection to MSO via OOB (MSO is outside ACI fabric)
  • vSphere 6 minimum
  • no stretched L4-L7
  • RTT between sites up to 1s
  • MSO:
    • 3 VMs with Docker Swarm
    • Active-active
    • 1 Gbps between VMs, RTT 150ms
    • ESXi 5.5, 4 vCPU, 8Gb RAM, 5Gb HDD
  • ports:
    1. TCP 2377: Docker Swarm mgmt
    2. TCP/UDP 7946: Docker Swarm container network discovery
    3. UDP 4789: VXLAN
    4. TCP 443
    5. ESP
  • 12 sites
  • spines host VNID translation table between sites, it’s fileld by MSO via APIC; receiving spine performs translation
  • shadow EPGs are created locally for remote EPG to apply policy ⇒ no stretched vzAny, unenforced VRF
  • translation creation:
    1. stretched EPG
    2. contract between EPGs on different sites
  • sites connected via IPN over VLAN 4
  • overlay unicast TEP on spine (O-UTEP) – src for VXLAN traffic, anycast dst for unicast
  • overlay multicast TEP (O-MTEP) – anycast dst for BUM
  • EVPN type-2 routes
  • BUM:
    1. headend replication ⇒ spine back-to-back is supported (2 sites only, spines cannot forward transit traffic)
    2. not affected by contracts
    3. optimize WAN bandwidth:
      • allocate mcast group for stretched BD (other BDs do not send traffic under this mcast group)
      • GIPo for stretched BD from separate pool, no unnecessary flood to and from local BD
    4. on receiving IGMP Join on leaf ⇒ update spine COOP ⇒ spine updates border leaf ⇒ border leaf sends PIM Join
  • EVPN speaker – communicates with other site, EVPN forwarder – communicates with speaker on its own site
  • no support no rogue EP control between sites
  • spine replaces src IP of VXLAN for O-UTEP ⇒ leafs on other sites learn EP as connected to spine
  • in Multipod each pod is assigned separate O-UTEP; O-MTEP is common
  • PBR is implemented by provider leaf for east-west traffic ⇒ all traffic goes through 1 PBR node on 1 site
  • PBR is implemented by compute leaf for north-south traffic ⇒ PBR node on the site where EPG is located; however, L3Out for return traffic might be different (asymmetric routing)
  • PBR to node on another site is not supported, redir_override action:
    1. if node local to site – redirect
    2. if node remote to site – permit
# show dcimgr repo vnid-maps
# show dcimgr repo sclass-maps

Ingress spine performs VNID and class ID translation

In the beginning policy is enforced on egress leaf ⇒ EP is learned via dataplane ⇒ policy for return traffic can be enforced immediately on ingress leaf

Intersite L3Out

  • allows using remote L3Out for egress traffic
  • incompatible with GOLF, remote leaf, CloudSec
  • border leafs receive TEP addresses from external pool, spines announce /32
  • prefixes from L3Out are announced between spines via VPNv4
  • return traffic to border leaf goes directly from compute leaf, no multisite translation ⇒ border does not learn class ID via dataplane ⇒ policy is always enforced on compute leaf (if the first packet is originated on compute, then policy is not enforced); src IP is still translated to O-UTEP for compute → L3Out traffic
  • transit routing: policy is enforced on the first leaf, no src IP tranlation, no pcTag assigned

Remote leaf

  • no EVPN in IPN, COOP through IPN with spine
  • no need for PIM-BD because traffic is sent unicast to spine
  • requires VLAN 4 on uplink towards IPN
  • TEP:
    1. RL-DP-TEP: per remote leaf, src address for VXLAN for orphan
    2. RL-vPC-TEP: anycast per vPC pair, src address for VXLAN for vPC
    3. RL-Ucast-TEP: anycast on spine; dst for VXLAN unicast from remote leaf
    4. RL-Mcast-TEP: anycast on spine; dst for BUM flooding from remote leaf; does not matter to which spine because all of them are connected to FTAG
    • spine rewrites src address to Ucast TEP for traffic towards remote leaf, belonging to this pod or another; no rewriting for traffic from remote leaf
  • external TEP pool: instead of announcing pod pool, only spine + APIC + border leaf
  • uses DHCP to receive addresses for uplink and TEP
  • fabric ports can be used to connect vPC pair back-to-back, preferred over connection via uplink
  • vPC synchronize EP directly, not via ZeroMQ because IPN may fail ⇒ better to deploy vPC for two remote leafs, even if they have only single-homed downlinks
  • no support for stretched BDs between remote leafs in Multisite between different sites (remote leaf – remote leaf, remote leaf – local leaf)
  • inter-VRF traffic goes through WAN for vPC pair from own address to own address with VNID rewrite – policy enforcement in egress VRF; for orphans traffic passes through spine as usual; depends on COOP ≡ spines
  • local PBR between EP via service node if vPC is used
  • if all uplinks fail, downlinks are shut down
  • MP-BGP session with external RR of every pod, COOP – with own pod only

vPod

  • EVPN in IPN
  • vSpine + vLeaf + AVE
  • vSphere only

Remote leaf direct forwarding

  • direct communication between remote leafs in own site (including RL in other pods)
  • Multisite: communication through spine from own pod, spine rewrites src address to O-UTEP
  • L2/L3 proxy – in software
  • download relevant part of COOP DB locally, entry timeout – 15 mins
  • spines do not forward BUM from RL because RL do it themselves
  • if spines in own pod fail, RL establishes COOP session to spines in other pod (knob)

Remote leaf + Multipod without direct forwarding

S1 translates src address of mcast from IPN for HREP towards remote leafs in own pod.

VTEP src from remote leaf standpoint:

  • LL2 from unicast
  • RL-Ucast-TEP (pod 1)
  • results in MAC flapping

VLAN 5 – connection between pods (VRF2), announce RL TEP pool, announce pod pool

VLAN 4 – connection between pod and remote leafs within pod (VRF1); announce RL TEP pool and pod pool; receiving RL TEP pool, from any pod except own pod, is denied via route-map

After adding VLAN 5 S2 sees RL via S1; S1 translates src address for traffic towards RL for any traffic in dataplane ⇒ no MAC flapping, every EP is known via RL-Ucast-TEP

In case of RL direct forwarding, there is direct tunnel between RL and S2 for both unicast and multicast ⇒ no MAC flapping