ACI Common

  1. Deployment
  2. ACI
  3. DCNM
  4. NIR
  5. NIA
  6. NAE
  7. APIC Cloud
  8. Design
  9. Discovery
  10. APIC
  11. Spine
  12. Leaf
  13. Tier-2 leaf
  14. Endpoint
  15. Host tracking
  16. GARP-based EP move detection
  17. EPG
  18. vzAny
  19. µEPG
  20. ESG
  21. Stale endpoints (policy enforcement = ingress)
  22. Underlay
  23. Overlay
  24. iVXLAN
  25. BD
  26. VRF
  27. Tenants
  28. AAEP
  29. Policy
  30. Contracts
  31. Taboo contract
  32. pcTag
  33. Shared services
  34. Contract HW
  35. Service chaining
  36. Multicast
  37. Load-balancing
  38. vPC
  39. STP
  40. FC
  41. MCP
  42. L3Out
  43. External subnets
  44. OSPF
  45. BGP
  46. EIGRP
  47. Floating L3Out
  48. Encap mode
  49. Transit routing
  50. WAN integration
  51. IP SLA
  52. Track
  53. Route-map
  54. QoS
  55. Firmware image
  56. Faults
  57. Config rollback
  58. AAA
  59. Roles
  60. CoPP
  61. Atomic counters

Deployment

  • multi-pod: stretched APIC cluster
  • multi-site: separate APIC clusters + MSO
  • vPOD: vSpine + vLeaf + AVE (APIC + 2 * vAPIC on main site)
  • stretched fabric transit leaf connect spines of different sites

ACI

  • supports FCoE (4.0 supports native FC) locally on a leaf (does not go into fabric) as NPV (also as FCoE)
  • protocols:
    1. IS-IS: underlay
    2. COOP: discover endpoints, EVPN twin
    3. BGP: external connectivity, EVPN from spine towards IPN, VPN within fabric
    4. PIM BD
    5. OpFlex: managing AVS
  • MTU > 1600, RTT < 50ms between spines
  • consistent policy
  • no NAT support
  • spine: X9700, N9500, N9300
  • leaf: X9500, N9500, N9300
  • NX-OS: X9400/X9600, N9200
  • consistent VMM policy for various systems
; Switch from iBash into NX-OS, software level cmds
# vsh

; Hardware level, pizza boxes
# vsh_lc

; Hardware level, modular; from vsh
# attach module 

; PTEP addresses
# acidiag fnvread

DCNM

  • supports various Nexus models, not only N9k
  • automated underlay/overlay
  • programmable fabric (programable network ≡ API)
  • license mode: 1) switch-based: by SN, only SAN 1) server-based: by MAC of mgmt interface, SAN + LAN
    • cannot be mixed

NIR

  • network insights resources
  • detect anomalies
  • tshoot (TCAM, CPU, RAM, temperature)
  • EP statistics, resource utilization, flows

NIA

  • network insights advisor
  • proactive
  • notifies about vulnerabilities, bugs
  • checks limits of HW per version

NAE

  • network assurance engine
  • check config compliance to the policy
  • network modelling

APIC Cloud

  • uses provider API for policy implementation

Design

  • for 10/40/100G oversubscription – 3:1 or 4:1
  • 80 leafs (3 APIC), 200 leaf (5 APIC) max
  • 6 spine per pod, 24 spine per fabric max
  • 20 FEX per leaf, 650 per fabric, 576 ports per switch
  • tenants: 1000 (3 APIC), 3000 (5 APIC)
  • VRF: 1000 (3 APIC), 3000 (5 APIC)
  • contracts: 10000
  • filters: 10000
  • EPG: 4000 (single tenant), 500 per tenant up to 21000, 400 isolation-enabled EPG
  • BD: 15000
  • service chain: 1000

Discovery

  • LLDP TLV, then receiving address via DHCP for VTEP and option with bootfile address
  • CDP/LLDP can discover non-ACI switches and put them into unmanaged nodes group (e.g. blade switch)
  • LLDP for discovering infra VLAN
  • manual admission to fabric
  • in LLDP TLV there is a vector (APIC ID, APIC IP, APIC UUID) – APIC discovery (mismatch – report to APIC)
  • between APIC and node – IFM (intra-fabric messaging) heartbeat over TLS
; delete fabric config; adding to fabric requires reload
# acidiag touch clean

APIC

  • image management
  • DHCP/TFTP server during discovery
  • stores 3 copies of DB (shards): 3 APIC = copy on each, more – random ⇒ increase number of APICs ≡ just increase of allowed leafs, not extra fault-tolerance
  • APIC-L > 1000 physical ports, APIC-M 3 APICs quorum != data consistency (shards can be RO or lost)
  • if shard is lost (all 3 APICs holding it are down) – restore from snapshot
  • active-standby uplink
  • in-band preferred over out-of-band
  • can host containers, subnet 172.16.0.0/16
  • shard leader – RW, others in RO
  • after split-brain restore according to timestamps
  • if fabric starts with 1 or 2 APICs – RW
  • after all APICs failed, new APICs can read config and state from fabric (VTEP, SGT, VNID,.. – fabric ID recovery)
  • APIC cluster has same version (exception – during upgrade)
  • bond0. = TEP pool + APIC ID (e.g. 10.0.0.3)
; bond settings on APIC in /proc/net/bonding/bond0
# acidiag run lldptool out|in eth2-1

Spine

  • holds COOP DB for all endpoints (global proxy table)
  • synchronize COOP DB between each other
  • BGP RR
  • anycast proxy VTEP for unknown unicast
  • IP connection towards IP network in multipod/multisite
  • push update COOP to leaf only because of bounce
  • COOP = ZeroMQ + DHT
  • always MAC-VTEP mapping; IP-VTEP – if IP routing is enabled on BD (subnet might be empty)
  • when proxying, does not change src IP, changes dst IP, for the purpose of data plane learning
  • if there is no entry in COOP DB regarding RP, runs ARP gleaning on all leaves with BD configured instead of dropping packet; only ARP and IP unknown unicast, MAC for gleaning – BD MAC; for unknown MAC – drop
  • COOP dampening:
    1. enabled by default, can be disabled only via API
    2. penalty: per IP, every COOP event brings weighted penalty; after 10 mins decreases by 75%
    3. state:
    – normal ( 10000 or longer than 5 mins in critical)
    – non-dampening (< 2500, freeze → normal)
# show coop internal info ep <BD VNID> <MAC>
# show coop internal info ep <VRF VNID> <IP> 

Leaf

  • policy enforcement as soon as dst EPG is known (ingress or egress leaf)
  • registers remote endpoint location via COOP by spine
  • learns MAC/IP by ARP snooping (local station table) or via data plane (for VTEP)
  • randomly chooses spine for COOP registration (anycast)
  • includes src EPG into VXLAN header
  • passes ARP instead of proxying, in order no to break GARP servers
  • normalization: translates other encapsulations into VXLAN
  • on VM move:
    1. new leaf forwards gratuitous ARP to old leaf, updates COOP
    2. old leaf forwards traffic towards VM to new leaf (bouncing) after receiving push via COOP
    3. bounce-to-proxy (D flag): usually for IP that has its MAC moved (bounce exists) but IP is not learned yet
    4. bounce does not change src IP (sIPo) of VXLAN to its own VTEP
  • anycast gateway
  • only straigh-through FEX, only on downlinks (for 40G – breakout)
  • can pass ARP unicast to VTEP (only when routing is enabled, otherwise always flood) or spine (if dst is unknown)
  • enable unicast routing ≡ default GW + EP learning (IP-VTEP mapping)
  • limit IP learning to subnet ≡ uRPF for learning in BD
  • generations:
    1. ALE: application leaf engine
    2. LSE: leaf-spine engine
  • native VLAN (= untagged) – per leaf, not per port
  • processes:
    • supervisor:
      1. endpoint manager (EPM): learns EP, passes to APIC and vPC peer
      2. ethernet Lif table manager (ELTM): OpFlex → ASIC (intf, VLAN, VRF)
      3. unicast RIB
      4. policy manager: OpFlex → ASIC (contract)
    • linecard:
      1. HAL
      2. endpoint manager client (EPMC)
      3. ELTMC
      4. uFIB
      5. ACLQOS

Tier-2 leaf

  • from 4.1(1)
  • change of downlink → fabric port on leaf only after discovering tier-1 leaf by APIC ⇒ if APIC is connected to tier-2, leaf must be connected by fabric ports
  • incompatible with remote leaf

Endpoint

  • types:
    1. local: flag L or VL (virtual local ≡ behind AVS/AVE)
    2. remote: no flag
    3. on-peer: flag O, orphan port on vPC peer
  • learning:
    • remote: conversational learning ≡ cache
    • learning IP+MAC via data plane or via *ARP, DHCP; IP packets only, routed, not switched
  • hardware proxy: send packet for unknown EP in case of ARP, unicast L2, unicast L3 + BD dst is local (cannot route)
  • if ARP flooding is disabled, leaf itself sends ARP request for unknown IP (ARP gleaning) after receiving message from spine that it doesn’t know about EP; sends locally + GIPo
  • if dst IP for L3 packet is not in BD on ingress leaf, then routing for the prefix (not COOP) via spine proxy (pervasive route); no remote endpoint learning → contract applied on egress leaf, learning happens on return traffic
  • endpoint loop protection: BD learn disable or port shutdown, when EP moves too frequently; can distinguish single EP flap when the problem is only with single EP (NIC teaming) and there is no loop
  • rogue RP control: freezes COOP entry (VTEP + port) if EP moves frequently (static EP + DL bit
  • setting DL flag in VXLAN → do not learn src IP – VTEP binding (needed for L4-L7 PBR)
  • EP announce: when bounce times out, announce removal of corresponding entry on border leaf and compute leaf
  • by default aging timer is renewed by any IP/MAC for EP → IP can get stuck; IP aging policy enables timers per IP (System Settings → EP Controls → IP aging)
  • when EP moves, on the old leaf appears bounce (10 mins by default); when a packet hits it, leaf sets E bit and sends it towards new leaf
  • hardware proxy sets E bit because COOP has up-to-date info
  • if MAC moves but IP location is not clear, old leaf sets bounce: MAC → new leaf, IP → spine-proxy
  • if EP sends traffic from IP that falls under L3Out (except 0.0.0.0/0) but does not fall under BD subnet (spoofing), then 2nd generation does not learn this remote EP via dataplane
; FD VLAN – BD VLAN mapping
# vsh_lc -c "show system internal eltmc info vlan access_encap_vlan "
; BD, VLANs, flags
# vsh_lc -c "show system internal epmc endpoint mac "
# clear system internal epm endpoint key vrf  ip  

Host tracking

  • only for IP, cannot be disabled
  • if more than 75% of timeout, sends 3 ARP requests
  • if packet received from IP within 75% – set flag but do not flush the timer
  • timers if flushed on 75% if flag is set ⇒ EP can get stuck for 2 intervals

GARP-based EP move detection

  • by default cannot detect IP move to a new MAC within same interface and EPG
  • usecase: different VMs on one host, one IP (another VM is turned off)
  • learns IP-MAC mapping via GARP on the same port and EPG
  • requires unicast routing and ARP flooding

EPG

  • endpoint group, mapped to VLAN internally, distinguished in data plane by port + encap
  • resolution:
    1. physical port (leaf or FEX)
    2. port group (≡ VLAN/VXLAN)
    3. VXLAN VNID, NVGRE
    4. VLAN ID (associated in AAEP config)
    5. subnet (µEPG)
    6. mcast group
    7. attributes (VM tags, guest OS, MAC), µEPG
  • intra-EPG contract = PVLAN + proxy ARP ⇒ traffic is routed
  • 3960 EPG+BD per leaf (594 VLAN IDs reserved)
  • preferred group membership:
    • EPG group that do not need contract between each other; cannot be provider
    • increase TCAM consumption:
      • changes implicit deny for permit
      • adds deny rules between EPG
  • if EPG – provider for shared service, subnet has to be configured under EPG for proper route leaking
  • static leaf binding (to VLAN) makes all ports L2 ⇒ L3Out cannot be configured (solution – bind to port + VLAN)
  • labels can dictate which consumer EPG can communicate with provider EPG (exact match)

vzAny

  • saves TCAM
  • cannot be provider for shared resource
  • permit all return TCP traffic ≡ match established, consumer + provider (stateful in ACI ≡ match ACK)
  • includes all EPGs in VRF, including L3Out
  • if vzAny is a consumer for EPG from another VRF, L3Out in own VRF announces EPG subnet from another VRF
  • does not trigger contract on ingress leaf if EPG falls under vzAny (0 → 0 or 0 → EPG) regardless of remote EP
  • incompatible with policy compression

µEPG

  • resolution immediate: does not need to classify traffic as EPG after vMotion with DVS (no OpFlex ⇒ APIC required, leaf cannot pass policy via OpFlex)
  • deployment immediate
  • PVLAN-based: different VLAN → no local switching, everythin goes through leaf
  • precedence (DVS):
    1. best: IP addr (MAC if AVS)
    2. MAC (IP if AVS)
    3. vNIC domain name
    4. VM ID
    5. VM name
    6. hypervisor ID
    7. VMM domain
    8. vCenter DC
    9. custom attributes
    10. OS
    11. tag
    • MAC for switched, IP – for routed (requires proxy ARP for intra-subnet traffic ⇒ intra-EPG contract/isolation)
  • EPG match precedence: which precedence to start comparing from if there is attribute tie
  • operator precedence: equals > contains > start with > end with
  • does not belong to EPG, assigned automatically (no port group in VMM); master EPG required for BD and QoS class
  • segmentaion on VM level, does not depend on VLAN or subnet
  • classification:
    1. OpFlex supported (AVE): VLAN/VXLAN ID
    2. no OpFlex (vDS): MAC VM

ESG

  • endpoint security group
  • not tied to BD, has no network config
  • ESG-ESG, ESG-L3Out contracts
  • classified by IP, EPG, tag
  • only for routed

Stale endpoints (policy enforcement = ingress)

DL bit for IP2→L3Out→IP1 refreshes aging on leaf1 but not the entry itself (VTEP to leaf3) ⇒ IP2 stale

There is no bounce on leaf2, because L3Out does not learn EP (IP2) ⇒ formally there is no EP move

Solution:

  • manual clear
  • enforce subnet check + remove BD subnet (removes EP by EP announce)

If IP1 sends traffic only to L3Out, then DL bit is set so no remote EP entries are created. IP1 → IP2 creates remote EP on leaf3

IP1 → L3Out refreshes aging on leaf3 without refreshing VTEP (DL bit). After bounce on leaf1 expires, L3Out → IP1 is blackholed

Solution:

  • disable remote EP learn
  • EP Announce (after bounce expires)

Underlay

  • unnumbered intf, only loopbacks are numbered (/32 to VTEP, vPC VIP, APIC, spine proxy IP)
  • L1 IS-IS, system ID from reserved PTEP (10.0.128.70 → 46:80:00:0A:00:00)
  • common RIB with OOB mgmt in Linux kernel
  • at least /22 subnet, /16 recommended; change only via fabric reset
  • port tracking: on uplink loss on leaf downlink is disabled ⇒ hosts are made to switch NIC
  • supports BFD
  • leafs have anycast fabric VTEPs for connection with vSwitch
  • all intf routed, connection via subints in VLAN2 (subint ID not fixed)
  • 000C.0C0C.0C0C – MAC for every fabric intf on spine, 0D0D.0D0D.0D0D – on leafs
  • loopback 0 = PTEP
; VTEP roles
# show isis dteps vrf overlay-1

Overlay

  • VXLAN identifies VRF, BD ID (for mcast and IPv6), EPG (VXLAN-GPO)
  • flod&learn:
    • only 2 hosts in BD
    • silent host even for ARP request
    • cannot lose initial packet from host
  • ARP gleaning: ARP request about unknown IP
  • if HW proxy does not know EP location – drop ≡ silent host issue ( ARP flooding → HW proxy switch requires refresh on host)
  • endpoint loop protection: endpoint flap (default: 4 times in 60s) → port disable or BD EP learning disable
  • VRF leaking on ingress leaf (has necessary routes + VNID from target VRF)
  • class ID ≡ pcTag ≡ SGT
  • global station table (GST): IP/MAC to remote VTEP mapping, conversation-based, COOP DB subset
  • local station table (LST): IP/MAC local
  • UDO 48879

iVXLAN

 0                   1                   2                   3
 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|  VXLAN flags  |R|D|E|S|D| Rsv |            Source             |
+               | |L| |P|P|     |            group              +
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                     VNID                      |   Reserved    |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

DL (don’t learn) – 1 = disable remote learning for this packet
E (exception) – 1 = leaf cannot send the packet back to fabric (e.g. bounce)
SP (source policy applied)
DP (dst policy applied)
DL disables learning EP but not refreshing aging timers ⇒ stale EP

BD

  • BUM spreading zone
  • legacy mode: EPG ≡ EPG ≡ VLAN in HW
  • flood in encapsulation:
    1. BUM and control plane – only within VLAN (not BD), exactly FD_VLAN
    2. proxy ARP for traffic between VLAN
    3. does not support IPv6 and mcast routing
  • if unicast routing disabled, dataplane EP learning via ARP remains (subnets have no effect)
  • disabling unicast routing or removing subnet – trigger EP announce for IP EP to remove it from leafs where BD is not deployed
  • disabling IP data plane learning disables only IP learning + sets DL bit in iVXLAN
  • limit local IP learning to subnets: does not learn IP outside the subnet; does not impact MAC learning or remote EP
  • subnet in BD ≡ subnet in EPG, except for leaking
  • platform-independent VLANs were needed between Broadcom ASIC (front ports) and Cisco ASIC (fabric port, VXLAN)
# show system internal eltm info vlan brief
# show system internal epm vlan all

VRF

  • policy enforcement:
    1. ingress: applied on compute leaf for ingress and egress, border does not filter traffic ≡ distributed, less TCAM utilization on border
    2. egress: L3Out → EPG filtering on border
    • disable remote EP learn:
      • disables learning on border leaf via data plane (no need since no policy is enforced)
      • fixes bug with 1st gen leaf → L3Out
      • compute sets DL bit for all traffic towards border, disables only IP learning (MAC remains)
  • disabling IP data plane learning disables learning only for IP; MAC is learned in dataplane, IP learned via control plane
  • enforce subnet check (global):
    • does not learn MAC or IP if IP is not from BD subnet if routed
    • for remote EP subnet has to be from VRF
    • bug: disables MAC learning from ARP traffic in L2 BD (routing disabled)
  • disabling IP dataplane learning for subnet BD or EPG:
    1. does not learn IP via dataplane
    2. learns MAC if bridged
    3. does not learn IP or MAC from ARP request
    4. local EP learns MAC and IP from ARP reply, remote EP learns only MAC
    • requires L2 unknown unicast = flood because of point 3.

Tenants

  • default:
    1. infra:
      • intra fabric communication (SW-SW, SW-APIC)
      • infrastructure VLAN 3967 – supported by all platforms
    2. mgmt:
      • same VRF as infra, same RIB in APIC Linux core
      • to apply contracts on APIC, APIC addresses have to be added to EPG (Node Mgmt Addresses)
    3. common:
      • can be used for shared services
      • resources are visible to other tenants
      • object names have to be unique system-wide (if name conflict is found, priority is given to tenant object, not from common)
  • Operational → Flows shows flows for log action in contracts via CoPP (e.g. 0 → 0 = deny, log)

AAEP

  • attachable access entity profile
  • connects domain and intf policy
  • deploy EPG as VLAN
  • VLAN pool receives list of VNID ⇒ diffrent VLAN pools = different FD_VLAN and different VNID for same VLAN ⇒ confines flood for BPDU
  • enforce EPG VLAN validation: if EPG is associated with domains that have overlapping VLAN pools, then overlapping VLAN assignment is prohibited (non-deterministic because different VNID could be assigned ⇒ sync fail for vPC EPM)
  • local scope VLAN: different EPGs on the same leaf have same port VLAN; EPGs must be in different BD ≡ different FD_VLAN

Policy

  • interface policy → LLDP “default” is used for ZTP
  • not applied on SW in config zone in locked deployment mode
  • config zone – only for infra tenant + fabric policies
  • interface policy group ≡ vPC port-channel, different policy groups ≡ different vPC port-channels

Contracts

  • scope:
    1. global
    2. tenant
    3. application profile: not compatible with L2Out or L3Out
    4. VRF
    • defines which EPG can use ≡ which EPG can potentially communicate (pervasive route for BD subnet via spine proxy)
  • labels can refine further which subjects can refer to which EPG (exact match)
  • dynamic inheritance (depth = 1), can inherit from different mast EPG
  • for intra-EPG contract – provider, ≡ isolation + contract
  • subject tie-breakers:
    1. contract between EPGs > vzAny > implicit deny
    2. deny priority:
      • lowest: vzAny–vzAny (17-20)
      • medium: EPG–vzAny (13-15) or vzAny–EPG (14-16)
      • highest: EPG-EPG
      • default: contract level
    3. if tie on certain level:
      • deny > redirect
      • redirect > permit if the filter is equal; otherwise – the more specific filter wins
    4. TCAM level – priority: the lower the more priority entry has
    5. more precise L4 filter has more priority;
    6. specific dst > specific src
  • compression: filter is not reused between contracts; filter is reused within a contract for different EPGs (reference instead of filter); permit only
  • permitted by default (no contract needed): DHCP, NDP (NS, NA), OSPF, EIGRP, PIM, IGMP, ARP
  • applied on ingress leaf, exception – no remote EP in COOP (EPG routing, L3Out)
  • provider exports contract to other tenants
  • by default, for IP fragments – frop
  • labels:
    1. EPG level: contract is applied to EPGs with same label; EPG config
    2. subject level: subject in contract specifies allowed provider label and consumer label; EPG and subject config
    3. contract level: sets EPG label or subject label only for this contract
    • inherited labels are applied only to inherited contracts, configured labels or contract do not interact with inherited once;
    • do not affect service graph with copy/redirect; not implemented in TCAM, only software, impact TCAM programming
    • incompatible with policy compression
# show zoning-rule
# show zoning-filter
; TCAM utilization
# vsh_lc -c "show platform internal hal health-stats"

Taboo contract

  • allows to abort a part of a contract, e.g. deny permitted traffic
  • has more priority than common contracts
  • always provider

pcTag

  • range:
    1. local: 16386 – 65535, unique within VRF; 0x4000 – 0xFFFF
    2. global: 16 – 16385; assigned to provider EPG for shared service
    3. reserved:
      • 0: vzAny
      • 1:
        • fraffic for CPU, implicit permit: spine-proxy, ARP, mcast
        • pervasive route: BD subnet, L3Out SVI subnet
      • 10: µEPG in private VLAN without mapping = classification on VLAN instead of MAC
      • 13:
        • for 0.0.0.0/0 L3Out from different VRF or all routes from L3Out that do not fall under shared security import
        • dst pcTag; if src, then pcTag = VRF pcTag
      • 14:
        • consumer VRF inside provider VRF
        • implicit permit because policy is enforced in consumer VRF for both traffic directions
      • 15:
        • 0.0.0.0/0 from all L3Out, does not relate to specific L3Out ⇒ contract for 0.0.0.0/0 from L3Out-1 are also applied for 0.0.0.0/0 from L3Out-2
        • dst pcTag; if src, then pcTag = VRF ID
  • change pcTag – trigger EP Announce

Shared services

  • provider exports contract, inside consumer tenant – consumed contract interface
  • subnet in provider VRF configured under EPG – allows to determine sclass and apply contract in consumer VRF
  • flag on subnet – shared between VRF
  • way to leak BD subnet from provider: assign provider as service consumer in consumer VRF (creates pervasive route backwards as well)
  • routing between VRFs – always through spine (pervasive route) even if both VRFs are deployed on the same leaf; reason – no remote EP learning (ingress leaf sets DL bit): needed to always use pervasive route because of VNID rewrite on ingress leaf
  • all contracts enforced in consumer VRF
  • if L3Out EPG is consumer, contract is applied on ingress leaf (not only in consumer VRF but in provider VRF as well ⇒ no rule with pcTag = 14 in HW)
  • if ingress enforcement, then contract enforced on compute (even if provider VRF is located on compute) ⇒ L3Out has global pcTag
  • EPG is in preferred group, L3Out in another VRF – consumer:
    1. contract in applied on ingress leaf
    2. in provider VRF there is an entry “EPG → 0 = deny” to deny traffic no falling under contract
    3. as a result, EPG cannot communicate with preferred group

Contract HW

  • if packet hits implicit entry, SP and DP bit are not set; same with vzAny as src EPG (no src EPG in zoning)
  • implicit rules are added if VRF is in enforced mode
  • implicit rules:
    1. 1 → 0
      • priority = 0
      • for traffic from CPU (BD SVI, L3Out)
    2. 0 ←→ 10
      • priority = 2
      • denies µEPG traffic if it’s classifed by VLAN instead of MAC
    3. EPG ←→ EPG
      • priority = 3
    4. 0 → 13
      • priority = 5
      • denies inter-VRF L3Out EPG
      • shard route control without shared security import
    5. EPG → 14
      • priority = 9
      • provider EPG with global pcTag in provider VRF
    6. EPG → 0
      • priority = 12
      • provider EPG with global pcTag in consumer VRF
      • deny traffic without contract
    7. 0 → BD ID
      • priority = 16
      • permits unknown unicast for flood
    8. 0 → 0
      • priority = 17
      • filter – implarp
      • permits unicast ARP (bcast in permitted as mcast
    9. 0 → 0
      • priority = 21
      • implicit deny
    10. 0 → 15
      • priority = 22
      • used only when preferred group is enabled
    11. VRF ID → 0
      • priority = 18
      • when preferred group is enabled
      • denies traffic from L3Out
    12. EPG → 0
      • priority = 18
      • when preferred group is enabled
      • denies traffic from EPG not within preferred group
    13. EPG → EPG
      • priority = 2
      • for intra-EPG isolation or contract
      • for intra-EPG isolation for external EPG
      • protects from vzAny or preferred group contract with other filter
    14. 0 ←→ 10
      • priority = 2
      • drop traffic from µEPG that was wrongly classified
# show system internal policy-mgr stats
; hitcount per rule
# contract_parser.py

Service chaining

  • inserts extra header after VXLAN

Multicast

  • FTAG root on spine (up to 16, martian address 0.0.0.), root is connected to all leafs, other spines – via a single leaf
  • PIM SM, SSM, Auto-RP, BSR in overlay
  • GIPo: group IP outer address for EPG (AVS ←→ leaf) and BD (leaf ←→ leaf); 225.0.0.0/15 by default
  • only L2 mcast in multipod
  • RP outside ACI
  • not affected by contracts, group admission via IGMP
  • connecting spine to the tree allows reroute the traffic after uplink to root fails
  • FTAG-tree number – last bits of mcast group; 16 avaialble, only 12 are used
  • root for FTAG0 is not preempted, only after reload; others can start reelection
  • does not fall under disable RP remote learn, needed for correct passing (S,G)
  • does not fall under disable IP dataplane learning
  • always route when mcast routing enabled (even L2 mcast) ⇒ leaf decreases TTL (both ingress and egress)
; tree
# show isis internal mcast routes ftag
; OIF per BD, VRF
# show isis internal mcast routes gipo

Load-balancing

  • dynamic, per flow (flowlet)
  • based on link and ASIC congestion
  • info on ingress port on leaf is included into packets
  • DRE:
    1. data redundancy elimination bits
    2. every hop updates DRE if own value is bigger than received
    3. max DRE is included into return packets to the same leaf (feedback)
    4. both leafs dynamically adjust weight ECMP based on DRE
  • mouse flow (short-lived, delay sensitive) has priority over elephant flow (long-lived, BW intensive) within QoS level
  • short flow < 15 packets
  • flowlet:
    • TCP burst, certain pause between them
    • can be balanced on different links without damage to the flow if gap between flowlets if bigger then max delay on all the paths – this is so because no reordering needed

vPC

  • peers use anycast vPC VTEP without peer-link (replaced with ZMQ in fabric; IGMP, EPM sync)
  • if downlink fails:
    1. available VTEP updates COOP with its physical VTEP
    2. on failed peer – bounce entry (not visible in CLI or GUI), once it’s removed (e.g. port operational) – EP Announce
    3. available VTEP send traffic from the endpoint using physical VTEP ⇒ updates other leafs via dataplane
  • if preferred group is used for L3Out on the peer, corresponding subnet might be absent; solution → recreate subnet on L3Out or reenable preferred group
  • keepalive ≡ route to VTEP via IS-IS
  • peers sync all EPs, including orphan
  • EPGs on both peers must have same encap VLAN, otherwise different FD_VLAN ⇒ different VNID ⇒ EPM sync error
# show coop internal info ip-db

STP

  • ACI does not participate in STP
  • BPDU flood within EPG within VLAN ⇒ L2Out VLAN ≠ EPG VLAN in order not to flap EP DB
  • MST:
    1. EPG in native VLAN for BPDU flood
    2. MST VLAN mapping for correct CAM flush on TCN (including spine proxy DB)
  • on TCN – CAM flush within EPG ⇒ epties HW proxy DB ⇒ silent host problem
  • if EPGs are different but access VLANs are the same, BPDUs do not pass through (BD traffic can flood through) ⇒ L2 loop

FC

  • only FCoE to servers and FCF
  • NPV only
  • does not forward traffic through fabric

MCP

  • miscabling protocol
  • discovers leafs being connected to each other – through LLDP
  • sends frame in all VLANs on all ports, if such a frame received (same key) → err-disable whole link
  • 256 VLAN per link (first ones, does not send frames on others)

L3Out

  • routes from IGP are imported into BGP to be passed to other leafs ⇒ RR required on spine
  • routes are installed on leafs if necessary (VRF is present)
  • needs to be associated with BD
  • for internal prefixes to be announced:
    1. subnet externally advertised: removes tag 0xFFFFFFFE (prohibits redistribution)
    2. BD association: creates prefix-list in route-map for MP-BGP → L3Out and static → L3Out
    3. contract between BD EPG and external EPG: creates a route on border if the route is not connected
  • supports BFDv1 async, except BGP multihop
  • can integrate via OpFlex with N7k and ASR9k for VRF-lite
  • can announce /32 for BD to avoid asymmetric routing
  • always redistributes static into MP-BGP, including BD subnets + connected
  • does not support FEX
  • does not configure RID automatically
  • no EP mapping DB: VXLAN F&L + RIB + ARP tables; BD_EXT_VLAN – per L3Out + encap VLAN
  • cannot use 0.0.0.0/0 as preferred group member ⇒ can bypass by 0.0.0.0/1 and 128.0.0.0/1
  • if 1st get leaf sends traffic to L3Out, it sets DL bit (HW issue) ⇒ packets refreshes aging but not VTEP info ⇒ stale EP
  • for traffic L3Out → compute DL bit is set ⇒ routed via MP-BGP without EP learning for VRF with ingress enforcement
  • RD = PTEP:VRFID ⇒ leaf gets VPNv4 prefixes from all border leafs
  • remote EP learning does not happen for traffic towards L3Out (DL bit); however, refreshes aging if towards local EP; same for traffic from L3Out
  • classification into EPG via longest prefix match – per VRF, not per L3Out
  • classification into EPG in SVI subnet – via ARP, pcTag=1, permit all; workaround – SVI subnet configured as EPG itself
  • intra-EPG contract does not enable isolation by default in contrast to common EPG
  • single IGP or BGP per L3Out (exception: OSPF + BGP with loopback exchange)

External subnets

  1. export route control:
    • exact prefix/length match
    • which transit routes (from different L3Out) to announce
    • route-map + prefix-list
    • alternative to BD association with L3Out
  2. import route control:
    • exact prefix/length match
    • which external prefixes to announce within fabric via BGP
    • route-map (BGP, inbound; OSPF, table-map for filtering RIB)
    • only OSPF and BGP
  3. external subnet for external EPG:
    • ACL semantic, per VRF (not per L3Out)
    • which external addresses to consider as external EPG
    • assigns pcTag, removes tag (by default all pervasive routes have tag=0xFFFFFFFE)
  4. shared route control:
    • controls if route can be leaked to other VRFs (when they define contract with external EPG)
    • creates route-map for leaking: common for all VRFs, checks only prefix (no RT), that’s why routes from VRF1 can fall under aggregate in VRF2
  5. shared security import:
    • allows applying contract in another VRF (dataplane permission, not control plane)
    • contract filtering on border leaf
    • assigns global pcTag
  6. aggregate export/import:
    • adds “le 32” to 0.0.0.0/0, permitting all prefixes
    • 0.0.0.0/0 only
    • does not relate to summarization
    • not applied to static route (only exact match) on another leaf
  7. aggregate shared routes:
    • adds “le 32” to prefix-list
    • non 0.0.0.0/0 are allowed
# vsh -c "show system internal policy-mgr prefix"

OSPF

  • always ASBR, may be ABR
  • all external routes (E) are redistributed into BGP
  • BGP is redistributed as E2
  • areas on different leafs = different areas, even if the number is the same
  • one OSPF process in VRF per leaf ⇒ different L3Out on one leaf ≡ different areas

BGP

  • static route or OSPF for next-hop resolution (OSPF announces only loopback and intf subnet)
  • route-map – whole L3Out, all peers
  • announce 0.0.0.0/0: transit routing or default route leak policy

EIGRP

  • 1 L3Out per border per VRF because of single process and single AS ⇒ 2 L3Out with EIGRP on same border are not allowed

Floating L3Out

  • does not specify logical intf: supports routing to migrating VMs
  • anchor node: IGP peering via L3Out to port-group in VMM
  • floating IP – on SVI on the leaf (VNF is located behind the leaf), participates in dataplane; anchor node has its own IP
  • next-hop propagation: anchor announces routes with next-hop = VTEP with VNF instead of its own

Encap mode

  1. local: default
  2. VRF:
    • several L3Out on same external BD ≡ same encap VLAN
    • L3Out per BGP peer
    • several IGP on one SVI

Transit routing

  • not all combinations of IGP/BGP
  • IGP routes receive tag=0xFFFFFFFF, ACI ignores such routes as loop prevention
  • applies contract on ingress leaf regardless of VRF inforcement
  • 0.0.0.0/0 cannot be src or dst EPG within same L3Out – traffic is dropped because of policy (no contract between VRF → 15)

WAN integration

  • GOLF: giant overlay forwarding
  • L3Out in infra tenant, connected to spine (WAN ≡ border leaf)
  • shared L3Out
  • BGP EVPN route-type 5, host route leak via route-type 2
  • N7k F3, ASR1k, ASR9k
  • does not support mcast

IP SLA

  • next-hop monitoring
  • incompatible with track policy

Track

  • IP monitoring, not necessarily next-hop
  • route policy has more priority than next-hop policy
  • incompatible with IP SLA

Route-map

  • for BGP → IGP route-map is common for all IGP on border leaf within VRF
  • OSPF and EIGRP redistribute summarization and discard route from each other because of common route-map (IGP1 → BGP → IGP2)
  • BGP → IGP = redistribution, BGP → BGP = outbound route-map per session
  • types:
    1. match prefix and routing policy: adds prefix-list with export route control subnets to all matches; if route-map has prefix-list already – merge
    2. match routing policy only: ignores prefixes from L3Out
  • where to apply:
    1. L3Out subnet: subnet scope
    2. L3Out EPG: all subnets in defined direction (including BD subnet with export route control)
    3. default-export: all subnets in defined direction + BDs with L3Out association (pervasive static route)
    • match priority: subnet → EPG → default-export/import

QoS

  • levels:
    1. 20% BW; Level 1; CoS = 2
    2. 20% BW; Level 2; CoS = 1
    3. 20% BD; Level 3; CoS = 0; default
    4. 0% BW; APIC traffic; CoS = 3
    5. 0% BW; SPAN; CoS = 4
    6. 0% BW; control plane, iTraceroute; CoS = 6; punted to CPU; copy (spine) or redirect (leaf) in TCAM on origin leaf
    • tail drop, DWRR
  • reserved:
    1. IFC (insieme fabric controller): APIC originated/destined; strict priority
    2. Supervisor class: supervisor, control plane; strict priority
    3. SPAN: BE, DWRR; can be starved
  • classified by DSCP, CoS, EPG, contracts
  • by default marking are ignored and rewritten
  • QoS precedence:
    1. zone rule (contract): subject, then contract wide
    2. DSCP EPG
    3. CoS EPG
    4. default
  • FEX resets CoS on egress
  • by default leaf inserts CoS into VXLAN DSCP (CS) and backwards ⇒ changes client CoS but not DSCP (CS)
  • applied on ingress, egress parameters do not matter (e.g. EPG QoS)
  • contract QoS is applied on contract enforcement ⇒ EP has to be known (on spine-proxy/flood egress does not apply QoS)
  • dot1p preserve writes client CoS into VXLAN IPP
  • if client traffic is marked with CoS = 6 (dot1p preserve or IPN DSCP-CoS mapping), then the traffic is punted to CPU on egress leaf ⇒ drop
  • DSCP class-to-CoS translation (infra → policies → protocol) is not compatible with dot1p preserve
  • for L3Out contract requires VRF egress enforcement
; buffer drops statistics
# vsh_lc -c "show platfowm internal counters port "
; per QoS class drops
# show queueing interface 
; database policer drops
# show dpp policy

Firmware image

  1. APIC image
  2. switch image: one for all leaf/spine models
  3. catalog image: compatibility description, HW tests
  • APICs are updated one by one in random sequence (~ 10 mins)
  • n-group upgrade: split switches into groups for the update, lowers downtime risks
  • can restrict max number of devices being updated at once
  • only 1 peer out of vPC pair is upgraded at a time
  • update is possible only if ACI is healthy (no convergence in APIC cluster)

Faults

  • phases:
    1. soaking: found, delay before raise to ignore intermittent issues
    2. soaking-cleared: false alarm; cleared while in soaking
    3. raised
    4. raised-clearing
    5. retaining: archive entry about fault
  • if fault is acked by user, it is removed
  • timers:
    1. soaking: 120s default
    2. clearing: 120s default
    3. retaining: 3600s default

Config rollback

  • snapshot does not save passwords if security password is not set on creating

AAA

  • by default prefers in-band to out-of-band
  • fallback (login domain) – APIC local profiles

Roles

  1. aaa: config AAA + import/export policy
  2. admin
  3. access-admin: access policies
  4. fabric-admin: fabric policy + external connectivity
  5. nw-svc-admin: L4-L7 insertion + orchestration
  6. nw-svc-params: parameters of L4-L7 devices
  7. ops: monitoring + tshoot
  8. read-all
  9. tenant-admin
  10. tenant-ext-admin: tenant external connectivity
  11. vmm-admin
  12. nw-svc-devpkg: import a package for infra admin, RO for tenant admin
  13. nw-svc-policy: creating service graph
  14. nw-svc-device: creating device cluster
  15. nw-svc-devshare: export device cluster to another tenant
  • RO or RW access is determined on user level

CoPP

  • ASIC interface – between ASIC and CPU:
    1. knet0: 1st gen leaf; receive
    2. knet1: 1st gen leaf; transmit
    3. tahoe0: EX+, baby spine; Rx and Tx
    4. psdev1.1: modular spine; Rx and Tx
  • restrictions on contract log:
    1. permit: 300 pps, PERMIT LOG
    2. deny: 500 pps, ACLLOG
; parse internal headers, debug control plane
# tcpdump -xxxi tahoe0 | knet_parser.py --decoder tahoe
; CoPP drops
# show system internal aclqos brcm copp entries unit 0

Atomic counters

  • mark bit – iVXLAN bit 56
  • process:
    1. mark bit = 0 in all packets
    2. all SW clear statistics for mark bit = 1
    3. all SW start sending traffic with mark bit = 1, clear statistics for mark bit = 0
    4. after 30s all SW send traffic with mark bit = 0 but continue to send and count mark bit = 1
    5. APIC collects statistics