- Deployment
- ACI
- DCNM
- NIR
- NIA
- NAE
- APIC Cloud
- Design
- Discovery
- APIC
- Spine
- Leaf
- Tier-2 leaf
- Endpoint
- Host tracking
- GARP-based EP move detection
- EPG
- vzAny
- µEPG
- ESG
- Stale endpoints (policy enforcement = ingress)
- Underlay
- Overlay
- iVXLAN
- BD
- VRF
- Tenants
- AAEP
- Policy
- Contracts
- Taboo contract
- pcTag
- Shared services
- Contract HW
- Service chaining
- Multicast
- Load-balancing
- vPC
- STP
- FC
- MCP
- L3Out
- External subnets
- OSPF
- BGP
- EIGRP
- Floating L3Out
- Encap mode
- Transit routing
- WAN integration
- IP SLA
- Track
- Route-map
- QoS
- Firmware image
- Faults
- Config rollback
- AAA
- Roles
- CoPP
- Atomic counters
Deployment
- multi-pod: stretched APIC cluster
- multi-site: separate APIC clusters + MSO
- vPOD: vSpine + vLeaf + AVE (APIC + 2 * vAPIC on main site)
- stretched fabric transit leaf connect spines of different sites
ACI
- supports FCoE (4.0 supports native FC) locally on a leaf (does not go into fabric) as NPV (also as FCoE)
- protocols:
1. IS-IS: underlay
2. COOP: discover endpoints, EVPN twin
3. BGP: external connectivity, EVPN from spine towards IPN, VPN within fabric
4. PIM BD
5. OpFlex: managing AVS - MTU > 1600, RTT < 50ms between spines
- consistent policy
- no NAT support
- spine: X9700, N9500, N9300
- leaf: X9500, N9500, N9300
- NX-OS: X9400/X9600, N9200
- consistent VMM policy for various systems
; Switch from iBash into NX-OS, software level cmds
# vsh
; Hardware level, pizza boxes
# vsh_lc
; Hardware level, modular; from vsh
# attach module
; PTEP addresses
# acidiag fnvread
DCNM
- supports various Nexus models, not only N9k
- automated underlay/overlay
- programmable fabric (programable network ≡ API)
- license mode: 1) switch-based: by SN, only SAN 1) server-based: by MAC of mgmt interface, SAN + LAN
- cannot be mixed
NIR
- network insights resources
- detect anomalies
- tshoot (TCAM, CPU, RAM, temperature)
- EP statistics, resource utilization, flows
NIA
- network insights advisor
- proactive
- notifies about vulnerabilities, bugs
- checks limits of HW per version
NAE
- network assurance engine
- check config compliance to the policy
- network modelling
APIC Cloud
- uses provider API for policy implementation
Design
- for 10/40/100G oversubscription – 3:1 or 4:1
- 80 leafs (3 APIC), 200 leaf (5 APIC) max
- 6 spine per pod, 24 spine per fabric max
- 20 FEX per leaf, 650 per fabric, 576 ports per switch
- tenants: 1000 (3 APIC), 3000 (5 APIC)
- VRF: 1000 (3 APIC), 3000 (5 APIC)
- contracts: 10000
- filters: 10000
- EPG: 4000 (single tenant), 500 per tenant up to 21000, 400 isolation-enabled EPG
- BD: 15000
- service chain: 1000
Discovery
- LLDP TLV, then receiving address via DHCP for VTEP and option with bootfile address
- CDP/LLDP can discover non-ACI switches and put them into unmanaged nodes group (e.g. blade switch)
- LLDP for discovering infra VLAN
- manual admission to fabric
- in LLDP TLV there is a vector (APIC ID, APIC IP, APIC UUID) – APIC discovery (mismatch – report to APIC)
- between APIC and node – IFM (intra-fabric messaging) heartbeat over TLS
; delete fabric config; adding to fabric requires reload
# acidiag touch clean
APIC
- image management
- DHCP/TFTP server during discovery
- stores 3 copies of DB (shards): 3 APIC = copy on each, more – random ⇒ increase number of APICs ≡ just increase of allowed leafs, not extra fault-tolerance
- APIC-L > 1000 physical ports, APIC-M 3 APICs quorum != data consistency (shards can be RO or lost)
- if shard is lost (all 3 APICs holding it are down) – restore from snapshot
- active-standby uplink
- in-band preferred over out-of-band
- can host containers, subnet 172.16.0.0/16
- shard leader – RW, others in RO
- after split-brain restore according to timestamps
- if fabric starts with 1 or 2 APICs – RW
- after all APICs failed, new APICs can read config and state from fabric (VTEP, SGT, VNID,.. – fabric ID recovery)
- APIC cluster has same version (exception – during upgrade)
- bond0. = TEP pool + APIC ID (e.g. 10.0.0.3)
; bond settings on APIC in /proc/net/bonding/bond0
# acidiag run lldptool out|in eth2-1
Spine
- holds COOP DB for all endpoints (global proxy table)
- synchronize COOP DB between each other
- BGP RR
- anycast proxy VTEP for unknown unicast
- IP connection towards IP network in multipod/multisite
- push update COOP to leaf only because of bounce
- COOP = ZeroMQ + DHT
- always MAC-VTEP mapping; IP-VTEP – if IP routing is enabled on BD (subnet might be empty)
- when proxying, does not change src IP, changes dst IP, for the purpose of data plane learning
- if there is no entry in COOP DB regarding RP, runs ARP gleaning on all leaves with BD configured instead of dropping packet; only ARP and IP unknown unicast, MAC for gleaning – BD MAC; for unknown MAC – drop
- COOP dampening:
1. enabled by default, can be disabled only via API
2. penalty: per IP, every COOP event brings weighted penalty; after 10 mins decreases by 75%
3. state:
– normal ( 10000 or longer than 5 mins in critical)
– non-dampening (< 2500, freeze → normal)
# show coop internal info ep <BD VNID> <MAC>
# show coop internal info ep <VRF VNID> <IP>
Leaf
- policy enforcement as soon as dst EPG is known (ingress or egress leaf)
- registers remote endpoint location via COOP by spine
- learns MAC/IP by ARP snooping (local station table) or via data plane (for VTEP)
- randomly chooses spine for COOP registration (anycast)
- includes src EPG into VXLAN header
- passes ARP instead of proxying, in order no to break GARP servers
- normalization: translates other encapsulations into VXLAN
- on VM move:
1. new leaf forwards gratuitous ARP to old leaf, updates COOP
2. old leaf forwards traffic towards VM to new leaf (bouncing) after receiving push via COOP
3. bounce-to-proxy (D flag): usually for IP that has its MAC moved (bounce exists) but IP is not learned yet
4. bounce does not change src IP (sIPo) of VXLAN to its own VTEP - anycast gateway
- only straigh-through FEX, only on downlinks (for 40G – breakout)
- can pass ARP unicast to VTEP (only when routing is enabled, otherwise always flood) or spine (if dst is unknown)
- enable unicast routing ≡ default GW + EP learning (IP-VTEP mapping)
- limit IP learning to subnet ≡ uRPF for learning in BD
- generations:
1. ALE: application leaf engine
2. LSE: leaf-spine engine - native VLAN (= untagged) – per leaf, not per port
- processes:
- supervisor:
1. endpoint manager (EPM): learns EP, passes to APIC and vPC peer
2. ethernet Lif table manager (ELTM): OpFlex → ASIC (intf, VLAN, VRF)
3. unicast RIB
4. policy manager: OpFlex → ASIC (contract) - linecard:
1. HAL
2. endpoint manager client (EPMC)
3. ELTMC
4. uFIB
5. ACLQOS
- supervisor:
Tier-2 leaf
- from 4.1(1)
- change of downlink → fabric port on leaf only after discovering tier-1 leaf by APIC ⇒ if APIC is connected to tier-2, leaf must be connected by fabric ports
- incompatible with remote leaf
Endpoint
- types:
1. local: flag L or VL (virtual local ≡ behind AVS/AVE)
2. remote: no flag
3. on-peer: flag O, orphan port on vPC peer - learning:
- remote: conversational learning ≡ cache
- learning IP+MAC via data plane or via *ARP, DHCP; IP packets only, routed, not switched
- hardware proxy: send packet for unknown EP in case of ARP, unicast L2, unicast L3 + BD dst is local (cannot route)
- if ARP flooding is disabled, leaf itself sends ARP request for unknown IP (ARP gleaning) after receiving message from spine that it doesn’t know about EP; sends locally + GIPo
- if dst IP for L3 packet is not in BD on ingress leaf, then routing for the prefix (not COOP) via spine proxy (pervasive route); no remote endpoint learning → contract applied on egress leaf, learning happens on return traffic
- endpoint loop protection: BD learn disable or port shutdown, when EP moves too frequently; can distinguish single EP flap when the problem is only with single EP (NIC teaming) and there is no loop
- rogue RP control: freezes COOP entry (VTEP + port) if EP moves frequently (static EP + DL bit
- setting DL flag in VXLAN → do not learn src IP – VTEP binding (needed for L4-L7 PBR)
- EP announce: when bounce times out, announce removal of corresponding entry on border leaf and compute leaf
- by default aging timer is renewed by any IP/MAC for EP → IP can get stuck; IP aging policy enables timers per IP (System Settings → EP Controls → IP aging)
- when EP moves, on the old leaf appears bounce (10 mins by default); when a packet hits it, leaf sets E bit and sends it towards new leaf
- hardware proxy sets E bit because COOP has up-to-date info
- if MAC moves but IP location is not clear, old leaf sets bounce: MAC → new leaf, IP → spine-proxy
- if EP sends traffic from IP that falls under L3Out (except 0.0.0.0/0) but does not fall under BD subnet (spoofing), then 2nd generation does not learn this remote EP via dataplane
; FD VLAN – BD VLAN mapping
# vsh_lc -c "show system internal eltmc info vlan access_encap_vlan "
; BD, VLANs, flags
# vsh_lc -c "show system internal epmc endpoint mac "
# clear system internal epm endpoint key vrf ip
Host tracking
- only for IP, cannot be disabled
- if more than 75% of timeout, sends 3 ARP requests
- if packet received from IP within 75% – set flag but do not flush the timer
- timers if flushed on 75% if flag is set ⇒ EP can get stuck for 2 intervals
GARP-based EP move detection
- by default cannot detect IP move to a new MAC within same interface and EPG
- usecase: different VMs on one host, one IP (another VM is turned off)
- learns IP-MAC mapping via GARP on the same port and EPG
- requires unicast routing and ARP flooding
EPG
- endpoint group, mapped to VLAN internally, distinguished in data plane by port + encap
- resolution:
- physical port (leaf or FEX)
- port group (≡ VLAN/VXLAN)
- VXLAN VNID, NVGRE
- VLAN ID (associated in AAEP config)
- subnet (µEPG)
- mcast group
- attributes (VM tags, guest OS, MAC), µEPG
- intra-EPG contract = PVLAN + proxy ARP ⇒ traffic is routed
- 3960 EPG+BD per leaf (594 VLAN IDs reserved)
- preferred group membership:
- EPG group that do not need contract between each other; cannot be provider
- increase TCAM consumption:
- changes implicit deny for permit
- adds deny rules between EPG
- if EPG – provider for shared service, subnet has to be configured under EPG for proper route leaking
- static leaf binding (to VLAN) makes all ports L2 ⇒ L3Out cannot be configured (solution – bind to port + VLAN)
- labels can dictate which consumer EPG can communicate with provider EPG (exact match)
vzAny
- saves TCAM
- cannot be provider for shared resource
- permit all return TCP traffic ≡ match established, consumer + provider (stateful in ACI ≡ match ACK)
- includes all EPGs in VRF, including L3Out
- if vzAny is a consumer for EPG from another VRF, L3Out in own VRF announces EPG subnet from another VRF
- does not trigger contract on ingress leaf if EPG falls under vzAny (0 → 0 or 0 → EPG) regardless of remote EP
- incompatible with policy compression
µEPG
- resolution immediate: does not need to classify traffic as EPG after vMotion with DVS (no OpFlex ⇒ APIC required, leaf cannot pass policy via OpFlex)
- deployment immediate
- PVLAN-based: different VLAN → no local switching, everythin goes through leaf
- precedence (DVS):
- best: IP addr (MAC if AVS)
- MAC (IP if AVS)
- vNIC domain name
- VM ID
- VM name
- hypervisor ID
- VMM domain
- vCenter DC
- custom attributes
- OS
- tag
- MAC for switched, IP – for routed (requires proxy ARP for intra-subnet traffic ⇒ intra-EPG contract/isolation)
- EPG match precedence: which precedence to start comparing from if there is attribute tie
- operator precedence: equals > contains > start with > end with
- does not belong to EPG, assigned automatically (no port group in VMM); master EPG required for BD and QoS class
- segmentaion on VM level, does not depend on VLAN or subnet
- classification:
- OpFlex supported (AVE): VLAN/VXLAN ID
- no OpFlex (vDS): MAC VM
ESG
- endpoint security group
- not tied to BD, has no network config
- ESG-ESG, ESG-L3Out contracts
- classified by IP, EPG, tag
- only for routed
Stale endpoints (policy enforcement = ingress)
DL bit for IP2→L3Out→IP1 refreshes aging on leaf1 but not the entry itself (VTEP to leaf3) ⇒ IP2 stale
There is no bounce on leaf2, because L3Out does not learn EP (IP2) ⇒ formally there is no EP move
Solution:
- manual clear
- enforce subnet check + remove BD subnet (removes EP by EP announce)
If IP1 sends traffic only to L3Out, then DL bit is set so no remote EP entries are created. IP1 → IP2 creates remote EP on leaf3
IP1 → L3Out refreshes aging on leaf3 without refreshing VTEP (DL bit). After bounce on leaf1 expires, L3Out → IP1 is blackholed
Solution:
- disable remote EP learn
- EP Announce (after bounce expires)
Underlay
- unnumbered intf, only loopbacks are numbered (/32 to VTEP, vPC VIP, APIC, spine proxy IP)
- L1 IS-IS, system ID from reserved PTEP (10.0.128.70 → 46:80:00:0A:00:00)
- common RIB with OOB mgmt in Linux kernel
- at least /22 subnet, /16 recommended; change only via fabric reset
- port tracking: on uplink loss on leaf downlink is disabled ⇒ hosts are made to switch NIC
- supports BFD
- leafs have anycast fabric VTEPs for connection with vSwitch
- all intf routed, connection via subints in VLAN2 (subint ID not fixed)
- 000C.0C0C.0C0C – MAC for every fabric intf on spine, 0D0D.0D0D.0D0D – on leafs
- loopback 0 = PTEP
; VTEP roles
# show isis dteps vrf overlay-1
Overlay
- VXLAN identifies VRF, BD ID (for mcast and IPv6), EPG (VXLAN-GPO)
- flod&learn:
- only 2 hosts in BD
- silent host even for ARP request
- cannot lose initial packet from host
- ARP gleaning: ARP request about unknown IP
- if HW proxy does not know EP location – drop ≡ silent host issue ( ARP flooding → HW proxy switch requires refresh on host)
- endpoint loop protection: endpoint flap (default: 4 times in 60s) → port disable or BD EP learning disable
- VRF leaking on ingress leaf (has necessary routes + VNID from target VRF)
- class ID ≡ pcTag ≡ SGT
- global station table (GST): IP/MAC to remote VTEP mapping, conversation-based, COOP DB subset
- local station table (LST): IP/MAC local
- UDO 48879
iVXLAN
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| VXLAN flags |R|D|E|S|D| Rsv | Source |
+ | |L| |P|P| | group +
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| VNID | Reserved |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
DL (don’t learn) – 1 = disable remote learning for this packet
E (exception) – 1 = leaf cannot send the packet back to fabric (e.g. bounce)
SP (source policy applied)
DP (dst policy applied)
DL disables learning EP but not refreshing aging timers ⇒ stale EP
BD
- BUM spreading zone
- legacy mode: EPG ≡ EPG ≡ VLAN in HW
- flood in encapsulation:
- BUM and control plane – only within VLAN (not BD), exactly FD_VLAN
- proxy ARP for traffic between VLAN
- does not support IPv6 and mcast routing
- if unicast routing disabled, dataplane EP learning via ARP remains (subnets have no effect)
- disabling unicast routing or removing subnet – trigger EP announce for IP EP to remove it from leafs where BD is not deployed
- disabling IP data plane learning disables only IP learning + sets DL bit in iVXLAN
- limit local IP learning to subnets: does not learn IP outside the subnet; does not impact MAC learning or remote EP
- subnet in BD ≡ subnet in EPG, except for leaking
- platform-independent VLANs were needed between Broadcom ASIC (front ports) and Cisco ASIC (fabric port, VXLAN)
# show system internal eltm info vlan brief
# show system internal epm vlan all
VRF
- policy enforcement:
- ingress: applied on compute leaf for ingress and egress, border does not filter traffic ≡ distributed, less TCAM utilization on border
- egress: L3Out → EPG filtering on border
- disable remote EP learn:
- disables learning on border leaf via data plane (no need since no policy is enforced)
- fixes bug with 1st gen leaf → L3Out
- compute sets DL bit for all traffic towards border, disables only IP learning (MAC remains)
- disabling IP data plane learning disables learning only for IP; MAC is learned in dataplane, IP learned via control plane
- enforce subnet check (global):
- does not learn MAC or IP if IP is not from BD subnet if routed
- for remote EP subnet has to be from VRF
- bug: disables MAC learning from ARP traffic in L2 BD (routing disabled)
- disabling IP dataplane learning for subnet BD or EPG:
- does not learn IP via dataplane
- learns MAC if bridged
- does not learn IP or MAC from ARP request
- local EP learns MAC and IP from ARP reply, remote EP learns only MAC
- requires L2 unknown unicast = flood because of point 3.
Tenants
- default:
- infra:
- intra fabric communication (SW-SW, SW-APIC)
- infrastructure VLAN 3967 – supported by all platforms
- mgmt:
- same VRF as infra, same RIB in APIC Linux core
- to apply contracts on APIC, APIC addresses have to be added to EPG (Node Mgmt Addresses)
- common:
- can be used for shared services
- resources are visible to other tenants
- object names have to be unique system-wide (if name conflict is found, priority is given to tenant object, not from common)
- infra:
- Operational → Flows shows flows for log action in contracts via CoPP (e.g. 0 → 0 = deny, log)
AAEP
- attachable access entity profile
- connects domain and intf policy
- deploy EPG as VLAN
- VLAN pool receives list of VNID ⇒ diffrent VLAN pools = different FD_VLAN and different VNID for same VLAN ⇒ confines flood for BPDU
- enforce EPG VLAN validation: if EPG is associated with domains that have overlapping VLAN pools, then overlapping VLAN assignment is prohibited (non-deterministic because different VNID could be assigned ⇒ sync fail for vPC EPM)
- local scope VLAN: different EPGs on the same leaf have same port VLAN; EPGs must be in different BD ≡ different FD_VLAN
Policy
- interface policy → LLDP “default” is used for ZTP
- not applied on SW in config zone in locked deployment mode
- config zone – only for infra tenant + fabric policies
- interface policy group ≡ vPC port-channel, different policy groups ≡ different vPC port-channels
Contracts
- scope:
- global
- tenant
- application profile: not compatible with L2Out or L3Out
- VRF
- defines which EPG can use ≡ which EPG can potentially communicate (pervasive route for BD subnet via spine proxy)
- labels can refine further which subjects can refer to which EPG (exact match)
- dynamic inheritance (depth = 1), can inherit from different mast EPG
- for intra-EPG contract – provider, ≡ isolation + contract
- subject tie-breakers:
- contract between EPGs > vzAny > implicit deny
- deny priority:
- lowest: vzAny–vzAny (17-20)
- medium: EPG–vzAny (13-15) or vzAny–EPG (14-16)
- highest: EPG-EPG
- default: contract level
- if tie on certain level:
- deny > redirect
- redirect > permit if the filter is equal; otherwise – the more specific filter wins
- TCAM level – priority: the lower the more priority entry has
- more precise L4 filter has more priority;
- specific dst > specific src
- compression: filter is not reused between contracts; filter is reused within a contract for different EPGs (reference instead of filter); permit only
- permitted by default (no contract needed): DHCP, NDP (NS, NA), OSPF, EIGRP, PIM, IGMP, ARP
- applied on ingress leaf, exception – no remote EP in COOP (EPG routing, L3Out)
- provider exports contract to other tenants
- by default, for IP fragments – frop
- labels:
- EPG level: contract is applied to EPGs with same label; EPG config
- subject level: subject in contract specifies allowed provider label and consumer label; EPG and subject config
- contract level: sets EPG label or subject label only for this contract
- inherited labels are applied only to inherited contracts, configured labels or contract do not interact with inherited once;
- do not affect service graph with copy/redirect; not implemented in TCAM, only software, impact TCAM programming
- incompatible with policy compression
# show zoning-rule
# show zoning-filter
; TCAM utilization
# vsh_lc -c "show platform internal hal health-stats"
Taboo contract
- allows to abort a part of a contract, e.g. deny permitted traffic
- has more priority than common contracts
- always provider
pcTag
- range:
- local: 16386 – 65535, unique within VRF; 0x4000 – 0xFFFF
- global: 16 – 16385; assigned to provider EPG for shared service
- reserved:
- 0: vzAny
- 1:
- fraffic for CPU, implicit permit: spine-proxy, ARP, mcast
- pervasive route: BD subnet, L3Out SVI subnet
- 10: µEPG in private VLAN without mapping = classification on VLAN instead of MAC
- 13:
- for 0.0.0.0/0 L3Out from different VRF or all routes from L3Out that do not fall under shared security import
- dst pcTag; if src, then pcTag = VRF pcTag
- 14:
- consumer VRF inside provider VRF
- implicit permit because policy is enforced in consumer VRF for both traffic directions
- 15:
- 0.0.0.0/0 from all L3Out, does not relate to specific L3Out ⇒ contract for 0.0.0.0/0 from L3Out-1 are also applied for 0.0.0.0/0 from L3Out-2
- dst pcTag; if src, then pcTag = VRF ID
- change pcTag – trigger EP Announce
Shared services
- provider exports contract, inside consumer tenant – consumed contract interface
- subnet in provider VRF configured under EPG – allows to determine sclass and apply contract in consumer VRF
- flag on subnet – shared between VRF
- way to leak BD subnet from provider: assign provider as service consumer in consumer VRF (creates pervasive route backwards as well)
- routing between VRFs – always through spine (pervasive route) even if both VRFs are deployed on the same leaf; reason – no remote EP learning (ingress leaf sets DL bit): needed to always use pervasive route because of VNID rewrite on ingress leaf
- all contracts enforced in consumer VRF
- if L3Out EPG is consumer, contract is applied on ingress leaf (not only in consumer VRF but in provider VRF as well ⇒ no rule with pcTag = 14 in HW)
- if ingress enforcement, then contract enforced on compute (even if provider VRF is located on compute) ⇒ L3Out has global pcTag
- EPG is in preferred group, L3Out in another VRF – consumer:
- contract in applied on ingress leaf
- in provider VRF there is an entry “EPG → 0 = deny” to deny traffic no falling under contract
- as a result, EPG cannot communicate with preferred group
Contract HW
- if packet hits implicit entry, SP and DP bit are not set; same with vzAny as src EPG (no src EPG in zoning)
- implicit rules are added if VRF is in enforced mode
- implicit rules:
- 1 → 0
- priority = 0
- for traffic from CPU (BD SVI, L3Out)
- 0 ←→ 10
- priority = 2
- denies µEPG traffic if it’s classifed by VLAN instead of MAC
- EPG ←→ EPG
- priority = 3
- 0 → 13
- priority = 5
- denies inter-VRF L3Out EPG
- shard route control without shared security import
- EPG → 14
- priority = 9
- provider EPG with global pcTag in provider VRF
- EPG → 0
- priority = 12
- provider EPG with global pcTag in consumer VRF
- deny traffic without contract
- 0 → BD ID
- priority = 16
- permits unknown unicast for flood
- 0 → 0
- priority = 17
- filter – implarp
- permits unicast ARP (bcast in permitted as mcast
- 0 → 0
- priority = 21
- implicit deny
- 0 → 15
- priority = 22
- used only when preferred group is enabled
- VRF ID → 0
- priority = 18
- when preferred group is enabled
- denies traffic from L3Out
- EPG → 0
- priority = 18
- when preferred group is enabled
- denies traffic from EPG not within preferred group
- EPG → EPG
- priority = 2
- for intra-EPG isolation or contract
- for intra-EPG isolation for external EPG
- protects from vzAny or preferred group contract with other filter
- 0 ←→ 10
- priority = 2
- drop traffic from µEPG that was wrongly classified
- 1 → 0
# show system internal policy-mgr stats
; hitcount per rule
# contract_parser.py
Service chaining
- inserts extra header after VXLAN
Multicast
- FTAG root on spine (up to 16, martian address 0.0.0.), root is connected to all leafs, other spines – via a single leaf
- PIM SM, SSM, Auto-RP, BSR in overlay
- GIPo: group IP outer address for EPG (AVS ←→ leaf) and BD (leaf ←→ leaf); 225.0.0.0/15 by default
- only L2 mcast in multipod
- RP outside ACI
- not affected by contracts, group admission via IGMP
- connecting spine to the tree allows reroute the traffic after uplink to root fails
- FTAG-tree number – last bits of mcast group; 16 avaialble, only 12 are used
- root for FTAG0 is not preempted, only after reload; others can start reelection
- does not fall under disable RP remote learn, needed for correct passing (S,G)
- does not fall under disable IP dataplane learning
- always route when mcast routing enabled (even L2 mcast) ⇒ leaf decreases TTL (both ingress and egress)
; tree
# show isis internal mcast routes ftag
; OIF per BD, VRF
# show isis internal mcast routes gipo
Load-balancing
- dynamic, per flow (flowlet)
- based on link and ASIC congestion
- info on ingress port on leaf is included into packets
- DRE:
- data redundancy elimination bits
- every hop updates DRE if own value is bigger than received
- max DRE is included into return packets to the same leaf (feedback)
- both leafs dynamically adjust weight ECMP based on DRE
- mouse flow (short-lived, delay sensitive) has priority over elephant flow (long-lived, BW intensive) within QoS level
- short flow < 15 packets
- flowlet:
- TCP burst, certain pause between them
- can be balanced on different links without damage to the flow if gap between flowlets if bigger then max delay on all the paths – this is so because no reordering needed
vPC
- peers use anycast vPC VTEP without peer-link (replaced with ZMQ in fabric; IGMP, EPM sync)
- if downlink fails:
- available VTEP updates COOP with its physical VTEP
- on failed peer – bounce entry (not visible in CLI or GUI), once it’s removed (e.g. port operational) – EP Announce
- available VTEP send traffic from the endpoint using physical VTEP ⇒ updates other leafs via dataplane
- if preferred group is used for L3Out on the peer, corresponding subnet might be absent; solution → recreate subnet on L3Out or reenable preferred group
- keepalive ≡ route to VTEP via IS-IS
- peers sync all EPs, including orphan
- EPGs on both peers must have same encap VLAN, otherwise different FD_VLAN ⇒ different VNID ⇒ EPM sync error
# show coop internal info ip-db
STP
- ACI does not participate in STP
- BPDU flood within EPG within VLAN ⇒ L2Out VLAN ≠ EPG VLAN in order not to flap EP DB
- MST:
- EPG in native VLAN for BPDU flood
- MST VLAN mapping for correct CAM flush on TCN (including spine proxy DB)
- on TCN – CAM flush within EPG ⇒ epties HW proxy DB ⇒ silent host problem
- if EPGs are different but access VLANs are the same, BPDUs do not pass through (BD traffic can flood through) ⇒ L2 loop
FC
- only FCoE to servers and FCF
- NPV only
- does not forward traffic through fabric
MCP
- miscabling protocol
- discovers leafs being connected to each other – through LLDP
- sends frame in all VLANs on all ports, if such a frame received (same key) → err-disable whole link
- 256 VLAN per link (first ones, does not send frames on others)
L3Out
- routes from IGP are imported into BGP to be passed to other leafs ⇒ RR required on spine
- routes are installed on leafs if necessary (VRF is present)
- needs to be associated with BD
- for internal prefixes to be announced:
- subnet externally advertised: removes tag 0xFFFFFFFE (prohibits redistribution)
- BD association: creates prefix-list in route-map for MP-BGP → L3Out and static → L3Out
- contract between BD EPG and external EPG: creates a route on border if the route is not connected
- supports BFDv1 async, except BGP multihop
- can integrate via OpFlex with N7k and ASR9k for VRF-lite
- can announce /32 for BD to avoid asymmetric routing
- always redistributes static into MP-BGP, including BD subnets + connected
- does not support FEX
- does not configure RID automatically
- no EP mapping DB: VXLAN F&L + RIB + ARP tables; BD_EXT_VLAN – per L3Out + encap VLAN
- cannot use 0.0.0.0/0 as preferred group member ⇒ can bypass by 0.0.0.0/1 and 128.0.0.0/1
- if 1st get leaf sends traffic to L3Out, it sets DL bit (HW issue) ⇒ packets refreshes aging but not VTEP info ⇒ stale EP
- for traffic L3Out → compute DL bit is set ⇒ routed via MP-BGP without EP learning for VRF with ingress enforcement
- RD = PTEP:VRFID ⇒ leaf gets VPNv4 prefixes from all border leafs
- remote EP learning does not happen for traffic towards L3Out (DL bit); however, refreshes aging if towards local EP; same for traffic from L3Out
- classification into EPG via longest prefix match – per VRF, not per L3Out
- classification into EPG in SVI subnet – via ARP, pcTag=1, permit all; workaround – SVI subnet configured as EPG itself
- intra-EPG contract does not enable isolation by default in contrast to common EPG
- single IGP or BGP per L3Out (exception: OSPF + BGP with loopback exchange)
External subnets
- export route control:
- exact prefix/length match
- which transit routes (from different L3Out) to announce
- route-map + prefix-list
- alternative to BD association with L3Out
- import route control:
- exact prefix/length match
- which external prefixes to announce within fabric via BGP
- route-map (BGP, inbound; OSPF, table-map for filtering RIB)
- only OSPF and BGP
- external subnet for external EPG:
- ACL semantic, per VRF (not per L3Out)
- which external addresses to consider as external EPG
- assigns pcTag, removes tag (by default all pervasive routes have tag=0xFFFFFFFE)
- shared route control:
- controls if route can be leaked to other VRFs (when they define contract with external EPG)
- creates route-map for leaking: common for all VRFs, checks only prefix (no RT), that’s why routes from VRF1 can fall under aggregate in VRF2
- shared security import:
- allows applying contract in another VRF (dataplane permission, not control plane)
- contract filtering on border leaf
- assigns global pcTag
- aggregate export/import:
- adds “le 32” to 0.0.0.0/0, permitting all prefixes
- 0.0.0.0/0 only
- does not relate to summarization
- not applied to static route (only exact match) on another leaf
- aggregate shared routes:
- adds “le 32” to prefix-list
- non 0.0.0.0/0 are allowed
# vsh -c "show system internal policy-mgr prefix"
OSPF
- always ASBR, may be ABR
- all external routes (E) are redistributed into BGP
- BGP is redistributed as E2
- areas on different leafs = different areas, even if the number is the same
- one OSPF process in VRF per leaf ⇒ different L3Out on one leaf ≡ different areas
BGP
- static route or OSPF for next-hop resolution (OSPF announces only loopback and intf subnet)
- route-map – whole L3Out, all peers
- announce 0.0.0.0/0: transit routing or default route leak policy
EIGRP
- 1 L3Out per border per VRF because of single process and single AS ⇒ 2 L3Out with EIGRP on same border are not allowed
Floating L3Out
- does not specify logical intf: supports routing to migrating VMs
- anchor node: IGP peering via L3Out to port-group in VMM
- floating IP – on SVI on the leaf (VNF is located behind the leaf), participates in dataplane; anchor node has its own IP
- next-hop propagation: anchor announces routes with next-hop = VTEP with VNF instead of its own
Encap mode
- local: default
- VRF:
- several L3Out on same external BD ≡ same encap VLAN
- L3Out per BGP peer
- several IGP on one SVI
Transit routing
- not all combinations of IGP/BGP
- IGP routes receive tag=0xFFFFFFFF, ACI ignores such routes as loop prevention
- applies contract on ingress leaf regardless of VRF inforcement
- 0.0.0.0/0 cannot be src or dst EPG within same L3Out – traffic is dropped because of policy (no contract between VRF → 15)
WAN integration
- GOLF: giant overlay forwarding
- L3Out in infra tenant, connected to spine (WAN ≡ border leaf)
- shared L3Out
- BGP EVPN route-type 5, host route leak via route-type 2
- N7k F3, ASR1k, ASR9k
- does not support mcast
IP SLA
- next-hop monitoring
- incompatible with track policy
Track
- IP monitoring, not necessarily next-hop
- route policy has more priority than next-hop policy
- incompatible with IP SLA
Route-map
- for BGP → IGP route-map is common for all IGP on border leaf within VRF
- OSPF and EIGRP redistribute summarization and discard route from each other because of common route-map (IGP1 → BGP → IGP2)
- BGP → IGP = redistribution, BGP → BGP = outbound route-map per session
- types:
- match prefix and routing policy: adds prefix-list with export route control subnets to all matches; if route-map has prefix-list already – merge
- match routing policy only: ignores prefixes from L3Out
- where to apply:
- L3Out subnet: subnet scope
- L3Out EPG: all subnets in defined direction (including BD subnet with export route control)
- default-export: all subnets in defined direction + BDs with L3Out association (pervasive static route)
- match priority: subnet → EPG → default-export/import
QoS
- levels:
- 20% BW; Level 1; CoS = 2
- 20% BW; Level 2; CoS = 1
- 20% BD; Level 3; CoS = 0; default
- 0% BW; APIC traffic; CoS = 3
- 0% BW; SPAN; CoS = 4
- 0% BW; control plane, iTraceroute; CoS = 6; punted to CPU; copy (spine) or redirect (leaf) in TCAM on origin leaf
- tail drop, DWRR
- reserved:
- IFC (insieme fabric controller): APIC originated/destined; strict priority
- Supervisor class: supervisor, control plane; strict priority
- SPAN: BE, DWRR; can be starved
- classified by DSCP, CoS, EPG, contracts
- by default marking are ignored and rewritten
- QoS precedence:
- zone rule (contract): subject, then contract wide
- DSCP EPG
- CoS EPG
- default
- FEX resets CoS on egress
- by default leaf inserts CoS into VXLAN DSCP (CS) and backwards ⇒ changes client CoS but not DSCP (CS)
- applied on ingress, egress parameters do not matter (e.g. EPG QoS)
- contract QoS is applied on contract enforcement ⇒ EP has to be known (on spine-proxy/flood egress does not apply QoS)
- dot1p preserve writes client CoS into VXLAN IPP
- if client traffic is marked with CoS = 6 (dot1p preserve or IPN DSCP-CoS mapping), then the traffic is punted to CPU on egress leaf ⇒ drop
- DSCP class-to-CoS translation (infra → policies → protocol) is not compatible with dot1p preserve
- for L3Out contract requires VRF egress enforcement
; buffer drops statistics
# vsh_lc -c "show platfowm internal counters port "
; per QoS class drops
# show queueing interface
; database policer drops
# show dpp policy
Firmware image
- APIC image
- switch image: one for all leaf/spine models
- catalog image: compatibility description, HW tests
- APICs are updated one by one in random sequence (~ 10 mins)
- n-group upgrade: split switches into groups for the update, lowers downtime risks
- can restrict max number of devices being updated at once
- only 1 peer out of vPC pair is upgraded at a time
- update is possible only if ACI is healthy (no convergence in APIC cluster)
Faults
- phases:
- soaking: found, delay before raise to ignore intermittent issues
- soaking-cleared: false alarm; cleared while in soaking
- raised
- raised-clearing
- retaining: archive entry about fault
- if fault is acked by user, it is removed
- timers:
- soaking: 120s default
- clearing: 120s default
- retaining: 3600s default
Config rollback
- snapshot does not save passwords if security password is not set on creating
AAA
- by default prefers in-band to out-of-band
- fallback (login domain) – APIC local profiles
Roles
- aaa: config AAA + import/export policy
- admin
- access-admin: access policies
- fabric-admin: fabric policy + external connectivity
- nw-svc-admin: L4-L7 insertion + orchestration
- nw-svc-params: parameters of L4-L7 devices
- ops: monitoring + tshoot
- read-all
- tenant-admin
- tenant-ext-admin: tenant external connectivity
- vmm-admin
- nw-svc-devpkg: import a package for infra admin, RO for tenant admin
- nw-svc-policy: creating service graph
- nw-svc-device: creating device cluster
- nw-svc-devshare: export device cluster to another tenant
- RO or RW access is determined on user level
CoPP
- ASIC interface – between ASIC and CPU:
- knet0: 1st gen leaf; receive
- knet1: 1st gen leaf; transmit
- tahoe0: EX+, baby spine; Rx and Tx
- psdev1.1: modular spine; Rx and Tx
- restrictions on contract log:
- permit: 300 pps, PERMIT LOG
- deny: 500 pps, ACLLOG
; parse internal headers, debug control plane
# tcpdump -xxxi tahoe0 | knet_parser.py --decoder tahoe
; CoPP drops
# show system internal aclqos brcm copp entries unit 0
Atomic counters
- mark bit – iVXLAN bit 56
- process:
- mark bit = 0 in all packets
- all SW clear statistics for mark bit = 1
- all SW start sending traffic with mark bit = 1, clear statistics for mark bit = 0
- after 30s all SW send traffic with mark bit = 0 but continue to send and count mark bit = 1
- APIC collects statistics