EIGRP SIA – why?

It’s very likely that you already know what EIGRP stuck-in-active (SIA) feature means. Just a quick recap: if a router does not get a Reply message for previously sent Query within Active timer (3 minutes by default), it tears down the adjacency with the “stuck” neighbour; in the meantime the router probes its neighbours with SIA-Query, resetting Active timer if there is SIA-Reply from the neighbour. Sounds simple, right? Just another failsafe to protect network from a router that might go haywire. Let me ask you a long multi-question though:

Why SIA is required – there is no way to disable it? Isn’t it enough to expire Holddown timer on the stuck neighbour and consider its Reply unnecessary?

Well, the reply really depends on the viewpoint (Cisco’s “it depends”, uh-huh). Let’s see it on an example:

In such a setup there is absolutely no way SIA would be needed. Let’s imagine that R3 stops sending EIGRP packets for some reason and 1.1.1.1/32 on R1 goes down:

  1. R1 would send a Query for 1.1.1.1/32 to R2;
  2. R2 would send a Query for 1.1.1.1/32 to R3, however, it will never get a Reply;
  3. There would be a few unsuccessful EIGRP retransmits from R2 towards R3;
  4. Either Holddown timer expires (15s by default) or number of retransmits reaches 16 (only Cisco knows how long);
  5. R2 tears down neighbourship with R3 and sends Reply back to R1;
  6. Active timer on R1 never comes even close to expiration (3 minutes) so the 1.1.1.1/32 in Active state is removed.

Remember, however, that EIGRP was designed really long time ago – when serial links were ubiquitous. The most important feature of these links for this discussion – relatively long distance and high delay as a result. Although serial links are actively upgraded, there is still a similar connection – radiolinks. Consider the following setup:

The only non-default thing is the serial link using Frame-Relay for encapsulation.

R1#sho run | s interface|router
interface Loopback0
 ip address 1.1.1.1 255.255.255.255
interface FastEthernet0/0
 ip address 192.168.12.1 255.255.255.0
interface FastEthernet0/1
 ip address 192.168.14.1 255.255.255.0
router eigrp 1
 network 0.0.0.0
R2#show run | section interface|router
interface Loopback0
 ip address 2.2.2.2 255.255.255.255
interface FastEthernet0/0
 ip address 192.168.12.2 255.255.255.0
interface Serial4/0
 ip address 192.168.23.2 255.255.255.0
 encapsulation frame-relay
 no keepalive
 frame-relay interface-dlci 100
router eigrp 1
 network 0.0.0.0
R3#show run | section interface|router
interface Loopback0
 ip address 3.3.3.3 255.255.255.255
interface Serial4/0
 ip address 192.168.23.3 255.255.255.0
 encapsulation frame-relay
 no keepalive
 frame-relay interface-dlci 100
router eigrp 1
 network 0.0.0.0

Let’s try to run the scenario without SIA involved. The feature was introduced in 12.1(5) release so any 12.0 software should do. Although we cannot drop Queries specifically, we can discard all unicast packets to achieve the following: drop Queries and accept Hello. As a result, R2 would consider R3 to have failed based on Active timer (180 seconds by default) and not on Holddown timer (also 180 seconds by default). Although it seems like a setup at the first glance, I suggest holding on to it for some time.

R3#show ip access-lists
Extended IP access list NOUNICAST
    10 permit ip any 224.0.0.0 15.255.255.255
    20 deny ip any any

Now, let’s bring down 1.1.1.1/32 and activate the ACL on R3:

R3(config)#interface s4/0
R3(config-if)#ip access-group NOUNICAST in
R1(config)#iinterface lo0
R1(config-if)#sh

Now R1 considers the route to be in Active state.

R1# show ip eigrp topology active
IP-EIGRP Topology Table for AS(1)/ID(1.1.1.1)

Codes: P - Passive, A - Active, U - Update, Q - Query, R - Reply,
       r - Reply status

A 1.1.1.1/32, 1 successors, FD is Infinity
    1 replies, active 00:00:07, query-origin: Local origin
      Remaining replies:
         via 192.168.12.2, r, FastEthernet0/0

After 3 minutes R1 should flush the route because by that moment it has received no Reply from R2 as there was no response from R3. However, this is not the case:

R1#show ip eigrp topology active 
IP-EIGRP Topology Table for AS(1)/ID(1.1.1.1)

Codes: P - Passive, A - Active, U - Update, Q - Query, R - Reply,
       r - Reply status

A 1.1.1.1/32, 1 successors, FD is Inaccessible
    1 replies, active 00:03:05, query-origin: Local origin
         via Connected (Infinity/Infinity), Loopback0
    Remaining replies:
         via 192.168.12.2, r, FastEthernet0/0
R1#show ip eigrp topology active 
IP-EIGRP Topology Table for AS(1)/ID(1.1.1.1)

Is there anything wrong with the configuration? I don’t think so. However, let’s get back to the failure condition based on Active timer instead of Holddown timer. Imagine that there are a bunch of other routers between R1 and R2, all using serial links and thus contributing to overall delay. May there be just a slight difference between 1.1.1.1/32 going down (and starting Active timer) and last Hello from R3 arriving (refreshing Holddown timer) that is covered completely by that delay? Definitely so:

  1. Although R2 might terminate neighbourship with R3 after 180 seconds, there is still a propagation delay for that event to reach R1.
  2. With a bit of “luck”, last Hello and disapperance of 1.1.1.1/32 would line up.

As soon as R2 prepares the Reply to be sent back to R1, Active timer on R1 expires and R1 resets the neighbourship with R2, at least according to the description of DUAL. As you could imagine, such a behaviour causes chain flapping of EIGRP neighbourships all around the network, just because there are high-delay links and a rogue malfunctioning router.

So why did we filter only unicast packets instead of dropping all EIGRP datagrams? Well, it would have required me to initiate the events at the same time right after last Hello from R3 was received. Although it’s possible with some automation, using Active timer instead removed the delay between my brain and the keyboard completely from equation while still providing us with the same result.

However, that’s not what we received during the test. I’ll have to speculate a little bit here as I don’t have a strict explanation for it, only suggestion.

  1. It’s possible to alleviate the problem by increasing the gap between default values of Active and Holddown timers. However, feasibility of such a method really depends on the total delay between the routers so I’d consider it to be a workaround. It seems that IOS 12.0 implements exactly this behaviour; version 11 could have provided different results but I could not find the image.
  2. The proper solution to the problem at hand is SIA. The idea is simple: separate prefix availability check (Query) from neighbour availability check (SIA-Query). Such an approach incurs no tangible dependency on total delay compared to timer tuning. Besides, it is generally a good idea to separate functions and not to overload them extensively.

Does it really matter in the modern world, especially since SIA cannot be disabled? Most likely not, to be honest, unless you run a very outdated IOS version (SIA would be the least of your concerns in such a case though). Understanding the reason for a feature to be implemented makes me feel good – so maybe such a knowledge would make someone feel good as well.

Kudos for review: Anastasiia Kuraleva

Follow on Telegram, LinkedIn

Leave a comment