MPLS L3VPN over DMVPN revisited

One of the previous articles discussed a way to implement L3VPN over DMVPN that would allow direct spoke-to-spoke communication without traversing hub. One of the tricks of that setup is internal BGP labeled unicast (iBGP LU) for label distribution. At first, I thought external BGP (eBGP) modification would be just copy-paste with a few well-known adjustments; however, lab proved me wrong.

Since most of the technology around DMVPN and L3VPN remains the same, we could reduce the task to implementing eBGP LU over a single L2-segment:

topology

Switch1 is a mere emulation of DMVPN cloud since we are free to abstract from its internals. CSR3 has the same role as DMVPN hub had – distribution of prefixes between spokes: R1 and R2. I usually use Cisco 7200 images for lab; however, I would need a newer IOS for this one, at least IOS XE Denali 16.3.1. The initial configuration is simple enough:

CSR3(config)# interface Loopback0
CSR3(config-if)# ip address 3.3.3.3 255.255.255.255
CSR3(config)# interface GigabitEthernet1
CSR3(config-if)# ip address 192.168.0.3 255.255.255.0
CSR3(config-if)# mpls bgp forwarding
R1(config)# interface Loopback0
R1(config-if)# ip address 1.1.1.1 255.255.255.255
R1(config)# interface GigabitEthernet1
R1(config-if)# ip address 192.168.0.1 255.255.255.0
R1(config-if)# mpls bgp forwarding
R2(config)# interface Loopback0
R2(config-if)# ip address 2.2.2.2 255.255.255.255
R2(config)# interface GigabitEthernet1
R2(config-if)# ip address 192.168.0.2 255.255.255.0
R2(config-if)# mpls bgp forwarding

We would be using eBGP that is why CSR3 cannot act as a route-reflector for R1 and R2. However, since every router is connected to the same L2-segment, CSR3 does not change next-hop in prefixes exchanged between neighbors in that L2-segment. Also note that such a behavior is optional according to BGP RFC and might differ for other software, devices, vendors etc.:

When sending a message to an external peer, X, and the peer is one IP hop away from the speaker:
<irrelevant part omitted>
Otherwise, if the route being announced was learned from an external peer, the speaker can use an IP address of any adjacent router (known from the received NEXT_HOP attribute) that the speaker itself uses for local route calculation in the NEXT_HOP attribute, provided that peer X shares a common subnet with this address. This is a second form of “third party” NEXT_HOP attribute.

BGP RFC4271, page 27

One last thing before BGP configuration: R1 and R2 use the same ASN for the sake of scalability so we have to account for BGP loop prevention logic as well. Let’s enable BGP IPv4 address family without LU first:

CSR3(config)#router bgp 3
CSR3(config-router)# bgp router-id 3.3.3.3
CSR3(config-router)# bgp listen range 192.168.0.0/24 peer-group DMVPN
CSR3(config-router)# no bgp default ipv4-unicast
CSR3(config-router)# neighbor DMVPN peer-group
CSR3(config-router)# neighbor DMVPN remote-as 12
CSR3(config-router)# address-family ipv4
CSR3(config-router-af)# network 3.3.3.3 mask 255.255.255.255
CSR3(config-router-af)# neighbor DMVPN activate
R1(config)#router bgp 12
R1(config-router)# bgp router-id 1.1.1.1
R1(config-router)# no bgp default ipv4-unicast
R1(config-router)# neighbor 192.168.0.3 remote-as 3
R1(config-router)# address-family ipv4
R1(config-router-af)# network 1.1.1.1 mask 255.255.255.255
R1(config-router-af)# neighbor 192.168.0.3 activate
R1(config-router-af)# neighbor 192.168.0.3 allowas-in 
R2(config)#router bgp 12
R2(config-router)# bgp router-id 2.2.2.2
R2(config-router)# no bgp default ipv4-unicast
R2(config-router)# neighbor 192.168.0.3 remote-as 3
R2(config-router)# address-family ipv4
R2(config-router-af)# network 2.2.2.2 mask 255.255.255.255
R2(config-router-af)# neighbor 192.168.0.3 activate
R2(config-router-af)# neighbor 192.168.0.3 allowas-in 1

At this point we would expect the BGP prefixes on R1 and R2 to have direct next-hops instead of CSR3 address:

R1#show ip bgp         
BGP table version is 4, local router ID is 1.1.1.1
Status codes: s suppressed, d damped, h history, * valid, > best, i - internal, 
              r RIB-failure, S Stale, m multipath, b backup-path, f RT-Filter, 
              x best-external, a additional-path, c RIB-compressed, 
Origin codes: i - IGP, e - EGP, ? - incomplete
RPKI validation codes: V valid, I invalid, N Not found

     Network          Next Hop            Metric LocPrf Weight Path
 *>  1.1.1.1/32       0.0.0.0                  0         32768 i
 *>  2.2.2.2/32       192.168.0.2                            0 3 12 i
 *>  3.3.3.3/32       192.168.0.3              0             0 3 i
R1#
R1#traceroute 2.2.2.2 source lo0
Type escape sequence to abort.
Tracing the route to 2.2.2.2
VRF info: (vrf in name/id, vrf out name/id)
  1 192.168.0.2 12 msec 8 msec 8 msec

Now it’s time to enable BGP LU in IPv4 AF:

CSR3(config-router-af)#neighbor DMVPN send-label
R1(config-router-af)#neighbor 192.168.0.3 send-label
R2(config-router-af)#neighbor 192.168.0.3 send-label

Sanity check whether BGP converged to something we would expect:

R1#sho ip bgp
BGP table version is 12, local router ID is 1.1.1.1
Status codes: s suppressed, d damped, h history, * valid, > best, i - internal, 
              r RIB-failure, S Stale, m multipath, b backup-path, f RT-Filter, 
              x best-external, a additional-path, c RIB-compressed, 
Origin codes: i - IGP, e - EGP, ? - incomplete
RPKI validation codes: V valid, I invalid, N Not found

     Network          Next Hop            Metric LocPrf Weight Path
 *   1.1.1.1/32       192.168.0.3                            0 3 12 i
 *>                   0.0.0.0                  0         32768 i
 *>  2.2.2.2/32       192.168.0.3                            0 3 12 i
 *>  3.3.3.3/32       192.168.0.3              0             0 3 i
R1#traceroute 2.2.2.2 source lo0
Type escape sequence to abort.
Tracing the route to 2.2.2.2
VRF info: (vrf in name/id, vrf out name/id)
  1 192.168.0.3 [MPLS: Label 16 Exp 0] 8 msec 8 msec 12 msec
  2 192.168.0.2 12 msec 20 msec 12 msec

We’ve got a problem suitable for Friday evening debug: not only have we increased the number of prefixes in BGP RIB (take a look at 1.1.1.1/32) but the path from R1 to R2 has also become suboptimal.

As you might remember from the first article, BGP speaker assigns a label to the prefix if it considers itself to be a next-hop for the prefix; this way speaker ensures that it is a part of the LSP. Let’s explicitly specify that the next-hop should not be changed:

CSR3(config-router-af)#neighbor DMVPN next-hop-unchanged 
%BGP: Can propagate the nexthop only to multi-hop EBGP neighbor or iBGP VRF CE lite

No luck as described in the corresponding section of Configuration Guide. However, there is also a second option, route-maps, although it should conform to the same restrictions as the BGP level command:

CSR3(config)#route-map NH_UNCHANGED
CSR3(config-route-map)#set ip next-hop unchanged 
CSR3(config)#router bgp 3
CSR3(config-router)#address-family ipv4
CSR3(config-router-af)#neighbor DMVPN route-map NH_UNCHANGED out
R1#show ip bgp
BGP table version is 14, local router ID is 1.1.1.1
Status codes: s suppressed, d damped, h history, * valid, > best, i - internal, 
              r RIB-failure, S Stale, m multipath, b backup-path, f RT-Filter, 
              x best-external, a additional-path, c RIB-compressed, 
Origin codes: i - IGP, e - EGP, ? - incomplete
RPKI validation codes: V valid, I invalid, N Not found

     Network          Next Hop            Metric LocPrf Weight Path
 *>  1.1.1.1/32       0.0.0.0                  0         32768 i
 *>  2.2.2.2/32       192.168.0.2                            0 3 12 i
 *>  3.3.3.3/32       192.168.0.3              0             0 3 i
R1#
R1#traceroute 2.2.2.2 source lo0
Type escape sequence to abort.
Tracing the route to 2.2.2.2
VRF info: (vrf in name/id, vrf out name/id)
  1 192.168.0.2 12 msec 16 msec 8 msec

For some reason the route-map variant works even though it should not according to the documentation. I remember to have found a corresponding note regarding BGP LU specifically that only route-map can be used to change next-hop in such a case; however, I cannot find it now so if anyone knows the source, please, share.


UPD. A more appropriate way would be to use route-server configuration instead of cryptic route-map. Why wasn’t it used in the first place? I simply didn’t think about it 🙂


OK, we have restored the direct path. What about that extra route in BGP RIB? Why was it present and is gone now? The first part is pretty easy to answer: the reason for the prefix to be sent back to the BGP speaker has a term of “update group” in Cisco IOS:

CSR3#sho ip bgp update-group 
BGP version 4 update-group 3, external, Address Family: IPv4 Unicast
  BGP Update version : 15/0, messages 0, active RGs: 1
  Route map for outgoing advertisements is NH_UNCHANGED
  Sending Prefix & Label
  Topology: global, highest version: 15, tail marker: 15
  Format state: Current working (OK, last minimum advertisement interval)
                Refresh blocked (not in list, last not in list)
  Update messages formatted 9, replicated 12, current 0, refresh 0, limit 1000
  Number of NLRIs in the update sent: max 1, min 0
  Minimum time between advertisement runs is 30 seconds
  Has 2 members:
   *192.168.0.1     *192.168.0.2 

It basically means that IOS assembles one BGP update per group instead of per peer in order to conserve resources (imagine processing full-view). We used to have an option to control such a behavior via peer-groups but it has been revoked optimized long since. R1 and R2 are in the same group so they get the same set of prefixes, including their own. CSR3 rewrote next-hop for BGP LU updates; such a loopback update was perfectly valid from BGP point of view on R1 and R2. However, why don’t we see these prefixes when next-hop is not updated? Shall we ask the actors?

R1#debug ip bgp updates 
BGP updates debugging is on for address family: IPv4 Unicast
R1# 
R1#clear ip bgp * soft
<debug omitted>
*Dec  5 17:16:11.831: BGP(0): 192.168.0.3 rcv UPDATE about 1.1.1.1/32 -- DENIED due to: NEXTHOP is our own address;
<debug omitted>

It turns out to be rather simple: if a prefix has BGP speaker’s address as the next-hop, router would drop the prefix. As you might imagine, the same process takes place with BGP RR updates.

Let’s get back to MPLS part:

R1#ping 2.2.2.2 source lo0
Type escape sequence to abort.
Sending 5, 100-byte ICMP Echos to 2.2.2.2, timeout is 2 seconds:
Packet sent with a source address of 1.1.1.1 
!!!!!
Success rate is 100 percent (5/5), round-trip min/avg/max = 8/11/20 ms
R1#
R1#ping mpls ipv4 2.2.2.2/32 source 1.1.1.1
Sending 5, 100-byte MPLS Echos to 2.2.2.2/32, 
     timeout is 2 seconds, send interval is 0 msec:

Codes: '!' - success, 'Q' - request not sent, '.' - timeout,
  'L' - labeled output interface, 'B' - unlabeled output interface, 
  'D' - DS Map mismatch, 'F' - no FEC mapping, 'f' - FEC mismatch,
  'M' - malformed request, 'm' - unsupported tlvs, 'N' - no label entry, 
  'P' - no rx intf label prot, 'p' - premature termination of LSP, 
  'R' - transit router, 'I' - unknown upstream index,
  'X' - unknown return code, 'x' - return code 0

Type escape sequence to abort.
QQQQQ
Success rate is 0 percent (0/5)
R1#
R1#show ip cef 2.2.2.2/32 detail 
2.2.2.2/32, epoch 0, flags rib only nolabel, rib defined all labels
  recursive via 192.168.0.2
    attached to FastEthernet0/0
R1#
R1#sho mpls forwarding-table 
Local      Outgoing   Prefix           Bytes Label   Outgoing   Next Hop    
Label      Label      or Tunnel Id     Switched      interface

As you can see, there is no MPLS connectivity between R1 and R2. The reason – no labels are present. Since we use BGP for label distribution, there must be something wrong with the update. Let’s check R1 point of view:

R1#show ip bgp labels 
   Network          Next Hop      In label/Out label
   1.1.1.1/32       0.0.0.0         imp-null/nolabel
   2.2.2.2/32       192.168.0.2     nolabel/nolabel
   3.3.3.3/32       192.168.0.3     nolabel/nolabel

No labels are received from CSR3: neither directly from CSR3 nor indirectly from R2. However, CSR3 has correct labels in BGP:

CSR3#show ip bgp labels 
   Network          Next Hop      In label/Out label
   1.1.1.1/32       192.168.0.1     nolabel/imp-null
   2.2.2.2/32       192.168.0.2     nolabel/imp-null
   3.3.3.3/32       0.0.0.0         nolabel/nolabel

It seems that CSR3 is not sending the labels towards its peers after the route-map was applied so let’s try to fix its behavior using the route-map once again as we did for next-hop:

CSR3(config)#route-map NH_UNCHANGED permit 10 
CSR3(config-route-map)#set mpls-label 
CSR3#clear ip bgp * soft

Peeking at R1 BGP table should reveal whether anything has changed:

R1#show ip bgp labels 
   Network          Next Hop      In label/Out label
   1.1.1.1/32       0.0.0.0         imp-null/nolabel
   2.2.2.2/32       192.168.0.2     nolabel/imp-null
   3.3.3.3/32       192.168.0.3     nolabel/nolabel

As a matter of fact, R1 does receive the correct label for 2.2.2.2/32, however, 3.3.3.3/32 is still empty; for some reason CSR3 ignores the label for the prefix that it injected into BGP:

CSR3#show ip bgp labels
   Network          Next Hop      In label/Out label
   1.1.1.1/32       192.168.0.1     nolabel/imp-null
   2.2.2.2/32       192.168.0.2     nolabel/imp-null
   3.3.3.3/32       0.0.0.0         nolabel/nolabel

Disclaimer: the text further till the end of the article is based purely on trial and error with no official documentation backup.

Once again, let’s try some route-map woodoo. 3.3.3.3/32 falls under the single entry in NH_UNCHANGED, however, it does not have any “other” next-hop except CSR3 so “next-hop unchanged” might be causing some exception within the code that results in MPLS label not been added to the BGP update. I would use prefix-list to identify prefixes that are originated by DMVPN hub:

CSR3(config)#ip prefix-list LOCAL permit 3.3.3.3/32          
CSR3(config)#route-map NH_UNCHANGED permit 5                 
CSR3(config-route-map)#match ip address prefix-list LOCAL              
CSR3(config-route-map)#set mpls-label 
CSR3#clear ip bgp * soft
R1#show ip bgp labels 
   Network          Next Hop      In label/Out label
   1.1.1.1/32       0.0.0.0         imp-null/nolabel
   2.2.2.2/32       192.168.0.2     nolabel/imp-null
   3.3.3.3/32       192.168.0.3     nolabel/imp-null

Voila! All labels are finally in place. Is R1 able to build LSP towards 2.2.2.2/32?

R1#ping mpls ipv4 2.2.2.2/32 source 1.1.1.1
Sending 5, 100-byte MPLS Echos to 2.2.2.2/32, 
     timeout is 2 seconds, send interval is 0 msec:

Codes: '!' - success, 'Q' - request not sent, '.' - timeout,
  'L' - labeled output interface, 'B' - unlabeled output interface, 
  'D' - DS Map mismatch, 'F' - no FEC mapping, 'f' - FEC mismatch,
  'M' - malformed request, 'm' - unsupported tlvs, 'N' - no label entry, 
  'P' - no rx intf label prot, 'p' - premature termination of LSP, 
  'R' - transit router, 'I' - unknown upstream index,
  'X' - unknown return code, 'x' - return code 0

Type escape sequence to abort.
QQQQQ
Success rate is 0 percent (0/5)
R1#
R1#show ip cef 2.2.2.2/32 det
2.2.2.2/32, epoch 0, flags rib defined all labels
  recursive via 192.168.0.2
    attached to FastEthernet0/0
R1#
R1#show mpls forwarding-table 2.2.2.2   
Local      Outgoing   Prefix           Bytes Label   Outgoing   Next Hop    
Label      Label      or Tunnel Id     Switched      interface              
None       No Label   2.2.2.2/32       0             Fa0/0      192.168.0.2 

LSP is not built: R1 does not install the labels into LFIB although they are received via BGP. The necessary information for LSP is present, except for, probably, the usecase: R1 is expected to pass the traffic further using LDP as there should be no P-router that is using BGP only. Please, note that this is just a guess deduced from the configuration that happened to work. Time to unveil the woodoo doll:

R1(config)#int f0/0.1
R1(config-subif)#mpls ip
R1(config)#router ospf 1
R1(config-router)#redistribute bgp 12 subnets

This way labels show up in LFIB and OSPF database remains empty because no interfaces are enabled for OSPF process (thus no need for filtering on redistribution):

R1#show mpls forwarding
Local      Outgoing   Prefix           Bytes Label   Outgoing   Next Hop    
Label      Label      or Tunnel Id     Switched      interface              
16         Pop Label  2.2.2.2/32       0             Fa0/0      192.168.0.2 
17         Pop Label  192.168.0.3/32   0             Fa0/0      192.168.0.3 
18         Pop Label  3.3.3.3/32       0             Fa0/0      192.168.0.3 
R1#
R1#show ip ospf database

            OSPF Router with ID (1.1.1.1) (Process ID 1)

Removal of OSPF process after the labels are installed does not purge them from LFIB; however, labels for new prefixes are ignored without OSPF process – perfect opportunity for EEM scripting DevOps integration 🙂 The same behavior can be observed for iBGP as well: labels do not make it into LFIB unless “donor” OSPF and LDP processes are initialized.

Enough meddling with IOS internals, let’s see whether LSP is finally up:

R1#ping mpls ipv4 2.2.2.2/32 source 1.1.1.1
Sending 5, 100-byte MPLS Echos to 2.2.2.2/32, 
     timeout is 2 seconds, send interval is 0 msec:

Codes: '!' - success, 'Q' - request not sent, '.' - timeout,
  'L' - labeled output interface, 'B' - unlabeled output interface, 
  'D' - DS Map mismatch, 'F' - no FEC mapping, 'f' - FEC mismatch,
  'M' - malformed request, 'm' - unsupported tlvs, 'N' - no label entry, 
  'P' - no rx intf label prot, 'p' - premature termination of LSP, 
  'R' - transit router, 'I' - unknown upstream index,
  'X' - unknown return code, 'x' - return code 0

Type escape sequence to abort.
!!!!!
Success rate is 100 percent (5/5), round-trip min/avg/max = 4/16/40 ms

At last LSP between R1 and R2 is up and operational so the rest of L3VPN setup can be implemented the same way as we did for iBGP. The biggest challenge for using eBGP in L3VPN over DMVPN is the necessity for “next-hop unchanged” knob that seems to be unsupported for labeled unicast. Obviously, this would nullify vendor technical support if an issue arises. Besides, although I managed to make the knob work for the article, this feature might get changed in some future software release; I would rather not expose any production environment to this potentially volatile behavior.

All in all, eBGP is not yet a valid choice for building direct spoke-to-spoke LSPs for L3VPN in DMVPN environments. And one more thing: curious reader, the homework is finished!

Kudos for review: Anastasiia Kuraleva

Follow on Telegram, LinkedIn

Leave a comment