IP MTU: how to stop living and start learning headers

Good old IPv4… It is as ubiquitous in networking world as the air is on the Earth. Although folks around the world use it on a daily basis, IPv4 still has a few surprises up its sleeve. Today we’re going to peek at one of them.

Here is the topology of four routers lined up in a row:

Each router has basic addressing set up as well as EIGRP on all of the interfaces:
R2#show run | section router|interface FastEthernet
interface FastEthernet0/0
 ip address 192.168.12.2 255.255.255.0
interface FastEthernet0/1
 ip address 192.168.23.2 255.255.255.0
router eigrp 1
 network 0.0.0.0

By default, IP MTU on each link is equal to 1500. This means that the acceptable IP packet size, including headers and payload, can be up to 1500 bytes; if a packet is too big, it has to be fragmented. Suppose the MTU of R2-R3 link is equal to 1400 bytes on both ends:

R2#show run interface fastEthernet0/1

interface FastEthernet0/1
 ip address 192.168.23.2 255.255.255.0
 ip mtu 1400

How many fragments would a 1500-byte packet produce?

R1#ping 4.4.4.4 source 1.1.1.1 size 1500 repeat 1
Type escape sequence to abort.
Sending 1, 1500-byte ICMP Echos to 4.4.4.4, timeout is 2 seconds:
Packet sent with a source address of 1.1.1.1 
!
Success rate is 100 percent (1/1), round-trip min/avg/max = 68/68/68 ms

It’s pretty easy to devise that 1500-byte packet should result in 2 fragments for MTU of 1400 bytes. However, why are there 2 fragments for ICMP echo reply? Well, it turns out that ICMP is in fact supposed to work this way:

The data received in the echo message must be returned in the echo reply message.

RFC792

Let’s reduce R2-R3 MTU down to 700 bytes on both ends and check whether we can squeeze exactly two fragments of 700 bytes through it. IP header is 20 bytes long so the initial payload should be 680*2 + 20 = 1380 bytes (IP MTU includes the header, remember?).

R1#ping 4.4.4.4 source 1.1.1.1 size 1380 repeat 1
Type escape sequence to abort.
Sending 1, 1380-byte ICMP Echos to 4.4.4.4, timeout is 2 seconds:
Packet sent with a source address of 1.1.1.1 
!
Success rate is 100 percent (1/1), round-trip min/avg/max = 48/48/48 ms

Now for the high spot of the testing: magical MTU value of 725.

R1#ping 4.4.4.4 source 1.1.1.1 size 1430 repeat 1    
Type escape sequence to abort.
Sending 1, 1430-byte ICMP Echos to 4.4.4.4, timeout is 2 seconds:
Packet sent with a source address of 1.1.1.1 
!
Success rate is 100 percent (1/1), round-trip min/avg/max = 48/48/48 ms

Let’s break down the initial packet: 20 bytes header, 1410 bytes of payload that result… in 3 fragments? Shouldn’t 1410 bytes perfectly fit into 705 allowed payload for fragments?

Frankly speaking, 725 is as special as any number not divisible by 8. The catch is called Fragment Offset field that regulates relative position of fragment payload within the initial data.

    0                   1                   2                   3
    0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |Version|  IHL  |Type of Service|          Total Length         |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |         Identification        |Flags|      Fragment Offset    |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |  Time to Live |    Protocol   |         Header Checksum       |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                       Source Address                          |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                    Destination Address                        |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                    Options                    |    Padding    |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

Fragment offset is measured in units of 8 bytes. Since 705 is not divided by 8, router chooses the closest value to 705 bytes – 704.

The exact sizing of fragmentation is not determined by RFC. In this particular case R2 chooses to put 8 bytes in the 2nd fragment and leave the rest 698 bytes for the last one.

In the end, hosts try to avoid fragmentation at all by various means: PMTUD, TCP MSS to name a few. Such a behaviour makes issues with unpredictable size of fragments even less likely to occur in real life. However, sometimes one needs a reason to justify learning IP headers, er?..

Kudos for review: Anastasiia Kuraleva

Follow on Telegram, LinkedIn

Leave a comment