TCP

  1. TCP
    1. Header
      1. Window size
      2. Urgent pointer
      3. Flags
    2. Options
      1. Maximum segment size (MSS)
      2. TCP window scale (scale factor)
      3. SACK
        1. SACK-Permitted
        2. SACK
      4. TCP timestamp
      5. MD5
    3. Keepalive
  2. Flow control
    1. Session creation
    2. Session tear down
    3. Fast Open
    4. Sliding window
    5. Slow start
    6. Delayed ack
    7. Nagle algorithm
    8. Acceleration
    9. TCP ECN
    10. Maximum segment lifetime (MSL)
    11. QUBIC
    12. HyStart
  3. QoS
    1. TCP starvation
    2. Global synchronisation
  4. Ports
  5. DCTCP

TCP

  • upper-level protocol data might not be sent immediately: several chunks are aggregated
  • segmentation and reassembly of byte stream input
  • uses same pseudoheader IP for CRC calculation as UDP
  • CRC is mandatory
  • retransmit – per packet (especially with SACK)
; open TCP/UDP ports
# show ip sockets
 0                   1                   2                   3
 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|           Source port         |        Destination port       |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                     Sequence number (bytes)                   |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                  Acknowledgement number (bytes)               |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|Hdr len|  Rsvd |     Flags     |      Window size (bytes)      |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|              CRC              |        Urgent pointer         |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
\                                                               \
/                            Options                            /
\                                                               \
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
\                                                               \
/                              Data                             /
\                                                               \
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

Sequence number – own transmission enumeration

Acknowledgement number – which byte is expected next, peer’s transmission enumeration

Window size

  • byte count that the receiver is ready to accept without acknowledgement
  • max 65535 bytes
  • 0 ≡ segments can be sent (e.g., for ACK) without data

Urgent pointer

  • priority marker
  • bytes after current Sequence number – urgent
  • considered only along with URG flag
  • offset of sequence number after urgent data piece
  • can span several segments

Flags

  • SYN (0x02)
    • synchronize sequence numbers
    • start of transmission
  • ACK (0x10)
    • during transmission
  • URG (0x20)
    • account for urgent pointer
  • FIN (0x01)
    • closing connection
  • RST (0x04)
    • abnormal connection close
  • PSH (0x08)
    • bypass delayed transmission with aggregation
    • passed with last segment of priority data on transmission
    • if received – pass to ULP immediately
  • CWR (0x80)
    • congestion window reduced (≈ FR BECN)
    • sender received ECE in received packet and reduced congestion window
  • ECE (0x40)
    • ECN echo (≈ FR FECN)
    • if passed with SYN ≡ ECN capable, otherwise – congestion notification
    • intermediate hop sets to signal congestion ≡ IP ECN = 11b

Options

  • 0: padding
  • 1: no operation
  • 2: MSS
  • 3: window scale
  • 4: SACK permitted
  • 5: SACK
  • 8: timestamp
  • 19: MD5

Maximum segment size (MSS)

  • SYN segment only
  • not negotiated, segment size ≤ MSS
  • data size only, does not include headers
  • unidirectional PMTUD, however, can influence SYN-reply
  • CPU-processed ⇒ better be adjusted in distributed way (spoke in lieu of hub)
; off by default, mins = 10 by default
(config)# ip tcp path-mtu-discovery [age-timer <mins>]
; off by default
(config-if)# ip tcp adjust-mss <MSS>
 0                   1                   2                   3
 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|      Type     |   Length = 4  |           MSS (bytes)         |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

TCP window scale (scale factor)

  • max segment size – 65535 bytes
  • allows to increase default segment size: up to 2¹⁶ → up tp 2³⁰ bytes
  • hardware may process incorrectly (≈ 1Gb)
  • must be enabled on both peers (negotiated in SYN segment)
(config)# ip tcp window-size 65536
 0                   1                   2       
 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|      Type     |   Length = 3  |  Scale factor |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

SACK

  • options SACK and SACK-premitted (SYN only, SACK negotiation)
(config)# ip tcp selective-ack

SACK-Permitted

 0                   1                  
 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|      Type     |   Length = 2  |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

SACK

 0                   1                   2                   3
 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
                                +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
                                |      Type     |     Length    |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                     Left edge of Block 1                      |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                    Right edge of Block 1                      |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                     Left edge of Block 2                      |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                    Right edge of Block 2                      |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                     Left edge of Block 3                      |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                    Right edge of Block 3                      |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                     Left edge of Block 4                      |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                    Right edge of Block 4                      |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

Length = {10, 18, 26, 34}

Left edge: first byte of block, received

Right edge: first byte of block, not received

TCP timestamp

  • usecase
    • calculate uptime and boot time (e.g., connection between spoofed IP and MAC)
    • duplicate packet detection
      • protection against wrapped SN (PAWS): drop segments that are older then threshold
    • estimate RTT in LFN
      • in lieu of RTT for one packet per window
      • real-time retransmit timeout tuning
  • vulnerable
  • incompatible with TCP header compression
  • not copied on fragmentation
  • added by both peers
  • negotiated within SYN ≡ declare option support; within session ≡ timestamp role
(config)# ip tcp timestamp
 0                   1                   2                   3
 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
                                +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
                                |      Type     |  Length = 10  |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                        Timestamp value                        |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                      Timestamp echo reply                     |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

Echo reply: latest received timestamp from peer ⇒ no need for clock synchronization

MD5

 0                   1                   2                   3
 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
                                +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
                                |      Type     |  Length = 18  |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                                                               |
|                           MD5 digest                          |
|                                                               |
|                                                               |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

Keepalive

  1. transmitter sends segment with sequence number, that receiver has already acknowledged; no data
  2. receiver registers an error and replies with up-to-date acknowledgement number, if session is still active

Flow control

  • via window size adjustment
  • decreases drops caused by buffer overrun

Session creation

A             SYN, Seq = n             B
    ------------------------------->
       SYN, ACK, Seq = m, Seq = n
    <-------------------------------
                  ACK
    ------------------------------->

If 100 bytes are sent, starting with n – received ack = n + 101

Session tear down

A                 FIN                  B | A               FIN, ACK                B
    ------------------------------->     |     ------------------------------->
               FIN, ACK                  |                 FIN, ACK            
    <-------------------------------     |     <-------------------------------
                  ACK                    |
    ------------------------------->     |

Fast Open

  • allows sending user data during handshake

Sliding window

Last sequentially received byte number is acknowledged, out-of-order segments are buffered.

Time to wait for acknowledgement ≈ 200ms

Slow start

  • in the beginning of a session
  • segments are sent in small batches disregarding large sliding window (= sender MSS)
  • if ACK received, effective window size is doubled ⇒ exponential acceleration
  • improvements:
    • no reset down to 0 for small packet loss
    • drop transmission speed on large losses (by 2 times), then increase linearly instead

Delayed ack

  • acknowledgement is not triggered by every segment
  • triggered by:
    • timer expiration: acknowledge batch of segments
    • if there is data to send (piggyback acknowledgement)

Nagle algorithm

  • small packets are buffered, sent in batch
  • Telnet
(config)# service nagle

Acceleration

  • retransmit timers tuning: no need to wait 200ms in fast networks
  • fast retransmit:
    • if 3 consecutively received segments contain the same Acknowledgement number, the segment with next byte is lost
    • does not wait for retransmit timer to expire
    • ≈ 10ms
  • SACK, NACK: negotiate in SYN segments with SACK-permitted option
  • performance enhancement proxy (PEP): dynamically changes window size for satellite links

TCP ECN

(config)# ip tcp ecn

Maximum segment lifetime (MSL)

  • RFC 793
  • implemented on top of IP TTL
  • protection against duplicate segments on SN wrap
  • increase MSL: does not work for high-speed LAN because of SN wrap
  • decrease MSL: does not work for LFN

QUBIC

  • if loss is detected, this BW is set as 0 (BWmax)
    • BW is decreased by 30% initially
    • BW increases cubically till BWmax: fast initial growth, careful approach to BWmax
    • after BWmax is reached without loss, BW grows cubically
  • default for Linux kernel
  • slow start as usual; QUBIC covers only congestion avoidance

HyStart

  • evolution of QUBIC: changes slow start behaviour
  • if RTT or inter-ACK time increase over threshold, exit slow start and enter QUBIC congestion avoidance
  • HyStart++:
    • RTT detection only
    • limited slow start (LSS) after slow start before congestion avoidance: LSS growths congestion windows faster then vanilla, but slower than QUBIC

QoS

TCP starvation

  • same as UDP dominance
  • TCP decreases transmission speed because of flow control, however, UDP fills newly freed bandwidth

Global synchronisation

  • if receiver of several sessions is overloaded, it decreases speed for everyone → load drops for everyone simultaneously, then is increased for everyone simultaneously as well → saw-like traffic pattern ≡ low utilization
  • solved by WRED

Ports

  • 21: CHARGEN
  • 49: TACACS+
  • 88: Kerberos
  • 143: IMAP
  • 445: SMB
  • 502: Modbus TCP
  • 601: reliable syslog
  • 860: iSCSI
  • 1344: ICAP
  • 1720: H.323 signalling (Q.931 + Q.932)
  • 1883: MQTT
  • 2049: NFSv4
  • 2377: Docker management (overlay driver)
  • 2404: IEC 60870-5-104
  • 2738: CTIQBE
  • 3205: iSNSP
  • 3225: FCIP
  • 3260: iSCSI
  • 4059: DLMS User Association
  • 5061: SIPS
  • 5222: XMPP
  • 7946: Docker inter-node communication (overlay driver)
  • 8883: MQTT TLS
  • 8905: Cisco ISE Posture
  • 20000: DNP3/IP

DCTCP

  • reduces window size proportionally to losses, not exponentially
  • congestion signalling: segments lost, IP ECN