TCP
- upper-level protocol data might not be sent immediately: several chunks are aggregated
- segmentation and reassembly of byte stream input
- uses same pseudoheader IP for CRC calculation as UDP
- CRC is mandatory
- retransmit – per packet (especially with SACK)
; open TCP/UDP ports
# show ip sockets
Header
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Source port | Destination port |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Sequence number (bytes) |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Acknowledgement number (bytes) |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|Hdr len| Rsvd | Flags | Window size (bytes) |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| CRC | Urgent pointer |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
\ \
/ Options /
\ \
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
\ \
/ Data /
\ \
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Sequence number – own transmission enumeration
Acknowledgement number – which byte is expected next, peer’s transmission enumeration
Window size
- byte count that the receiver is ready to accept without acknowledgement
- max 65535 bytes
- 0 ≡ segments can be sent (e.g., for ACK) without data
Urgent pointer
- priority marker
- bytes after current Sequence number – urgent
- considered only along with URG flag
- offset of sequence number after urgent data piece
- can span several segments
Flags
- SYN (0x02)
- synchronize sequence numbers
- start of transmission
- ACK (0x10)
- during transmission
- URG (0x20)
- account for urgent pointer
- FIN (0x01)
- closing connection
- RST (0x04)
- abnormal connection close
- PSH (0x08)
- bypass delayed transmission with aggregation
- passed with last segment of priority data on transmission
- if received – pass to ULP immediately
- CWR (0x80)
- congestion window reduced (≈ FR BECN)
- sender received ECE in received packet and reduced congestion window
- ECE (0x40)
- ECN echo (≈ FR FECN)
- if passed with SYN ≡ ECN capable, otherwise – congestion notification
- intermediate hop sets to signal congestion ≡ IP ECN = 11b
Options
- 0: padding
- 1: no operation
- 2: MSS
- 3: window scale
- 4: SACK permitted
- 5: SACK
- 8: timestamp
- 19: MD5
Maximum segment size (MSS)
- SYN segment only
- not negotiated, segment size ≤ MSS
- data size only, does not include headers
- unidirectional PMTUD, however, can influence SYN-reply
- CPU-processed ⇒ better be adjusted in distributed way (spoke in lieu of hub)
; off by default, mins = 10 by default
(config)# ip tcp path-mtu-discovery [age-timer <mins>]
; off by default
(config-if)# ip tcp adjust-mss <MSS>
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Type | Length = 4 | MSS (bytes) |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
TCP window scale (scale factor)
- max segment size – 65535 bytes
- allows to increase default segment size: up to 2¹⁶ → up tp 2³⁰ bytes
- hardware may process incorrectly (≈ 1Gb)
- must be enabled on both peers (negotiated in SYN segment)
(config)# ip tcp window-size 65536
0 1 2
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Type | Length = 3 | Scale factor |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
SACK
- options SACK and SACK-premitted (SYN only, SACK negotiation)
(config)# ip tcp selective-ack
SACK-Permitted
0 1
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Type | Length = 2 |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
SACK
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Type | Length |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Left edge of Block 1 |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Right edge of Block 1 |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Left edge of Block 2 |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Right edge of Block 2 |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Left edge of Block 3 |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Right edge of Block 3 |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Left edge of Block 4 |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Right edge of Block 4 |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Length = {10, 18, 26, 34}
Left edge: first byte of block, received
Right edge: first byte of block, not received
TCP timestamp
- usecase
- calculate uptime and boot time (e.g., connection between spoofed IP and MAC)
- duplicate packet detection
- protection against wrapped SN (PAWS): drop segments that are older then threshold
- estimate RTT in LFN
- in lieu of RTT for one packet per window
- real-time retransmit timeout tuning
- vulnerable
- incompatible with TCP header compression
- not copied on fragmentation
- added by both peers
- negotiated within SYN ≡ declare option support; within session ≡ timestamp role
(config)# ip tcp timestamp
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Type | Length = 10 |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Timestamp value |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Timestamp echo reply |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Echo reply: latest received timestamp from peer ⇒ no need for clock synchronization
MD5
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Type | Length = 18 |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| |
| MD5 digest |
| |
| |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Keepalive
- transmitter sends segment with sequence number, that receiver has already acknowledged; no data
- receiver registers an error and replies with up-to-date acknowledgement number, if session is still active
Flow control
- via window size adjustment
- decreases drops caused by buffer overrun
Session creation
A SYN, Seq = n B
------------------------------->
SYN, ACK, Seq = m, Seq = n
<-------------------------------
ACK
------------------------------->
If 100 bytes are sent, starting with n – received ack = n + 101
Session tear down
A FIN B | A FIN, ACK B
-------------------------------> | ------------------------------->
FIN, ACK | FIN, ACK
<------------------------------- | <-------------------------------
ACK |
-------------------------------> |
Fast Open
- allows sending user data during handshake
Sliding window
Last sequentially received byte number is acknowledged, out-of-order segments are buffered.
Time to wait for acknowledgement ≈ 200ms
Slow start
- in the beginning of a session
- segments are sent in small batches disregarding large sliding window (= sender MSS)
- if ACK received, effective window size is doubled ⇒ exponential acceleration
- improvements:
- no reset down to 0 for small packet loss
- drop transmission speed on large losses (by 2 times), then increase linearly instead
Delayed ack
- acknowledgement is not triggered by every segment
- triggered by:
- timer expiration: acknowledge batch of segments
- if there is data to send (piggyback acknowledgement)
Nagle algorithm
- small packets are buffered, sent in batch
- Telnet
(config)# service nagle
Acceleration
- retransmit timers tuning: no need to wait 200ms in fast networks
- fast retransmit:
- if 3 consecutively received segments contain the same Acknowledgement number, the segment with next byte is lost
- does not wait for retransmit timer to expire
- ≈ 10ms
- SACK, NACK: negotiate in SYN segments with SACK-permitted option
- performance enhancement proxy (PEP): dynamically changes window size for satellite links
TCP ECN
(config)# ip tcp ecn
Maximum segment lifetime (MSL)
- RFC 793
- implemented on top of IP TTL
- protection against duplicate segments on SN wrap
- increase MSL: does not work for high-speed LAN because of SN wrap
- decrease MSL: does not work for LFN
QUBIC
- if loss is detected, this BW is set as 0 (BWmax)
- BW is decreased by 30% initially
- BW increases cubically till BWmax: fast initial growth, careful approach to BWmax
- after BWmax is reached without loss, BW grows cubically
- default for Linux kernel
- slow start as usual; QUBIC covers only congestion avoidance
HyStart
- evolution of QUBIC: changes slow start behaviour
- if RTT or inter-ACK time increase over threshold, exit slow start and enter QUBIC congestion avoidance
- HyStart++:
- RTT detection only
- limited slow start (LSS) after slow start before congestion avoidance: LSS growths congestion windows faster then vanilla, but slower than QUBIC
QoS
TCP starvation
- same as UDP dominance
- TCP decreases transmission speed because of flow control, however, UDP fills newly freed bandwidth
Global synchronisation
- if receiver of several sessions is overloaded, it decreases speed for everyone → load drops for everyone simultaneously, then is increased for everyone simultaneously as well → saw-like traffic pattern ≡ low utilization
- solved by WRED
Ports
- 21: CHARGEN
- 49: TACACS+
- 88: Kerberos
- 143: IMAP
- 445: SMB
- 502: Modbus TCP
- 601: reliable syslog
- 860: iSCSI
- 1344: ICAP
- 1720: H.323 signalling (Q.931 + Q.932)
- 1883: MQTT
- 2049: NFSv4
- 2377: Docker management (overlay driver)
- 2404: IEC 60870-5-104
- 2738: CTIQBE
- 3205: iSNSP
- 3225: FCIP
- 3260: iSCSI
- 4059: DLMS User Association
- 5061: SIPS
- 5222: XMPP
- 7946: Docker inter-node communication (overlay driver)
- 8883: MQTT TLS
- 8905: Cisco ISE Posture
- 20000: DNP3/IP
DCTCP
- reduces window size proportionally to losses, not exponentially
- congestion signalling: segments lost, IP ECN