Post

TCP Cheatsheet

TCP

  • TCP/IP proposed by Vint Cerf and Bob Kahn in 1974
  • IPv4 specifications: IP RFC 791, TCP RFC 793
  • TCP is optimised for accurate rather than timely delivery
  • TCP uses a three-way handshake
  • Sequence numbers are picked randomly for security reasons
  • SYN packet: client picks random sequence number x, sends SYN packet which may include TCP flags and options
  • SYN ACK: server increments x by 1, picks random sequence number y, appends flags and options
  • ACK: client increments y by 1
  • Data packet sent after the ACK
  • Handshake imposes 1 RTT delay and makes TCP connection establishment expensive
  • TCP Fast Open (TFO) is available in Linux 3.7+
  • TFO include data payload with SYN
  • TFO requires cryptographic cookie and works for repeat connections only
  • TCP achieves reliability through retransmission
  • TCP retransmission works by the sender detecting segments that have been lost in transmission, typically identified through timeouts or receiving duplicate ACKs, and then resending those segments
  • John Nagle documented congestion collapse in 1984
  • Congestion collapse affects networks with asymmetric bandwidth
  • Nagle’s algorithm, reduces the number of packets that need to be sent over the network
  • Nagle’s algorithm works by combining a number of small outgoing messages, and sending them all at once
  • As long as there is a sent packet for which the sender has received no acknowledgment, the sender should keep buffering its output until it has a full packet’s worth of output, so that output can be sent all at once
  • Mechanisms were added to avoid congestion collapse: flow control, congestion control, and congestion avoidance
  • Flow control
    • Flow control prevents the sender overwhelming the receiver with data
    • Each side of a TCP connection advertises its receive window (rwnd) which is available buffer space
    • rwnd is initiated with default system settings
    • Each side can advertise a smaller window, or 0 for stop
    • Window scaling is specified in RFC 1323
    • Original TCP spec allocated 16 bytes for rwnd of 65k max
    • RFC 1323 allows window scaling option in the first SYN
    • Specifies how many bits to shift-left the window size in future ACKs
    • Window scaling is enabled by default in all major platforms
  • Congestion control and congestion avoidance
    • Prevents senders and receivers from overwhelming the network
    • Mechanisms to estimate bandwidth and adapt speeds to changing network conditions
    • In 1988 Van Jacobson and Micheal J. Karels documented algorithms to address these problems
    • slow-start, congestion avoidance, fast retransmit, and fast recovery
    • Many variants: TCP Tahoe, Reno, Vegas, New Reno, BIC, CUBIC, or Compound TCP
  • Slow-start
    • Measure available capacity by exchanging data
    • Server initialises a new congestion window (cwnd) per TCP connection
    • Set to initial conservative default. initcwnd on Linux
    • cwnd is sender-side limit on amount of unacknowledged data
    • cwnd is not exchanged
    • RFC 6928 specifies initial cwnd as 10 segments in April 2013
    • Maximum amount of data un-ACKed, in-flight for a new connection is smallest of rwnd and cwnd
    • For every ACK received the server increases cwnd by 1 segment
    • In other words, for every ACK received two packets can be sent. Resulting in exponential growth
    • A TCP connection cannot use the full capacity of link straight away
  • Slow-start restart (SSR)
    • TCP implements SSR mechanism, reseting the cwnd of idle connections as conditions may have changed
    • Disable SSR on the server (sysctl -w net.ipv4.tcp_slow_start_after_idle=0)
  • Congestion avoidance
    • TCP uses packet loss as feedback mechanism to regulate performance
    • Slow-starts doubles data in flight until:
      • It exceeds receiver’s rwnd
      • It exceeds a system-configured threshold (ssthresh) window
      • A packet is lost, at which point congestion avoidance algorithm takes over
    • Packet loss indicates congested link or router
    • Originally TCP used Additive Increase and Multiplicative Decrease (AIMD)
    • AIMD: half the congestion window, increase by fixed amount per round-trip
    • RFC 6937 specifies new algorithm, Proportional Rate Reduction (PRR).
    • PRR is default in Linux 3.2+
  • Bandwidth-delay product (BDP): product of data link’s capacity and end-to-end delay
  • BDP is maximum amount of data that can be in-flight
  • If max(rwnd,cwnd) = 16KB = 131,072 bits and RTT = 100ms == 0.1s. Max throughput = 16k/0.1 = 1.31Mbps
  • Fast retransmit and fast recovery
    • Fast retransmit reduces the time a sender waits before retransmitting a lost segment
    • Duplicate acknowledgement is the basis for the fast retransmit mechanism
    • If receiver receives a data segment that is out of order it immediately sends a duplicate ACK
    • If the sender receives three duplicate ACKs it will retransmit the missing segment
    • Fast recovery stops TCP using slow-start after fast retransmit
  • TCP receiver sees packet loss and retransmission as delivery delay when reading from socket
  • This is TCP head-of-line (HOL) blocking
  • Applications don’t have to reorder and reassemble and can be simple
  • The cost is unpredictable latency, commonly known as jitter
  • ss is a tool to inspect statistics for open sockets
    • ss --options --extended --memory --processes --info
  • Performance checklist
    • Upgrade server kernel to latest version
    • Ensure that cwnd is set to 10
    • Disable slow-start after idle
    • Ensure that window scaling is enabled
    • Eliminate redundant transfers
    • Compress transferred data
    • Position servers closer to user to reduce roundtrip times
    • Reuse established TCP connections whenever possible

Packet layout

    0                   1                   2                   3
    0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |          Source Port          |       Destination Port        |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                        Sequence Number                        |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                    Acknowledgment Number                      |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |  Data |           |U|A|P|R|S|F|                               |
   | Offset| Reserved  |R|C|S|S|Y|I|            Window             |
   |       |           |G|K|H|T|N|N|                               |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |           Checksum            |         Urgent Pointer        |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                    Options                    |    Padding    |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                             data                              |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
  
This post is licensed under CC BY 4.0 by the author.