Transmission Control Protocol (TCP)
Table of Contents
- Introduction
- Reliable Communication Mechanisms
- Important Terms
- Congestion Control Algorithms
- Explicit Congestion Notification
- Versions of TCP
- Enabling a TCP Congestion Control Algorithm
- TCP Head-of-Line Blocking
- TCP Segmentation Offload
- Resources
Introduction
- Building Blocks of TCP
- Understanding the Network stack architecture in Linux
- Threads and Connections
- Socket Programming: Echo Server, and RTT and Throughput Measurement
- Exercise and good resources to get introduced to socket programming.
- github.com/HarshKapadia2/socket-programming (Private repo)
- Slow Start vs Congestion Avoidance in TCP
- Implementing Reliable Transport Protocols
- Implementing Stop-and-Wait (Alternating-Bit) Protocol, Selective-Repeat Protocol (with Cumulative Acknowledgements) and Go-Back-N Protocol (with Selective Acknowledgements).
- Sample Output Trace
- github.com/HarshKapadia2/reliable-transport-protocol (Private repo)
- Performance Comparison of TCP Versions
Reliable Communication Mechanisms
- Confirm delivery
- Sender gets Acknowledgement (ACK) Packets from the receiver
- No loss at receiver
- Flow Control
- Sliding Window
- Detect corrupted packets
- Checksum
- Detect lost packets
- Set up a timer on the sender.
- Retransmission Timeout (RTO)
- Recover from lost packets
- Re-transmit packets
- Automatic Repeat Request (ARQ)
- Detect duplicates
- Add a Sequence Number to each packet.
- In-order delivery
- Add a Sequence Number to each packet.
- Multiplexing and De-multiplexing
Important Terms
- CWND
- Congestion Window
- RAWND
- Receiver Advertised Window
- Available buffer space on receiver sent to client.
- SWS
- Sender Window Size
SWS = min(CWND, RAWND)
- ACK
- Acknowledgement Packet/Datagram
- Packet that acknowledges the receipt of another packet.
- MSS
- Maximum Segment Size
- Maximum payload (data) size per segment/datagram (Transport Layer).
- MTU
- Maximum Transmission Unit
- Maximum payload (data) size per frame (Data Link Layer).
- RTT
- Round Trip Time
- Time from the start of the first packet sent from the CWND to the receipt of the ACK of the last packet sent from e current CWND.
- Capacity/Bandwidth > Throughput > Goodput
- Capacity/Bandwidth
- The total transmission/sending rate of the link.
- Measured in bps (Bits per Second)
- Throughput
- Actual transmission/sending rate available after losses.
- Includes new data and retransmitted data.
- Measured in bps (Bits per Second)
- Goodput
- The transmission/sending rate of new data.
- Measured in bps (Bits per Second)
- Capacity/Bandwidth
Congestion Control Algorithms
Read from Section 3.6 and Section 3.7 of the ‘Computer Networking - A Top-Down Approach’ book.
Slow Start (SS)
- Exponential growth (Doubling) to rapidly increase sending rate
- ‘SS Threshold’s
- CWND size at which SS stops.
- CWND
- Initial
- CWND = 1 MSS
- SS threshold = Large value
- Incrementing CWND
CWND size = CWND size + 1 MSS
per ACK, which implies doubling the CWND size every RTT.- Incrementing CWND stops when
CWND = min(SS Threshold, RAWND)
.- New SS Threshold value?
- SS phase stop (Whichever of the following occurs first.)
- Packet loss occurs
- Implies congestion
SS Threshold = CWND size / 2
- Packet loss indicators
- RTO expiring (Timeout)
- Heavy congestion
- Duplicate ACKs
- Low to moderate congestion
- RTO expiring (Timeout)
- SS Threshold is reached
- RAWND is reached
- Rate of sending becomes constant
- Packet loss occurs
- Initial
- Ramps up sending rate faster than AIMD.
Congestion Avoidance
- Linear increase of sending rate rather than exponential increase (as in Slow Start), as Congestion Avoidance is slowly probing for congestion point.
- Need to probe for congestion point to be able to operate at optimal throughput (just below link capacity).
- Congestion Avoidance practices AIMD (Additive Increase, Multiplicative Decrease).
- Starts after ‘SS Threshold’ is hit in Slow Start.
- AIMD
- Additive Increase (AI)
CWND size = CWND size + (1 / MSS)
per ACK, which implies increasing the CWND size by one MSS every RTT.- Linear increase
- Multiplicative Decrease (MD)
CWND size = CWND size / 2
- This is a multiplicative decrease, as CWND decreases by a factor of
1 / 2
.
- Additive Increase (AI)
Fast Retransmit
- On receiving three consecutive duplicate ACKs, the sender immediately re-transmits the assumingly lost packet.
- This is done to utilize the channel appropriately and not have wait times with no packet sending till the RTO expires to trigger a re-transmission.
- There is a chance that the packet was not lost and will just reach late, but to hasten the transfer to use the link capacity optimally, Fast Retransmit is used.
- This is fair to do, as loss detection by duplicate ACKs implies that the network is not as congested as when a loss is detected by a RTO expiring (which implies that no packets can be sent or received), so it is okay to retransmit without maybe requiring to, to hasten up communication and increase communication efficiency.
Fast Recovery
- Fast Recovery works with Fast Retransmit.
- It starts on the detection of three consecutive duplicate ACKs.
- Duplicate ACKs indicate low to moderate congestion.
- On the detection of three consecutive duplicate ACKs
- Fast Retransmit kicks in here and sends the missing packet.
SS Threshold = CWND size / 2
- Now
CWND size = SS Threshold + 3 MSS
(1 MSS per duplicate ACK and there are three duplicate ACKs here), which implies that the CWND is artificially inflated. - For every duplicate ACK received after the three duplicate ACKs,
CWND size = CWND size + 1 MSS
- Once the ACK for the retransmitted packet is received,
CWND size = SS Threshold
, which implies that the CWND is deflated and returned back to its usual condition. - Fast Recovery now goes to the Congestion Avoidance algorithm, but if a timeout (RTO expiry) occurs, then it goes to the Slow Start algorithm.
- More info (RFC 5681, Section 3.2)
- Where is the Slow Start Threshold value set by TCP Reno Fast Recovery used?
Explicit Congestion Notification
- ECN: Explicit Congestion Notification
- Section 3.7.2 of the ‘Computer Networking - A Top-Down Approach’ book.
- What is the ECN (Explicit Congestion Notification) flag within a TCP header used for?
- Explicit Congestion Notification (ECN) Explained
Versions of TCP
- TCP Tahoe
- TCP Reno
- TCP NewReno
- TCP CUBIC
- TCP Vegas
- TCP BBR (TCP Bottleneck Bandwidth and RTT)
- CTCP (Compound TCP)
- FAST TCP (FAST Active Queue Management Scalable TCP)
- TCP Veno
- TCP Westwood
- TCP Bic
- H-TCP (TCP Hamilton)
- HS-TCP (Highspeed TCP)
- TCP Hybla
- TCP Illinois
- TCP SACK
- DCTCP (Data Center TCP)
and more…
TCP Tahoe
- A Loss-based Congestion Control Algorithm.
- Congestion Control algorithms used
- Slow Start (SS)
- Congestion Avoidance
- Only Additive Increase (AI)
- Only timeouts were used to detect packet loss, so
CWND size = 1 MSS
after every RTO expiry.
TCP Reno
- A Loss-based Congestion Control Algorithm.
- Congestion Control algorithms used
- Packet loss detection
- Timeout (RTO expiry)
CWND size = 1 MSS
- Three consecutive duplicate ACKs
CWND size = CWND size / 2
(MD)
- Timeout (RTO expiry)
TCP CUBIC
- A Loss-based Congestion Control Algorithm.
- Similar to TCP Reno, but has changes in the Congestion Avoidance phase.
- More info
TCP Vegas
- A Delay-based Congestion Control Algorithm.
- It compares the current Throughput with Throughput when the link was uncongested, and decides the current sending rate based on that.
- More info
TCP BBR
- BBR: Congestion-Based Congestion Control (Research paper)
- TCP BBR - Exploring TCP congestion control
DCTCP
- DCTCP: Data Center TCP
- Prerequisite: Explicit Congestion Notification (ECN)
- DCTCP is a TCP Congestion Control scheme for Data Center traffic.
- DCTCP extends ECN processing to estimate the fraction of bytes that encounter congestion rather than simply detecting that some congestion has occurred (as in ECN). DCTCP then scales the CWND based on this estimate.
- DCTCP achieves high-burst tolerance, low latency and high throughput.
- DCTCP modifies ECN for Congestion Control, but requires standard TCP Congestion Control practices (Fast Retransmit, Fast Recovery, etc.) for packet losses and timeouts.
- DCTCP reacts to congestion at most once per CWND.
- The growth of the CWND is handled just like normal TCP, with Slow Start, Congestion Avoidance, etc.
- In the absence of ECN, variables like CWND and SS Threshold should be handled like conventional TCP.
- RFC 8257: Data Center TCP (DCTCP): TCP Congestion Control for Data Centers
- Research paper: Data Center TCP (DCTCP)
- Datacenter TCP explained
- Enabling DCTCP
Enabling a TCP Congestion Control Algorithm
Instructions for Linux.
-
Check available TCP Congestion Control algorithms
$ sysctl net.ipv4.tcp_available_congestion_control net.ipv4.tcp_available_congestion_control = reno cubic
-
Check the current TCP Congestion Control algorithm
$ sysctl net.ipv4.tcp_congestion_control net.ipv4.tcp_congestion_control = cubic
-
List all available loadable TCP Congestion Control Linux kernel modules
$ find /lib/modules/$(uname -r) -type f -name '*.ko*' | grep tcp /lib/modules/4.15.0-169-generic/kernel/net/netfilter/xt_tcpudp.ko /lib/modules/4.15.0-169-generic/kernel/net/netfilter/xt_tcpmss.ko /lib/modules/4.15.0-169-generic/kernel/net/rds/rds_tcp.ko /lib/modules/4.15.0-169-generic/kernel/net/ipv4/tcp_dctcp.ko /lib/modules/4.15.0-169-generic/kernel/net/ipv4/tcp_hybla.ko /lib/modules/4.15.0-169-generic/kernel/net/ipv4/tcp_vegas.ko /lib/modules/4.15.0-169-generic/kernel/net/ipv4/tcp_bic.ko /lib/modules/4.15.0-169-generic/kernel/net/ipv4/tcp_nv.ko /lib/modules/4.15.0-169-generic/kernel/net/ipv4/tcp_cdg.ko /lib/modules/4.15.0-169-generic/kernel/net/ipv4/tcp_veno.ko /lib/modules/4.15.0-169-generic/kernel/net/ipv4/tcp_diag.ko /lib/modules/4.15.0-169-generic/kernel/net/ipv4/tcp_bbr.ko /lib/modules/4.15.0-169-generic/kernel/net/ipv4/tcp_illinois.ko /lib/modules/4.15.0-169-generic/kernel/net/ipv4/tcp_westwood.ko /lib/modules/4.15.0-169-generic/kernel/net/ipv4/tcp_yeah.ko /lib/modules/4.15.0-169-generic/kernel/net/ipv4/tcp_probe.ko /lib/modules/4.15.0-169-generic/kernel/net/ipv4/tcp_highspeed.ko /lib/modules/4.15.0-169-generic/kernel/net/ipv4/tcp_scalable.ko /lib/modules/4.15.0-169-generic/kernel/net/ipv4/tcp_htcp.ko /lib/modules/4.15.0-169-generic/kernel/net/ipv4/tcp_lp.ko /lib/modules/4.15.0-169-generic/kernel/drivers/usb/typec/tcpm.ko /lib/modules/4.15.0-169-generic/kernel/drivers/atm/atmtcp.ko /lib/modules/4.15.0-169-generic/kernel/drivers/rapidio/switches/idtcps.ko /lib/modules/4.15.0-169-generic/kernel/drivers/scsi/libiscsi_tcp.ko /lib/modules/4.15.0-169-generic/kernel/drivers/scsi/iscsi_tcp.ko /lib/modules/4.15.0-169-generic/kernel/drivers/staging/typec/tcpci.ko
-
Load the DCTCP Linux kernel module
$ sudo modprobe tcp_dctcp
-
Check available TCP Congestion Control algorithms again
$ sysctl net.ipv4.tcp_available_congestion_control net.ipv4.tcp_available_congestion_control = reno cubic dctcp
-
The current TCP Congestion Control algorithm can be changed as well
$ sudo vim /etc/sysctl.conf # Add `net.ipv4.tcp_congestion_control=dctcp` to the last line of the file. $ sudo sysctl -p # Load the configuration (from `/etc/sysctl.conf`) to apply the changes # OR $ sudo sysctl net.ipv4.tcp_congestion_control = dctcp
TCP Head-of-Line Blocking
- HoLB: Head-of-Line Blocking
- How does HTTP2 solve Head of Line blocking (HOL) issue
- Multiple messages multiplexed over a single TCP connection (as in HTTP/2) implies that even if only one packet at the start of the Congestion Window (CWND) needs to be retransmitted, all the packets after it will be buffered at the receiver and not be handed to their respective streams up the networking stack even if the individual packets that might be belonging to different streams are complete.
TCP Segmentation Offload
- TSO: TCP Segmentation Offload
- ‘Large Send Off’ offloading technique
- More on TSO