This paper examines the consequences of one potential to the incast collapse (loss of useful throughput given several simultaneous TCP streams being received by a sink) problem: lowering or eliminating the minimum retransmission timeout. The intuition behind this change is that it will speed recovery from serious congestion events and that the very short and predictable round-trip times of datacenter networks do not require the current default's large margin of safety. They implemented this timeout reduction both in simulation and by modifying the Linux kernel TCP implementation (which required allowing timeouts to be triggered other than on the timer interrupt and internal timers to have better than 1ms granularity).
First, they evaluated this workload in an incast scenario; here, they had the sink pull 1MB data split evenly across a varying number of servers and measured the effective throughput (‘goodput’). Both simulation results and experiments showed a dramatic increase in useful throughput for small (approx. 1ms) minimum retransmission timeouts compared to values closer to the current default of 200ms. Based on RTT information from high-bandwidth (~10Gbps) datacenter networks and simulations, the authors further suggested that future networks will have substantially lower round-trip times, so minRTO should be eliminatined entirely. For such networks, the authors also reccommended adding random variation to the computed RTO to avoid accidental synchronization (from similar RTTs).
The authors also examined some possible negative impact drastically eliminating the minimum retransmission timeout would have. Two different sources of spurious retransmissions and delays are considered: from underestimates RTT due to a delay spike and from ACKs delayed for longer than the retransmission timeout. For both problems, the authors performed a comparision of Bit Torrrent throughputs observed with and without a very small minimum RTO. They found little difference in the per-flow throughput distribution with a low minimum retranmission timeout.
For delayed ACKs, the authors compared throughput in an 16-node fan-in situation with and without delayed ACKs. They found that enabling delayed ACKs (with a delayed ACK timer larger than the round-trip time) caused a substantially lower throughput with large fan-in.
The solution proposed in this paper is relatively easy to deploy, simple, and seems to work with few negative consequences. The authors suggest that the spurious timeout avoidance the minRTO was set for is no longer necessary now that recovery from spurious timeouts does not cause entry into slow-start. But do the delay spikes that motivated the original minRTO setting still exist? (Is better recovery from misestimation enough?)
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment