Wednesday, September 30, 2009

A Comparision of Mechanisms for Improving TCP Performance over Wireless Links

This paper compares the performance of TCP over a wireless link using various schemes believed to improve TCP performance, primarily by dealing better with losses over the wireless link. The used a real wireless link between one TCP endpoint and a gateway and the gateway was connected over a wired link(s) (Ethernet or the wide-area US Internet) to the other TCP endpoint. Though the authors used a real wireless link, they simulated single-bit losses (by corrupting TCP checksums) according to an exponential distribution. (The authors also performed one simulation where introduced errors caused more than one packet to be lost, finding that selective acknowledgment helped more than usual.)

The authors compare several types of schemes for improving TCP performance: gateway solutions where the gateway resends missing segments and (in TCP-aware schemes) possibly suppresses spurious duplicate ACKs also; gateway solutions that split the TCP connection into two TCP connnections; and purely end-to-end solutions. The authors find that several proposed improvements do help with caveats:

  • link-layer retransmission schemes help by themselves but work a lot better when they also suppress spurious duplicate acknowledgments, which lead to spurious retransmits and congestion window size reductions (TCP-aware link-layer schemes provided the most improvement of all the schemes tested in the paper);
  • end-to-end selective acknowledgments make good use of much of the bandwidth of the link without gateway modifications, but do not perform as well as gateway modifications especially with a wide-area connection; and
  • splitting the TCP stream at the gateway performs not as well as link-layer snooping because of limited buffer sizes at the gateway


Though the authors mentioned “very high interactive delays” as a motivating problem observed in TCP over wireless, they disappointingly did not attempt to measure it in this paper.

MACAW: A Media Access Protocol for Wireless LANs

This article proposes a media access protocol targetted at a particular contended wireless network. In the network environment considered by the authors, connectivity is symmetric (if A can receive from B, B can from A) and relatively short range with a sharp strength drop off. Thus, the primary efficiency concerns with implementing wireless in the authors environment are collisions at the receiver, including from signals out of range of the transmitter.

The hidden terminal problem makes carrier sense inappropriate. Instead, the authors explicit Request-to-Send and Clear-to-Send messages to manage the channel. Additionally the authors add an explicit acknowledgment (to guard against giving the false congestion signals to TCP-like protocols) and an explicit Data-Sending message (so other nodes within range of the transmitter can avoid interfering with its reception of the CTS and ACK).

Nodes control their use of the network based on these messages. A CTS or DS message reserves the channel for long enough to send the corresponding data and its acknowledgment. Based on past apparent collisions and their (apparent) locations, nodes backoff the rate at which they send RTSs, separately estimating congestion at their location and at the receiver.

The authors evaluated the MAC protocol with simulations of a variety of network topologies, verifying that reasonably fairness between data streams was provided by the MAC protocol and good link utilization was achieved (even in the presence of hidden terminals, unresponsive nodes, etc.) Most of the simulations used loss-insensitive flows and assumed the reception success was dependent entirely on distance from the receiver and competing transmitters.

It is very nice that the authors evaluated many network topologies and these clearly evaluations where the author's improvements produce huge differences. However, the paper seems to lack evaluation of more realistic scenarios. Even though the authors ACK scheme is motivated by transport protocol performance, almost none of the authors simulations consider performance of a TCP-like transport protocol or even how intermittent noise affects their link-layer protocol alone. The authors also do not consider any metric besides stream throughput, such as how much jitter and latency this MAC protocol introduces into the network.

Monday, September 28, 2009

Safe and Effective Fine-grained TCP Retransmissions for Datacenter Communication

This paper examines the consequences of one potential to the incast collapse (loss of useful throughput given several simultaneous TCP streams being received by a sink) problem: lowering or eliminating the minimum retransmission timeout. The intuition behind this change is that it will speed recovery from serious congestion events and that the very short and predictable round-trip times of datacenter networks do not require the current default's large margin of safety. They implemented this timeout reduction both in simulation and by modifying the Linux kernel TCP implementation (which required allowing timeouts to be triggered other than on the timer interrupt and internal timers to have better than 1ms granularity).

First, they evaluated this workload in an incast scenario; here, they had the sink pull 1MB data split evenly across a varying number of servers and measured the effective throughput (‘goodput’). Both simulation results and experiments showed a dramatic increase in useful throughput for small (approx. 1ms) minimum retransmission timeouts compared to values closer to the current default of 200ms. Based on RTT information from high-bandwidth (~10Gbps) datacenter networks and simulations, the authors further suggested that future networks will have substantially lower round-trip times, so minRTO should be eliminatined entirely. For such networks, the authors also reccommended adding random variation to the computed RTO to avoid accidental synchronization (from similar RTTs).

The authors also examined some possible negative impact drastically eliminating the minimum retransmission timeout would have. Two different sources of spurious retransmissions and delays are considered: from underestimates RTT due to a delay spike and from ACKs delayed for longer than the retransmission timeout. For both problems, the authors performed a comparision of Bit Torrrent throughputs observed with and without a very small minimum RTO. They found little difference in the per-flow throughput distribution with a low minimum retranmission timeout.

For delayed ACKs, the authors compared throughput in an 16-node fan-in situation with and without delayed ACKs. They found that enabling delayed ACKs (with a delayed ACK timer larger than the round-trip time) caused a substantially lower throughput with large fan-in.

The solution proposed in this paper is relatively easy to deploy, simple, and seems to work with few negative consequences. The authors suggest that the spurious timeout avoidance the minRTO was set for is no longer necessary now that recovery from spurious timeouts does not cause entry into slow-start. But do the delay spikes that motivated the original minRTO setting still exist? (Is better recovery from misestimation enough?)

Sunday, September 27, 2009

Understanding TCP Incast Throughput Collapse in Datacenter Networks

This paper is another examination of solutions to and the cause of the incast problem. They analyze the incast phonemon with and without the changes proposed in the CMU paper (much lower minRTO/disabling delayed ACKs). Unlike the CMU paper, the evaluation here primarily uses a constant data per source scenario rather than a constant data per sink scenario. They find that this fixed-source-size workload yields more complex goodput behavior than the other scenario. And in contrast the CMU findings, the authors find that delayed ACKs, in fact, do help against incast for large numbers of senders because they induce increased variance in the RTT estimate and TCP window size.

The authors produced a simple numerical model of the throughput collapse behavior assuming that it is determined by the number of retransmission timeout events and the interpacket wait time. From this model and refinements, the authors produce an explanation for the throughput collapse behavior: first, throughput collapses as the number of retransmission timeouts increases with the number of senders, then the width of period in which retransmission timeouts are triggered (per congestion event) increases, so recovery senders become desynchronized, but eventually the width of that distribution becomes so wide that senders about to experience a timeout interfere with senders just recovering from a timeout.

It is disappointing that the qualitative observations which would seem to explain the throughput collapse are not incorporated into the model, nor is an good empirical way to test for their presence beyond fitting a more complicated model attempted or suggested.

Monday, September 21, 2009

VL2

This paper is remarkable similar to PortLand. Therefore, I will focus first on the differences:
  • Topology — the supported topologies are folded Clos networks. Similar to fat trees, these provide many paths between racks and, as in PortLand, the authors propose routing flows randomly over the paths. Both of the papers examine three-stage topologies, both of which would support approximately the same number of end-hosts based on the number of ports each switch had; though the fat tree topology examined would use considerably more switches. To achieve similar performance VL2 assumes that switch-to-switch bandwidths are substantially higher than end-host-to-switch bandwidths. The fat tree topology is also considerably more regular, so where PortLand relies on location-specifying MAC addresses and explicit failure notifications the VL2 scheme runs a link state protocol between switches (but not to export end-host addresses) while .
  • Layer choice — VL2 uses location-specifying IP addresses; providing a mapping from location-agnostic (relocatable) IP addresses to location-specific ones analagous to PortLand's AMAC to PMAC mapping. Unlike PortLand, using IP addresses combines two mappings (AMAC to PMAC and PMAC to IP) that PortLand maintained into one mapping that VL2 contains (AA to LA). The disadvantage, of course, is that servers need extra support to know this mapping (doing IP in IP encapsulation) rather than it being backwards-compatible as using the ARP cache is.
  • The directory servers — in this paper, they are only responsible for the AA to LA mappings (no failure matrix), and replication/distribution of the directory servers is actually implemented. Also, the directory servers are queried by end-hosts rather than by switches;
  • Autoconfiguration story — None seems apparent in this paper.
  • Hardware required — The system in this paper runs on commodity switching ASICs, relying mostly on existing switch features rather than custom switching software.

The authors use some traces (presumably from MS data centers) and show that flow size varies a lot, traffic patterns are not very predictable, and small failures are frequent. They use these traces to motivate their choice of random load balancing rather than assuming a smarter scheme exists. For their prototype evaluation they measure utilization and fairness under an all-to-all TCP scenario on their 75-node testbed (apparently with near-commodity hardware) and measure interference between two services (with and without large numbers of TCP flows (‘mice’) in one service).

Probably the best feature of this paper is the authors extensive reuse of existing protocol implementations (OSPF, ECMP, etc.) This means that their fabric is probably more readily deployable and more hardware-agnostic than PortLand. It is also a disadvantage: they are not taking advantage of the relatively predictable topology they demand for routing, and many more pieces of hardware (each end-host) needs to be reconfigured than under the PortLand scheme which is truly invisible to the end-hosts.

Saturday, September 19, 2009

PortLand

This paper describes a layer-2 switch fabric targeted at datacenters. Scalability and the obvious simple correctness goals apply (loop-free forwarding, fault-tolerance) as well as ease of configuration and allowing IP addresses to move around in the datacenter.
To make this task simpler (that is, with less per-switch state) the authors fix the datacenter's topology as a fat tree and designate a centralized ‘fabric manager’ (which must be replicated for fault tolerance, of course) which coordinates between switches.

As with SEATTLE, the switches in this scheme handle all ARP requests themselves in addition to managing routing.
Switch state is kept small by distributing ‘pseudo MAC’ addresses which encodes the position of the destination node in the topology; a host's actual MAC address is never distributed to other nodes. The fabric manager controls the IP to PMAC mapping, allowing switches to respond to ARP requests and generate gratuitous ARPs if a PMAC becomes invalid.

Forwarding is kept loop-free by assuming a rooted fat tree and by following the rule that a packet changes from going up toward the root to down toward a node exactly once. For load balancing, switches can choose freely between upstream ports (the authors suggest switching on flows to keep each flow single-path); for fault-tolerance, the fabric manager distributes failure information to affected switches who then update their routing tables accordingly.

Probably the most clever scheme here is switch autoconfiguration. Bottom level switches discover that they are connected to enough non-switches, and the level number of higher-level switches propagates up the fat-tree. Position number assignment is done in a greedy fashion, synchronized by a higher level switch.
Combined with failed link detection, this should make (re)configuration automatic except for the pesky issue of the fabric manager and communication with it.

The authors tested their system on a 16-host prototype, evaluating its response to failures, IP migration, and the overhead it imposed on the fabric manager. Failures resulted in approximately 100 milliseconds of disconnectivity (which was, of course, made worse by TCP timeouts), and for the VM migration scenario, the TCP flow stalls fails for periods of approx 200-600 ms. For the fabric manager, they evaluated the overhead of between 25 and 100 ARP misses/sec/host, asssumed scaling, and extrapolated the CPU and network bandwidth uses of the fabric manager, and found that the fabric manager would require a lot of CPU resources (70 cores for the 100 ARP/sec case with ~30k hosts).

These numbers are disappointing though the authors point out that these miss rates seem unrealistic and that the fabric manager could probably be distributed to a cluster. Since presumably the authors are pretty confident that ARP miss rates are considerably lower in a real data center, it is disappointing that they didn't try to measure or intelligently estimate the rate at which they would be expected to occur. The linear extrapolation is also disappointing — couldn't the authors have really run their fabric manager on a synthetic trace with these miss rates and host count and measured the CPU usage?

Wednesday, September 16, 2009

Detailed Diagnosis in Enterprise Networks

This paper describes a system (called NetMedic) for identifying the likely location of the root cause of faults from component state/performance traces. Machine-local information (such CPU and memory utilization) in addition to network information (such as traffic rates and response times) are collected and assigned to “components” which, in the paper's evaluation, correspond to entities such as machines, user processes, or the collective networking infrastructure.

Since the system does not assume any domain knowledge about the measurements it receives, faults are detected based on measurements deviating from their usual values. Diagnosis uses historical correlations to determine whether an anomalous measurement bis likely influencing/influenced by another. To determine which components' measurements to compare, domain knowledge is used: ‘template’ mark the dependencies each component type has to other components (e.g. between a machine and all of its processes or between processes communicating over IP). Given this directed dependency graph and the likelihood of measurements influencing each other, the ‘best’ cause is determined based on the best geometric means of influence scores on paths from the alleged cause to components, weighted by the destination components' abnormality, combined with this geometric mean for the component being diagnosed.

The authors include an evaluation where they analyze whether their system can diagnose injected faults causes correctly. They compare their approach to a simpler approach and a hand-tuned aporoach. Both share the dependency graph and root cause ranking method. The simpler method only measures whether components are ‘abnormal’ and computes influencing ranks based solely on whether both components are abnormal; the hand-tuned method only measures influence between variables hand-marked as related. The authors find that their system gave the correct cause as the first ranked cause for almost all the injected faults, and did so substantially better than the simple approach, which was successful about half of the time. The hand-tuned approach did not a substantial improvement, showing that the authors had not lost much but avoiding domain-specific knowledge in that part.

One troubling aspect of this work is formulas that seem to be chosen mostly arbitrarily (e.g. using geometric mean and sqrt(c->e value * global weighted value) for cause ranking; the choice of 0.1 and 0.8 has a ‘default’ edge weight; etc.) Of course, the authors had to make some choice here, but from the results we cannot tell how sensitive correct diagnosis is to these choices — is there technique actually general enough that they mostly don't matter? (The authors did do a sensitivity analysis for one of these parameters — the length of history used for diagnosis.)

There's also the big step away from fully domain-agnostic diagnosis in requiring templates. These templates are fortunately applicable to many kinds of applications, which is why, based on their motivation, they felt it okay to use them but not such similar input about the measurements they are taking. But these templates do not look entirely trivially to make: the ones they show have many subtle dependencies, such as on machine firewalls and other traffic. Are they actually easy to infer, and given that the authors have gone to great effort to develop a method for ignoring irrelevant correlations, how useful are they?

Floodless in SEATTLE

This paper proposes an architecture for managing Ethernet routing over an enterprise-type network. It aims to provide more scalability than Ethernet bridging and easier administration and address portability than IP subnetting and VLAN-based approaches. In this approach, switches use a link-state protocol to route amongst themselves, and, using this, form a DHT to store end-host routing information. This DHT distributes the routing information across the routers so the amount of information stored by each switch is proportional the average number of end hosts a router has. This information is, of course, not local to the switches which need it, so switches maintain a cache of this information, invalidated (by the destination switch) on the use of a stale entry. To interact with unmodified end hosts, the authors implement ARP and DHCP backed by information in the DHT. ARP broadcasts used to invalidate cached are translated to unicasts to interested hosts by tracking which hosts may have cached entries.

The authors evaluated their using a trace driven simulation on several real network topologies and an evaluation with an actual implementation on a very small network.
From the simulations, they showed that the information maintained per router was actually smaller than both Ethernet bridging based approaches and that it achieved better routes (given similar cache sizes) than past identity-based routing proposals. From the actual implementation, they measured the per-packet overhead, performance when attempting to contact many non-existent hosts, and switch fail-over speed.

It's unfortunate that this paper did not include an explicit comparison to IP subnetting or VLANs, by which it is clearly motivated. It is understandable why, of course; their arguments against IP subnetting are based on ease of configuration which is very difficult to measure. For VLANs, only some of their complaints were about ease of configuration, but for the others it might be hard to infer a suitable configuration from their topology information.

This scheme is attractive for its careful attention to complete transparency for end-hosts, which should make it relatively easy to deploy. The big problem is that deploying it in the most straightforward way involves changing all switches over; it would seem, however, that there's probably some way of constructing a reasonable middle-ground where it is deployed gradually across the network.

Monday, September 14, 2009

Random Early Detection Gateways

This describes a queue discipline that gives better congestion signals to TCP flows than Drop Tail and slight refinements on Drop Tail like Early Random Drop. The authors observe that a problem with Drop Tail and Early Random Drop is that they both only act when the queue is near-full, thereby discriminating against bursty flows and causing global synchronization. To avoid this, this scheme (RED) tracks a long-term average queue size and ‘marks’ (drop for TCP) packets according to this size such that marks are distributed approximately evenly over time. The authors evaluate their scheme with multiple TCP flows with different round-trip times going over a bottleneck link, and show that it avoids global synchronization and achieves fairness among bursty flows (in their example, TCP flows over a high-bandwidth delay product relative to their window size).

The oddest omission from this paper to me was packet sizes: The algorithm and the simulations work in terms of whole packets apparently without considering the size of packets. This is particularly odd given that one might expect bursty interactive flows to have variable packet sizes.

This paper does remarkably simplify its problem by assuming that all flows are running a congestion control algorithm, as this scheme does not converge to fairness in the phase of loss-agnostic flows (which the authors did not test). The authors propose sampling dropped packet, showing that a window of dropped packets is likely to have close to the expected number (based on actual) of dropped packets from each flow; however, the authors don't evaluate this scheme beyond this theoretical observation.

Congestion Control for High Bandwidth-Delay Product Networks

This paper describes a protocol (called XCP) for congestion control that utilizes explicit signaling of the degree of congestion by routers. To replace the dropped-packet signal, senders include an explicit ‘feedback&rsquo field indicating their desired window size adjustment upon receiving an ACK for a packet. Routers modify this value (guided by the sender's current window size and RTT estimate, which the sender must also include) according to both fairness and efficiency criteria. To adjust for efficiency, routers estimate the difference between incoming and the maximum outgoing bandwidth and ensure that provided feedback values add up to this value over the measurement interval. To provide fairness, the additive-increase/multiplicative-decrease scheme of TCP is used to set the feedbacks, with the overall amount of increase or decrease chosen to match the measured bandwidth deficit or surplus. When the bandwidth deficit or surplus is almost 0 but fairness has not been achieved, feedback is still set for some portion of traffic to ensure convergence to fairness.

The authors measured their scheme in simulation in number of circumstances. Their simulations showed that it output performed TCP with any of several queuing disciplines in terms of speed of convergence to near full utilization and fairness and behavior with competing flows. (One notable exception was their "shuffling" scheme to ensure convergence to fairness would cause notable loss of throughput when a router downstream of a bottleneck would try to adjust bandwidth upwards to provide it a better share.) Additionally, the authors' simulations showed that their new protocol could be made to not discriminate against TCP flows.

The biggest issue with this protocol is the difficulty of implementation. To implement this scheme, routers will need to track per-flow state, and to provide it with security, would need to occasionally “test” flows to ensure that they are not cheating. The authors do propose a scheme similar to Core-Stateless Fair Queuing to isolate the changes of XCP to an administrative domain, but besides reducing the security issue and the number of potential flows to track to a function of the number of edge routers, this scheme does not seem to reduce the per-router overhead. And though the scheme to multiplex TCP flows alongside XCP flows helps, even the core/edge strategy for XCP (which the authors did not simulate) appears ill-suited for incremental deployment.

From a security standpoint, I find it somewhat disturbing that the ‘check if the flow’ responds scheme relies upon the sender-provided RTT estimate. I guess, however, that reasonable capping of this RTT would still yield an improvement upon the response time of such ‘enforcement’ schemes on traditional TCP.

Wednesday, September 9, 2009

Core-Stateless Fair Queuing

This paper describes a method to implement fair queuing across a network without requiring exceptional state tracking from the most routers. Rather than compute a priority for packets, the primary mechanism used to control sending rates is probabilistic dropping based on the rate of the flow the packet is in. Computing the flow rate involves tracking substantial per-flow state, but such per-flow state can be tracked only at network edges and then attached to outgoing packets (and modified as necessary within). To determine how often to drop, it is also necessary to estimate the fair share deserved by each flow, which is done by comparing the observed overall drop and arrival rates at each router.

The authors evaluate this mechanism over several simulated topologies with variously behaved competing flows. They show that the performance of CSFQ is often competitive to a full per-flow state implementation of fair queuing (Deficient Fair Queuing). Its aboslute performance is usually slightly worse than DFQ (with the exception of performing better in cases where flows may suffer from not being able to temporarily exceed their "fair" buffer to accommodate burstiness).

CSFQ has some parameters which seem to be set arbitrarily and some which are not clear how to set in general. Arbitrary parameters include thresholds used to adjust for buffering effects. Exponential smoothing parameters the authors recommend setting substantially larger than delay-jitter and shorter than the average duration of a flow. Since with things like web pages which are in fact short-lived flows, we probably have a relatively short average flow duration; how should these parameters be set when flows may traverse the globe (with presumably high delay jitter) before exiting the network or traverse only to the local metropolis's peering point (with hopefully very low delay jitter)?

Analysis and Simulation of a Fair Queueing Algorithm

This article constructs and analyzes a method for multiple flows to share a bottleneck link fairly. The intuition derives from the intuitively fair algorithm when bits and parts of bits can be interleaved arbitrarily: if there are N flows and we can send one bit per unit of time, then every unit of time we want to send 1/N bits from each flow. Since packets must be kept together across the link, we can't actually do this, but the same effect can be achieved by computing the packet finishing times we would have if we could do this and using these as priorities for sending packets (ignoring effects from ties and small packets that arrive while larger packets are being sent). To provide better delay to flows not using their full share of bandwidth the authors also give extra priority to new packets from flows which were recently inactive (by effectively backdating the packets to fill the gap up to some threshold).

The authors evaluated their fair queuing algorithm (in simulation) against first-come first-serve and DECbit (which gives explicit congestion notifcations to flows over their fair share) running against continuous ("FTP") and intermittent ("telnet") flow controlled streams (both traditional TCP-style and with explicit congestion notification) and against misbehaved flows. In contrast to the other algorithms, their fair queuing algorithm provides well-behaved flows good performance in the presence of an ill-behaved flow and severely punishes the ill-behaved flow (which they assume to be loss-sensitive). Even for the normal cases, they show that they provide lower latency to the intermittent flow (rewarding it for using less than its fair share of bandwidth) than the DECbit scheme.

The incidental punishment of ill-behaved flows is strange; as the authors note, it is a result of charging flows for packets that cannot be queued, but it would be nicer to make the degree of punishment more explicit. (Since, surely, well-behaved hosts will send somewhat above their fair share while adjusting or due to network perturbation, and should necessarily lose throughput for it.)

Sunday, September 6, 2009

Analysis of the Increase and Decrease Algorithms for Congestion Avoidance in Computer Networks

This article is concerned with congestion avoidance algorithms which utilize a single bit signal that the network is operating above or below its desired level of load for coordination between nodes on the network. In the model of the paper, time is divided into discrete intervals, and each user of the network sets its load for the next interval based on feedback generated by the previous interval.

Primarily evaluated are algorithms where at each time t, the new load is a(signal) + b(signal)*(old load) where signal is the single-bit indicator of whether the network was over the desired load at time t -1 . The appropriate values of a, b for each case are analyzed in light of convergence to both the desired level of load and the desired fairness. The main result is that additive increase/multiplicative decrease (i.e. a(underloaded) > 0, a(overloaded) = 0, b(underloaded) = 0, 0 < b(overloaded) < 1) is are the types of a and b functions which achieves convergence on both criteria. With respect to the rate of convergence, there is a tradeoff between quick convergence and low levels of oscillation around the goal load.

For me, the most interesting observation of the paper are the idealized throughput and response-time curves. TCP-style congestion control has its boolean signal of a sort (a dropped packet) — but where on that throughput/response time curve does it fall?

Congestion Avoidance and Control

This paper describes congestion control techniques for TCP built around reaching an equilibrium between packets entering and leaving the network and maintaining that equilibrium:
reaching the equilibirum
slow start — ramp up sending speed slowly so the correct sending speed won't be overshot. Prior systems would not recover from a mismatch in their local network rate and the least capable link to their end host because they would continuous overshot.
linear window size increase — be conservative about increasing the amount of data in flight to avoid massively overdriving the network, from which it is hard to recover.
don't retransmit until packets is out of the network
RTT variance estimate — because of queues, etc. round-trip time varies in the network a lot. This must be taken into account, which the authors do by creating an estimator of RTT variance.
backoff after retransmission enough to avoid more losses
exponential backoff — motivated by some theoretical results (about what sort of backoff is required for stability) and a desire to be as general as possible with what network topologies they support, the authors choose exponential backoff.

Despite the authors attempt at generality in these algorithms, since as they note there were TCP implementations for systems ranging from 56Kbit/s to hundreds of Mbit/s, there are some interesting assumptions about the environment. The most glaring one is that losses due to damage are are rare, which seems a strange assumption when one is considering microwave links, etc. It is hard to see, though, how this sort of congestion control (only at the ends and where loss types are not distinguishable) could function very well without assuming that at least most losses are due to congestion.

Wednesday, September 2, 2009

Understanding BGP Misconfiguration

This is a study of BGP misconfigurations inferred from changes in route announcements. The authors choose short-lived route announcements as candidate misconfigurations, then guessed the type of ‘misconfiguration’ from the AS graph (assuming the usual provider/peer/customer relationships only) combined with an e-mail survey. They found that misconfigurations were relatively common, affecting as much as 1% of route prefixes in their relatively short survey. A majority of misconfigurations were apparently due to poor usability of routers — either through apparent configuration file slips or through issues relating to recovering old routes after reboot. Some of the misconfigurations were a surprise to operators, notably those caused by misplaced assumptions about their provider's filtering and some from router initialization bugs, and of course, some were false positives from Internet routing arrangements not being fully described by their model of inter-network business relationships.

Perhaps most remarkable about this paper is that most of these problems do not adversely affect connectivity. In this sense, BGP appears to have partially achieved the old goal of survivability, if only by accident. (Indeed, it seems to perform better than the Registries' databases at providing information about how to reach networks.) The most serious effect seems to come from the added load these background misconfigurations place on the system and the more serious possibilities they reflect. They are reminders that configuration is not very automated or well understood, so future global network disrupting events are still seem quite possible.

Interdomain Internet Routing

This article primarily describes how inter-network routing works over the present Internet. BGP, the protocol that supports such routing, is relative simple (at least at network edges), consisting simply of annotated offers or withdrawls of offers to route to some set of addresses. The annotations provided include information about the overall path of the route, which is used to prevent loops and help make routing decisions, but routing decisions are heavily influenced by other policies.

Much of the interesting part of inter-network routing, how these routing decisions are made and which offers are made or withdrawn, are not part of BGP. Typically, these policies are determined by the economic relationship between the two networks. A provider wants to maximize utilization of links that make it money, so it has an incentive to advertise any available routes using such links. When deciding between multiple routes to the same destination, a provider will similarly be motivated to configure their software to prefer routes using links for which they are paid, and may then decide based on traditional “technical” criteria like path length. (Of course, explicit contractual arrangements also influence these decisions.)

BGP does seem to have been designed for a smaller, friendlier Internet. The state size is and its changes are not trivial and likely to get worse as more people would sensibly like to have an internet or a backup internet connection. As administrated, there is a lot of trust, revealed through the poor filtering illustrated by hijacked blocks and then general lack of route authentication in the protocol. And, despite its flexibility for route announcers to choose their policy, some policies seem hard to enforce technically, such as preferences for peers to use their own networks for cross-country/sea transit. (Presumably, this could be prevented by only advertising the routes through the preferred entry point except that this would break reachability during glitches on the peer's network, so it is easier to contractually require honoring MED when needed?)