Wednesday, October 28, 2009

Resilient Overlay Networks

From the observation that BGP as deployed on the Internet does not allow for particularly fast or flexible use of the many available paths, this paper developers an overlay network (RON) to take advantage of alternate paths. Each node in the overlay network probes its (Internet-routed) link to each other node, estimating its loss rate and latency as well as detecting unavailability of the link within tens of seconds. From these measurements, RON chooses an intermediate node (if any) to optimize for an application-chosen metric.

The authors deployed their system across approximately 20 nodes on the Internet (mostly US and mostly educational) and measured the packet loss, latency, and throughput provided in practice over their overlay network. They found a dramatic reduction in large (greater than 30%) packet loss rates and a small but significant improvement in latency and throughput across all links. These improvements persisted even when they ignored Internet2 links (which, for policy reasons, may not be used in practice for commercial traffic). They noted that different paths were chosen in practice when optimizing for different metrics (latency, loss, throughput) and that these paths were frequently not the ‘normal’ Internet route.

A subtext of this paper is that BGP and IP do a bad job: they are insufficiently reactive and expressive to make good use of the available Internet routes subject to policy constraints. The headline result: avoiding most outages to which BGP takes many minutes to respond, certainly does seem like something that a hypothetical new BGP could avoid. Of course (as the authors argue) some of advantages come from things no BGP replacement could do: the amount of state maintained by RON as is certainly would not scale to hundreds of thousands of routers (probably not even to hundreds, as the authors leave this tricky issue for future work). Also, the authors note that it is to their advantage that RON only deals with a small amount of the traffic (relative to link capacity) and thus avoids inducing new failures when it reroutes traffic.

1 comment:

  1. An interesting question is whether overlays that are independently deployed could cooperate, would they be incentivized to do so, and if they didn't, what would be the performance implications. You could image several independent overlays choosing the same uncongested logical path thus killing its performance and negating its selection as a good alternative path.

    ReplyDelete