Wednesday, September 2, 2009

Understanding BGP Misconfiguration

This is a study of BGP misconfigurations inferred from changes in route announcements. The authors choose short-lived route announcements as candidate misconfigurations, then guessed the type of ‘misconfiguration’ from the AS graph (assuming the usual provider/peer/customer relationships only) combined with an e-mail survey. They found that misconfigurations were relatively common, affecting as much as 1% of route prefixes in their relatively short survey. A majority of misconfigurations were apparently due to poor usability of routers — either through apparent configuration file slips or through issues relating to recovering old routes after reboot. Some of the misconfigurations were a surprise to operators, notably those caused by misplaced assumptions about their provider's filtering and some from router initialization bugs, and of course, some were false positives from Internet routing arrangements not being fully described by their model of inter-network business relationships.

Perhaps most remarkable about this paper is that most of these problems do not adversely affect connectivity. In this sense, BGP appears to have partially achieved the old goal of survivability, if only by accident. (Indeed, it seems to perform better than the Registries' databases at providing information about how to reach networks.) The most serious effect seems to come from the added load these background misconfigurations place on the system and the more serious possibilities they reflect. They are reminders that configuration is not very automated or well understood, so future global network disrupting events are still seem quite possible.

2 comments:

  1. They don't affect connectivity compared to the failure rate.....but if I understand their numbers correctly (and their extrapolations are correct), then about 10% of *all route updates* lead to some connectivity loss. That seems pretty big to me, despite their dismissal of it as small compared to the real failure rate.

    ReplyDelete
  2. This simply means that some small edge network becomes unreachable. For this to matter, it needs to be weighted by how much traffic originates or terminates in that AS. Not easy to calculate, but it could be a tiny piece of the traffic for a small AS (even if Berkeley gets disconnected, it has little effect on the Internet at large even if a big effect on us).

    ReplyDelete