Monday, August 31, 2009

The Design Philosophy of the DARPA Internet Protocols

This article gives a design rationale for the DARPA Internet protocols. The primary motivation derives from connecting a variety, separately administered networks and supporting a variety of applications on top of them. The targeted applications tended to be non-jitter-sensitive ones with varying bandwidth requirements, so packet switching was chosen over circuit-switching. Another important goal was tolerance of failures; fault-tolerance combined with separate adminstration and the least common denominator of a variety of networks lead to the datagram model: Nothing was to be assumed of the intermediate hosts in a connection, so all the state of the network would be provided at the ends of the connection. The IP layer, then, provides exactly this service, on top of which end-user services would be built with modifications only at the end points (including the initial example of TCP).

Probably the biggest tradeoff mentioned in the article was one of end-host complexity. By relying on such a minimalist network stack and tolerating fail-stop failures of the network between two end hosts, implementing the protocols on the end-hosts is not easy (compared to, e.g., network architectures that explicitly signal failures or flow control limits).

Somehow, despite placing high importance on resilience from explicit attack, the design considerations mentioned fail to account for the presence of an adversary on the network who can insert non-fail-stop failures. The article recognizes this is a lost of robustness:

In this respect, the goal of robustness, which lead to the method of fate-sharing, which led to host-resident algorithms, contributes to a loss of robustness if the host misbehaves.
but fails to mention the apparent design assumption that the separately administered hosts are reasonably well-behaved.

It is interesting that the article describes the lack of router state-carrying as a robustness feature. At our current scale, it seems more like a scalability feature. Though the article talks about performance considerations (particularly the difficult of specifying them to defense contractors) and mentions network-performance-sensitive activities like teleconferencing, performance generally and scalability in particular did not appear to be major goals. (The closest among the enumerated goals is “[permitting] host attachment with a low level of effort”, which was observed to not be well-achieved because of the protocol reimplementation effort.)

The End-to-End Argument

Caricature of argument:
Detecting failures at every step is too much work! Just try the whole thing and see if it worked.
Justifications:
duplicated work
Because there are possible failures at multiple levels which look the same ("duplicate" message, corrupted data, requests that were not processed, etc.), checks should happen at higher layers anyways.
match the goal
By checking for the high-level gaurentees explicitly, we may detect failures in components we wouldn't have thought to verify “for free” (e.g. router buffers in the network).
more moving parts
More integrity checking (e.g. hop-by-hop) means more code to find bugs in.
avoid excess functionality
Not every high-level task needs feature X (e.g. duplicate suppression), so always providing it at the lower level will be wasted effort.

We love abstraction; we just give our problem to the next layer down and never think about it again… The end-to-end argument derives from the experience that this strategy fails because the lower layers do not have knowledge of the overall application: and in practice, the feature is — or should have been — reimplemented in a higher layer.

The examples in the paper, then, are failures of abstraction. In some cases (the file transfer relying on the transport protocol for integrity, relying on the communication subsystem for encryption), it may really make sense to move the action to the application layer. In other cases, there is still hope for modularity: it's just that the lower layer is providing the wrong interface. Ack'ing after a message is processed, considering message duplicates based on some part of their content, ordering messages based upon vector clocks instead of a streams order, etc. — convenient abstractions exist for these.

One should not be surprised that a layer that solves a weaker problem is not, in fact, useful for solving the stronger problem. The more compelling cases come when there is no substitute layer without substantial application intrusion that would solve the real problem, as in the end-to-end integrity/fault tolerance cases.