Monday, August 31, 2009

The End-to-End Argument

Caricature of argument:
Detecting failures at every step is too much work! Just try the whole thing and see if it worked.
Justifications:
duplicated work
Because there are possible failures at multiple levels which look the same ("duplicate" message, corrupted data, requests that were not processed, etc.), checks should happen at higher layers anyways.
match the goal
By checking for the high-level gaurentees explicitly, we may detect failures in components we wouldn't have thought to verify “for free” (e.g. router buffers in the network).
more moving parts
More integrity checking (e.g. hop-by-hop) means more code to find bugs in.
avoid excess functionality
Not every high-level task needs feature X (e.g. duplicate suppression), so always providing it at the lower level will be wasted effort.

We love abstraction; we just give our problem to the next layer down and never think about it again… The end-to-end argument derives from the experience that this strategy fails because the lower layers do not have knowledge of the overall application: and in practice, the feature is — or should have been — reimplemented in a higher layer.

The examples in the paper, then, are failures of abstraction. In some cases (the file transfer relying on the transport protocol for integrity, relying on the communication subsystem for encryption), it may really make sense to move the action to the application layer. In other cases, there is still hope for modularity: it's just that the lower layer is providing the wrong interface. Ack'ing after a message is processed, considering message duplicates based on some part of their content, ordering messages based upon vector clocks instead of a streams order, etc. — convenient abstractions exist for these.

One should not be surprised that a layer that solves a weaker problem is not, in fact, useful for solving the stronger problem. The more compelling cases come when there is no substitute layer without substantial application intrusion that would solve the real problem, as in the end-to-end integrity/fault tolerance cases.

1 comment:

  1. Not sure that I agree with your characterization that doing it all levels is too much work. The argument in my view is that you can't count on the underlying subnets doing the job so you need to do it end to end otherwise.

    ReplyDelete