Monday, November 23, 2009
BotGraph: Large Scale Spamming Botnet Detection
To detect the unusually high rate of account generation, the authors propose trackng the exponentially weighted moving average of the signup rate at each IP. Because real signup rates are typically very low, the authors argue that they can easily detect mass signups. It does seem that a clever botnet would ramp up its account generation rate slowly to evade these long-term averaging techniques. (The bots would try to appear to be a popular proxy, etc.) The authors argue that such techniques would be ineffective due to the natural churn of the botnet's compromised hosts, but given that, at least in office environments, many desktop machines are certainly not turned off routinely, and legitimate signup patterns can expect to see the same sorts of diurnal and weekly variations that machines being on and off see, I do not find this convincing.
Most of this paper is devoted to defending against attacks which do not use aggressive creation of new users. The authors use the correlation between accounts and the hosts which use them. The botnets the authors observe apparently load balance these accounts across many of its hosts over time and use many accounts from any particular compromised host. The authors therefore relate any two users based on the ASs they share being logged in from. Using these as edge weights and choosing a threshold, this induces a graph the users which can be used to cluster the users into candidate bot groups. These groups can be analyzed collectively for spammy behavior (because effective spamming requires the user accounts as a whole to have a high send rate per account given the effectiveness of signup rate limiting).
An obvious attack on this scheme, mentioned in the discussion of this paper, is for botnets to not reuse accounts between ASs. The authors argue for a countermeasure of grouping by IPs or some other smaller granularity to continue to catch those botnets. Clearly, as the authors argue, spammers have something to lose when they effectively cannot reuse their accounts between arbitrary pairs of machines, which makes this a worthwhile technique in spite of the relative ease of evading detection. The proposed future work, also correlating users by e-mail metadata and content would also likely be similarly worthwhile because an inability to reuse domains or arbitrary distribute destinations across accounts would also make spamming effectively more expensive.
Not-A-Bot: Improving Service Availablity in the Face of Botnet Attacks
For services for which botnet abuse was a concern, applications would be able to ask the attester to sign some specified data with a statement indicating that keyboard or mouse activity had occurred at a time close to the requested action. The attester would avoid generating multiple attestations for the same and thereby effectively ratelimits the sensitive applications to the rate of real user activity on the machine. In the "normal" use case, the application would send these attestations to the service. The service would verify the attestation and adjust its behavior; for example, it could prioritize attested requests over non-attested requests.
Botnets attempting to create false attestations would be able to do so on non-idle machines of course, but the authors argue the rate at which they could do so effectively would be seriously limited. One danger is that attestations in this system are storable indefinitely — they contain a nonce (generated by the attestor to prevent trivial replays), but no timestamp unless one is added. The authors assume that users of attestations will encode the time in the signed material; for example, signing the date header of any sent e-mail and using URLs encoding a service-generated nonce or timestamp. Services must also maintain a nonce cache to prevent replay, which may be quite large in the case of high-latency services like e-mail. The authors predict that even for large botnets, the victim machines on which these botnets are running will not have user activity often enough to produce attested spam, DDoS, or click fraud traffic at nearly the rate they are acapable of today.
Wednesday, November 18, 2009
Cutting the Electric Bill for Internet-Scale System
The paper uses traces of both these electricity spot prices and Akamai's workload to evaluate the potential savings from moving computation where energy is cheaper. Some practical concerns with moving workload were modeled: keeping work close to users due to latency concerns, not dramatically increasing bandwidth costs, and the realities of not having a power-proportional datacenter. Under somewhat optimistic scenarios, electricity cost reductions of more than 30% seemed achievable with this type of optimization.
The simulation evaluation was very thorough and used a real workload (which are quite hard to come by). The real worklaod used here does not, however, sound like the type one would want to prove this system by: Akamai's customers are presumably paying in large part for Akamai's closeness to the end-users, but increased distance is just about the only compromise that cannot be avoided. Surprisingly, the US electricity market seems dynanmic enough that even allowing for modest compromises (e.g. less than 1000 miles) can provide substantial benefit.
Skilled in the Art of Being Idle: Reducing Energy Waste in Networked Systems
The traces reveal that there is a lot of background noise on networks. Some of this traffic is broadcast at the Ethernet level but not actually logically directed to the end host (for example IPX traffic, routing backup traffic, ARP requests not for the end host's IP). Some is for protocols that require periodic maintainence traffic to maintain liveness, e.g. ARP (when querying the host), IGMP (if the host belongs to IP multicast groups), DHCP, TCP (for keepalives), SSDP. Such traffic was frequent enough in the author's traces that a proxy that would not handle that traffic would not be able to allow end hosts to sleep a substantial amount of time without disrupting service. Making such simple fill-in-the-blank replies for a handful of services and conservatively waking machines up for other services is all that would be required to allow machines to stay asleep for most their idle periods without sacrificing availability. Problematically, all it takes to disrupt this result is one misconfigured service (for example, the machine updating service the authors observed in the enterprise environment).
Scalable Application Layer Multicast
The multicast tree maintained in this protocol has a strictly layered structure. Within each layer, nodes are partitioned into ‘clusters’. Each member of a cluster keeps track of all the other members of its clusters and forwards messages it generates to all other members of the cluster. For forwarding messages in and out of the clusters, each cluster elects a ‘leader’. This leader is required to be a member of some cluster in the next higher layer to forward messages outside of the original cluster. To allow recovery from loss of the leader, each member of a cluster also tracks the membership of its leader's cluster in the next layer.
To add a node to this tree, the joining node determines its distance (presumably in terms of latency) to all nodes of the highest level of the cluster, then the cluster in the next layer lead by the best node in the highest layer and so on. When it finds an appropriate cluster in the lowest layer, it joins it. When the leader of a cluster later detects that it is larger than the size threshold, it is split (potentially causing this procedure to repeat at a higher layer). Similar delayed detection by cluster leaders is used to initiate the merging clusters.
To maintain closeness in the network, nodes are responsible for periodically probing the alternate cluster leaders (in the same clsuter as its current cluster leader) and switching clusters if it has chosen inappropriately (or, as was surprisingly not mentioned in the paper, if network conditions have changed.)
Tuesday, November 17, 2009
A Reliable Multicast Framework for Light-weight Sessions and Application-Level Framing
From this reception information, participants can decide when to send a negative acknowledgment (which the paper calls a "repair request"). To prevent many repair requests from being sent at once (due to the failure of a link relatively "early" in the multicast tree), senders wait a random amount of time proportional to their distance from the original sender before sending the repair request. If, before this timer expires, they see another repair request for the same block, they backoff their timer exponentially (and use the same strategy to resend a transmission if the timer expires). Any host which has the data (all that received it are assumed to buffer the data), can send a retransmission, but uses a similar timer mechanism to lower the liklihood of duplicate retransmissions.
To select the random backoff values, the authors propose a heuristic scheme using feedback from the numbher of seen repair requests and retransmissions and some manually sent clamps. The authors evaluated this scheme in simulations over random multicast trees with up to 5000 nodes.
Presumably because it was divised with all participants being "first-class" — equally able to send messages, not distinguished by the strucutre of the mutlicast tree, and able to leave or join the group at will — this scheme requires nodes to track and resend a disappointing amount of state. Each node needs to keep estimates of the round-trip-time to each other node in the multicast group. To make these estimates, each node must send periodic reception reports. Without a means to summarize these reports (besides switching "pages"), the size of each report is proportional to the number of senders. (The authors do propose "local recovery" to limit the scope of retransmissions and associated messages, but this scheme does not reduce the need Though this amount state and messages are probably manageable for a low-bandwidth stream with thousands of recipients, it does not sound like scaling this scheme to larger data sizes and more recipients will be comfortable.
Thursday, November 12, 2009
X-Trace: A Pervasive Network Tracing Framework
This dependency tracking is done by piggybacking information about the previous subtask in intercomponent request, including the “root’ task whether the new subtask is the next subtask or a subsubtask (which may be performed concurrently with other subsubtasks). When a component acts on a request, this metadata is attached to the log entries it creates, and these logs are aggregated using the task ID.
Propagating this metadata and aggregating the logs is relatively cheap, but not free (the authors observed a 15% decrease in throughput in an instrumented Apache). The main cost, however, is clearly in the modification of each component of the system. The system's pervasiveness requires its metadata hooks to infect all the components. Ideally, much of this could be automated, since most systems already ‘know’ which request they are acting on at any point, which presumably was what the authors were exploring in their modifications to libasync. Such knowledge does not easily let one automatically choose between ‘next’ versus ‘down’, and certainly won't easily detect the non-tree structures the authors argue (describing as future work) would be more appropriate for some systems.
After proposing tracing all the way down to the link layer, the authors sensibly proposed, as future work, limiting what is traced by a combination of choosing which layers to trace and which incoming requests to mark. Of course, we already have tools for diagnosing network difficulties at the network path level, so logically, it might be nice (and certainly require less switch modification) to use X-Trace to find the path in our maze of an overlay network and then compose in our existing tool. Unfortuantely, based on the figures in the paper, X-Trace does not really record failures in such straightforward a manner. What humans interpret as a (service-stopping) failure in their typical traces is an omitted node, apparently without a hint was to what that next-hop might be save (non-automated) comparision with the correct case.
Wednesday, November 11, 2009
NetFPGA: A Tool for Network Research and Education
The research need filled by NetFPGA is an obvious one: most deployed networks work in large part using specialized hardware. Networking researchers typically did not have the means to create or modify hardware routers, so new ideas were tested with software routers. When tests with software routers yielded subpar performance, as they often do, authors would often argue that next-generation hardware routers or switches would be able to provide the functionality easily and without any loss of performance. NetFPGA allows these researchers to stop the hand-waving.
Monday, November 9, 2009
A Policy-aware Switching Layer for Data Centers
Switches along the paths of packets handle traffic that needs to be routed through a middlebox by encapsulating the ethernet frame in another ethernet frame destined for the middlebox. The router adjacent to the middlebox chosen removes this encapsulation, allowing the middlebox to act as if it were on the actual path.
The primary challenges of implementing this scheme comes with handling changes. To ensure compliance during a policy change without being able to mark packets passing through middleboxes, the authors propose dividing a change from policy P1 to P2 into multiple steps. First, each type of middlebox (relevant to the policy change) is split into two types (say, old and new) of middleboxes. Then (after the new-type middleboxes have processed all old traffic), the new policy is installed replacing the old type with the new type everywhere. Finally, after the new policy has been installed on all switches (and the old type middleboxes have processed all pending traffic), all middleboxes of the old type are switched to middleboxes of the new type.
The primary disadvantage of this routing scheme is latency. Compared to the traditional approach of placing middleboxes directly in the natural path of the packets, this scheme imposes an extra approx. 0.3 ms (based on the authors' testbed) of latency. The authors argue that this difference is insignificant given the general low-latency of the environment, but, for data-center-internal traffic, this may double the latency, which is very likely to substantially affect performance.
Internet Indirection Infrastructure
The paper envisions that this addressing mechanism can be used to implement many commonly requested routing services:
- mobility — the mobile host simply maintains a trigger with a permanent address, arranging for it to point to its current real address;
- multiast — each subscribing host inserts a trigger for the multicast address (or, for large multicast groups, the hosts can arrange for a tree of triggers);
- anycast — endpoint encode their preferences in the inexactly matched portion of the address and ensure that exactly one trigger is the closest match;
- service composition — the end-host wishing to insert the service inserts a trigger listing the service, then itself; the service agrees to forward the message to the address it receives as metadata;
Other concerns about this system are ones of security. The most obvious issue is that the flexibility to implement multicast comes with the ability to send massive amounts of duplicitous traffic to a victim host. The authors propose sending challenges to the trigger's mapped address to assure that the address to which triggers map actually permits the mapping. For cases where the victim is the infrastructure itself, the authors propose assigning resources based on the end-host inserting the trigger and detecting loops using either test packets or a bloom filter to mark visited hosts.
The authors also address concerns of eavesdropping or hijacking communications using the triggers. The authors argued that random private triggers would prevent eavesdropping attacks because a malicious third-party would need to guess the private trigger. Similarly, the authors would prevent hijacking by having the public trigger merely forward to a (hopefully unguessable) private trigger required to remove the public.
I felt that these security concerns would be better solved by using a traditional cryptographic end-to-end solution for hijacking and eavesdropping concerns (perhaps by encoding public keys in the public triggers) and authentication for trigger-removal requests. The authors wish i3 to be a relatively configuration-free system distributed between multiple administrative domains, to achieve these goals, almost any party should be able to run an i3 node. Given this, it seems a more viable attack would be to collect triggers by running (perhaps many) i3 nodes; since the trigger's values will be distributed at random among the i3 nodes and forwarded to other i3 nodes for caching, an attacker should be able to get the values of many private triggers.
Wednesday, November 4, 2009
DNS Performance and the Effectiveness of Caching
The authors of the paper, however, were not mostly concerned with misconfiguration but with the effect of caching and TTL settings. The authors generally found that caching worked well for nameserver records, so most queries required few referrals (each of which is an extra round-trip to a server), substantially decreasing the latency of lookups. The authors found that the names requested followed a Zipf-like distribution and therefore most of the benefit of caching was obtained by caching relatively few names. They also found, based on the TCP trace, that because accesses to the same domain were highly clustered, short TTLs (popular for load balancing), did not cause substantial increases in the number of DNS queries or user-perceived latency.
Development of the Domain Name System
To achieve the distributed administrative control, DNS used an explicit hierarchy, which was to reflect the actual administrative hierarchy of the network. Each part of this hierarchy (a ‘zone’) is assigned to one or more DNS servers (each of which may be responsible for many distinct zones). Each DNS server returns a time-to-live (some number of seconds), permitting querying clients and servers to cache the records for an appropriate amount of time. The same caching policy applies to references to other DNS servers, allowing most queries to avoid loading the top of the hierarchy.
In spite of this caching strategy and centralized caches per-organization, the authors observed a relatively high query rate on the root servers of about seven queries per second (over all seven root servers; when only about twenty-five thousand hosts had names on the Internet), and overall poor performance due to wide-area network issues that would not be discovered in testing. Some of these issues were due to poor implementations. Some of the non-cached traffic was due to an initial lack of negative response caching.
A notable problem with DNS mentioned in the paper is one of misconfiguration. Administrators apparently did a very poor job of understanding or correctly setting TTLs, and the consequences of doing so were usually quite subtle.
With hindsight, it is clear that the paper's evaluation of DNS security was very pure. Though the authors mention the security concerns of resolvers blindly accepting records — albeit mostly from a misconfiguration scenario — they missed the chance to discover the poor use of additional answer sections that would cause major security problems later.
Monday, November 2, 2009
Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications
Chord maps its keys and nodes to conceptual ring, and uses the distance in one direction to route messages. A node is directly responsible for the keys between its address on the ring and the previous node's address and uses the following nodes as replicas for those keys. To guarantee routability to some node that thinks it is responsible for a key, each node has pointers to their current notion of their next or previous node. To provide fast routing in the common case, each node also keeps a pointer to nodes approximately 2^k units further around the ring for all k less than the key size.
Given these data structures to maintain, the paper presents algorithms for handling initialization of the finger table with relatively few messages and for maintaining the finger tables and replicas in the presence of failures. The paper provides some theoretical evidence these algorithms maintain good performance and availability in the presence of failures. The simulation evaluation gives good evidence that the algorithms can, in fact, maintain network connectivity in the presence of high failure rates, though they don't seem to show that that replicas would actually mitigate the impact of these failures in practice.
Chord is a very nice paper, but DHTs as envisioned in Chord have been relatively rare. Many distributed systems (especially lookup services which are relatively undemanding) are not actually large or unstable or decentralized enough to benefit from this sort of detailed handling of churn and scale, yielding approaches like the single-hop DHT (e.g. Dynamo, SEATTLE) and the replicated master (e.g. GFS).
Looking Up Data in P2P Systems
From these choices, the authors focus on DHTs, which are all symmetric, structured solutions to the distributed lookup problem. The paper describes several routing schemes (and their corresponding distance functions) and concludes with several practical issues that were apparently not well-solved on DHTs when the paper was written: such as assessing the impact of churn, handling malicious nodes, and providing more powerful primitives than lookup-by-key.
Wednesday, October 28, 2009
Active network visions and reality: lessons from a capsule-based system
Most of the paper deals with countermeasures to security and scalability problems arising from running external code on routers. The authors use tight resource limits per-service in terms of node resource consumption and assume software isolation between services on the active nodes. But these restrictions are not enough to prevent abuses under their scheme: malicious users can write programs that ping-pong traffic within the network to generate hundreds of times the traffic of their edge link — and possibly much more if they are to be able to implement multicast, etc. in their network programs.
The authors solution (which they acknowledge is unsatisfying) is for services to be endorsed by a central authority. Even with this restriction, it is not clear how these services are to satisfy more local constraints: how, for example, could nodes homed against the commercial Internet and the Internet2 backbone enforce the restriction that the Internet2 backbone was only to be used for non-commercial traffic? Generally, the scheme takes away control from the middle of the network (especially given the difficulty of answering “policy questions” about arbitrary forwarding programs) while asking the middle to devote a lot of resources to supporting it.
Resilient Overlay Networks
The authors deployed their system across approximately 20 nodes on the Internet (mostly US and mostly educational) and measured the packet loss, latency, and throughput provided in practice over their overlay network. They found a dramatic reduction in large (greater than 30%) packet loss rates and a small but significant improvement in latency and throughput across all links. These improvements persisted even when they ignored Internet2 links (which, for policy reasons, may not be used in practice for commercial traffic). They noted that different paths were chosen in practice when optimizing for different metrics (latency, loss, throughput) and that these paths were frequently not the ‘normal’ Internet route.
A subtext of this paper is that BGP and IP do a bad job: they are insufficiently reactive and expressive to make good use of the available Internet routes subject to policy constraints. The headline result: avoiding most outages to which BGP takes many minutes to respond, certainly does seem like something that a hypothetical new BGP could avoid. Of course (as the authors argue) some of advantages come from things no BGP replacement could do: the amount of state maintained by RON as is certainly would not scale to hundreds of thousands of routers (probably not even to hundreds, as the authors leave this tricky issue for future work). Also, the authors note that it is to their advantage that RON only deals with a small amount of the traffic (relative to link capacity) and thus avoids inducing new failures when it reroutes traffic.
Monday, October 26, 2009
On Inferring Autonomous System Relationships in the Internet
- provider to customer — a provider exports any routes on these links and a customer only exports its and its own customers' routes
- peer to peer — only routes to customers (and the parties themselves) are exported
- sibling to sibling — both customer and peer and provider routes may be exported
- AS paths are valley free (that is, they contain zero or more customer-to-provider and sibling-to-sibling links followed by zero or one peer-to-peer links followed by zero or more provider-to-customer and sibling-to-sibling links)
- the highest degree AS in a path is the ‘top’ of the path (that is, all customer-to-provider links occur before it appears and all provider-to-customer links occur after it)
The authors ran their inference algorithm over BGP routing tables observed from 24 sites and compared their results to AT&T internal information about their own AS-AS relationship. They found that the inferences were mostly — approximately 99% — accurate for the AT&T. There were some notable errors; for example, they inferred that AT&T had a provider, characterized several customers (with ASs) as having no relationship to AT&T entirely and most of the sibling-to-sibling relationships they identified AT&T has having were considered other types of relationships by AT&T.
Based on a manual analysis, the authors blame their errors primarily on misconfigurations and inter-AS relationships not fitting into their simplistic model and the inaccuracy of their heuristic. Unfortunately, the authors do not attempt to quantify the relative effect of these issues and it is not apparent from their single example what sort of AS relationships could be added to make a substantially more complete model. The heuristics used are generally a bit disappointing: do the authors have good evidence that AS degree really determines the top of the path? is there a better way to ignore incorrectly advertised customer-to-provider paths (e.g. by noticing they typically won't ever be used).
On Power-Law Relationships of the Internet Toplogy
The authors demonstrate that these four graphs have what the authors call power laws: that is, relationships of the form y ≈ A*x^B for some A, B over some properties of parts of the graph x and y. The power laws they find empirically are:
- between the outdegree rank (position in list of nodes sorted by outdegree) of a node and outdegree of a node;
- between the frequency of an outdegree in the graph and the outdegree itself;
- between the number of pairs of node within a distance of each other and the distance;
- between the eigenvalues of the graph (presumably of its adjacency matrix though the authors appear to forget to specify this) and their ranks
As applications for these observations, the authors primarily propose using these observations to more realistic evaluate applications for the Internet. They observe that prior work which ignored these power laws made some poor estimates about properties of the internet graph: for example, they implicitly assumed a roughly uniform distribution of outdegrees and thus assumed that most nodes had much smaller 2 or 3 or 4-hop than actually occured. Toward the end of using these properties for evaluation, the authors try to extract properties that can be used directly to produce synthetic graphs for Internet protocol evaluation (for example, a division into a core/tree structure and a direct estimator of AS and router hop counts).
The authors also using these properties to produce test environments that will continue to reflect the Internet in the future. They suggest that these power laws will continue to exist over future topology graphs and so can be used to construct reasonable evaluations for future Internet protocols intended for a larger Internet.
Though the formulas for diameter, neighborhood size, etc. are good qualitative summaries of the Internet topology (and thus useful for protocol design, etc.), I find it hard to believe that one would really prefer to use them as is to evaluate protocols intended for the current topology when one could sample from an actual topology map.
It is disappointing that the authors did not have a very long time span of data to analyze. I found it strange that the authors did not attempt to collect a (inevitably partial) 1997 or 1998 map of the router-level topology of the Internet to allow for a several-year analysis of the evolution of the topology (which seems to be one of the major goals of the work). And, for the BGP views, the authors did not attempt to quantify how complete their graphs were (since some links may not be visible from some BGP peering points).
Monday, October 19, 2009
Interactive WiFi Connectivity For Moving Vehicles
The authors evaluated their protocol using a trace-driven simulation (based on loss rate/location traces measured from two real testbed) which the authors apparently validated with measurements on the real testbed. Using their measurements, they compared ViFi to optimal handoff between base stations (with no backups to relay) and to optimal use of relaying (where the best possible base station always did the relying) and showed that they achieved substantially better utilization than using a single base station and nearly the optimal utilization for the relaying strategy. They also tested the application performance of some transport-layer protocols (TCP, VoIP), which was much better with their scheme than simple handoff though they ignored some TCP anomalies (e.g. apparently indefinately stalled connections).
A disappointing artifact of the paper was that the evaluation was apparently performed with very few simulated users (though it is unclear what the DeiselNet trace contained). The authors did not evaluate fairness between users despite a dubious decision to rely on carrier-sense and not attempt to reimplement exponential backoff which using broadcasts had disabled.
White Space Networking with Wi-Fi like Connectivity
To assign channels, the authors measure the utilization of each 6MHz frequency range and the number of active access points on each. Based on this, they estimate the share of bandwidth each node can expect receive and send these measurements to the access point. The AP choses a channel by aggregating these measurements (after taking into account which channels are used by incumbents seen by any end host or the AP).
To allow devices to quickly switch channels after incumbents are detected, a backup channel to which the AP listens (using its scanner radio) and clients fall back to in quiet periods is used.
The authors evaluated their active channel detection, AP discovery, and channel switching speed on an (apparently two-node) prototype deployment. They evaluated their channel selection algorithm in simulation and showed that it appeared to perform close the optimum in the presence of simulated background traffic, spatial variation.
Disappointingly, the authors don't appear to evaluate the effects their channel switching will have on end users. On their only experiment on the real testbed with this issue (Fig. 14) shows that there appears to be some serious throughput penalty when channel switching must occur, and from their evaluation it is not clear what effect that would likely have on transport protocols.
Wednesday, October 14, 2009
XORs in The Air: Practical Wireless Network Coding
The authors evaluated COPE in building ad-hoc 802.11a testbed. With UDP flows between random pairs of hosts (with flow sizes corresponding to those from some Internet file sample), they observed approximately double effective throughput at high levels of network utilziation and very little improvement at low levels of network utilization.
With TCP flows in the same test bed, performance was considerably worse. The authors attributed this problem to TCP not recovering from unavoidable collisions losses due to hidden terminals that MAC layer retries often did not recovery from; they achieved an approximately 40% performance when they artificially eliminated hidden terminals (by placing hidden terminals within carrier-sense range of all other nodes but artificially restricting the offered paths) and sufficiently loaded the network. (The authors went to some extra effort to make TCP flows work well — for example, avoiding packet reordering though they did not evaluate the effect of this improvement.)
Though ordinarily random pair connection patterns seem like a good way to stress the network, it would seem that in this case, it's likely to make the coding scheme look better than it may be for many practical situations. It performs especially well when there are flows that cross each other on the network, but many common applications are asymmetric (e.g. web browsing-like patterns would likely have most traffic originating from a small number of gateways with relatively little traffic to code against in the opposite direction).
ExOR: Opportunistic Multi-Hop Routing for Wireless Networks
The authors evaluated ExOR on RoofNet comparing its efficiency to the RoofNet's routing with the ETX metric. The authors measured throughput compared to traditional routing (presumably with the ETX metric). They sent 10% more traffic when testing ExOR to compensate for the reduced delivery rate. Even accounting for this extra traffic, the authors found substantially better throughput than the ‘traditional’ routing for multi-hop paths and especially for the poorest routes under traditional routing.
Based on the evaluation, it appears that this protocol targets bulk transfers. The authors did not evaluate latency, and appear to expect users to run a transport protocol unlike TCP, since this technique will certainly introduce jitter and/or out-of-order delivery and/or losses. The authors implemented a web proxy, only using ExOR for 100KB or larger transfers, but unfortunately the paper does not evaluate that proxy's actual overhead.
Wednesday, October 7, 2009
A Performance Comparision of Multi-Hop Wireless Ad Hoc Networking Routing Protocols
- route construction time — on demand (AODV, DSR, TORA) or always (DSDV)
- loop avoidance mechanism — sequence number (DSDV, AODV), source routing (DSR), topological ordering + drop packets during link reversal (temporary routing loop) (TORA)
- route storage — next hop table (DSDV, AODV), topological ordering (“height”) (TORA), at source/on data packets (DSR)
The authors of this paper implemented DSDV (with two different route propagation policies), AODV (with explicit beacons and with on-send loss detection), DSR, TORA in a detailed simulation of a 802.11 network with RTS/CTS. They choose several simulations scenarios with varying degrees of mobility (both in speed and movement frequencies) randomly in the simulated space. They used several fixed-rate packet sources (between 10, 20, or 30 random pairs of hosts) with sufficiently small packet sizes to usually avoid congestion losses. Given these scenarios, they measured the delivery rate, routing overhead (in bytes and packets), and route length versus the optimal path.
They found that routing protocol efficiency greatly depended upon the degree of mobility: faster, more frequent movement resulted in substantially larger overheads for all but DSDV. Except for TORA at high packet rates (where it experienced a routing protocol congestion collapse) and DSDV with high mobility (where its periodic updates were not fast enough), performance was good overall: more than 95% of packets were delivered and most packets took the shortest route. In terms of routing overhead, results were not surprising (except for TORA, which seems to be clearly broken), demand routing protocols (except for TORA) did substantially better and DSR seemed to trade off increased byte usage for decreased packet usage.
It's good to see a careful simulation of these various routing protocols under enough circumstances that it should be difficult to claim that an advantageous situation has been missed. Of course, the downside of a big sensitivity analysis like this is that reading the paper, I can't really tell how the choice of protocol would matter for a realistic application of an ad-hoc network. Except for the obviously bad cases (e.g. most of TORA, DSDV with very high mobility), would these differences (in overheads, delivery rates) even have a substantial effect on application performance?
A High-Throughput Path Metric for Multi-Hop Wireless Routing
The authors evaluated their deployment with a 802.11b network deployed in offices in their lab. They measured the hop-by-hop loss rates and computing a baseline ‘best’ set of route throughputs (assuming that their model of interference along a multi-hop route based on loss rates was accurate). They compared these throughputs to actual throughputs obtained using (modified to avoid instabilities) DSDV and DSR with both the standard hop-count metric and with transmission count metric.
Not surprisingly, the authors observed improved performance with their metric. They identified some non-obvious sources of the improvement which could be used independently by DSDV and DSR: avoiding asymmetric routes (which can be done by ‘handshaking’ before considering a link usable) and usually avoiding routes that experience transmission errors (which DSR does when given link-layer feedback).
The authors also evaluated the accuracy of their link quality measurements. They found that significant bias was introduced by the packet sizes of their probes — larger packets were received significantly less often than smaller ones, but their metric assumed all packets were equal.
Monday, October 5, 2009
Architecture and Evaluation of an Unplanned 802.11b Mesh Network
The authors showed that this network did, in fact, benefit in both connectivity and bandwidth from making extensive use of multihop paths because using several shorter links provided much better throughput (for TCP connections) than the (typically longer) paths with a lower hop count. A vast majority of nodes in the network had more than two neighbors in the routed network. Though the link quality (in terms of achievable throughput) was correlated with physical distance, there were plenty of relatively short links with low quality and some highly beneficial long links with high quality.
The authors notably evaluated (802.11b) RTS/CTS versus in their environment. Even though they had observed self-interference when a flow spanned multiple hops, RTS/CTS did not improve throughput.
As a case for creating wireless mesh networks without planning, this paper's result are disappointing. In the multi-hop TCP test 10% could not communicate (despite the nodes of the network being physically stationary), multi-hop routes were much slower than single-hop routes (even considering the penalty of switching off), and adding nodes at random appeared (from simulations) to have seriously diminishing returns.
Modeling Wireless Links for Transport Protocols
- losses — most link-layer protocols use forward error correction with a limit on retries, so loss rates due to errors aren't very high (as often assumed) except in very poor radio conditions. More common, the paper claims, are bursty losses from handoffs.
- delay variation — contrary to the default of many simulators, delay jitter is frequently high due to error recovery and handovers and link scheduling (e.g. in cellular links) can cause delay spikes
- not duplex
- traffic models — consider multiple flows and flows including more than the wireless link
Wednesday, September 30, 2009
A Comparision of Mechanisms for Improving TCP Performance over Wireless Links
The authors compare several types of schemes for improving TCP performance: gateway solutions where the gateway resends missing segments and (in TCP-aware schemes) possibly suppresses spurious duplicate ACKs also; gateway solutions that split the TCP connection into two TCP connnections; and purely end-to-end solutions. The authors find that several proposed improvements do help with caveats:
- link-layer retransmission schemes help by themselves but work a lot better when they also suppress spurious duplicate acknowledgments, which lead to spurious retransmits and congestion window size reductions (TCP-aware link-layer schemes provided the most improvement of all the schemes tested in the paper);
- end-to-end selective acknowledgments make good use of much of the bandwidth of the link without gateway modifications, but do not perform as well as gateway modifications especially with a wide-area connection; and
- splitting the TCP stream at the gateway performs not as well as link-layer snooping because of limited buffer sizes at the gateway
Though the authors mentioned “very high interactive delays” as a motivating problem observed in TCP over wireless, they disappointingly did not attempt to measure it in this paper.
MACAW: A Media Access Protocol for Wireless LANs
The hidden terminal problem makes carrier sense inappropriate. Instead, the authors explicit Request-to-Send and Clear-to-Send messages to manage the channel. Additionally the authors add an explicit acknowledgment (to guard against giving the false congestion signals to TCP-like protocols) and an explicit Data-Sending message (so other nodes within range of the transmitter can avoid interfering with its reception of the CTS and ACK).
Nodes control their use of the network based on these messages. A CTS or DS message reserves the channel for long enough to send the corresponding data and its acknowledgment. Based on past apparent collisions and their (apparent) locations, nodes backoff the rate at which they send RTSs, separately estimating congestion at their location and at the receiver.
The authors evaluated the MAC protocol with simulations of a variety of network topologies, verifying that reasonably fairness between data streams was provided by the MAC protocol and good link utilization was achieved (even in the presence of hidden terminals, unresponsive nodes, etc.) Most of the simulations used loss-insensitive flows and assumed the reception success was dependent entirely on distance from the receiver and competing transmitters.
It is very nice that the authors evaluated many network topologies and these clearly evaluations where the author's improvements produce huge differences. However, the paper seems to lack evaluation of more realistic scenarios. Even though the authors ACK scheme is motivated by transport protocol performance, almost none of the authors simulations consider performance of a TCP-like transport protocol or even how intermittent noise affects their link-layer protocol alone. The authors also do not consider any metric besides stream throughput, such as how much jitter and latency this MAC protocol introduces into the network.
Monday, September 28, 2009
Safe and Effective Fine-grained TCP Retransmissions for Datacenter Communication
First, they evaluated this workload in an incast scenario; here, they had the sink pull 1MB data split evenly across a varying number of servers and measured the effective throughput (‘goodput’). Both simulation results and experiments showed a dramatic increase in useful throughput for small (approx. 1ms) minimum retransmission timeouts compared to values closer to the current default of 200ms. Based on RTT information from high-bandwidth (~10Gbps) datacenter networks and simulations, the authors further suggested that future networks will have substantially lower round-trip times, so minRTO should be eliminatined entirely. For such networks, the authors also reccommended adding random variation to the computed RTO to avoid accidental synchronization (from similar RTTs).
The authors also examined some possible negative impact drastically eliminating the minimum retransmission timeout would have. Two different sources of spurious retransmissions and delays are considered: from underestimates RTT due to a delay spike and from ACKs delayed for longer than the retransmission timeout. For both problems, the authors performed a comparision of Bit Torrrent throughputs observed with and without a very small minimum RTO. They found little difference in the per-flow throughput distribution with a low minimum retranmission timeout.
For delayed ACKs, the authors compared throughput in an 16-node fan-in situation with and without delayed ACKs. They found that enabling delayed ACKs (with a delayed ACK timer larger than the round-trip time) caused a substantially lower throughput with large fan-in.
The solution proposed in this paper is relatively easy to deploy, simple, and seems to work with few negative consequences. The authors suggest that the spurious timeout avoidance the minRTO was set for is no longer necessary now that recovery from spurious timeouts does not cause entry into slow-start. But do the delay spikes that motivated the original minRTO setting still exist? (Is better recovery from misestimation enough?)
Sunday, September 27, 2009
Understanding TCP Incast Throughput Collapse in Datacenter Networks
The authors produced a simple numerical model of the throughput collapse behavior assuming that it is determined by the number of retransmission timeout events and the interpacket wait time. From this model and refinements, the authors produce an explanation for the throughput collapse behavior: first, throughput collapses as the number of retransmission timeouts increases with the number of senders, then the width of period in which retransmission timeouts are triggered (per congestion event) increases, so recovery senders become desynchronized, but eventually the width of that distribution becomes so wide that senders about to experience a timeout interfere with senders just recovering from a timeout.
It is disappointing that the qualitative observations which would seem to explain the throughput collapse are not incorporated into the model, nor is an good empirical way to test for their presence beyond fitting a more complicated model attempted or suggested.
Monday, September 21, 2009
VL2
- Topology — the supported topologies are folded Clos networks. Similar to fat trees, these provide many paths between racks and, as in PortLand, the authors propose routing flows randomly over the paths. Both of the papers examine three-stage topologies, both of which would support approximately the same number of end-hosts based on the number of ports each switch had; though the fat tree topology examined would use considerably more switches. To achieve similar performance VL2 assumes that switch-to-switch bandwidths are substantially higher than end-host-to-switch bandwidths. The fat tree topology is also considerably more regular, so where PortLand relies on location-specifying MAC addresses and explicit failure notifications the VL2 scheme runs a link state protocol between switches (but not to export end-host addresses) while .
- Layer choice — VL2 uses location-specifying IP addresses; providing a mapping from location-agnostic (relocatable) IP addresses to location-specific ones analagous to PortLand's AMAC to PMAC mapping. Unlike PortLand, using IP addresses combines two mappings (AMAC to PMAC and PMAC to IP) that PortLand maintained into one mapping that VL2 contains (AA to LA). The disadvantage, of course, is that servers need extra support to know this mapping (doing IP in IP encapsulation) rather than it being backwards-compatible as using the ARP cache is.
- The directory servers — in this paper, they are only responsible for the AA to LA mappings (no failure matrix), and replication/distribution of the directory servers is actually implemented. Also, the directory servers are queried by end-hosts rather than by switches;
- Autoconfiguration story — None seems apparent in this paper.
- Hardware required — The system in this paper runs on commodity switching ASICs, relying mostly on existing switch features rather than custom switching software.
The authors use some traces (presumably from MS data centers) and show that flow size varies a lot, traffic patterns are not very predictable, and small failures are frequent. They use these traces to motivate their choice of random load balancing rather than assuming a smarter scheme exists. For their prototype evaluation they measure utilization and fairness under an all-to-all TCP scenario on their 75-node testbed (apparently with near-commodity hardware) and measure interference between two services (with and without large numbers of TCP flows (‘mice’) in one service).
Probably the best feature of this paper is the authors extensive reuse of existing protocol implementations (OSPF, ECMP, etc.) This means that their fabric is probably more readily deployable and more hardware-agnostic than PortLand. It is also a disadvantage: they are not taking advantage of the relatively predictable topology they demand for routing, and many more pieces of hardware (each end-host) needs to be reconfigured than under the PortLand scheme which is truly invisible to the end-hosts.
Saturday, September 19, 2009
PortLand
To make this task simpler (that is, with less per-switch state) the authors fix the datacenter's topology as a fat tree and designate a centralized ‘fabric manager’ (which must be replicated for fault tolerance, of course) which coordinates between switches.
As with SEATTLE, the switches in this scheme handle all ARP requests themselves in addition to managing routing.
Switch state is kept small by distributing ‘pseudo MAC’ addresses which encodes the position of the destination node in the topology; a host's actual MAC address is never distributed to other nodes. The fabric manager controls the IP to PMAC mapping, allowing switches to respond to ARP requests and generate gratuitous ARPs if a PMAC becomes invalid.
Forwarding is kept loop-free by assuming a rooted fat tree and by following the rule that a packet changes from going up toward the root to down toward a node exactly once. For load balancing, switches can choose freely between upstream ports (the authors suggest switching on flows to keep each flow single-path); for fault-tolerance, the fabric manager distributes failure information to affected switches who then update their routing tables accordingly.
Probably the most clever scheme here is switch autoconfiguration. Bottom level switches discover that they are connected to enough non-switches, and the level number of higher-level switches propagates up the fat-tree. Position number assignment is done in a greedy fashion, synchronized by a higher level switch.
Combined with failed link detection, this should make (re)configuration automatic except for the pesky issue of the fabric manager and communication with it.
The authors tested their system on a 16-host prototype, evaluating its response to failures, IP migration, and the overhead it imposed on the fabric manager. Failures resulted in approximately 100 milliseconds of disconnectivity (which was, of course, made worse by TCP timeouts), and for the VM migration scenario, the TCP flow stalls fails for periods of approx 200-600 ms. For the fabric manager, they evaluated the overhead of between 25 and 100 ARP misses/sec/host, asssumed scaling, and extrapolated the CPU and network bandwidth uses of the fabric manager, and found that the fabric manager would require a lot of CPU resources (70 cores for the 100 ARP/sec case with ~30k hosts).
These numbers are disappointing though the authors point out that these miss rates seem unrealistic and that the fabric manager could probably be distributed to a cluster. Since presumably the authors are pretty confident that ARP miss rates are considerably lower in a real data center, it is disappointing that they didn't try to measure or intelligently estimate the rate at which they would be expected to occur. The linear extrapolation is also disappointing — couldn't the authors have really run their fabric manager on a synthetic trace with these miss rates and host count and measured the CPU usage?
Wednesday, September 16, 2009
Detailed Diagnosis in Enterprise Networks
Since the system does not assume any domain knowledge about the measurements it receives, faults are detected based on measurements deviating from their usual values. Diagnosis uses historical correlations to determine whether an anomalous measurement bis likely influencing/influenced by another. To determine which components' measurements to compare, domain knowledge is used: ‘template’ mark the dependencies each component type has to other components (e.g. between a machine and all of its processes or between processes communicating over IP). Given this directed dependency graph and the likelihood of measurements influencing each other, the ‘best’ cause is determined based on the best geometric means of influence scores on paths from the alleged cause to components, weighted by the destination components' abnormality, combined with this geometric mean for the component being diagnosed.
The authors include an evaluation where they analyze whether their system can diagnose injected faults causes correctly. They compare their approach to a simpler approach and a hand-tuned aporoach. Both share the dependency graph and root cause ranking method. The simpler method only measures whether components are ‘abnormal’ and computes influencing ranks based solely on whether both components are abnormal; the hand-tuned method only measures influence between variables hand-marked as related. The authors find that their system gave the correct cause as the first ranked cause for almost all the injected faults, and did so substantially better than the simple approach, which was successful about half of the time. The hand-tuned approach did not a substantial improvement, showing that the authors had not lost much but avoiding domain-specific knowledge in that part.
One troubling aspect of this work is formulas that seem to be chosen mostly arbitrarily (e.g. using geometric mean and sqrt(c->e value * global weighted value) for cause ranking; the choice of 0.1 and 0.8 has a ‘default’ edge weight; etc.) Of course, the authors had to make some choice here, but from the results we cannot tell how sensitive correct diagnosis is to these choices — is there technique actually general enough that they mostly don't matter? (The authors did do a sensitivity analysis for one of these parameters — the length of history used for diagnosis.)
There's also the big step away from fully domain-agnostic diagnosis in requiring templates. These templates are fortunately applicable to many kinds of applications, which is why, based on their motivation, they felt it okay to use them but not such similar input about the measurements they are taking. But these templates do not look entirely trivially to make: the ones they show have many subtle dependencies, such as on machine firewalls and other traffic. Are they actually easy to infer, and given that the authors have gone to great effort to develop a method for ignoring irrelevant correlations, how useful are they?
Floodless in SEATTLE
The authors evaluated their using a trace driven simulation on several real network topologies and an evaluation with an actual implementation on a very small network.
From the simulations, they showed that the information maintained per router was actually smaller than both Ethernet bridging based approaches and that it achieved better routes (given similar cache sizes) than past identity-based routing proposals. From the actual implementation, they measured the per-packet overhead, performance when attempting to contact many non-existent hosts, and switch fail-over speed.
It's unfortunate that this paper did not include an explicit comparison to IP subnetting or VLANs, by which it is clearly motivated. It is understandable why, of course; their arguments against IP subnetting are based on ease of configuration which is very difficult to measure. For VLANs, only some of their complaints were about ease of configuration, but for the others it might be hard to infer a suitable configuration from their topology information.
This scheme is attractive for its careful attention to complete transparency for end-hosts, which should make it relatively easy to deploy. The big problem is that deploying it in the most straightforward way involves changing all switches over; it would seem, however, that there's probably some way of constructing a reasonable middle-ground where it is deployed gradually across the network.
Monday, September 14, 2009
Random Early Detection Gateways
The oddest omission from this paper to me was packet sizes: The algorithm and the simulations work in terms of whole packets apparently without considering the size of packets. This is particularly odd given that one might expect bursty interactive flows to have variable packet sizes.
This paper does remarkably simplify its problem by assuming that all flows are running a congestion control algorithm, as this scheme does not converge to fairness in the phase of loss-agnostic flows (which the authors did not test). The authors propose sampling dropped packet, showing that a window of dropped packets is likely to have close to the expected number (based on actual) of dropped packets from each flow; however, the authors don't evaluate this scheme beyond this theoretical observation.
Congestion Control for High Bandwidth-Delay Product Networks
The authors measured their scheme in simulation in number of circumstances. Their simulations showed that it output performed TCP with any of several queuing disciplines in terms of speed of convergence to near full utilization and fairness and behavior with competing flows. (One notable exception was their "shuffling" scheme to ensure convergence to fairness would cause notable loss of throughput when a router downstream of a bottleneck would try to adjust bandwidth upwards to provide it a better share.) Additionally, the authors' simulations showed that their new protocol could be made to not discriminate against TCP flows.
The biggest issue with this protocol is the difficulty of implementation. To implement this scheme, routers will need to track per-flow state, and to provide it with security, would need to occasionally “test” flows to ensure that they are not cheating. The authors do propose a scheme similar to Core-Stateless Fair Queuing to isolate the changes of XCP to an administrative domain, but besides reducing the security issue and the number of potential flows to track to a function of the number of edge routers, this scheme does not seem to reduce the per-router overhead. And though the scheme to multiplex TCP flows alongside XCP flows helps, even the core/edge strategy for XCP (which the authors did not simulate) appears ill-suited for incremental deployment.
From a security standpoint, I find it somewhat disturbing that the ‘check if the flow’ responds scheme relies upon the sender-provided RTT estimate. I guess, however, that reasonable capping of this RTT would still yield an improvement upon the response time of such ‘enforcement’ schemes on traditional TCP.
Wednesday, September 9, 2009
Core-Stateless Fair Queuing
The authors evaluate this mechanism over several simulated topologies with variously behaved competing flows. They show that the performance of CSFQ is often competitive to a full per-flow state implementation of fair queuing (Deficient Fair Queuing). Its aboslute performance is usually slightly worse than DFQ (with the exception of performing better in cases where flows may suffer from not being able to temporarily exceed their "fair" buffer to accommodate burstiness).
CSFQ has some parameters which seem to be set arbitrarily and some which are not clear how to set in general. Arbitrary parameters include thresholds used to adjust for buffering effects. Exponential smoothing parameters the authors recommend setting substantially larger than delay-jitter and shorter than the average duration of a flow. Since with things like web pages which are in fact short-lived flows, we probably have a relatively short average flow duration; how should these parameters be set when flows may traverse the globe (with presumably high delay jitter) before exiting the network or traverse only to the local metropolis's peering point (with hopefully very low delay jitter)?
Analysis and Simulation of a Fair Queueing Algorithm
The authors evaluated their fair queuing algorithm (in simulation) against first-come first-serve and DECbit (which gives explicit congestion notifcations to flows over their fair share) running against continuous ("FTP") and intermittent ("telnet") flow controlled streams (both traditional TCP-style and with explicit congestion notification) and against misbehaved flows. In contrast to the other algorithms, their fair queuing algorithm provides well-behaved flows good performance in the presence of an ill-behaved flow and severely punishes the ill-behaved flow (which they assume to be loss-sensitive). Even for the normal cases, they show that they provide lower latency to the intermittent flow (rewarding it for using less than its fair share of bandwidth) than the DECbit scheme.
The incidental punishment of ill-behaved flows is strange; as the authors note, it is a result of charging flows for packets that cannot be queued, but it would be nicer to make the degree of punishment more explicit. (Since, surely, well-behaved hosts will send somewhat above their fair share while adjusting or due to network perturbation, and should necessarily lose throughput for it.)
Sunday, September 6, 2009
Analysis of the Increase and Decrease Algorithms for Congestion Avoidance in Computer Networks
Primarily evaluated are algorithms where at each time t, the new load is a(signal) + b(signal)*(old load) where signal is the single-bit indicator of whether the network was over the desired load at time t -1 . The appropriate values of a, b for each case are analyzed in light of convergence to both the desired level of load and the desired fairness. The main result is that additive increase/multiplicative decrease (i.e. a(underloaded) > 0, a(overloaded) = 0, b(underloaded) = 0, 0 < b(overloaded) < 1) is are the types of a and b functions which achieves convergence on both criteria. With respect to the rate of convergence, there is a tradeoff between quick convergence and low levels of oscillation around the goal load.
For me, the most interesting observation of the paper are the idealized throughput and response-time curves. TCP-style congestion control has its boolean signal of a sort (a dropped packet) — but where on that throughput/response time curve does it fall?
Congestion Avoidance and Control
- reaching the equilibirum
- slow start — ramp up sending speed slowly so the correct sending speed won't be overshot. Prior systems would not recover from a mismatch in their local network rate and the least capable link to their end host because they would continuous overshot.
- linear window size increase — be conservative about increasing the amount of data in flight to avoid massively overdriving the network, from which it is hard to recover.
- don't retransmit until packets is out of the network
- RTT variance estimate — because of queues, etc. round-trip time varies in the network a lot. This must be taken into account, which the authors do by creating an estimator of RTT variance.
- backoff after retransmission enough to avoid more losses
- exponential backoff — motivated by some theoretical results (about what sort of backoff is required for stability) and a desire to be as general as possible with what network topologies they support, the authors choose exponential backoff.
Despite the authors attempt at generality in these algorithms, since as they note there were TCP implementations for systems ranging from 56Kbit/s to hundreds of Mbit/s, there are some interesting assumptions about the environment. The most glaring one is that losses due to damage are are rare, which seems a strange assumption when one is considering microwave links, etc. It is hard to see, though, how this sort of congestion control (only at the ends and where loss types are not distinguishable) could function very well without assuming that at least most losses are due to congestion.
Wednesday, September 2, 2009
Understanding BGP Misconfiguration
Perhaps most remarkable about this paper is that most of these problems do not adversely affect connectivity. In this sense, BGP appears to have partially achieved the old goal of survivability, if only by accident. (Indeed, it seems to perform better than the Registries' databases at providing information about how to reach networks.) The most serious effect seems to come from the added load these background misconfigurations place on the system and the more serious possibilities they reflect. They are reminders that configuration is not very automated or well understood, so future global network disrupting events are still seem quite possible.
Interdomain Internet Routing
Much of the interesting part of inter-network routing, how these routing decisions are made and which offers are made or withdrawn, are not part of BGP. Typically, these policies are determined by the economic relationship between the two networks. A provider wants to maximize utilization of links that make it money, so it has an incentive to advertise any available routes using such links. When deciding between multiple routes to the same destination, a provider will similarly be motivated to configure their software to prefer routes using links for which they are paid, and may then decide based on traditional “technical” criteria like path length. (Of course, explicit contractual arrangements also influence these decisions.)
BGP does seem to have been designed for a smaller, friendlier Internet. The state size is and its changes are not trivial and likely to get worse as more people would sensibly like to have an internet or a backup internet connection. As administrated, there is a lot of trust, revealed through the poor filtering illustrated by hijacked blocks and then general lack of route authentication in the protocol. And, despite its flexibility for route announcers to choose their policy, some policies seem hard to enforce technically, such as preferences for peers to use their own networks for cross-country/sea transit. (Presumably, this could be prevented by only advertising the routes through the preferred entry point except that this would break reachability during glitches on the peer's network, so it is easier to contractually require honoring MED when needed?)
Monday, August 31, 2009
The Design Philosophy of the DARPA Internet Protocols
This article gives a design rationale for the DARPA Internet protocols. The primary motivation derives from connecting a variety, separately administered networks and supporting a variety of applications on top of them. The targeted applications tended to be non-jitter-sensitive ones with varying bandwidth requirements, so packet switching was chosen over circuit-switching. Another important goal was tolerance of failures; fault-tolerance combined with separate adminstration and the least common denominator of a variety of networks lead to the datagram model: Nothing was to be assumed of the intermediate hosts in a connection, so all the state of the network would be provided at the ends of the connection. The IP layer, then, provides exactly this service, on top of which end-user services would be built with modifications only at the end points (including the initial example of TCP).
Probably the biggest tradeoff mentioned in the article was one of end-host complexity. By relying on such a minimalist network stack and tolerating fail-stop failures of the network between two end hosts, implementing the protocols on the end-hosts is not easy (compared to, e.g., network architectures that explicitly signal failures or flow control limits).
Somehow, despite placing high importance on resilience from explicit attack, the design considerations mentioned fail to account for the presence of an adversary on the network who can insert non-fail-stop failures. The article recognizes this is a lost of robustness:
In this respect, the goal of robustness, which lead to the method of fate-sharing, which led to host-resident algorithms, contributes to a loss of robustness if the host misbehaves.but fails to mention the apparent design assumption that the separately administered hosts are reasonably well-behaved.
It is interesting that the article describes the lack of router state-carrying as a robustness feature. At our current scale, it seems more like a scalability feature. Though the article talks about performance considerations (particularly the difficult of specifying them to defense contractors) and mentions network-performance-sensitive activities like teleconferencing, performance generally and scalability in particular did not appear to be major goals. (The closest among the enumerated goals is “[permitting] host attachment with a low level of effort”, which was observed to not be well-achieved because of the protocol reimplementation effort.)
The End-to-End Argument
Detecting failures at every step is too much work! Just try the whole thing and see if it worked.Justifications:
- duplicated work
- Because there are possible failures at multiple levels which look the same ("duplicate" message, corrupted data, requests that were not processed, etc.), checks should happen at higher layers anyways.
- match the goal
- By checking for the high-level gaurentees explicitly, we may detect failures in components we wouldn't have thought to verify “for free” (e.g. router buffers in the network).
- more moving parts
- More integrity checking (e.g. hop-by-hop) means more code to find bugs in.
- avoid excess functionality
- Not every high-level task needs feature X (e.g. duplicate suppression), so always providing it at the lower level will be wasted effort.
We love abstraction; we just give our problem to the next layer down and never think about it again… The end-to-end argument derives from the experience that this strategy fails because the lower layers do not have knowledge of the overall application: and in practice, the feature is — or should have been — reimplemented in a higher layer.
The examples in the paper, then, are failures of abstraction. In some cases (the file transfer relying on the transport protocol for integrity, relying on the communication subsystem for encryption), it may really make sense to move the action to the application layer. In other cases, there is still hope for modularity: it's just that the lower layer is providing the wrong interface. Ack'ing after a message is processed, considering message duplicates based on some part of their content, ordering messages based upon vector clocks instead of a streams order, etc. — convenient abstractions exist for these.
One should not be surprised that a layer that solves a weaker problem is not, in fact, useful for solving the stronger problem. The more compelling cases come when there is no substitute layer without substantial application intrusion that would solve the real problem, as in the end-to-end integrity/fault tolerance cases.