Charles Reiss / CS 268: 2009

Monday, November 23, 2009

BotGraph: Large Scale Spamming Botnet Detection

The paper describes a log analysis technique intended to detect botnet-generated e-mail spamming through a commercial e-mail service. It assumes that it can either detect botnets when they are generating accounts, because they generate accounts at an usually high rate, or based on unusual relationships between the multiple sending accounts botnets maintain.

To detect the unusually high rate of account generation, the authors propose trackng the exponentially weighted moving average of the signup rate at each IP. Because real signup rates are typically very low, the authors argue that they can easily detect mass signups. It does seem that a clever botnet would ramp up its account generation rate slowly to evade these long-term averaging techniques. (The bots would try to appear to be a popular proxy, etc.) The authors argue that such techniques would be ineffective due to the natural churn of the botnet's compromised hosts, but given that, at least in office environments, many desktop machines are certainly not turned off routinely, and legitimate signup patterns can expect to see the same sorts of diurnal and weekly variations that machines being on and off see, I do not find this convincing.

Most of this paper is devoted to defending against attacks which do not use aggressive creation of new users. The authors use the correlation between accounts and the hosts which use them. The botnets the authors observe apparently load balance these accounts across many of its hosts over time and use many accounts from any particular compromised host. The authors therefore relate any two users based on the ASs they share being logged in from. Using these as edge weights and choosing a threshold, this induces a graph the users which can be used to cluster the users into candidate bot groups. These groups can be analyzed collectively for spammy behavior (because effective spamming requires the user accounts as a whole to have a high send rate per account given the effectiveness of signup rate limiting).

An obvious attack on this scheme, mentioned in the discussion of this paper, is for botnets to not reuse accounts between ASs. The authors argue for a countermeasure of grouping by IPs or some other smaller granularity to continue to catch those botnets. Clearly, as the authors argue, spammers have something to lose when they effectively cannot reuse their accounts between arbitrary pairs of machines, which makes this a worthwhile technique in spite of the relative ease of evading detection. The proposed future work, also correlating users by e-mail metadata and content would also likely be similarly worthwhile because an inability to reuse domains or arbitrary distribute destinations across accounts would also make spamming effectively more expensive.

Not-A-Bot: Improving Service Availablity in the Face of Botnet Attacks

This paper proposes distinguishing between human-initiated requests and machine-initiated requests by having a trusted attester generate signatures tying an application-specified action to recent input activity. The attester is isolated from the host operating system by a trusted virtual machine. Integrity of the attester (and, assuming the VM's correctness, the confidentiality of its signing key) and the virtual machine is assured by a TPM (Trusted Platform Module).

For services for which botnet abuse was a concern, applications would be able to ask the attester to sign some specified data with a statement indicating that keyboard or mouse activity had occurred at a time close to the requested action. The attester would avoid generating multiple attestations for the same and thereby effectively ratelimits the sensitive applications to the rate of real user activity on the machine. In the "normal" use case, the application would send these attestations to the service. The service would verify the attestation and adjust its behavior; for example, it could prioritize attested requests over non-attested requests.

Botnets attempting to create false attestations would be able to do so on non-idle machines of course, but the authors argue the rate at which they could do so effectively would be seriously limited. One danger is that attestations in this system are storable indefinitely — they contain a nonce (generated by the attestor to prevent trivial replays), but no timestamp unless one is added. The authors assume that users of attestations will encode the time in the signed material; for example, signing the date header of any sent e-mail and using URLs encoding a service-generated nonce or timestamp. Services must also maintain a nonce cache to prevent replay, which may be quite large in the case of high-latency services like e-mail. The authors predict that even for large botnets, the victim machines on which these botnets are running will not have user activity often enough to produce attested spam, DDoS, or click fraud traffic at nearly the rate they are acapable of today.

Wednesday, November 18, 2009

Cutting the Electric Bill for Internet-Scale System

Based on this observation that energy prices differ around the world — and in different ways at different times — this paper proposes redirecting traffic to utilize energy at the cheapest (sufficiently close) site. The price differences the paper speaks of exploiting are not just diurnal and seasonal ones, but hourly ones from the electricity spot market, which have substantially larger price fluctuations that cannot be definitely planned for hours in advance.

The paper uses traces of both these electricity spot prices and Akamai's workload to evaluate the potential savings from moving computation where energy is cheaper. Some practical concerns with moving workload were modeled: keeping work close to users due to latency concerns, not dramatically increasing bandwidth costs, and the realities of not having a power-proportional datacenter. Under somewhat optimistic scenarios, electricity cost reductions of more than 30% seemed achievable with this type of optimization.

The simulation evaluation was very thorough and used a real workload (which are quite hard to come by). The real worklaod used here does not, however, sound like the type one would want to prove this system by: Akamai's customers are presumably paying in large part for Akamai's closeness to the end-users, but increased distance is just about the only compromise that cannot be avoided. Surprisingly, the US electricity market seems dynanmic enough that even allowing for modest compromises (e.g. less than 1000 miles) can provide substantial benefit.

Skilled in the Art of Being Idle: Reducing Energy Waste in Networked Systems

This paper evaluates the viability of putting computers to sleep while preventing disruption from the machine not being on the network. Current platforms have ‘wake-on-LAN’ functionality, enabling machines to be woken up from a low-power state by a ‘magic packet’. A simple proxy could examine traffic destined to the machine and inject the magic packet before traffic to which the machine needs to respond. A more sophisticated version of this proxy could respond on the machine's behalf for simple but important queries. Using network traces for desktop machines in both home and work environments, this paper evaluates the likely power savings achieved with various levels of proxy sophistication.

The traces reveal that there is a lot of background noise on networks. Some of this traffic is broadcast at the Ethernet level but not actually logically directed to the end host (for example IPX traffic, routing backup traffic, ARP requests not for the end host's IP). Some is for protocols that require periodic maintainence traffic to maintain liveness, e.g. ARP (when querying the host), IGMP (if the host belongs to IP multicast groups), DHCP, TCP (for keepalives), SSDP. Such traffic was frequent enough in the author's traces that a proxy that would not handle that traffic would not be able to allow end hosts to sleep a substantial amount of time without disrupting service. Making such simple fill-in-the-blank replies for a handful of services and conservatively waking machines up for other services is all that would be required to allow machines to stay asleep for most their idle periods without sacrificing availability. Problematically, all it takes to disrupt this result is one misconfigured service (for example, the machine updating service the authors observed in the enterprise environment).

Scalable Application Layer Multicast

In contrast to the '95 paper, this paper, written in the days of considerably larger Internet, is very concerned about limiting the state stored on each node as the network grows. Probably also motivated by the realities of the Internet, this paper does not assume the infrastructure — the routers — will provide multicast services. Instead, end nodes are drafted into service as internal forwarding nodes in the multicast tree. The paper is devoted primarily to selecting these internal nodes and their neighbors, even in the face of failures and newly joining nodes.

The multicast tree maintained in this protocol has a strictly layered structure. Within each layer, nodes are partitioned into ‘clusters’. Each member of a cluster keeps track of all the other members of its clusters and forwards messages it generates to all other members of the cluster. For forwarding messages in and out of the clusters, each cluster elects a ‘leader’. This leader is required to be a member of some cluster in the next higher layer to forward messages outside of the original cluster. To allow recovery from loss of the leader, each member of a cluster also tracks the membership of its leader's cluster in the next layer.

To add a node to this tree, the joining node determines its distance (presumably in terms of latency) to all nodes of the highest level of the cluster, then the cluster in the next layer lead by the best node in the highest layer and so on. When it finds an appropriate cluster in the lowest layer, it joins it. When the leader of a cluster later detects that it is larger than the size threshold, it is split (potentially causing this procedure to repeat at a higher layer). Similar delayed detection by cluster leaders is used to initiate the merging clusters.

To maintain closeness in the network, nodes are responsible for periodically probing the alternate cluster leaders (in the same clsuter as its current cluster leader) and switching clusters if it has chosen inappropriately (or, as was surprisingly not mentioned in the paper, if network conditions have changed.)

Tuesday, November 17, 2009

A Reliable Multicast Framework for Light-weight Sessions and Application-Level Framing

This paper describes a technique for providing reliable delivery on top of a unreliabile multicast. The principal problem dealt with in the paper is avoiding the explosion of messages that would result by adapting TCP-like loss recovery to this many-receiver world. The first change necessary is switching to negative-acknowledgments. To make sure receivers are aware of which messages have been sent, periodic reports of the extent of messages (expressed as sequence numbers for each sender) are multicasted to the group by each participant. To prevent the size of each these reception reports from exploding, these reception reports only refer to a particular application-level entity being operated upon (a whiteboard "page" in the application in the paper).

From this reception information, participants can decide when to send a negative acknowledgment (which the paper calls a "repair request"). To prevent many repair requests from being sent at once (due to the failure of a link relatively "early" in the multicast tree), senders wait a random amount of time proportional to their distance from the original sender before sending the repair request. If, before this timer expires, they see another repair request for the same block, they backoff their timer exponentially (and use the same strategy to resend a transmission if the timer expires). Any host which has the data (all that received it are assumed to buffer the data), can send a retransmission, but uses a similar timer mechanism to lower the liklihood of duplicate retransmissions.

To select the random backoff values, the authors propose a heuristic scheme using feedback from the numbher of seen repair requests and retransmissions and some manually sent clamps. The authors evaluated this scheme in simulations over random multicast trees with up to 5000 nodes.

Presumably because it was divised with all participants being "first-class" — equally able to send messages, not distinguished by the strucutre of the mutlicast tree, and able to leave or join the group at will — this scheme requires nodes to track and resend a disappointing amount of state. Each node needs to keep estimates of the round-trip-time to each other node in the multicast group. To make these estimates, each node must send periodic reception reports. Without a means to summarize these reports (besides switching "pages"), the size of each report is proportional to the number of senders. (The authors do propose "local recovery" to limit the scope of retransmissions and associated messages, but this scheme does not reduce the need Though this amount state and messages are probably manageable for a low-bandwidth stream with thousands of recipients, it does not sound like scaling this scheme to larger data sizes and more recipients will be comfortable.

Thursday, November 12, 2009

X-Trace: A Pervasive Network Tracing Framework

Applications that run on networks inherently rely on the correct functioning of many components. Understanding failures or performance problems in such systems requires understanding how a logical tasks is performed across these components. X-Trace is a system that tracks each task's intercomponent dependencies and thereby allows the component to produce logs that can be reconstructed to show the logical execution of a task.

This dependency tracking is done by piggybacking information about the previous subtask in intercomponent request, including the “root’ task whether the new subtask is the next subtask or a subsubtask (which may be performed concurrently with other subsubtasks). When a component acts on a request, this metadata is attached to the log entries it creates, and these logs are aggregated using the task ID.

Propagating this metadata and aggregating the logs is relatively cheap, but not free (the authors observed a 15% decrease in throughput in an instrumented Apache). The main cost, however, is clearly in the modification of each component of the system. The system's pervasiveness requires its metadata hooks to infect all the components. Ideally, much of this could be automated, since most systems already ‘know’ which request they are acting on at any point, which presumably was what the authors were exploring in their modifications to libasync. Such knowledge does not easily let one automatically choose between ‘next’ versus ‘down’, and certainly won't easily detect the non-tree structures the authors argue (describing as future work) would be more appropriate for some systems.

After proposing tracing all the way down to the link layer, the authors sensibly proposed, as future work, limiting what is traced by a combination of choosing which layers to trace and which incoming requests to mark. Of course, we already have tools for diagnosing network difficulties at the network path level, so logically, it might be nice (and certainly require less switch modification) to use X-Trace to find the path in our maze of an overlay network and then compose in our existing tool. Unfortuantely, based on the figures in the paper, X-Trace does not really record failures in such straightforward a manner. What humans interpret as a (service-stopping) failure in their typical traces is an omitted node, apparently without a hint was to what that next-hop might be save (non-automated) comparision with the correct case.

Wednesday, November 11, 2009

NetFPGA: A Tool for Network Research and Education

This paper describes a programmable hardware designed to allow rapid prototype of new hardware switch and router designs. This hardware is built-up from components commonly used for research and education outside of the networking space; the second version described in the paper uses a standard PCI bus for control and provides four gigabit ethernet outputs, so it is practical for researchers to deploy and can handle convincingly large amounts of load.

The research need filled by NetFPGA is an obvious one: most deployed networks work in large part using specialized hardware. Networking researchers typically did not have the means to create or modify hardware routers, so new ideas were tested with software routers. When tests with software routers yielded subpar performance, as they often do, authors would often argue that next-generation hardware routers or switches would be able to provide the functionality easily and without any loss of performance. NetFPGA allows these researchers to stop the hand-waving.

Monday, November 9, 2009

A Policy-aware Switching Layer for Data Centers

This paper proposes a switching architecture to ensure that some traffic is always routed through certain middleboxes. Given some collection of middleboxes of various types, the administrator specifies that the traffic matching some (source interface, source address/port, destination address/port) pattern must be forwarded through a certain type of middlebox (or routed to their final destination). Chains of middleboxes are specified by using a prior middlebox as the source interface type.
Switches along the paths of packets handle traffic that needs to be routed through a middlebox by encapsulating the ethernet frame in another ethernet frame destined for the middlebox. The router adjacent to the middlebox chosen removes this encapsulation, allowing the middlebox to act as if it were on the actual path.

The primary challenges of implementing this scheme comes with handling changes. To ensure compliance during a policy change without being able to mark packets passing through middleboxes, the authors propose dividing a change from policy P1 to P2 into multiple steps. First, each type of middlebox (relevant to the policy change) is split into two types (say, old and new) of middleboxes. Then (after the new-type middleboxes have processed all old traffic), the new policy is installed replacing the old type with the new type everywhere. Finally, after the new policy has been installed on all switches (and the old type middleboxes have processed all pending traffic), all middleboxes of the old type are switched to middleboxes of the new type.

The primary disadvantage of this routing scheme is latency. Compared to the traditional approach of placing middleboxes directly in the natural path of the packets, this scheme imposes an extra approx. 0.3 ms (based on the authors' testbed) of latency. The authors argue that this difference is insignificant given the general low-latency of the environment, but, for data-center-internal traffic, this may double the latency, which is very likely to substantially affect performance.

Internet Indirection Infrastructure

This paper describes an overlay-based Internet routing service built over the primitives of a DHT. Addresses in the overlay address consistent of two pieces, one which must match exactly and the other which is matched to the closest (longest prefix match) available address when routing messages. Receivers arrange to have an address routed to them by inserting a mapping (called a trigger) from an overlay address to some list of (overlay or IP) addresses. When a message is sent to an overlay address, these triggers are found, and for each trigger, the message is forwarded to the first functional address, passing along any left-over addresses in the trigger as metadata. These triggers are expired unless refreshed periodically and can be removed only if the destination-part is known. Conflicts in address assignment are avoided by using a sufficiently large address space.

The paper envisions that this addressing mechanism can be used to implement many commonly requested routing services:

mobility — the mobile host simply maintains a trigger with a permanent address, arranging for it to point to its current real address;
multiast — each subscribing host inserts a trigger for the multicast address (or, for large multicast groups, the hosts can arrange for a tree of triggers);
anycast — endpoint encode their preferences in the inexactly matched portion of the address and ensure that exactly one trigger is the closest match;
service composition — the end-host wishing to insert the service inserts a trigger listing the service, then itself; the service agrees to forward the message to the address it receives as metadata;

The implementation of this primitive over a Chord DHT assigns responsibility for routing to an overlay address to the Chord node responsible for the corresponding key in Chord's keyspace. (To allow closest-match routing, exactly all addresses with the same exact-match portion correspond to the same Chord key.) The responsible Chord node forwards all data destined for the overlay address in question. This routing scheme poses some performance concerns. The chosen intermediate server may be a poor choice, routing traffic between relatively close hosts through a far-away or slow router. To avoid this the authors propose using separate addresses for connection setup (done with a public address) than for the connection itself (which uses a temporary private address); the private address can be chosen to fall into the range a nearby server is responsible for (presumed to be discovered by querying the addresses of nearby servers in the DHT). Another concern is overloading a DHT node responsible for a popular public address; the authors believe they can address this by caching the associated triggers along the lookup path.

Other concerns about this system are ones of security. The most obvious issue is that the flexibility to implement multicast comes with the ability to send massive amounts of duplicitous traffic to a victim host. The authors propose sending challenges to the trigger's mapped address to assure that the address to which triggers map actually permits the mapping. For cases where the victim is the infrastructure itself, the authors propose assigning resources based on the end-host inserting the trigger and detecting loops using either test packets or a bloom filter to mark visited hosts.

The authors also address concerns of eavesdropping or hijacking communications using the triggers. The authors argued that random private triggers would prevent eavesdropping attacks because a malicious third-party would need to guess the private trigger. Similarly, the authors would prevent hijacking by having the public trigger merely forward to a (hopefully unguessable) private trigger required to remove the public.

I felt that these security concerns would be better solved by using a traditional cryptographic end-to-end solution for hijacking and eavesdropping concerns (perhaps by encoding public keys in the public triggers) and authentication for trigger-removal requests. The authors wish i3 to be a relatively configuration-free system distributed between multiple administrative domains, to achieve these goals, almost any party should be able to run an i3 node. Given this, it seems a more viable attack would be to collect triggers by running (perhaps many) i3 nodes; since the trigger's values will be distributed at random among the i3 nodes and forwarded to other i3 nodes for caching, an attacker should be able to get the values of many private triggers.

Wednesday, November 4, 2009

DNS Performance and the Effectiveness of Caching

This is a study of DNS in the wild, using traces of DNS requests and responses leaving two adminstrative domains along with TCP connections and disconnections. They found typical lookup performance was better than the '88 paper, but, still, many lookups would take an exceptionally long time from a user's perspective, apparently due to network or configuration problems. About 20% of all lookups got no answer (and no error repsonse) providing both high lookup latency and a probably uncachable negative result. Around 4% of the unanswered lookups were authority forwarding loops — an obvious misconfiguration. Some popular negative responding lookups were found to be common misconfigurations (e.g. asking the root servers to resolve 'loopback' or 'loghost').

The authors of the paper, however, were not mostly concerned with misconfiguration but with the effect of caching and TTL settings. The authors generally found that caching worked well for nameserver records, so most queries required few referrals (each of which is an extra round-trip to a server), substantially decreasing the latency of lookups. The authors found that the names requested followed a Zipf-like distribution and therefore most of the benefit of caching was obtained by caching relatively few names. They also found, based on the TCP trace, that because accesses to the same domain were highly clustered, short TTLs (popular for load balancing), did not cause substantial increases in the number of DNS queries or user-perceived latency.

Development of the Domain Name System

This article describes the design goals and some of the rationale behind the DNS protocol. The system DNS replaced was clearly neither scalable or efficient — previously, programs would scan a flat file listing hostnames and their corresponding IP addresses, and this file would be distributed to all end hosts. The replacement was to be distributed, so multiple adminstrative domains could add hosts, and was to be lightweight and provide cache, so as to provide comparable performance to the local database approach. Loftier goals of providing extensibility, but this seemed to be reflected in the initial design primarily in the existence of multiple types of records and the lack of structure demanded on names.

To achieve the distributed administrative control, DNS used an explicit hierarchy, which was to reflect the actual administrative hierarchy of the network. Each part of this hierarchy (a ‘zone’) is assigned to one or more DNS servers (each of which may be responsible for many distinct zones). Each DNS server returns a time-to-live (some number of seconds), permitting querying clients and servers to cache the records for an appropriate amount of time. The same caching policy applies to references to other DNS servers, allowing most queries to avoid loading the top of the hierarchy.

In spite of this caching strategy and centralized caches per-organization, the authors observed a relatively high query rate on the root servers of about seven queries per second (over all seven root servers; when only about twenty-five thousand hosts had names on the Internet), and overall poor performance due to wide-area network issues that would not be discovered in testing. Some of these issues were due to poor implementations. Some of the non-cached traffic was due to an initial lack of negative response caching.

A notable problem with DNS mentioned in the paper is one of misconfiguration. Administrators apparently did a very poor job of understanding or correctly setting TTLs, and the consequences of doing so were usually quite subtle.

With hindsight, it is clear that the paper's evaluation of DNS security was very pure. Though the authors mention the security concerns of resolvers blindly accepting records — albeit mostly from a misconfiguration scenario — they missed the chance to discover the poor use of additional answer sections that would cause major security problems later.

Monday, November 2, 2009

Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications

This paper describes the Chord DHT, a simulation-based evaluation of its load balancing and performance under churn, and a very small testbed evaluation of its latency across the wide-area network.

Chord maps its keys and nodes to conceptual ring, and uses the distance in one direction to route messages. A node is directly responsible for the keys between its address on the ring and the previous node's address and uses the following nodes as replicas for those keys. To guarantee routability to some node that thinks it is responsible for a key, each node has pointers to their current notion of their next or previous node. To provide fast routing in the common case, each node also keeps a pointer to nodes approximately 2^k units further around the ring for all k less than the key size.

Given these data structures to maintain, the paper presents algorithms for handling initialization of the finger table with relatively few messages and for maintaining the finger tables and replicas in the presence of failures. The paper provides some theoretical evidence these algorithms maintain good performance and availability in the presence of failures. The simulation evaluation gives good evidence that the algorithms can, in fact, maintain network connectivity in the presence of high failure rates, though they don't seem to show that that replicas would actually mitigate the impact of these failures in practice.

Chord is a very nice paper, but DHTs as envisioned in Chord have been relatively rare. Many distributed systems (especially lookup services which are relatively undemanding) are not actually large or unstable or decentralized enough to benefit from this sort of detailed handling of churn and scale, yielding approaches like the single-hop DHT (e.g. Dynamo, SEATTLE) and the replicated master (e.g. GFS).

Looking Up Data in P2P Systems

This paper reviews peer-to-peer overlay networks from the perspective of how they approach the lookup problem (finding where to find data). It observes that there is a spectrum of actual implementations from centralized to hierarchical to symmetric. The paper focuses primarily on the symmetric sort, observing that they gain in resilience by not having any special nodes whose failures matter more than the failure of other nodes. (I find it curious that the actual variable performance availability of machines in P2P systems is not mentioned here.) Separate from the symmetric/hierarchical distinction the paper makes a distinction between structured and unstructured storage, where structured systems have "a well-defined set of information about other nodes". The authors only focus structured systems since the systems they consider unstructured prevent retrievability guarantees.

From these choices, the authors focus on DHTs, which are all symmetric, structured solutions to the distributed lookup problem. The paper describes several routing schemes (and their corresponding distance functions) and concludes with several practical issues that were apparently not well-solved on DHTs when the paper was written: such as assessing the impact of churn, handling malicious nodes, and providing more powerful primitives than lookup-by-key.

Wednesday, October 28, 2009

Active network visions and reality: lessons from a capsule-based system

This paper describes an active networking implementation (that is, allowing network users to specify forwarding and related code to be deployed within the network). Rather than embed code itself in the forwarded packets, this design embeds a identifier which is retrieved from the prior hop on demand. The authors implementation using this scheme show that is has less than 1 millisecond latency overhead but can only keep with approximately T1-speed links. The authors blame this inefficiency on their use of Java and argue that a production implementation which would presumably use a different isolation technique could perform at the speed of faster links. The authors envision a system where not all nodes on the network are ‘active’ (which also enables incremental deployment), making high-speed nodes failing to implement active networking not a fatal flaw.

Most of the paper deals with countermeasures to security and scalability problems arising from running external code on routers. The authors use tight resource limits per-service in terms of node resource consumption and assume software isolation between services on the active nodes. But these restrictions are not enough to prevent abuses under their scheme: malicious users can write programs that ping-pong traffic within the network to generate hundreds of times the traffic of their edge link — and possibly much more if they are to be able to implement multicast, etc. in their network programs.

The authors solution (which they acknowledge is unsatisfying) is for services to be endorsed by a central authority. Even with this restriction, it is not clear how these services are to satisfy more local constraints: how, for example, could nodes homed against the commercial Internet and the Internet2 backbone enforce the restriction that the Internet2 backbone was only to be used for non-commercial traffic? Generally, the scheme takes away control from the middle of the network (especially given the difficulty of answering “policy questions” about arbitrary forwarding programs) while asking the middle to devote a lot of resources to supporting it.

Resilient Overlay Networks

From the observation that BGP as deployed on the Internet does not allow for particularly fast or flexible use of the many available paths, this paper developers an overlay network (RON) to take advantage of alternate paths. Each node in the overlay network probes its (Internet-routed) link to each other node, estimating its loss rate and latency as well as detecting unavailability of the link within tens of seconds. From these measurements, RON chooses an intermediate node (if any) to optimize for an application-chosen metric.

The authors deployed their system across approximately 20 nodes on the Internet (mostly US and mostly educational) and measured the packet loss, latency, and throughput provided in practice over their overlay network. They found a dramatic reduction in large (greater than 30%) packet loss rates and a small but significant improvement in latency and throughput across all links. These improvements persisted even when they ignored Internet2 links (which, for policy reasons, may not be used in practice for commercial traffic). They noted that different paths were chosen in practice when optimizing for different metrics (latency, loss, throughput) and that these paths were frequently not the ‘normal’ Internet route.

A subtext of this paper is that BGP and IP do a bad job: they are insufficiently reactive and expressive to make good use of the available Internet routes subject to policy constraints. The headline result: avoiding most outages to which BGP takes many minutes to respond, certainly does seem like something that a hypothetical new BGP could avoid. Of course (as the authors argue) some of advantages come from things no BGP replacement could do: the amount of state maintained by RON as is certainly would not scale to hundreds of thousands of routers (probably not even to hundreds, as the authors leave this tricky issue for future work). Also, the authors note that it is to their advantage that RON only deals with a small amount of the traffic (relative to link capacity) and thus avoids inducing new failures when it reroutes traffic.

Monday, October 26, 2009

On Inferring Autonomous System Relationships in the Internet

This paper describes a technique for inferring an annotated graph of AS to AS links. The annotations describe three types of relationships, one of which is directional, based on the export policies of the ASs:

provider to customer — a provider exports any routes on these links and a customer only exports its and its own customers' routes
peer to peer — only routes to customers (and the parties themselves) are exported
sibling to sibling — both customer and peer and provider routes may be exported

The authors infer these relationships from the AS paths in a number of BGP-derived routing tables from various peering points on the Internet. The authors work based on two assumptions:

AS paths are valley free (that is, they contain zero or more customer-to-provider and sibling-to-sibling links followed by zero or one peer-to-peer links followed by zero or more provider-to-customer and sibling-to-sibling links)
the highest degree AS in a path is the ‘top’ of the path (that is, all customer-to-provider links occur before it appears and all provider-to-customer links occur after it)

From these assumptions, the authors can determine that every AS before the ‘top’ AS gets transit from the following AS in the AS path and every AS after the top gives transit to the following AS. These transit relationships distinguish provider-to-customer links from other types of links. Then peering relationships can be found by determining based on no customer-to-provider routes being exported through a peering link. Some extra heuristics are used to deal with misconfigured BGP sites or exceptions where an apparent customer that (probably accidentally, the authors assume) provides transit between their two providers; ordinarily, the algorithm would infer that this was not a provider/customer relationship.

The authors ran their inference algorithm over BGP routing tables observed from 24 sites and compared their results to AT&T internal information about their own AS-AS relationship. They found that the inferences were mostly — approximately 99% — accurate for the AT&T. There were some notable errors; for example, they inferred that AT&T had a provider, characterized several customers (with ASs) as having no relationship to AT&T entirely and most of the sibling-to-sibling relationships they identified AT&T has having were considered other types of relationships by AT&T.

Based on a manual analysis, the authors blame their errors primarily on misconfigurations and inter-AS relationships not fitting into their simplistic model and the inaccuracy of their heuristic. Unfortunately, the authors do not attempt to quantify the relative effect of these issues and it is not apparent from their single example what sort of AS relationships could be added to make a substantially more complete model. The heuristics used are generally a bit disappointing: do the authors have good evidence that AS degree really determines the top of the path? is there a better way to ignore incorrectly advertised customer-to-provider paths (e.g. by noticing they typically won't ever be used).

On Power-Law Relationships of the Internet Toplogy

This paper examines several maps of the internet topology, three at the autonomous system level (derived from a BGP peering point) from over just over a years time in 1997-1998 and one of the router-level topology from 1995.

The authors demonstrate that these four graphs have what the authors call power laws: that is, relationships of the form y ≈ A*x^B for some A, B over some properties of parts of the graph x and y. The power laws they find empirically are:

between the outdegree rank (position in list of nodes sorted by outdegree) of a node and outdegree of a node;
between the frequency of an outdegree in the graph and the outdegree itself;
between the number of pairs of node within a distance of each other and the distance;
between the eigenvalues of the graph (presumably of its adjacency matrix though the authors appear to forget to specify this) and their ranks

As applications for these observations, the authors primarily propose using these observations to more realistic evaluate applications for the Internet. They observe that prior work which ignored these power laws made some poor estimates about properties of the internet graph: for example, they implicitly assumed a roughly uniform distribution of outdegrees and thus assumed that most nodes had much smaller 2 or 3 or 4-hop than actually occured. Toward the end of using these properties for evaluation, the authors try to extract properties that can be used directly to produce synthetic graphs for Internet protocol evaluation (for example, a division into a core/tree structure and a direct estimator of AS and router hop counts).

The authors also using these properties to produce test environments that will continue to reflect the Internet in the future. They suggest that these power laws will continue to exist over future topology graphs and so can be used to construct reasonable evaluations for future Internet protocols intended for a larger Internet.

Though the formulas for diameter, neighborhood size, etc. are good qualitative summaries of the Internet topology (and thus useful for protocol design, etc.), I find it hard to believe that one would really prefer to use them as is to evaluate protocols intended for the current topology when one could sample from an actual topology map.

It is disappointing that the authors did not have a very long time span of data to analyze. I found it strange that the authors did not attempt to collect a (inevitably partial) 1997 or 1998 map of the router-level topology of the Internet to allow for a several-year analysis of the evolution of the topology (which seems to be one of the major goals of the work). And, for the BGP views, the authors did not attempt to quantify how complete their graphs were (since some links may not be visible from some BGP peering points).

Monday, October 19, 2009

Interactive WiFi Connectivity For Moving Vehicles

This paper describes a protocol for vehicular WiFi (ViFi) based on utilizing multiple access points. A mobile host chooses a primary base station and several auxillary base stations. When the auxillary base stations overhear a packet to or from the mobile host that is not ACK'd, they probabilistically resend it. The probability of resending is based on the probability of other nodes deciding to resend the packet and the liklihood of reaching the ultimate destination, with a goal of usually resending the (apparently) unACK'd packet exactly once.

The authors evaluated their protocol using a trace-driven simulation (based on loss rate/location traces measured from two real testbed) which the authors apparently validated with measurements on the real testbed. Using their measurements, they compared ViFi to optimal handoff between base stations (with no backups to relay) and to optimal use of relaying (where the best possible base station always did the relying) and showed that they achieved substantially better utilization than using a single base station and nearly the optimal utilization for the relaying strategy. They also tested the application performance of some transport-layer protocols (TCP, VoIP), which was much better with their scheme than simple handoff though they ignored some TCP anomalies (e.g. apparently indefinately stalled connections).

A disappointing artifact of the paper was that the evaluation was apparently performed with very few simulated users (though it is unclear what the DeiselNet trace contained). The authors did not evaluate fairness between users despite a dubious decision to rely on carrier-sense and not attempt to reimplement exponential backoff which using broadcasts had disabled.

White Space Networking with Wi-Fi like Connectivity

This paper describes a MAC protocol for Wi-Fi-like whitespace networking ("WhiteFi"). The primary challenge is to choose and maintain choices of channels that avoid interference with existing users of the channel. To achieve this goal the hardware structure of a WhiteFi node includes both a scanner that drives a software-defined radio (used for access point detection and to detect ‘incumbents’ of channels) as well as a Wi-Fi-like radio.

To assign channels, the authors measure the utilization of each 6MHz frequency range and the number of active access points on each. Based on this, they estimate the share of bandwidth each node can expect receive and send these measurements to the access point. The AP choses a channel by aggregating these measurements (after taking into account which channels are used by incumbents seen by any end host or the AP).

To allow devices to quickly switch channels after incumbents are detected, a backup channel to which the AP listens (using its scanner radio) and clients fall back to in quiet periods is used.

The authors evaluated their active channel detection, AP discovery, and channel switching speed on an (apparently two-node) prototype deployment. They evaluated their channel selection algorithm in simulation and showed that it appeared to perform close the optimum in the presence of simulated background traffic, spatial variation.

Disappointingly, the authors don't appear to evaluate the effects their channel switching will have on end users. On their only experiment on the real testbed with this issue (Fig. 14) shows that there appears to be some serious throughput penalty when channel switching must occur, and from their evaluation it is not clear what effect that would likely have on transport protocols.

Wednesday, October 14, 2009

XORs in The Air: Practical Wireless Network Coding

This paper presents a routing and MAC scheme (COPE) that attempts to increase wireless throughput by using a coding techniques to send multiple packets at once. When a router has set of packets to forward to multiple destinations and each destination has all but one packet in a set, it can send the XOR of all the packets in the set and each destination can decode it. To allow routers to use this feature, they discover what packets their neighbors have available based on periodic reception reports and by guessing from overheard packets and transmission probabilities tracked by routing.

The authors evaluated COPE in building ad-hoc 802.11a testbed. With UDP flows between random pairs of hosts (with flow sizes corresponding to those from some Internet file sample), they observed approximately double effective throughput at high levels of network utilziation and very little improvement at low levels of network utilization.

With TCP flows in the same test bed, performance was considerably worse. The authors attributed this problem to TCP not recovering from unavoidable collisions losses due to hidden terminals that MAC layer retries often did not recovery from; they achieved an approximately 40% performance when they artificially eliminated hidden terminals (by placing hidden terminals within carrier-sense range of all other nodes but artificially restricting the offered paths) and sufficiently loaded the network. (The authors went to some extra effort to make TCP flows work well — for example, avoiding packet reordering though they did not evaluate the effect of this improvement.)

Though ordinarily random pair connection patterns seem like a good way to stress the network, it would seem that in this case, it's likely to make the coding scheme look better than it may be for many practical situations. It performs especially well when there are flows that cross each other on the network, but many common applications are asymmetric (e.g. web browsing-like patterns would likely have most traffic originating from a small number of gateways with relatively little traffic to code against in the opposite direction).

ExOR: Opportunistic Multi-Hop Routing for Wireless Networks

ExOR is a routing protocol that attempts to exploit ‘lucky’ delivery of packets along multiple lossy (hopefully in an uncorrelated way) hops from each router on the path. Each sent batch of packets is broadcast with an ordered list of next-hops. When a router receives a packet, it waits an amount of time according to where it is in the priority list and the sending rate in its environment and forwards packets it did not overhear from nodes higher in the priority list. The nodes on the ordered list are sorted by each nodes ETX to the destination and pruned based on which nodes which are used well enough in a simulation of ExOR. ExOR only attempts to transmit 90% of the packets in a batch, on the assumption that the last packets in a batch will be more expensive to send due to a higher liklihood of collisions (due to insufficient information to schedule packet sending well). The remaining packets are expected to be transmitted through traditional hop-by-hop metrics (though the authors did not appear to do this in their evaluation).

The authors evaluated ExOR on RoofNet comparing its efficiency to the RoofNet's routing with the ETX metric. The authors measured throughput compared to traditional routing (presumably with the ETX metric). They sent 10% more traffic when testing ExOR to compensate for the reduced delivery rate. Even accounting for this extra traffic, the authors found substantially better throughput than the ‘traditional’ routing for multi-hop paths and especially for the poorest routes under traditional routing.

Based on the evaluation, it appears that this protocol targets bulk transfers. The authors did not evaluate latency, and appear to expect users to run a transport protocol unlike TCP, since this technique will certainly introduce jitter and/or out-of-order delivery and/or losses. The authors implemented a web proxy, only using ExOR for 100KB or larger transfers, but unfortunately the paper does not evaluate that proxy's actual overhead.

Wednesday, October 7, 2009

A Performance Comparision of Multi-Hop Wireless Ad Hoc Networking Routing Protocols

There are several choices to be made in constructing ad-hoc routing protocols, several of which are explored by the protocols in this paper:

route construction time — on demand (AODV, DSR, TORA) or always (DSDV)
loop avoidance mechanism — sequence number (DSDV, AODV), source routing (DSR), topological ordering + drop packets during link reversal (temporary routing loop) (TORA)
route storage — next hop table (DSDV, AODV), topological ordering (“height”) (TORA), at source/on data packets (DSR)

Additionally, these algorithms need to choose various timeouts and rates and delays (to avoid accidental synchronization) for link state and route discovery messages as well as route propagation policies.

The authors of this paper implemented DSDV (with two different route propagation policies), AODV (with explicit beacons and with on-send loss detection), DSR, TORA in a detailed simulation of a 802.11 network with RTS/CTS. They choose several simulations scenarios with varying degrees of mobility (both in speed and movement frequencies) randomly in the simulated space. They used several fixed-rate packet sources (between 10, 20, or 30 random pairs of hosts) with sufficiently small packet sizes to usually avoid congestion losses. Given these scenarios, they measured the delivery rate, routing overhead (in bytes and packets), and route length versus the optimal path.

They found that routing protocol efficiency greatly depended upon the degree of mobility: faster, more frequent movement resulted in substantially larger overheads for all but DSDV. Except for TORA at high packet rates (where it experienced a routing protocol congestion collapse) and DSDV with high mobility (where its periodic updates were not fast enough), performance was good overall: more than 95% of packets were delivered and most packets took the shortest route. In terms of routing overhead, results were not surprising (except for TORA, which seems to be clearly broken), demand routing protocols (except for TORA) did substantially better and DSR seemed to trade off increased byte usage for decreased packet usage.

It's good to see a careful simulation of these various routing protocols under enough circumstances that it should be difficult to claim that an advantageous situation has been missed. Of course, the downside of a big sensitivity analysis like this is that reading the paper, I can't really tell how the choice of protocol would matter for a realistic application of an ad-hoc network. Except for the obviously bad cases (e.g. most of TORA, DSDV with very high mobility), would these differences (in overheads, delivery rates) even have a substantial effect on application performance?

A High-Throughput Path Metric for Multi-Hop Wireless Routing

This paper describes and evaluates the routing metric that was ultimately used in RoofNet based on an office environment deployment. The metric assumes hop-by-hop acknowledgments and retransmissions for unicast traffic. For each hop, the expected number of transmissions required to successfully send a packet along that hop is computed based upon the success rate of broadcasted probes (assumed to not use retransmission) in both directions. The authors believe that this metric will tend to find higher bandwidth paths better than merely counting hops and lower the overall load on the network. The authors recognize that their scheme assumes that variable transmission power would not be very advantageous (which seems strange given that their radios can definitely vary their power). Their scheme also seems specialized for their (common in wireless networks of this sort) half-duplex environment where a forwarded transmission interferes with the original sender so ‘pipelining’ is not possible.

The authors evaluated their deployment with a 802.11b network deployed in offices in their lab. They measured the hop-by-hop loss rates and computing a baseline ‘best’ set of route throughputs (assuming that their model of interference along a multi-hop route based on loss rates was accurate). They compared these throughputs to actual throughputs obtained using (modified to avoid instabilities) DSDV and DSR with both the standard hop-count metric and with transmission count metric.

Not surprisingly, the authors observed improved performance with their metric. They identified some non-obvious sources of the improvement which could be used independently by DSDV and DSR: avoiding asymmetric routes (which can be done by ‘handshaking’ before considering a link usable) and usually avoiding routes that experience transmission errors (which DSR does when given link-layer feedback).

The authors also evaluated the accuracy of their link quality measurements. They found that significant bias was introduced by the packet sizes of their probes — larger packets were received significantly less often than smaller ones, but their metric assumed all packets were equal.

Monday, October 5, 2009

Architecture and Evaluation of an Unplanned 802.11b Mesh Network

This paper evaluates the performance of a mesh network deployed in Cambridge, Massachusetts. The network had access points in locations that were convenient for the ‘community’ participants to used and routed between them using a link-state protocol. Using broadcasted probes, loss rate of every link was evaluated at each of the 802.11b supported bit rate (also taking into account the loss rate for ACKs on the reverse flow). From this information, routes (including the 802.11b bitrate at each hop) are chosen to maximize the end-to-end throughput across the network. Experimental evidence showed that the bandwidth prediction from the probes slightly overestimated multihop routes due to interference.

The authors showed that this network did, in fact, benefit in both connectivity and bandwidth from making extensive use of multihop paths because using several shorter links provided much better throughput (for TCP connections) than the (typically longer) paths with a lower hop count. A vast majority of nodes in the network had more than two neighbors in the routed network. Though the link quality (in terms of achievable throughput) was correlated with physical distance, there were plenty of relatively short links with low quality and some highly beneficial long links with high quality.

The authors notably evaluated (802.11b) RTS/CTS versus in their environment. Even though they had observed self-interference when a flow spanned multiple hops, RTS/CTS did not improve throughput.

As a case for creating wireless mesh networks without planning, this paper's result are disappointing. In the multi-hop TCP test 10% could not communicate (despite the nodes of the network being physically stationary), multi-hop routes were much slower than single-hop routes (even considering the penalty of switching off), and adding nodes at random appeared (from simulations) to have seriously diminishing returns.

Modeling Wireless Links for Transport Protocols

This paper characterizes several types of wireless links regarding how they may effect wireless links. The paper seems to be primarily motivated by poor models used for wireless links in prior work. Some areas for serious modelling errors:

losses — most link-layer protocols use forward error correction with a limit on retries, so loss rates due to errors aren't very high (as often assumed) except in very poor radio conditions. More common, the paper claims, are bursty losses from handoffs.
delay variation — contrary to the default of many simulators, delay jitter is frequently high due to error recovery and handovers and link scheduling (e.g. in cellular links) can cause delay spikes

not duplex
traffic models — consider multiple flows and flows including more than the wireless link

The paper gives suggestions on how to model these and some other effects (e.g. queue management, resource allocation, bandwidth variation, asymmetry) and provides some examples of how varying these parameters of the model can drastically affect the performance of transport protocols. The paper's value really does not seem to be so much in its evaluation of model differences (though they show some non-obvious results: e.g., buffer size can have a substantial effect on transport protocol behavior) but in providing a checklist of mostly simple mistakes to avoid.

Wednesday, September 30, 2009

A Comparision of Mechanisms for Improving TCP Performance over Wireless Links

This paper compares the performance of TCP over a wireless link using various schemes believed to improve TCP performance, primarily by dealing better with losses over the wireless link. The used a real wireless link between one TCP endpoint and a gateway and the gateway was connected over a wired link(s) (Ethernet or the wide-area US Internet) to the other TCP endpoint. Though the authors used a real wireless link, they simulated single-bit losses (by corrupting TCP checksums) according to an exponential distribution. (The authors also performed one simulation where introduced errors caused more than one packet to be lost, finding that selective acknowledgment helped more than usual.)

The authors compare several types of schemes for improving TCP performance: gateway solutions where the gateway resends missing segments and (in TCP-aware schemes) possibly suppresses spurious duplicate ACKs also; gateway solutions that split the TCP connection into two TCP connnections; and purely end-to-end solutions. The authors find that several proposed improvements do help with caveats:

link-layer retransmission schemes help by themselves but work a lot better when they also suppress spurious duplicate acknowledgments, which lead to spurious retransmits and congestion window size reductions (TCP-aware link-layer schemes provided the most improvement of all the schemes tested in the paper);
end-to-end selective acknowledgments make good use of much of the bandwidth of the link without gateway modifications, but do not perform as well as gateway modifications especially with a wide-area connection; and
splitting the TCP stream at the gateway performs not as well as link-layer snooping because of limited buffer sizes at the gateway

Though the authors mentioned “very high interactive delays” as a motivating problem observed in TCP over wireless, they disappointingly did not attempt to measure it in this paper.

MACAW: A Media Access Protocol for Wireless LANs

This article proposes a media access protocol targetted at a particular contended wireless network. In the network environment considered by the authors, connectivity is symmetric (if A can receive from B, B can from A) and relatively short range with a sharp strength drop off. Thus, the primary efficiency concerns with implementing wireless in the authors environment are collisions at the receiver, including from signals out of range of the transmitter.

The hidden terminal problem makes carrier sense inappropriate. Instead, the authors explicit Request-to-Send and Clear-to-Send messages to manage the channel. Additionally the authors add an explicit acknowledgment (to guard against giving the false congestion signals to TCP-like protocols) and an explicit Data-Sending message (so other nodes within range of the transmitter can avoid interfering with its reception of the CTS and ACK).

Nodes control their use of the network based on these messages. A CTS or DS message reserves the channel for long enough to send the corresponding data and its acknowledgment. Based on past apparent collisions and their (apparent) locations, nodes backoff the rate at which they send RTSs, separately estimating congestion at their location and at the receiver.

The authors evaluated the MAC protocol with simulations of a variety of network topologies, verifying that reasonably fairness between data streams was provided by the MAC protocol and good link utilization was achieved (even in the presence of hidden terminals, unresponsive nodes, etc.) Most of the simulations used loss-insensitive flows and assumed the reception success was dependent entirely on distance from the receiver and competing transmitters.

It is very nice that the authors evaluated many network topologies and these clearly evaluations where the author's improvements produce huge differences. However, the paper seems to lack evaluation of more realistic scenarios. Even though the authors ACK scheme is motivated by transport protocol performance, almost none of the authors simulations consider performance of a TCP-like transport protocol or even how intermittent noise affects their link-layer protocol alone. The authors also do not consider any metric besides stream throughput, such as how much jitter and latency this MAC protocol introduces into the network.

Monday, September 28, 2009

Safe and Effective Fine-grained TCP Retransmissions for Datacenter Communication

This paper examines the consequences of one potential to the incast collapse (loss of useful throughput given several simultaneous TCP streams being received by a sink) problem: lowering or eliminating the minimum retransmission timeout. The intuition behind this change is that it will speed recovery from serious congestion events and that the very short and predictable round-trip times of datacenter networks do not require the current default's large margin of safety. They implemented this timeout reduction both in simulation and by modifying the Linux kernel TCP implementation (which required allowing timeouts to be triggered other than on the timer interrupt and internal timers to have better than 1ms granularity).

First, they evaluated this workload in an incast scenario; here, they had the sink pull 1MB data split evenly across a varying number of servers and measured the effective throughput (‘goodput’). Both simulation results and experiments showed a dramatic increase in useful throughput for small (approx. 1ms) minimum retransmission timeouts compared to values closer to the current default of 200ms. Based on RTT information from high-bandwidth (~10Gbps) datacenter networks and simulations, the authors further suggested that future networks will have substantially lower round-trip times, so minRTO should be eliminatined entirely. For such networks, the authors also reccommended adding random variation to the computed RTO to avoid accidental synchronization (from similar RTTs).

The authors also examined some possible negative impact drastically eliminating the minimum retransmission timeout would have. Two different sources of spurious retransmissions and delays are considered: from underestimates RTT due to a delay spike and from ACKs delayed for longer than the retransmission timeout. For both problems, the authors performed a comparision of Bit Torrrent throughputs observed with and without a very small minimum RTO. They found little difference in the per-flow throughput distribution with a low minimum retranmission timeout.

For delayed ACKs, the authors compared throughput in an 16-node fan-in situation with and without delayed ACKs. They found that enabling delayed ACKs (with a delayed ACK timer larger than the round-trip time) caused a substantially lower throughput with large fan-in.

The solution proposed in this paper is relatively easy to deploy, simple, and seems to work with few negative consequences. The authors suggest that the spurious timeout avoidance the minRTO was set for is no longer necessary now that recovery from spurious timeouts does not cause entry into slow-start. But do the delay spikes that motivated the original minRTO setting still exist? (Is better recovery from misestimation enough?)

Sunday, September 27, 2009

Understanding TCP Incast Throughput Collapse in Datacenter Networks

This paper is another examination of solutions to and the cause of the incast problem. They analyze the incast phonemon with and without the changes proposed in the CMU paper (much lower minRTO/disabling delayed ACKs). Unlike the CMU paper, the evaluation here primarily uses a constant data per source scenario rather than a constant data per sink scenario. They find that this fixed-source-size workload yields more complex goodput behavior than the other scenario. And in contrast the CMU findings, the authors find that delayed ACKs, in fact, do help against incast for large numbers of senders because they induce increased variance in the RTT estimate and TCP window size.

The authors produced a simple numerical model of the throughput collapse behavior assuming that it is determined by the number of retransmission timeout events and the interpacket wait time. From this model and refinements, the authors produce an explanation for the throughput collapse behavior: first, throughput collapses as the number of retransmission timeouts increases with the number of senders, then the width of period in which retransmission timeouts are triggered (per congestion event) increases, so recovery senders become desynchronized, but eventually the width of that distribution becomes so wide that senders about to experience a timeout interfere with senders just recovering from a timeout.

It is disappointing that the qualitative observations which would seem to explain the throughput collapse are not incorporated into the model, nor is an good empirical way to test for their presence beyond fitting a more complicated model attempted or suggested.

Monday, September 21, 2009

VL2

This paper is remarkable similar to PortLand. Therefore, I will focus first on the differences:

Topology — the supported topologies are folded Clos networks. Similar to fat trees, these provide many paths between racks and, as in PortLand, the authors propose routing flows randomly over the paths. Both of the papers examine three-stage topologies, both of which would support approximately the same number of end-hosts based on the number of ports each switch had; though the fat tree topology examined would use considerably more switches. To achieve similar performance VL2 assumes that switch-to-switch bandwidths are substantially higher than end-host-to-switch bandwidths. The fat tree topology is also considerably more regular, so where PortLand relies on location-specifying MAC addresses and explicit failure notifications the VL2 scheme runs a link state protocol between switches (but not to export end-host addresses) while .
Layer choice — VL2 uses location-specifying IP addresses; providing a mapping from location-agnostic (relocatable) IP addresses to location-specific ones analagous to PortLand's AMAC to PMAC mapping. Unlike PortLand, using IP addresses combines two mappings (AMAC to PMAC and PMAC to IP) that PortLand maintained into one mapping that VL2 contains (AA to LA). The disadvantage, of course, is that servers need extra support to know this mapping (doing IP in IP encapsulation) rather than it being backwards-compatible as using the ARP cache is.
The directory servers — in this paper, they are only responsible for the AA to LA mappings (no failure matrix), and replication/distribution of the directory servers is actually implemented. Also, the directory servers are queried by end-hosts rather than by switches;
Autoconfiguration story — None seems apparent in this paper.
Hardware required — The system in this paper runs on commodity switching ASICs, relying mostly on existing switch features rather than custom switching software.

The authors use some traces (presumably from MS data centers) and show that flow size varies a lot, traffic patterns are not very predictable, and small failures are frequent. They use these traces to motivate their choice of random load balancing rather than assuming a smarter scheme exists. For their prototype evaluation they measure utilization and fairness under an all-to-all TCP scenario on their 75-node testbed (apparently with near-commodity hardware) and measure interference between two services (with and without large numbers of TCP flows (‘mice’) in one service).

Probably the best feature of this paper is the authors extensive reuse of existing protocol implementations (OSPF, ECMP, etc.) This means that their fabric is probably more readily deployable and more hardware-agnostic than PortLand. It is also a disadvantage: they are not taking advantage of the relatively predictable topology they demand for routing, and many more pieces of hardware (each end-host) needs to be reconfigured than under the PortLand scheme which is truly invisible to the end-hosts.

Saturday, September 19, 2009

PortLand

This paper describes a layer-2 switch fabric targeted at datacenters. Scalability and the obvious simple correctness goals apply (loop-free forwarding, fault-tolerance) as well as ease of configuration and allowing IP addresses to move around in the datacenter.
To make this task simpler (that is, with less per-switch state) the authors fix the datacenter's topology as a fat tree and designate a centralized ‘fabric manager’ (which must be replicated for fault tolerance, of course) which coordinates between switches.

As with SEATTLE, the switches in this scheme handle all ARP requests themselves in addition to managing routing.
Switch state is kept small by distributing ‘pseudo MAC’ addresses which encodes the position of the destination node in the topology; a host's actual MAC address is never distributed to other nodes. The fabric manager controls the IP to PMAC mapping, allowing switches to respond to ARP requests and generate gratuitous ARPs if a PMAC becomes invalid.

Forwarding is kept loop-free by assuming a rooted fat tree and by following the rule that a packet changes from going up toward the root to down toward a node exactly once. For load balancing, switches can choose freely between upstream ports (the authors suggest switching on flows to keep each flow single-path); for fault-tolerance, the fabric manager distributes failure information to affected switches who then update their routing tables accordingly.

Probably the most clever scheme here is switch autoconfiguration. Bottom level switches discover that they are connected to enough non-switches, and the level number of higher-level switches propagates up the fat-tree. Position number assignment is done in a greedy fashion, synchronized by a higher level switch.
Combined with failed link detection, this should make (re)configuration automatic except for the pesky issue of the fabric manager and communication with it.

The authors tested their system on a 16-host prototype, evaluating its response to failures, IP migration, and the overhead it imposed on the fabric manager. Failures resulted in approximately 100 milliseconds of disconnectivity (which was, of course, made worse by TCP timeouts), and for the VM migration scenario, the TCP flow stalls fails for periods of approx 200-600 ms. For the fabric manager, they evaluated the overhead of between 25 and 100 ARP misses/sec/host, asssumed scaling, and extrapolated the CPU and network bandwidth uses of the fabric manager, and found that the fabric manager would require a lot of CPU resources (70 cores for the 100 ARP/sec case with ~30k hosts).

These numbers are disappointing though the authors point out that these miss rates seem unrealistic and that the fabric manager could probably be distributed to a cluster. Since presumably the authors are pretty confident that ARP miss rates are considerably lower in a real data center, it is disappointing that they didn't try to measure or intelligently estimate the rate at which they would be expected to occur. The linear extrapolation is also disappointing — couldn't the authors have really run their fabric manager on a synthetic trace with these miss rates and host count and measured the CPU usage?

Wednesday, September 16, 2009

Detailed Diagnosis in Enterprise Networks

This paper describes a system (called NetMedic) for identifying the likely location of the root cause of faults from component state/performance traces. Machine-local information (such CPU and memory utilization) in addition to network information (such as traffic rates and response times) are collected and assigned to “components” which, in the paper's evaluation, correspond to entities such as machines, user processes, or the collective networking infrastructure.

Since the system does not assume any domain knowledge about the measurements it receives, faults are detected based on measurements deviating from their usual values. Diagnosis uses historical correlations to determine whether an anomalous measurement bis likely influencing/influenced by another. To determine which components' measurements to compare, domain knowledge is used: ‘template’ mark the dependencies each component type has to other components (e.g. between a machine and all of its processes or between processes communicating over IP). Given this directed dependency graph and the likelihood of measurements influencing each other, the ‘best’ cause is determined based on the best geometric means of influence scores on paths from the alleged cause to components, weighted by the destination components' abnormality, combined with this geometric mean for the component being diagnosed.

The authors include an evaluation where they analyze whether their system can diagnose injected faults causes correctly. They compare their approach to a simpler approach and a hand-tuned aporoach. Both share the dependency graph and root cause ranking method. The simpler method only measures whether components are ‘abnormal’ and computes influencing ranks based solely on whether both components are abnormal; the hand-tuned method only measures influence between variables hand-marked as related. The authors find that their system gave the correct cause as the first ranked cause for almost all the injected faults, and did so substantially better than the simple approach, which was successful about half of the time. The hand-tuned approach did not a substantial improvement, showing that the authors had not lost much but avoiding domain-specific knowledge in that part.

One troubling aspect of this work is formulas that seem to be chosen mostly arbitrarily (e.g. using geometric mean and sqrt(c->e value * global weighted value) for cause ranking; the choice of 0.1 and 0.8 has a ‘default’ edge weight; etc.) Of course, the authors had to make some choice here, but from the results we cannot tell how sensitive correct diagnosis is to these choices — is there technique actually general enough that they mostly don't matter? (The authors did do a sensitivity analysis for one of these parameters — the length of history used for diagnosis.)

There's also the big step away from fully domain-agnostic diagnosis in requiring templates. These templates are fortunately applicable to many kinds of applications, which is why, based on their motivation, they felt it okay to use them but not such similar input about the measurements they are taking. But these templates do not look entirely trivially to make: the ones they show have many subtle dependencies, such as on machine firewalls and other traffic. Are they actually easy to infer, and given that the authors have gone to great effort to develop a method for ignoring irrelevant correlations, how useful are they?

Floodless in SEATTLE

This paper proposes an architecture for managing Ethernet routing over an enterprise-type network. It aims to provide more scalability than Ethernet bridging and easier administration and address portability than IP subnetting and VLAN-based approaches. In this approach, switches use a link-state protocol to route amongst themselves, and, using this, form a DHT to store end-host routing information. This DHT distributes the routing information across the routers so the amount of information stored by each switch is proportional the average number of end hosts a router has. This information is, of course, not local to the switches which need it, so switches maintain a cache of this information, invalidated (by the destination switch) on the use of a stale entry. To interact with unmodified end hosts, the authors implement ARP and DHCP backed by information in the DHT. ARP broadcasts used to invalidate cached are translated to unicasts to interested hosts by tracking which hosts may have cached entries.

The authors evaluated their using a trace driven simulation on several real network topologies and an evaluation with an actual implementation on a very small network.
From the simulations, they showed that the information maintained per router was actually smaller than both Ethernet bridging based approaches and that it achieved better routes (given similar cache sizes) than past identity-based routing proposals. From the actual implementation, they measured the per-packet overhead, performance when attempting to contact many non-existent hosts, and switch fail-over speed.

It's unfortunate that this paper did not include an explicit comparison to IP subnetting or VLANs, by which it is clearly motivated. It is understandable why, of course; their arguments against IP subnetting are based on ease of configuration which is very difficult to measure. For VLANs, only some of their complaints were about ease of configuration, but for the others it might be hard to infer a suitable configuration from their topology information.

This scheme is attractive for its careful attention to complete transparency for end-hosts, which should make it relatively easy to deploy. The big problem is that deploying it in the most straightforward way involves changing all switches over; it would seem, however, that there's probably some way of constructing a reasonable middle-ground where it is deployed gradually across the network.

Monday, September 14, 2009

Random Early Detection Gateways

This describes a queue discipline that gives better congestion signals to TCP flows than Drop Tail and slight refinements on Drop Tail like Early Random Drop. The authors observe that a problem with Drop Tail and Early Random Drop is that they both only act when the queue is near-full, thereby discriminating against bursty flows and causing global synchronization. To avoid this, this scheme (RED) tracks a long-term average queue size and ‘marks’ (drop for TCP) packets according to this size such that marks are distributed approximately evenly over time. The authors evaluate their scheme with multiple TCP flows with different round-trip times going over a bottleneck link, and show that it avoids global synchronization and achieves fairness among bursty flows (in their example, TCP flows over a high-bandwidth delay product relative to their window size).

The oddest omission from this paper to me was packet sizes: The algorithm and the simulations work in terms of whole packets apparently without considering the size of packets. This is particularly odd given that one might expect bursty interactive flows to have variable packet sizes.

This paper does remarkably simplify its problem by assuming that all flows are running a congestion control algorithm, as this scheme does not converge to fairness in the phase of loss-agnostic flows (which the authors did not test). The authors propose sampling dropped packet, showing that a window of dropped packets is likely to have close to the expected number (based on actual) of dropped packets from each flow; however, the authors don't evaluate this scheme beyond this theoretical observation.

Congestion Control for High Bandwidth-Delay Product Networks

This paper describes a protocol (called XCP) for congestion control that utilizes explicit signaling of the degree of congestion by routers. To replace the dropped-packet signal, senders include an explicit ‘feedback&rsquo field indicating their desired window size adjustment upon receiving an ACK for a packet. Routers modify this value (guided by the sender's current window size and RTT estimate, which the sender must also include) according to both fairness and efficiency criteria. To adjust for efficiency, routers estimate the difference between incoming and the maximum outgoing bandwidth and ensure that provided feedback values add up to this value over the measurement interval. To provide fairness, the additive-increase/multiplicative-decrease scheme of TCP is used to set the feedbacks, with the overall amount of increase or decrease chosen to match the measured bandwidth deficit or surplus. When the bandwidth deficit or surplus is almost 0 but fairness has not been achieved, feedback is still set for some portion of traffic to ensure convergence to fairness.

The authors measured their scheme in simulation in number of circumstances. Their simulations showed that it output performed TCP with any of several queuing disciplines in terms of speed of convergence to near full utilization and fairness and behavior with competing flows. (One notable exception was their "shuffling" scheme to ensure convergence to fairness would cause notable loss of throughput when a router downstream of a bottleneck would try to adjust bandwidth upwards to provide it a better share.) Additionally, the authors' simulations showed that their new protocol could be made to not discriminate against TCP flows.

The biggest issue with this protocol is the difficulty of implementation. To implement this scheme, routers will need to track per-flow state, and to provide it with security, would need to occasionally “test” flows to ensure that they are not cheating. The authors do propose a scheme similar to Core-Stateless Fair Queuing to isolate the changes of XCP to an administrative domain, but besides reducing the security issue and the number of potential flows to track to a function of the number of edge routers, this scheme does not seem to reduce the per-router overhead. And though the scheme to multiplex TCP flows alongside XCP flows helps, even the core/edge strategy for XCP (which the authors did not simulate) appears ill-suited for incremental deployment.

From a security standpoint, I find it somewhat disturbing that the ‘check if the flow’ responds scheme relies upon the sender-provided RTT estimate. I guess, however, that reasonable capping of this RTT would still yield an improvement upon the response time of such ‘enforcement’ schemes on traditional TCP.

Wednesday, September 9, 2009

Core-Stateless Fair Queuing

This paper describes a method to implement fair queuing across a network without requiring exceptional state tracking from the most routers. Rather than compute a priority for packets, the primary mechanism used to control sending rates is probabilistic dropping based on the rate of the flow the packet is in. Computing the flow rate involves tracking substantial per-flow state, but such per-flow state can be tracked only at network edges and then attached to outgoing packets (and modified as necessary within). To determine how often to drop, it is also necessary to estimate the fair share deserved by each flow, which is done by comparing the observed overall drop and arrival rates at each router.

The authors evaluate this mechanism over several simulated topologies with variously behaved competing flows. They show that the performance of CSFQ is often competitive to a full per-flow state implementation of fair queuing (Deficient Fair Queuing). Its aboslute performance is usually slightly worse than DFQ (with the exception of performing better in cases where flows may suffer from not being able to temporarily exceed their "fair" buffer to accommodate burstiness).

CSFQ has some parameters which seem to be set arbitrarily and some which are not clear how to set in general. Arbitrary parameters include thresholds used to adjust for buffering effects. Exponential smoothing parameters the authors recommend setting substantially larger than delay-jitter and shorter than the average duration of a flow. Since with things like web pages which are in fact short-lived flows, we probably have a relatively short average flow duration; how should these parameters be set when flows may traverse the globe (with presumably high delay jitter) before exiting the network or traverse only to the local metropolis's peering point (with hopefully very low delay jitter)?

Analysis and Simulation of a Fair Queueing Algorithm

This article constructs and analyzes a method for multiple flows to share a bottleneck link fairly. The intuition derives from the intuitively fair algorithm when bits and parts of bits can be interleaved arbitrarily: if there are N flows and we can send one bit per unit of time, then every unit of time we want to send 1/N bits from each flow. Since packets must be kept together across the link, we can't actually do this, but the same effect can be achieved by computing the packet finishing times we would have if we could do this and using these as priorities for sending packets (ignoring effects from ties and small packets that arrive while larger packets are being sent). To provide better delay to flows not using their full share of bandwidth the authors also give extra priority to new packets from flows which were recently inactive (by effectively backdating the packets to fill the gap up to some threshold).

The authors evaluated their fair queuing algorithm (in simulation) against first-come first-serve and DECbit (which gives explicit congestion notifcations to flows over their fair share) running against continuous ("FTP") and intermittent ("telnet") flow controlled streams (both traditional TCP-style and with explicit congestion notification) and against misbehaved flows. In contrast to the other algorithms, their fair queuing algorithm provides well-behaved flows good performance in the presence of an ill-behaved flow and severely punishes the ill-behaved flow (which they assume to be loss-sensitive). Even for the normal cases, they show that they provide lower latency to the intermittent flow (rewarding it for using less than its fair share of bandwidth) than the DECbit scheme.

The incidental punishment of ill-behaved flows is strange; as the authors note, it is a result of charging flows for packets that cannot be queued, but it would be nicer to make the degree of punishment more explicit. (Since, surely, well-behaved hosts will send somewhat above their fair share while adjusting or due to network perturbation, and should necessarily lose throughput for it.)

Sunday, September 6, 2009

Analysis of the Increase and Decrease Algorithms for Congestion Avoidance in Computer Networks

This article is concerned with congestion avoidance algorithms which utilize a single bit signal that the network is operating above or below its desired level of load for coordination between nodes on the network. In the model of the paper, time is divided into discrete intervals, and each user of the network sets its load for the next interval based on feedback generated by the previous interval.

Primarily evaluated are algorithms where at each time t, the new load is a(signal) + b(signal)*(old load) where signal is the single-bit indicator of whether the network was over the desired load at time t -1 . The appropriate values of a, b for each case are analyzed in light of convergence to both the desired level of load and the desired fairness. The main result is that additive increase/multiplicative decrease (i.e. a(underloaded) > 0, a(overloaded) = 0, b(underloaded) = 0, 0 < b(overloaded) < 1) is are the types of a and b functions which achieves convergence on both criteria. With respect to the rate of convergence, there is a tradeoff between quick convergence and low levels of oscillation around the goal load.

For me, the most interesting observation of the paper are the idealized throughput and response-time curves. TCP-style congestion control has its boolean signal of a sort (a dropped packet) — but where on that throughput/response time curve does it fall?

Congestion Avoidance and Control

This paper describes congestion control techniques for TCP built around reaching an equilibrium between packets entering and leaving the network and maintaining that equilibrium:

reaching the equilibirum: slow start — ramp up sending speed slowly so the correct sending speed won't be overshot. Prior systems would not recover from a mismatch in their local network rate and the least capable link to their end host because they would continuous overshot.; linear window size increase — be conservative about increasing the amount of data in flight to avoid massively overdriving the network, from which it is hard to recover.
don't retransmit until packets is out of the network: RTT variance estimate — because of queues, etc. round-trip time varies in the network a lot. This must be taken into account, which the authors do by creating an estimator of RTT variance.
backoff after retransmission enough to avoid more losses: exponential backoff — motivated by some theoretical results (about what sort of backoff is required for stability) and a desire to be as general as possible with what network topologies they support, the authors choose exponential backoff.

Despite the authors attempt at generality in these algorithms, since as they note there were TCP implementations for systems ranging from 56Kbit/s to hundreds of Mbit/s, there are some interesting assumptions about the environment. The most glaring one is that losses due to damage are are rare, which seems a strange assumption when one is considering microwave links, etc. It is hard to see, though, how this sort of congestion control (only at the ends and where loss types are not distinguishable) could function very well without assuming that at least most losses are due to congestion.

Wednesday, September 2, 2009

Understanding BGP Misconfiguration

This is a study of BGP misconfigurations inferred from changes in route announcements. The authors choose short-lived route announcements as candidate misconfigurations, then guessed the type of ‘misconfiguration’ from the AS graph (assuming the usual provider/peer/customer relationships only) combined with an e-mail survey. They found that misconfigurations were relatively common, affecting as much as 1% of route prefixes in their relatively short survey. A majority of misconfigurations were apparently due to poor usability of routers — either through apparent configuration file slips or through issues relating to recovering old routes after reboot. Some of the misconfigurations were a surprise to operators, notably those caused by misplaced assumptions about their provider's filtering and some from router initialization bugs, and of course, some were false positives from Internet routing arrangements not being fully described by their model of inter-network business relationships.

Perhaps most remarkable about this paper is that most of these problems do not adversely affect connectivity. In this sense, BGP appears to have partially achieved the old goal of survivability, if only by accident. (Indeed, it seems to perform better than the Registries' databases at providing information about how to reach networks.) The most serious effect seems to come from the added load these background misconfigurations place on the system and the more serious possibilities they reflect. They are reminders that configuration is not very automated or well understood, so future global network disrupting events are still seem quite possible.

Interdomain Internet Routing

This article primarily describes how inter-network routing works over the present Internet. BGP, the protocol that supports such routing, is relative simple (at least at network edges), consisting simply of annotated offers or withdrawls of offers to route to some set of addresses. The annotations provided include information about the overall path of the route, which is used to prevent loops and help make routing decisions, but routing decisions are heavily influenced by other policies.

Much of the interesting part of inter-network routing, how these routing decisions are made and which offers are made or withdrawn, are not part of BGP. Typically, these policies are determined by the economic relationship between the two networks. A provider wants to maximize utilization of links that make it money, so it has an incentive to advertise any available routes using such links. When deciding between multiple routes to the same destination, a provider will similarly be motivated to configure their software to prefer routes using links for which they are paid, and may then decide based on traditional “technical” criteria like path length. (Of course, explicit contractual arrangements also influence these decisions.)

BGP does seem to have been designed for a smaller, friendlier Internet. The state size is and its changes are not trivial and likely to get worse as more people would sensibly like to have an internet or a backup internet connection. As administrated, there is a lot of trust, revealed through the poor filtering illustrated by hijacked blocks and then general lack of route authentication in the protocol. And, despite its flexibility for route announcers to choose their policy, some policies seem hard to enforce technically, such as preferences for peers to use their own networks for cross-country/sea transit. (Presumably, this could be prevented by only advertising the routes through the preferred entry point except that this would break reachability during glitches on the peer's network, so it is easier to contractually require honoring MED when needed?)

Monday, August 31, 2009

The Design Philosophy of the DARPA Internet Protocols

This article gives a design rationale for the DARPA Internet protocols. The primary motivation derives from connecting a variety, separately administered networks and supporting a variety of applications on top of them. The targeted applications tended to be non-jitter-sensitive ones with varying bandwidth requirements, so packet switching was chosen over circuit-switching. Another important goal was tolerance of failures; fault-tolerance combined with separate adminstration and the least common denominator of a variety of networks lead to the datagram model: Nothing was to be assumed of the intermediate hosts in a connection, so all the state of the network would be provided at the ends of the connection. The IP layer, then, provides exactly this service, on top of which end-user services would be built with modifications only at the end points (including the initial example of TCP).

Probably the biggest tradeoff mentioned in the article was one of end-host complexity. By relying on such a minimalist network stack and tolerating fail-stop failures of the network between two end hosts, implementing the protocols on the end-hosts is not easy (compared to, e.g., network architectures that explicitly signal failures or flow control limits).

Somehow, despite placing high importance on resilience from explicit attack, the design considerations mentioned fail to account for the presence of an adversary on the network who can insert non-fail-stop failures. The article recognizes this is a lost of robustness:

In this respect, the goal of robustness, which lead to the method of fate-sharing, which led to host-resident algorithms, contributes to a loss of robustness if the host misbehaves.

but fails to mention the apparent design assumption that the separately administered hosts are reasonably well-behaved.

It is interesting that the article describes the lack of router state-carrying as a robustness feature. At our current scale, it seems more like a scalability feature. Though the article talks about performance considerations (particularly the difficult of specifying them to defense contractors) and mentions network-performance-sensitive activities like teleconferencing, performance generally and scalability in particular did not appear to be major goals. (The closest among the enumerated goals is “[permitting] host attachment with a low level of effort”, which was observed to not be well-achieved because of the protocol reimplementation effort.)

The End-to-End Argument

Caricature of argument:

Detecting failures at every step is too much work! Just try the whole thing and see if it worked.

Justifications:

duplicated work: Because there are possible failures at multiple levels which look the same ("duplicate" message, corrupted data, requests that were not processed, etc.), checks should happen at higher layers anyways.
match the goal: By checking for the high-level gaurentees explicitly, we may detect failures in components we wouldn't have thought to verify “for free” (e.g. router buffers in the network).
more moving parts: More integrity checking (e.g. hop-by-hop) means more code to find bugs in.
avoid excess functionality: Not every high-level task needs feature X (e.g. duplicate suppression), so always providing it at the lower level will be wasted effort.

We love abstraction; we just give our problem to the next layer down and never think about it again… The end-to-end argument derives from the experience that this strategy fails because the lower layers do not have knowledge of the overall application: and in practice, the feature is — or should have been — reimplemented in a higher layer.

The examples in the paper, then, are failures of abstraction. In some cases (the file transfer relying on the transport protocol for integrity, relying on the communication subsystem for encryption), it may really make sense to move the action to the application layer. In other cases, there is still hope for modularity: it's just that the lower layer is providing the wrong interface. Ack'ing after a message is processed, considering message duplicates based on some part of their content, ordering messages based upon vector clocks instead of a streams order, etc. — convenient abstractions exist for these.

One should not be surprised that a layer that solves a weaker problem is not, in fact, useful for solving the stronger problem. The more compelling cases come when there is no substitute layer without substantial application intrusion that would solve the real problem, as in the end-to-end integrity/fault tolerance cases.