Monday, November 23, 2009

BotGraph: Large Scale Spamming Botnet Detection

The paper describes a log analysis technique intended to detect botnet-generated e-mail spamming through a commercial e-mail service. It assumes that it can either detect botnets when they are generating accounts, because they generate accounts at an usually high rate, or based on unusual relationships between the multiple sending accounts botnets maintain.

To detect the unusually high rate of account generation, the authors propose trackng the exponentially weighted moving average of the signup rate at each IP. Because real signup rates are typically very low, the authors argue that they can easily detect mass signups. It does seem that a clever botnet would ramp up its account generation rate slowly to evade these long-term averaging techniques. (The bots would try to appear to be a popular proxy, etc.) The authors argue that such techniques would be ineffective due to the natural churn of the botnet's compromised hosts, but given that, at least in office environments, many desktop machines are certainly not turned off routinely, and legitimate signup patterns can expect to see the same sorts of diurnal and weekly variations that machines being on and off see, I do not find this convincing.

Most of this paper is devoted to defending against attacks which do not use aggressive creation of new users. The authors use the correlation between accounts and the hosts which use them. The botnets the authors observe apparently load balance these accounts across many of its hosts over time and use many accounts from any particular compromised host. The authors therefore relate any two users based on the ASs they share being logged in from. Using these as edge weights and choosing a threshold, this induces a graph the users which can be used to cluster the users into candidate bot groups. These groups can be analyzed collectively for spammy behavior (because effective spamming requires the user accounts as a whole to have a high send rate per account given the effectiveness of signup rate limiting).

An obvious attack on this scheme, mentioned in the discussion of this paper, is for botnets to not reuse accounts between ASs. The authors argue for a countermeasure of grouping by IPs or some other smaller granularity to continue to catch those botnets. Clearly, as the authors argue, spammers have something to lose when they effectively cannot reuse their accounts between arbitrary pairs of machines, which makes this a worthwhile technique in spite of the relative ease of evading detection. The proposed future work, also correlating users by e-mail metadata and content would also likely be similarly worthwhile because an inability to reuse domains or arbitrary distribute destinations across accounts would also make spamming effectively more expensive.

Not-A-Bot: Improving Service Availablity in the Face of Botnet Attacks

This paper proposes distinguishing between human-initiated requests and machine-initiated requests by having a trusted attester generate signatures tying an application-specified action to recent input activity. The attester is isolated from the host operating system by a trusted virtual machine. Integrity of the attester (and, assuming the VM's correctness, the confidentiality of its signing key) and the virtual machine is assured by a TPM (Trusted Platform Module).

For services for which botnet abuse was a concern, applications would be able to ask the attester to sign some specified data with a statement indicating that keyboard or mouse activity had occurred at a time close to the requested action. The attester would avoid generating multiple attestations for the same and thereby effectively ratelimits the sensitive applications to the rate of real user activity on the machine. In the "normal" use case, the application would send these attestations to the service. The service would verify the attestation and adjust its behavior; for example, it could prioritize attested requests over non-attested requests.

Botnets attempting to create false attestations would be able to do so on non-idle machines of course, but the authors argue the rate at which they could do so effectively would be seriously limited. One danger is that attestations in this system are storable indefinitely — they contain a nonce (generated by the attestor to prevent trivial replays), but no timestamp unless one is added. The authors assume that users of attestations will encode the time in the signed material; for example, signing the date header of any sent e-mail and using URLs encoding a service-generated nonce or timestamp. Services must also maintain a nonce cache to prevent replay, which may be quite large in the case of high-latency services like e-mail. The authors predict that even for large botnets, the victim machines on which these botnets are running will not have user activity often enough to produce attested spam, DDoS, or click fraud traffic at nearly the rate they are acapable of today.

Wednesday, November 18, 2009

Cutting the Electric Bill for Internet-Scale System

Based on this observation that energy prices differ around the world — and in different ways at different times — this paper proposes redirecting traffic to utilize energy at the cheapest (sufficiently close) site. The price differences the paper speaks of exploiting are not just diurnal and seasonal ones, but hourly ones from the electricity spot market, which have substantially larger price fluctuations that cannot be definitely planned for hours in advance.

The paper uses traces of both these electricity spot prices and Akamai's workload to evaluate the potential savings from moving computation where energy is cheaper. Some practical concerns with moving workload were modeled: keeping work close to users due to latency concerns, not dramatically increasing bandwidth costs, and the realities of not having a power-proportional datacenter. Under somewhat optimistic scenarios, electricity cost reductions of more than 30% seemed achievable with this type of optimization.

The simulation evaluation was very thorough and used a real workload (which are quite hard to come by). The real worklaod used here does not, however, sound like the type one would want to prove this system by: Akamai's customers are presumably paying in large part for Akamai's closeness to the end-users, but increased distance is just about the only compromise that cannot be avoided. Surprisingly, the US electricity market seems dynanmic enough that even allowing for modest compromises (e.g. less than 1000 miles) can provide substantial benefit.

Skilled in the Art of Being Idle: Reducing Energy Waste in Networked Systems

This paper evaluates the viability of putting computers to sleep while preventing disruption from the machine not being on the network. Current platforms have ‘wake-on-LAN’ functionality, enabling machines to be woken up from a low-power state by a ‘magic packet’. A simple proxy could examine traffic destined to the machine and inject the magic packet before traffic to which the machine needs to respond. A more sophisticated version of this proxy could respond on the machine's behalf for simple but important queries. Using network traces for desktop machines in both home and work environments, this paper evaluates the likely power savings achieved with various levels of proxy sophistication.

The traces reveal that there is a lot of background noise on networks. Some of this traffic is broadcast at the Ethernet level but not actually logically directed to the end host (for example IPX traffic, routing backup traffic, ARP requests not for the end host's IP). Some is for protocols that require periodic maintainence traffic to maintain liveness, e.g. ARP (when querying the host), IGMP (if the host belongs to IP multicast groups), DHCP, TCP (for keepalives), SSDP. Such traffic was frequent enough in the author's traces that a proxy that would not handle that traffic would not be able to allow end hosts to sleep a substantial amount of time without disrupting service. Making such simple fill-in-the-blank replies for a handful of services and conservatively waking machines up for other services is all that would be required to allow machines to stay asleep for most their idle periods without sacrificing availability. Problematically, all it takes to disrupt this result is one misconfigured service (for example, the machine updating service the authors observed in the enterprise environment).

Scalable Application Layer Multicast

In contrast to the '95 paper, this paper, written in the days of considerably larger Internet, is very concerned about limiting the state stored on each node as the network grows. Probably also motivated by the realities of the Internet, this paper does not assume the infrastructure — the routers — will provide multicast services. Instead, end nodes are drafted into service as internal forwarding nodes in the multicast tree. The paper is devoted primarily to selecting these internal nodes and their neighbors, even in the face of failures and newly joining nodes.

The multicast tree maintained in this protocol has a strictly layered structure. Within each layer, nodes are partitioned into ‘clusters’. Each member of a cluster keeps track of all the other members of its clusters and forwards messages it generates to all other members of the cluster. For forwarding messages in and out of the clusters, each cluster elects a ‘leader’. This leader is required to be a member of some cluster in the next higher layer to forward messages outside of the original cluster. To allow recovery from loss of the leader, each member of a cluster also tracks the membership of its leader's cluster in the next layer.

To add a node to this tree, the joining node determines its distance (presumably in terms of latency) to all nodes of the highest level of the cluster, then the cluster in the next layer lead by the best node in the highest layer and so on. When it finds an appropriate cluster in the lowest layer, it joins it. When the leader of a cluster later detects that it is larger than the size threshold, it is split (potentially causing this procedure to repeat at a higher layer). Similar delayed detection by cluster leaders is used to initiate the merging clusters.

To maintain closeness in the network, nodes are responsible for periodically probing the alternate cluster leaders (in the same clsuter as its current cluster leader) and switching clusters if it has chosen inappropriately (or, as was surprisingly not mentioned in the paper, if network conditions have changed.)

Tuesday, November 17, 2009

A Reliable Multicast Framework for Light-weight Sessions and Application-Level Framing

This paper describes a technique for providing reliable delivery on top of a unreliabile multicast. The principal problem dealt with in the paper is avoiding the explosion of messages that would result by adapting TCP-like loss recovery to this many-receiver world. The first change necessary is switching to negative-acknowledgments. To make sure receivers are aware of which messages have been sent, periodic reports of the extent of messages (expressed as sequence numbers for each sender) are multicasted to the group by each participant. To prevent the size of each these reception reports from exploding, these reception reports only refer to a particular application-level entity being operated upon (a whiteboard "page" in the application in the paper).

From this reception information, participants can decide when to send a negative acknowledgment (which the paper calls a "repair request"). To prevent many repair requests from being sent at once (due to the failure of a link relatively "early" in the multicast tree), senders wait a random amount of time proportional to their distance from the original sender before sending the repair request. If, before this timer expires, they see another repair request for the same block, they backoff their timer exponentially (and use the same strategy to resend a transmission if the timer expires). Any host which has the data (all that received it are assumed to buffer the data), can send a retransmission, but uses a similar timer mechanism to lower the liklihood of duplicate retransmissions.

To select the random backoff values, the authors propose a heuristic scheme using feedback from the numbher of seen repair requests and retransmissions and some manually sent clamps. The authors evaluated this scheme in simulations over random multicast trees with up to 5000 nodes.

Presumably because it was divised with all participants being "first-class" — equally able to send messages, not distinguished by the strucutre of the mutlicast tree, and able to leave or join the group at will — this scheme requires nodes to track and resend a disappointing amount of state. Each node needs to keep estimates of the round-trip-time to each other node in the multicast group. To make these estimates, each node must send periodic reception reports. Without a means to summarize these reports (besides switching "pages"), the size of each report is proportional to the number of senders. (The authors do propose "local recovery" to limit the scope of retransmissions and associated messages, but this scheme does not reduce the need Though this amount state and messages are probably manageable for a low-bandwidth stream with thousands of recipients, it does not sound like scaling this scheme to larger data sizes and more recipients will be comfortable.

Thursday, November 12, 2009

X-Trace: A Pervasive Network Tracing Framework

Applications that run on networks inherently rely on the correct functioning of many components. Understanding failures or performance problems in such systems requires understanding how a logical tasks is performed across these components. X-Trace is a system that tracks each task's intercomponent dependencies and thereby allows the component to produce logs that can be reconstructed to show the logical execution of a task.

This dependency tracking is done by piggybacking information about the previous subtask in intercomponent request, including the “root’ task whether the new subtask is the next subtask or a subsubtask (which may be performed concurrently with other subsubtasks). When a component acts on a request, this metadata is attached to the log entries it creates, and these logs are aggregated using the task ID.

Propagating this metadata and aggregating the logs is relatively cheap, but not free (the authors observed a 15% decrease in throughput in an instrumented Apache). The main cost, however, is clearly in the modification of each component of the system. The system's pervasiveness requires its metadata hooks to infect all the components. Ideally, much of this could be automated, since most systems already ‘know’ which request they are acting on at any point, which presumably was what the authors were exploring in their modifications to libasync. Such knowledge does not easily let one automatically choose between ‘next’ versus ‘down’, and certainly won't easily detect the non-tree structures the authors argue (describing as future work) would be more appropriate for some systems.

After proposing tracing all the way down to the link layer, the authors sensibly proposed, as future work, limiting what is traced by a combination of choosing which layers to trace and which incoming requests to mark. Of course, we already have tools for diagnosing network difficulties at the network path level, so logically, it might be nice (and certainly require less switch modification) to use X-Trace to find the path in our maze of an overlay network and then compose in our existing tool. Unfortuantely, based on the figures in the paper, X-Trace does not really record failures in such straightforward a manner. What humans interpret as a (service-stopping) failure in their typical traces is an omitted node, apparently without a hint was to what that next-hop might be save (non-automated) comparision with the correct case.

Wednesday, November 11, 2009

NetFPGA: A Tool for Network Research and Education

This paper describes a programmable hardware designed to allow rapid prototype of new hardware switch and router designs. This hardware is built-up from components commonly used for research and education outside of the networking space; the second version described in the paper uses a standard PCI bus for control and provides four gigabit ethernet outputs, so it is practical for researchers to deploy and can handle convincingly large amounts of load.

The research need filled by NetFPGA is an obvious one: most deployed networks work in large part using specialized hardware. Networking researchers typically did not have the means to create or modify hardware routers, so new ideas were tested with software routers. When tests with software routers yielded subpar performance, as they often do, authors would often argue that next-generation hardware routers or switches would be able to provide the functionality easily and without any loss of performance. NetFPGA allows these researchers to stop the hand-waving.

Monday, November 9, 2009

A Policy-aware Switching Layer for Data Centers

This paper proposes a switching architecture to ensure that some traffic is always routed through certain middleboxes. Given some collection of middleboxes of various types, the administrator specifies that the traffic matching some (source interface, source address/port, destination address/port) pattern must be forwarded through a certain type of middlebox (or routed to their final destination). Chains of middleboxes are specified by using a prior middlebox as the source interface type.
Switches along the paths of packets handle traffic that needs to be routed through a middlebox by encapsulating the ethernet frame in another ethernet frame destined for the middlebox. The router adjacent to the middlebox chosen removes this encapsulation, allowing the middlebox to act as if it were on the actual path.

The primary challenges of implementing this scheme comes with handling changes. To ensure compliance during a policy change without being able to mark packets passing through middleboxes, the authors propose dividing a change from policy P1 to P2 into multiple steps. First, each type of middlebox (relevant to the policy change) is split into two types (say, old and new) of middleboxes. Then (after the new-type middleboxes have processed all old traffic), the new policy is installed replacing the old type with the new type everywhere. Finally, after the new policy has been installed on all switches (and the old type middleboxes have processed all pending traffic), all middleboxes of the old type are switched to middleboxes of the new type.

The primary disadvantage of this routing scheme is latency. Compared to the traditional approach of placing middleboxes directly in the natural path of the packets, this scheme imposes an extra approx. 0.3 ms (based on the authors' testbed) of latency. The authors argue that this difference is insignificant given the general low-latency of the environment, but, for data-center-internal traffic, this may double the latency, which is very likely to substantially affect performance.

Internet Indirection Infrastructure

This paper describes an overlay-based Internet routing service built over the primitives of a DHT. Addresses in the overlay address consistent of two pieces, one which must match exactly and the other which is matched to the closest (longest prefix match) available address when routing messages. Receivers arrange to have an address routed to them by inserting a mapping (called a trigger) from an overlay address to some list of (overlay or IP) addresses. When a message is sent to an overlay address, these triggers are found, and for each trigger, the message is forwarded to the first functional address, passing along any left-over addresses in the trigger as metadata. These triggers are expired unless refreshed periodically and can be removed only if the destination-part is known. Conflicts in address assignment are avoided by using a sufficiently large address space.

The paper envisions that this addressing mechanism can be used to implement many commonly requested routing services:
  • mobility — the mobile host simply maintains a trigger with a permanent address, arranging for it to point to its current real address;
  • multiast — each subscribing host inserts a trigger for the multicast address (or, for large multicast groups, the hosts can arrange for a tree of triggers);
  • anycast — endpoint encode their preferences in the inexactly matched portion of the address and ensure that exactly one trigger is the closest match;
  • service composition — the end-host wishing to insert the service inserts a trigger listing the service, then itself; the service agrees to forward the message to the address it receives as metadata;
The implementation of this primitive over a Chord DHT assigns responsibility for routing to an overlay address to the Chord node responsible for the corresponding key in Chord's keyspace. (To allow closest-match routing, exactly all addresses with the same exact-match portion correspond to the same Chord key.) The responsible Chord node forwards all data destined for the overlay address in question. This routing scheme poses some performance concerns. The chosen intermediate server may be a poor choice, routing traffic between relatively close hosts through a far-away or slow router. To avoid this the authors propose using separate addresses for connection setup (done with a public address) than for the connection itself (which uses a temporary private address); the private address can be chosen to fall into the range a nearby server is responsible for (presumed to be discovered by querying the addresses of nearby servers in the DHT). Another concern is overloading a DHT node responsible for a popular public address; the authors believe they can address this by caching the associated triggers along the lookup path.

Other concerns about this system are ones of security. The most obvious issue is that the flexibility to implement multicast comes with the ability to send massive amounts of duplicitous traffic to a victim host. The authors propose sending challenges to the trigger's mapped address to assure that the address to which triggers map actually permits the mapping. For cases where the victim is the infrastructure itself, the authors propose assigning resources based on the end-host inserting the trigger and detecting loops using either test packets or a bloom filter to mark visited hosts.

The authors also address concerns of eavesdropping or hijacking communications using the triggers. The authors argued that random private triggers would prevent eavesdropping attacks because a malicious third-party would need to guess the private trigger. Similarly, the authors would prevent hijacking by having the public trigger merely forward to a (hopefully unguessable) private trigger required to remove the public.

I felt that these security concerns would be better solved by using a traditional cryptographic end-to-end solution for hijacking and eavesdropping concerns (perhaps by encoding public keys in the public triggers) and authentication for trigger-removal requests. The authors wish i3 to be a relatively configuration-free system distributed between multiple administrative domains, to achieve these goals, almost any party should be able to run an i3 node. Given this, it seems a more viable attack would be to collect triggers by running (perhaps many) i3 nodes; since the trigger's values will be distributed at random among the i3 nodes and forwarded to other i3 nodes for caching, an attacker should be able to get the values of many private triggers.

Wednesday, November 4, 2009

DNS Performance and the Effectiveness of Caching

This is a study of DNS in the wild, using traces of DNS requests and responses leaving two adminstrative domains along with TCP connections and disconnections. They found typical lookup performance was better than the '88 paper, but, still, many lookups would take an exceptionally long time from a user's perspective, apparently due to network or configuration problems. About 20% of all lookups got no answer (and no error repsonse) providing both high lookup latency and a probably uncachable negative result. Around 4% of the unanswered lookups were authority forwarding loops — an obvious misconfiguration. Some popular negative responding lookups were found to be common misconfigurations (e.g. asking the root servers to resolve 'loopback' or 'loghost').

The authors of the paper, however, were not mostly concerned with misconfiguration but with the effect of caching and TTL settings. The authors generally found that caching worked well for nameserver records, so most queries required few referrals (each of which is an extra round-trip to a server), substantially decreasing the latency of lookups. The authors found that the names requested followed a Zipf-like distribution and therefore most of the benefit of caching was obtained by caching relatively few names. They also found, based on the TCP trace, that because accesses to the same domain were highly clustered, short TTLs (popular for load balancing), did not cause substantial increases in the number of DNS queries or user-perceived latency.

Development of the Domain Name System

This article describes the design goals and some of the rationale behind the DNS protocol. The system DNS replaced was clearly neither scalable or efficient — previously, programs would scan a flat file listing hostnames and their corresponding IP addresses, and this file would be distributed to all end hosts. The replacement was to be distributed, so multiple adminstrative domains could add hosts, and was to be lightweight and provide cache, so as to provide comparable performance to the local database approach. Loftier goals of providing extensibility, but this seemed to be reflected in the initial design primarily in the existence of multiple types of records and the lack of structure demanded on names.

To achieve the distributed administrative control, DNS used an explicit hierarchy, which was to reflect the actual administrative hierarchy of the network. Each part of this hierarchy (a ‘zone’) is assigned to one or more DNS servers (each of which may be responsible for many distinct zones). Each DNS server returns a time-to-live (some number of seconds), permitting querying clients and servers to cache the records for an appropriate amount of time. The same caching policy applies to references to other DNS servers, allowing most queries to avoid loading the top of the hierarchy.

In spite of this caching strategy and centralized caches per-organization, the authors observed a relatively high query rate on the root servers of about seven queries per second (over all seven root servers; when only about twenty-five thousand hosts had names on the Internet), and overall poor performance due to wide-area network issues that would not be discovered in testing. Some of these issues were due to poor implementations. Some of the non-cached traffic was due to an initial lack of negative response caching.

A notable problem with DNS mentioned in the paper is one of misconfiguration. Administrators apparently did a very poor job of understanding or correctly setting TTLs, and the consequences of doing so were usually quite subtle.

With hindsight, it is clear that the paper's evaluation of DNS security was very pure. Though the authors mention the security concerns of resolvers blindly accepting records — albeit mostly from a misconfiguration scenario — they missed the chance to discover the poor use of additional answer sections that would cause major security problems later.

Monday, November 2, 2009

Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications

This paper describes the Chord DHT, a simulation-based evaluation of its load balancing and performance under churn, and a very small testbed evaluation of its latency across the wide-area network.

Chord maps its keys and nodes to conceptual ring, and uses the distance in one direction to route messages. A node is directly responsible for the keys between its address on the ring and the previous node's address and uses the following nodes as replicas for those keys. To guarantee routability to some node that thinks it is responsible for a key, each node has pointers to their current notion of their next or previous node. To provide fast routing in the common case, each node also keeps a pointer to nodes approximately 2^k units further around the ring for all k less than the key size.

Given these data structures to maintain, the paper presents algorithms for handling initialization of the finger table with relatively few messages and for maintaining the finger tables and replicas in the presence of failures. The paper provides some theoretical evidence these algorithms maintain good performance and availability in the presence of failures. The simulation evaluation gives good evidence that the algorithms can, in fact, maintain network connectivity in the presence of high failure rates, though they don't seem to show that that replicas would actually mitigate the impact of these failures in practice.

Chord is a very nice paper, but DHTs as envisioned in Chord have been relatively rare. Many distributed systems (especially lookup services which are relatively undemanding) are not actually large or unstable or decentralized enough to benefit from this sort of detailed handling of churn and scale, yielding approaches like the single-hop DHT (e.g. Dynamo, SEATTLE) and the replicated master (e.g. GFS).

Looking Up Data in P2P Systems

This paper reviews peer-to-peer overlay networks from the perspective of how they approach the lookup problem (finding where to find data). It observes that there is a spectrum of actual implementations from centralized to hierarchical to symmetric. The paper focuses primarily on the symmetric sort, observing that they gain in resilience by not having any special nodes whose failures matter more than the failure of other nodes. (I find it curious that the actual variable performance availability of machines in P2P systems is not mentioned here.) Separate from the symmetric/hierarchical distinction the paper makes a distinction between structured and unstructured storage, where structured systems have "a well-defined set of information about other nodes". The authors only focus structured systems since the systems they consider unstructured prevent retrievability guarantees.

From these choices, the authors focus on DHTs, which are all symmetric, structured solutions to the distributed lookup problem. The paper describes several routing schemes (and their corresponding distance functions) and concludes with several practical issues that were apparently not well-solved on DHTs when the paper was written: such as assessing the impact of churn, handling malicious nodes, and providing more powerful primitives than lookup-by-key.