This paper describes a layer-2 switch fabric targeted at datacenters. Scalability and the obvious simple correctness goals apply (loop-free forwarding, fault-tolerance) as well as ease of configuration and allowing IP addresses to move around in the datacenter.
To make this task simpler (that is, with less per-switch state) the authors fix the datacenter's topology as a fat tree and designate a centralized ‘fabric manager’ (which must be replicated for fault tolerance, of course) which coordinates between switches.
As with SEATTLE, the switches in this scheme handle all ARP requests themselves in addition to managing routing.
Switch state is kept small by distributing ‘pseudo MAC’ addresses which encodes the position of the destination node in the topology; a host's actual MAC address is never distributed to other nodes. The fabric manager controls the IP to PMAC mapping, allowing switches to respond to ARP requests and generate gratuitous ARPs if a PMAC becomes invalid.
Forwarding is kept loop-free by assuming a rooted fat tree and by following the rule that a packet changes from going up toward the root to down toward a node exactly once. For load balancing, switches can choose freely between upstream ports (the authors suggest switching on flows to keep each flow single-path); for fault-tolerance, the fabric manager distributes failure information to affected switches who then update their routing tables accordingly.
Probably the most clever scheme here is switch autoconfiguration. Bottom level switches discover that they are connected to enough non-switches, and the level number of higher-level switches propagates up the fat-tree. Position number assignment is done in a greedy fashion, synchronized by a higher level switch.
Combined with failed link detection, this should make (re)configuration automatic except for the pesky issue of the fabric manager and communication with it.
The authors tested their system on a 16-host prototype, evaluating its response to failures, IP migration, and the overhead it imposed on the fabric manager. Failures resulted in approximately 100 milliseconds of disconnectivity (which was, of course, made worse by TCP timeouts), and for the VM migration scenario, the TCP flow stalls fails for periods of approx 200-600 ms. For the fabric manager, they evaluated the overhead of between 25 and 100 ARP misses/sec/host, asssumed scaling, and extrapolated the CPU and network bandwidth uses of the fabric manager, and found that the fabric manager would require a lot of CPU resources (70 cores for the 100 ARP/sec case with ~30k hosts).
These numbers are disappointing though the authors point out that these miss rates seem unrealistic and that the fabric manager could probably be distributed to a cluster. Since presumably the authors are pretty confident that ARP miss rates are considerably lower in a real data center, it is disappointing that they didn't try to measure or intelligently estimate the rate at which they would be expected to occur. The linear extrapolation is also disappointing — couldn't the authors have really run their fabric manager on a synthetic trace with these miss rates and host count and measured the CPU usage?