Applications that run on networks inherently rely on the correct functioning of many components. Understanding failures or performance problems in such systems requires understanding how a logical tasks is performed across these components. X-Trace is a system that tracks each task's intercomponent dependencies and thereby allows the component to produce logs that can be reconstructed to show the logical execution of a task.
This dependency tracking is done by piggybacking information about the previous subtask in intercomponent request, including the “root’ task whether the new subtask is the next subtask or a subsubtask (which may be performed concurrently with other subsubtasks). When a component acts on a request, this metadata is attached to the log entries it creates, and these logs are aggregated using the task ID.
Propagating this metadata and aggregating the logs is relatively cheap, but not free (the authors observed a 15% decrease in throughput in an instrumented Apache). The main cost, however, is clearly in the modification of each component of the system. The system's pervasiveness requires its metadata hooks to infect all the components. Ideally, much of this could be automated, since most systems already ‘know’ which request they are acting on at any point, which presumably was what the authors were exploring in their modifications to libasync. Such knowledge does not easily let one automatically choose between ‘next’ versus ‘down’, and certainly won't easily detect the non-tree structures the authors argue (describing as future work) would be more appropriate for some systems.
After proposing tracing all the way down to the link layer, the authors sensibly proposed, as future work, limiting what is traced by a combination of choosing which layers to trace and which incoming requests to mark. Of course, we already have tools for diagnosing network difficulties at the network path level, so logically, it might be nice (and certainly require less switch modification) to use X-Trace to find the path in our maze of an overlay network and then compose in our existing tool. Unfortuantely, based on the figures in the paper, X-Trace does not really record failures in such straightforward a manner. What humans interpret as a (service-stopping) failure in their typical traces is an omitted node, apparently without a hint was to what that next-hop might be save (non-automated) comparision with the correct case.
Thursday, November 12, 2009
Subscribe to:
Post Comments (Atom)
Actually I don't see much problem in collecting "working" xtraces and then using these as comparison with failure cases to pinpoint problems. One of the most interesting things about xtrace is the shape and structure of the graphs extracted from the traces.
ReplyDelete