This paper describes a system (called NetMedic) for identifying the likely location of the root cause of faults from component state/performance traces. Machine-local information (such CPU and memory utilization) in addition to network information (such as traffic rates and response times) are collected and assigned to “components” which, in the paper's evaluation, correspond to entities such as machines, user processes, or the collective networking infrastructure.
Since the system does not assume any domain knowledge about the measurements it receives, faults are detected based on measurements deviating from their usual values. Diagnosis uses historical correlations to determine whether an anomalous measurement bis likely influencing/influenced by another. To determine which components' measurements to compare, domain knowledge is used: ‘template’ mark the dependencies each component type has to other components (e.g. between a machine and all of its processes or between processes communicating over IP). Given this directed dependency graph and the likelihood of measurements influencing each other, the ‘best’ cause is determined based on the best geometric means of influence scores on paths from the alleged cause to components, weighted by the destination components' abnormality, combined with this geometric mean for the component being diagnosed.
The authors include an evaluation where they analyze whether their system can diagnose injected faults causes correctly. They compare their approach to a simpler approach and a hand-tuned aporoach. Both share the dependency graph and root cause ranking method. The simpler method only measures whether components are ‘abnormal’ and computes influencing ranks based solely on whether both components are abnormal; the hand-tuned method only measures influence between variables hand-marked as related. The authors find that their system gave the correct cause as the first ranked cause for almost all the injected faults, and did so substantially better than the simple approach, which was successful about half of the time. The hand-tuned approach did not a substantial improvement, showing that the authors had not lost much but avoiding domain-specific knowledge in that part.
One troubling aspect of this work is formulas that seem to be chosen mostly arbitrarily (e.g. using geometric mean and sqrt(c->e value * global weighted value) for cause ranking; the choice of 0.1 and 0.8 has a ‘default’ edge weight; etc.) Of course, the authors had to make some choice here, but from the results we cannot tell how sensitive correct diagnosis is to these choices — is there technique actually general enough that they mostly don't matter? (The authors did do a sensitivity analysis for one of these parameters — the length of history used for diagnosis.)
There's also the big step away from fully domain-agnostic diagnosis in requiring templates. These templates are fortunately applicable to many kinds of applications, which is why, based on their motivation, they felt it okay to use them but not such similar input about the measurements they are taking. But these templates do not look entirely trivially to make: the ones they show have many subtle dependencies, such as on machine firewalls and other traffic. Are they actually easy to infer, and given that the authors have gone to great effort to develop a method for ignoring irrelevant correlations, how useful are they?
Wednesday, September 16, 2009
Subscribe to:
Post Comments (Atom)
Good point ... a sensitivity analysis is missing in this paper, although the reported results are actually rather impressive.
ReplyDeleteWhere is the floodless paper?
ReplyDeleteApparently, I forgot to press the 'Post' button for the Floodless paper until you pointed that out (though I had written it out in the same sitting as Detailed Diagnosis...)
ReplyDelete