Adaptive algorithms for diagnosing large-scale failures in computer networks
Abstract
We propose a greedy algorithm, Cluster-MAX-COVERAGE (CMC), to efficiently diagnose large-scale clustered failures. We primarily address the challenge of determining faults with incomplete symptoms. CMC makes novel use of both positive and negative symptoms to output a hypothesis list with a low number of false negatives and false positives quickly. CMC requires reports from about half as many nodes as other existing algorithms to determine failures with 100 percent accuracy. Moreover, CMC accomplishes this gain significantly faster (sometimes by two orders of magnitude) than an algorithm that matches its accuracy. When there are fewer positive and negative symptoms at a reporting node, CMC performs much better than existing algorithms. We also propose an adaptive algorithm called Adaptive-MAX-COVERAGE (AMC) that performs efficiently during both independent and clustered failures. During a series of failures that include both independent and clustered, AMC results in a reduced number of false negatives and false positives.