About cookies on this site Our websites require some cookies to function properly (required). In addition, other cookies may be used with your consent to analyze site usage, improve the user experience and for advertising. For more information, please review your options. By visiting our website, you agree to our processing of information as described in IBM’sprivacy statement. To provide a smooth navigation, your cookie preferences will be shared across the IBM web domains listed here.
Publication
DSN 2008
Conference paper
Evaluating availability under quasi-heavy-tailed repair times
Abstract
The time required to recover from failures has a great impact on the availability of Information Technology (IT) systems. We define a class of probability distributions named quasi-heavy-tailed distributions as those distributions whose time series graph of the sample mean shows intermittent jumps in a given period. We find that the distribution of repair time is quasi-heavy-tailed for three IT systems, an in-house system hosted by IBM, a high performance computing system at the Los Alamos National Laboratory, and a distributed memory computer at the National Energy Research Scientific Computing Center. This means that the mean time to repair estimated by observing incidents within a certain period could dramatically change if we observe incidents successively for another period. In other words, the estimated mean time to repair has large fluctuations over time. As a result, classical metrics based on the mean time to repair are not optimal for evaluating the availability of these systems. We propose to evaluate the availability of IT systems with the T-year return value, estimated based on extreme value theory. The T-year return value refers to the value that the repair time exceeds on average once every estimated T years. We find that the T-year return value is a sound metric of the availability of the three IT systems. © 2008 IEEE.