Reliability modeling techniques for self-repairing computer systems
Abstract
This paper develops techniques for generating and using mathematical models applicable to architectural evaluation of the tradeoffs involved in designing self-repairing highly reliable computers for long missions. These systems must use standby sparing and their reliability is shown to be extremely sensitive to small variations in a new design parameter, the coverage, c,defined as the probability of system recovery given the existence of a failure. Interactive terminal calculations show c to be the single most important parameter in high-reliability system design. Changing the coverage from 1 to .98 can result in orders of magnitude change in system mission time with a specified reliability. Most techniques for increasing system reliability (e.g. adding more spares) are shown to be futile in the face of an inadequate .99 coverage. Adding checking, diagnostics, etc. to improve failure coverage is shown to be the most advantageous technique by examples of system tradeoff evaluation. This mandates extensive application of modeling techniques throughout all computer system design phases.