Markov Modelling
Introduction
The reliability models explored in previous chapters assume independence between system components, meaning that the failure or repair of a component is not affected by what is going on with any other component. Consequently, the system failure state is expressed as a combination of component failures. For example, in a series system, if any component fails, then the system fails; in a parallel system, if all components fail, then the system fails. For these models, it is important to know the set of failed components; however, the order in which the components failed is not significant.
For complex systems that are modelled using Reliability Block Diagrams (RBDs) or Fault Trees, there may exist a set of component failure combinations that lead to a failed system state. In most cases, it is assumed that these component failures are independent, meaning that the failure of one component does not affect the failure times or behaviours of any other component. However, in shared load systems, the failure of a component can increase the load on other components, thereby increasing the failure rate of the system. In addition, a common cause failure, whose occurrence can lead to the failure of one or more components in the system, can arise. Examples of common cause failures include the loss of a common power supply, earthquakes, extreme weather conditions, etc..
Although the previous chapters provide formulas to compute such reliability-related measures as reliability, availability and MTTF for standby systems, they do not provide methods for deriving these equations. For example, in a system with cold standby components, components cannot fail in the standby mode, but they can fail when they are in operation. Thus, the failure rate (or failure time distribution) in these two modes are different. The time to keep the standby component in operation depends on the failure time of the active unit. This means that component failures depend on the failure times of other components. In such cases, components cannot be assumed to be statistically independent. In addition to considering the set of component failure states, the order in which components fail must be considered.
Also, the previous chapters assume that all components are non-repairable. The equations given for system availability are based on the availability of individual components. The equations not only assume that component failure times are independent, but they also assume that the component repair times are independent. This means that the repair time of a component is independent of the states of other system components. This may not be true if a common-repair facility (group of repair technicians) exists for a set of components because a failed component may have to wait for a repair crew, who is busy repairing some other failed component.
In most cases, it is assumed that a good component operates continuously, even during system failure. This assumption is generally valid. However, when such independence between component failures and/or repairs should not be assumed, stochastic processes (rather than RBDs, fault trees or other combinatorial models) should be used. And, even when failure and repair times of all components are independent, cases exist where stochastic processes are necessary.
For example, exact reliability evaluation of a parallel system with repairable components cannot be performed using combinatorial models because the reliability of this system depends not only on the set of component states at a specified time but also on the history of component failure and repair events. Most combinatorial models do not even provide formulas for approximating the reliability of a repairable system. Moreover, combinatorial models cannot directly calculate the availability of a single component because all of the possible sequences of failures and the repairs of that component must be considered.
Stochastic processes can handle all of these complex and sequence-dependent situations. Stochastic processes can also accurately and completely model such dynamic system behaviours as:
Repairs.
Shocks (shared loads and induced failures).
Common cause and dependent failures.
Sequence/state-dependent failure rates (standby components).
Variable configurations.
Complex error handling and recovery mechanisms (common pool of repair technicians).
Phased mission requirements.
Because of their flexibility, generalized stochastic processes can be used to specify various complex system behaviours. Thus, they are widely used to assess system reliability and related characteristics in mission critical systems and research-oriented projects. However, their complexity makes them much harder to understand than combinatorial models. Consequently, generalized stochastic processes are not used in all industries.