Using Monte Carlo Simulation in Your Reliability Model

Once a model of a system has been constructed, you can simulate the system to predict how it will perform through time.  By definition, however, the performance of any system involving reliability elements is stochastic (i.e., inherently variable), since failure is generally described stochastically.  That is, we can't say exactly when a component will fail; we can only describe the failure (and repair) process statistically.  For example, if we had 100 identical computers, their failure (and repair) histories would not be identical; they would display a distribution of behaviors. 

In addition to this inherent variability, we might also be uncertain about some of the input parameters controlling the model.  For example, if we had not carried out actual tests on the components, the parameters describing their failure modes would be uncertain, and we could enter these as probability distributions in order to capture this uncertainty.

Variability and uncertainty are represented in GoldSim using Monte Carlo simulation.  Monte Carlo simulation consists of calculating a large number of “realizations” (potential futures). Each realization simulates the same system with the same initial conditions, but with different sampled stochastic values, both at the beginning of the simulation and as the system evolves through time.  This results in a large number of separate and independent results, each of which is considered equally likely.  These realizations can then be combined to provide statistical information on possible outcomes.

   Note: If studying a system that is effectively at steady-state, it can be appropriate to run a single arbitrarily long simulation (e.g., 1000 years), as this can capture the variability in the failures and repairs. However, for a system that is aging, the actual life span of the system should be simulated using Monte Carlo simulation.

The number of realizations that are required in order to accurately capture the behavior of a system is a complex issue that can be influenced by the computational requirements of running a realization, and the frequency of the behaviors you wish to capture.

A rule of thumb for determining the number of realizations is that the number of realizations should be large enough that at least 10 simulations will have an occurrence of the most infrequent behavior you want to capture.  For example, if you wanted to observe two consequences, one which occurred once in every 10 realizations, and another that occurred once in every 500 realizations, an adequate number of realizations for the simulation would be 5000.

Another factor that needs to be considered when deciding on the number of realizations is the impact that the number of realizations can have on the results and statistics available at the end of a dynamic simulation.

In order to generate confidence bounds on the Inherent Availability, Operational Availability and Reliability metrics, at least six realizations must be run.   However, this is a bare minimum, and if you are making use of these confidence bounds we strongly suggest running at least 100 realizations.

Furthermore, in order for the Failure Times results analysis to be available in result mode, the number of realizations must be large enough so that the sum of the number of failures over all realizations is greater than 5.  Similarly, for the Repair Times analysis to be available, the sum of the number of repairs over all realizations must be greater than 5.  As with the confidence bounds, the more failures and repairs that can be simulated, the less likely it is that the results represent an anomalous situation.

Related Topics…

Learn more about: