Engineering method of fault-tolerant system simulations

Markov process may be represented with the stochastic matrix q = (qr,s) and expectation matrix Y = (Yr,s), obtained from x(t) as follows: Abstract Markov process may be represented with the stochastic matrix q = (qr,s) and expectation matrix Y = (Yr,s), obtained from x(t) as follows: where Y' is the matrix of weighted expectations; Y'/q is the operation of direct (element-by-element) division of matrices. ~ ~ ~ ~ ~ ~ Larkin V. Evgenii, et al. Engineering method of fault-tolerant systems simulation


INTRODUCTION
The assurance problem can be regarded as a main problem of complex systems, operated autonomously during long time in unfavorable environment [1], [2], [3], [4], [5], [6]. The inimical environment increases failures rate of equipment, so it is necessary to utilize equipment units with higher reliability and/or endurance to external coercions, or to compensate failures by means of fault-tolerant system creation [7], [8], [9], [10]. Common principle of fault-tolerance supposes introducing redundancy into the system structure to substitute failed unit during operation, increases system weight, dimensions, energy consumption, etc. [27], [28], [29], [30]. Per se redundant units themselves are the source of faults. This is why preliminary redundant system simulation is of interest. There is the common approach to the system reliability simulation, based on the Markov [11], [12], [13], [14] or semi-Markov [14], [15], [16] processes theory, which rather of widely used when it is necessary to describe sequence of faults/recoveries in one separate unit, subassembly or system, which are considered as a unit [17]. Due to the approach, states of Markov chain are abstract analogues of unit states, and Markov switching are just alike real faults/recoveries of the element. Therefore, wandering through Markov chain characterizes the life-cycle of the unit. When interacting units assemble is considered, simple Markov or semi-Markov process is insuffi cient to describe their operation in the failure tolerance regime. Destructive/restorative processes in elements, gathered in assembles, develop independently. Moreover, processes compete [16], [18], [19] between them for a current fault/recovery events. Therefore, it is necessary to have mathematical apparatus, which, from one-side permits to describe fault/recovery process inside separate element, and from other side permits to simulate the competition effect in assembles. Apparatus would allow evaluate probabilities and time intervals of wandering through parallel Markov/semi-Markov process [16]. On determination, mentioned characteristics mean reliability factors of fault-tolerant system as a whole [5], [6], [7], [8], such as failure rate, mean time between failures, mean time to recovering, etc.

The approach to simulation of fault-tolerant systems
Let us consider fault-tolerant system, which includes M units, m-th of which in terms of fault tolerance modeling is described with semi-Markov process μ m . Processes in in units develop in parallel, so together processesμ m , 1 ≤ m ≤ M form complex M-parallel semi-Markov process [16,20] as follows (1) Semi-Markov process μ m includes set of structural states (below "states") A m and semi-Markov matrix: h m (t) where t is the time. Set of states may be represented as conjunction *elarkin@mail.ru Original Scientifi c Paper (4) elements of h m (t) main diagonal are equal to zeros, that describes graph without loops, physically it means, that during exploiting unit physical condition currently switches from one state to another, and after switching unit does not remain in the same condition, as before switching; probabilities of stochastic matrix p m refl ect those or that cause of element fail, or possibility of repair element with those or that result; densities of matrix f m (t) describe how many times lasts period till fail with concrete cause or how many times will be spent till element restoration; for elements of rows from 0(m)-th till [J(m)-1]-th the next expression is true: (5) for elements of rows from 0(m)-th till [J(m)-1]-th the next expression is true: both probabilities of p m -matrix and parameters of densities of f m (t)-matrix (expectation, dispersion, initial and central moments of higher orders) depend on the substance, of which the element is made, a quality of element manufacturing and assembling, an exploiting conditions, side effects, etc, and defi ne parameters of wandering through the semi-Markov process μ m .
in which A m ∩ A n = Ø , when m ≠ n; A m ={a 0(m) ,..., a j(m) ,..., a J(m) }; a 0(m) is the starting state of the m-th semi-Markov process, meaning the start of m-th unit exploitation (element is surely able to work); a j(m) is mathematic analogue m-th unit state (able to work, unable to work, shorttime failed, under repair, etc.); a J(m) is absorbing state of semi-Markov process, which is mathematic analogue of fully destroyed element.

Single element reliability parameters calculation
As it is known [16], [21], time density of wandering through the semi-Markov process (2) from the state a 0(m) to the state a J(m) is as follows:  (6) is necessary to replace semi-Markov matrix convolution operation, which is not defi ned, with characteristic matrix multiplication operation, which is well known. The dependence (6) defi nes pure (not weighed) density of time of reaching the state a J(m) from the state a 0(m) , by defi nition of matrix product operation [22], and due to the fact, that there is the only starting state a 0(m) and only absorbing state a J(m) in the semi-Markov process. Mean "lifetime" of m-th element in the system and probability of the fact, that m-th element remains operable during time θ, are as follows: Besides the task of time density f 0(m),J(m) (t) defi nition fault-tolerant system designer would to solve, the task of defi nition of time density of reaching arbitrary state a k(m) from the state a j(m) ,1(m) ≤ j(m) ≤ J(m), so both states are no startin, not absorbing. When j(m) ≠ k(m), the task may be interpreted as defi nition the time interval till failure, or defi nition the repair time. When j(m) = k(m), the task may be interpreted as defi nition the time interval between failures, or time interval between repairing. The only restriction, imposed onto wandering trajectories, is that neither state a j(m) nor state a k(m) process should fall twice. In other to satisfy the restriction state a j(m) should get status starting one, and state a j(m) should get status absorbing one. When j(m) = k(m), the state a j(m) should be split onto starting and absorbing substates. First case is shown on the fi g. 1 b, where is shown the graph, fi g. 1 a, from which all arks, leading to the state a j(m) , and all arks, leading from the state a k(m) , are deleted. To form such structure in semi-Markov matrix h m (t) all elements of j(m)-th column and k(m)-th row should be replaced to zeros. Elements h i(m),l(m) (t) should be recalculated as follows: is the row-vector, in which j(m)-th element is equal to one, and other elements are equal to zeros; is the column-vector, in which k(m)-th element is equal to one, and other elements are equal to zeros. In the semi-Markov process h' m (t) there are as minimum two absorbing states: namely a k(m) and a j(m) , so group of events of reaching from is not full and in common case which [J(m)+1]-th element is equal to one, and other elements are equal to zeros. In the semi-Markov process h"(t) there are two absorbing states: namely a J(m) and a J(m)+1 , so group of events of reaching a J(m)+1 from a j(m) is not full and in common case is weighted, but not pure density. The state a J(m)+1 from the state a j(m) may be reached with probability and pure time density of wandering from the state a j(m) to the state a k(m) may be defi ned as follows: From (13) and (17)  To reduce computational complexity it is necessary to take advantage of B.Grigelionis theorem [24] and recursive procedure [21].
Due to the theorem by B.Grigelionis combination of non-Poisson fl uxes approximately converge to the fl uxes with Poisson properties [25]. So, both f 0(m),J(m) (t), and f' j(m),k(m) (t), or f" j(m),J(m)+1 (t) may be correspondingly approximated with exponential distribution densities: In such a way, description of time intervals between events, based on the semi-Markov processes, are substituted by description, based on pure Markov processes. Of course, such substitutions coarsen the model [26], but permit to substantially simplify math calculations, when one would investigate and/or solve practical problems of system fault-tolerant design. As seen from (22), (23), (24), there is only parameter in exponential density, namely expectation, which one should to have for comprehensive description of the events fl ux. So the method should be oriented on accelerated numerical calculation the expectations namely. After all, method should not include matrix raising into degree, which tends to infi nity, as it is in (6), (11), (15).

The recursive method of expectation calculation
For description the method let us consider abstract Markov process, which is described with matrix Abstract Markov process may be represented with the stochastic matrix q = (q r,s ) and expectation matrix Y = (Y r,s ), obtained from x(t) as follows: Abstract Markov process may be represented with the stochastic matrix q = (q r,s ) and expectation matrix Y = (Y r,s ), obtained from x(t) as follows: where Y' is the matrix of weighted expectations; Y'/q is the operation of direct (element-by-element) division of matrices.

~~~L arkin V. Evgenii, et al. -Engineering method of fault-tolerant systems simulation
For realization of recursive method [21] let us to introduce four elementary transformations of Markov process structure, which are shown on fi g. 2.
On the fi g. 2 a the integration of sequential situated states is shown. Probability and expectation of time of wandering from a r to a s are determined with the next dependencies: (29) where q l ϵ (q r,s ), when l ϵ {r, r+1, ..., l, ..., s-1, s} , are probabilities of switching from previous state, belonging to wandering trajectory, shown on the fi g. 2 a, to the next state, belonging the same trajectory; Y l ϵ (Y r,s ) are time expectations of residence in states, belonging to wandering trajectory, shown on the fi g. 2.
On the fi g. 2 b the integration of parallel arcs is shown. Probability and time expectation of switching from r to s are determined by the next dependencies: where l is the number of arc leading from r to s; q r, l, s , T r, l, s are probabilities, and expectations of wandering from r to s with l-th arc.
It is necessary to admit, that at fi rst in Markov process x(t) there is neither parallel arks, nor loops. Named structural elements appear during recursive transformations of the process (25). The recursive transformation presupposes sequential elimination of states from s-th, till necessary number. Let on the fi rst step of recursion graph be the full S-states one with no loops. On the discussed step of recursion only s states remain in the graph. States are re-numerated in comparison with numeration of (25) in such a way, that initial and destination states have numbers one and two, correspondingly, that is necessary for simplifi cation of recursion procedure indexation.
In other to ensure equivalency of transforms, probabilities and expectations must be recalculated in accordance with dependencies (32) and (29): On second step, in accordance with (30), parallel arcs should be integrated, and both probabilities, and expectations should be integrated as follows: On third step, loops, which emerges after double splitting, should be eliminated and probabilities and expectations should be recalculated in accordance with (31): After deleting of loops in the graph stay s -1 states only and it is ready for next recursion stage. Recursion one need to continue, until in te reduced Markov process stay only two (in the case of h m (t)) or four (in the cases of h' m (t), h" m (t)) states, including states under investigation. The result of calculation is time expectations of Markov processes (22), (23) or (24), which can be used when investigate competition in fault-tolerant system. (37)

Interaction in fault-tolerant system
In real fault-tolerant system (1) semi-Markov every processes μ m do not operates separately, by itself, but processes really interact between them [16,18,19]. So, there may be created such abstraction, as complex M-parallel semi-Markov process [16] (38) where M A -is the set of functional states; M h(t)-is the semi-Markov matrix.

Example
The system, which includes units, fault/recovery structure of which is shown on the fi g. 2 with right without side arcs. Semi-Markov model of the unit fault/recoveries process is as follows: where r,s are states, simulated operation between failures and overall destruction of unit, correspondingly; g(t) is the time density with expectation T; p is the probability of short-time failure; (1 -p) is the probability of unit destruction.
In accordance with method, discussed in section 3, semi-Markov process (42) is transformed into the process μ, structure of which is shown on the fi g. 2 с, left without side arcs.: where is a unit lifetime.
Conformity of lifetime interval to exponential law was verifi ed with use direct Monte-Carlo method for the case, when g(t) is uniform distribution, g(t) = 1, when t (t -1) ≤ 0,5. Result of verifi cation is shown on the fi g. 5, where experimental histogram, which is just alike exponential law, is shown. Experimental expectation is equal to 1,96 grades of time, (error less then 1,5 % of theoretical 2 grades of time).Ĩ nto fault-tolerant system M units enable in parallel, to provide a passive redundancy [7], [8], [9], as it is shown on the fi g. 5. Structural states of semi-Markov model (43) fulfi ll functional states, shown on the fi g. 5. As an absorbing state in the structure the boundary of functional state is used. State the system as a whole is divided onto hierarchical levels. Common time to failure of the fault-tolerant system, as a whole, is as follows: Backup effi ciency may be evaluated as , where T 0 is time to failure of only unit in non-overloading regime of exploitation. Models, based on the classic semi-Markov process theory [20,31,32], are rather cumbersome, but permit quite exactly to describe number of effects emerging in systems under investigation. Theory of parallel semi-Markov processes is not in common use when solving urgent fault-tolerance problems that explains necessity of proposed mathematical apparatus development.

CONCLUSION
To sum up, dependences for calculation of time intervals and probabilities of wandering through ordinary semi-Markov processes, with use of the recursive method are obtained. Notions of complex semi-Markov process such as functional states and semi-Markov matrices Cartesian product are introduced. The simulation of fault-tolerant systems has been pro-posed to be divided into three stages. On the fi rst stage ordinary semi-Markov models of separate units lifecycles should be developed, and this models should be simplifi ed till semi-Markov processes with minimal number of states, in the limit, till two-state Markov processes. At the second stage one should gather ordinary semi-Markov processes, operating in parallel, to complex semi-Markov process with functional states. parameters of residence in which are calculated with use operation of Cartesian multiplication of semi-Markov matrices. At the third stage the abstraction -complex semi-Markov process -is used for estimation of reliability parameters of the fault-tolerant system as a whole. If the system has a more complex hierarchical structure, in which blocks of the next level are assembled from units of previous level, then describing blocks complex semi-Markov processes may be considered as ordinary processes from which complex process of next hierarchical level may be formed, etc. The approach proposed permits to create model of redundant system with any degree of complexity. Further research in this area may be directed to simulation the great number of practical redundant systems with complex interactions between components and complex algorithms of lifecycle. Also method of fault-tolerant system optimization, based on Petri-Markov nets approach may be worked out too.