Measuring and Managing Operational Risk in Industrial Processes

The risk of not achieving process outcomes in key processes should be identified and reduced to achieve better results. The identification of risks involves defining the possible events that compromises the achievement of process outcomes and also the consequences of the occurrence of these events. This paper proposes a methodology to minimize operational/technical risk across different processes or departments, minimizing the possibility of spending excessive resources in a given process while other processes pose bigger risks to the organisation or considered system. The probability of occurrence of the events that hinder the achievement of outcomes can be influenced by some technical, administrative or management actions. The methodology is tested through a case study of a CNC milling section on a furniture manufacturing process. The reliability of a key item, electrospindles, is determined and the influence of maintenance activities on such reliability and therefore on risk is analysed.


INTRODUCTION
In organisations, decisions and actions are taken on a daily basis to deliver products/services to fulfil customer orders.According to the management philosophy Total Quality Management and also ISO 9001 quality standard, organisation should define processes to facilitate its management activities [1].There are several process classifications [2], for example, operational (or value added or key or business), support and management processes.The new ISO 9001:2015 [3] revision focuses on risk management at the organisational level.Risk management is being introduced as a major change in the ISO 9001:2015 by replacing preventive actions requirements of the previous version by the identification of risks and opportunities in key processes, in this new version.
Risk management is a relevant field, which consists of dealing with unknown events that may influence the company goals or process outputs.Risk can be expressed by a function of the likelihood of occurrence of an event and its impact in the organisation [4].The definition of risk adopted in this work is "a measure of the probability and consequence of not achieving a defined goal" [4].
Probability of an event occurrence can be estimated if historical data is available about the occurrence of the event.Many factors influence the consequences of a failure or an event occurrence with some impact on outcome and if they can be translated into a cost, it facilitates management decisions.Depending on the direct or indirect influence of the event on the outcomes and also of the type of outcome affected by its occurrence, the associated cost will be of different magnitude.As an example, with regards to nonconforming product manufacturing, quality costs assessment ( [5], [6]) provides a template to estimate the size of quality problems (from a customer perspective) to help justify an improvement effort.The cost of failure associated with a specific defect, depends on the moment or place this defect is detected.For example, one defective unit at the end of the production process can cause a customer order to be unfulfilled, but a similar problem at the beginning of the productive system may be corrected without any impact on the fulfilment of the order.Thus, the cost of failure, the inefficiency of processes and lost sales may be the basis to estimate the impact of a manufacturing process failure.This cost also depends on the size of stock (buffers or work-in-process).A thorough analysis of such costs is presented in [7].
The research question explored in this paper is: how to assess risk across different processes or activities so that the actions to manage it are logically taken across the entire company or system and not only in a given process?The main objectives of this paper consist in proposing a methodology for risk analysis and management, and test its application in a process of a large multinational company of furniture industry.
In the application example, the studied process relies on machinery to carry out activities.Focus is given on reliability analysis that can be important to improve preventive maintenance decisions.A machine breakdown is analysed, as an event that hinder to obtain the expected yield.The case study shows how to select actions so that the risk associated with this process is reduced.By imposed confidentially the company name is not referred in this study.
The paper is organized as follows: in Section 2 a literature review is presented on risk management and on reliability analysis of manufacturing systems.Section 3 presents the proposed methodology for risk management in industrial systems.Section 4 shows an application example.Section 5 presents the main conclusions and further work.

Risk management and classification
Risk management can be characterised by a set of processes to identify and reduce risk.According Kerzner [4] the following processes can be defined to manage risk: risk planning, risk assessment, risk handling and risk monitoring.
Risk assessment is a critical process to identify and evaluate events, e.g.critical technical risks that could influence predefined objectives.This assessment is performed to increase the likelihood of meeting objectives (i.e.cost, performance and schedule).This process includes two relevant activities: risk identification and risk analysis.The later examines each identified risk issue to estimate the likelihood of a risk and predict its impact on the process outcome.
To facilitate the identification of risks, several authors have proposed risk classifications.For example [4] considers business risk (competitor activities, bad weather, inflation, recession, customer response and availability of resources) and insurable risks (direct property damage, indirect consequential loss, legal liability, personnel).The Project Management Institute [8] classifies risk as external-unpredictable (government regulations, natural hazards, acts of God), externalpredictable (cost of money, borrowing rates, raw material availability), internal-nontechnical (labour stoppages, cash flow problems, safety issues, health benefit plans), technical (changes in technology, changes in the state of the art, design issues) and legal (licences, patent rights, lawsuits, subcontractor performance and contractual failure).In a renowned company (Boeing) the risk categories are: financial, market, technical and production [4].
Despite different risk classifications, each one can always be questioned because risk, in some areas, evolves over time, and thus the risk identification is a continuous process.Kaplan and Mikes [9] proposed a classification based on three categories: avoidable risks, strategic risks and external risks.The category "avoidable risks" means that risks can be controlled, and that derived from the internal dynamics of the organisation and may, therefore, be eliminated or safeguarded, as they do not translate into strategic benefits.In their opinion, the management of such risks is the result of active prevention, through monitoring of operational processes and guidance of behaviours and decisions for the desirable standards, clarifying immediately the organisation's goals and values by which it is governed.
The process of identifying, evaluating, selecting and implementing one or more strategies (assumption, avoidance, control or mitigation or transfer) to set the risk to acceptable levels is denominated as risk handling [4].
Despite the variety of organisations and contexts, organisations have intrinsic capacities that provide them competitive advantage.Those capacities can be described as a set of key and support processes, and they may fail to provide the expected outputs, putting at risk the entire organisation.
In this work, it is proposed that risks are identified in key processes where the organisation has control over factors that can change either the likelihood of occurrence of unwanted events or its impact in organisational goals.The following section reviews literature on quantification and modelling of likelihood of failure in industrial systems.

Reliability analysis of technology intensive manufacturing systems
For technology intensive manufacturing systems, the probability of occurrence of events that hinder outcomes is translated by the reliability of the considered system taking into account only system internal events.External events such as the failure of the electrical power system or of the centralized industrial vacuum system may be considered at a superior level.Concerning reliability, over the past few decades, several approaches and methods have been developed for reliability analysis of items (component, part, unit) and manufacturing systems.In general, there are two fundamental approaches for manufacturing system reliability evaluation: analytical methods; and Monte Carlo simulation methods [10].Analytical methods such as Markov Chain method, Contingency Enumeration method, and Minimum Cut Set method evaluate system reliability indices using numerical solutions [11].On the other hand, Monte Carlo Simulation methods treat the problem as stochastic experiments and estimate the reliability performance by simulating the actual process and random behaviour of the system [12].
For the reliability analysis systems using one of these methods, two main stages must be performed before: (i) building a functional model that represents the relationship between components and their interactions and; (ii) computing the reliability of all system components.
Reliability-Based Method is an approach to predicting system failure based on statistical reliability models of component failure.The reliability is defined as the probability that an item will be functioning at time t under certain conditions.
When the failure data of the items are available and exhibit a high variability or randomness, probability distribution functions (e.g.Exponential, Lognormal, Weibull, Erlang, etc.) can be used for the reliability modelling and analysis.Indeed, to have an effective reliability analysis of the system it is necessary to have an accurate estimation of the failure distribution for the different items, which constitute the system.However, before choosing a specific distribution for a given data set, it is important to know the characteristics of the various alternative distributions, which can be used to model the data [13].Standard Weibull (2-parameter) distribution is one of the most commonly used methods of life data analysis due to its versatility ( [14], [15]).
Over the years, several new models, in some way related to standard Weibull distribution, have been proposed to capture the reliability of different components under different operational conditions with different behaviours.Murthy, Xie and Jiang [16] integrated the literature on Weibull models and developed taxonomy for the classification of such models as members of the Weibull family of distributions.
Most of the distributions in the Weibull family have a characteristic shape on the Weibull probability plot (WPP).For example, a standard Weibull distribution has a straight-line shape 3-parameter Weibull is a concave curve, and so on [16], [17] and [18].Hence, the WPP can provide a systematic procedure to determine which model is more appropriate for a specific data set.

PROPOSED METHODOLOGY FOR RISK MEASURING AND MANAGING
Considering a process with a set of outputs, the overall process may depend on Humans, equipment, material and other support processes to work properly.Processes may be dominated by some of these factors.
The process often has a set of attributes or solutions to allow flexibility and robustness.For example, it may have machine redundancy, or safety stock to cope with process failures.It may have contracts with operators that are "on call" or contracts with other organisations that perform the same activities.This type of solutions may have two implications: affect the probability of process failure and/or affect the consequence of process failure.Typically, all of them imply costs for the organisation.
To increase organisation sustainability the above consequences should be minimized across the organisation, i.e. across its key processes.However, these solutions aim to cope with the risk associated with the processes or activities.When defining solutions to avoid or mitigate the consequences of undesirable events, the evaluation of risks should be performed to prioritize actions or solutions that reduce the global impact on the organisation or system outcomes.
The proposed methodology to address risks aims to identify and focuses on major risks.It requires the existence of historical data to estimate the probabilities of failure.It also assumes the existence of solutions or actions that allow, with an associated cost, to change the probability of failure or the consequence of failure.These solutions are designated in the proposed methodology as controllable factors.
The methodology has the following stages: 1. Identify the system (the entire organisation, the manufacturing system, etc.) and its key processes.
For each key process, stages 2-7 must be followed: 2. Identify process outcomes and respective internal events that result in process failure (not achieving the process outcome).
3. Select the dominant event in term of impact in the results.In this stage a Failure Mode and Effect Analysis (FMEA) can be performed to select the event (or the vital few events) which has a higher impact on the system.
4. Estimate the consequence of process failure for the selected event.The consequences can be estimated based on cost.This information would be useful to the decision maker, since it can help to decide if further investment is needed to reduce the failure probability or the consequence.The cost is different depending if the event occurrence has a direct or indirect impact on the outcomes and also depending on its intensity or severity.Even if its true value is not known, its relative value compared to other processes is valuable information.8.After calculating for each key process, the respective risk and identifying the associated controllable factor(s), the decision maker will analyse data to reduce the level of the internal risk of the considered system.To reduce the level of internal risk it should be necessary to take actions in controllable factors across all key processes, first focusing the processes with associated higher risk.For instance, let consider a process A with a probability of failure P A =0.1 and another process B with the same probability of failure P B =0.1 and a consequence of failure of C B =2*C A .Without knowing the exact value C A , it is recommended that P B be reduced to 0.05 or less in order to reduce the overall internal risk.
For manufacturing processes dominated by machines, some of the steps of the methodology can be explored in more detail as follows.

Identify the system (1)
The boundaries of the system are important to be established since events such as delay in material delivery or the failure of the power system may be considered external events out of the responsibility of a manufacturing system such as a milling centre.
Identify process outcomes and respective internal events that result in process failure (2) The outcome for manufacturing system consists on producing quality product on time.Equipment failures or malfunctioning are, in process dominated by machine, the internal events to be considered since other type of events can be considered external, as mentioned in step 1.
Estimate the consequence of process failure for the selected event (4) This cost may be estimated per unit not produced or per period (e.g.day).This issue is addressed by [7].
Collect data about the occurrence of the event (5) a. Identify controllable factors that influence the event occurrence The main actions that can affect the occurrence of failure, reducing its probability occurrence is the realization of preventive maintenance or redundancies adding.Some actions or solutions can also be considered to mitigate the impact of failure in the process outcome such as holding a stock.
b. Collect historical data about the occurrence of failures Basic data in reliability analysis for a repairable and non-repairable component are time between failure (TBF) and time to failure (TTF) respectively.Life data (TBF or TTF) can be classified into two types: complete and censored data.In complete data, the actual realized values are known for each observation in the data set.In the case of censored data the actual realized values for some or all of the variables are not known.The most common case of censoring is referred to as right-censored data, or suspended data.In the case of life data, these data sets are composed of components that did not fail ( [19], [20]).For instance, the date of the failure of the component is known but the date of the installation is not known.For further details of the different types of censored data, see [21].It must be considered that in a specific data set as the number of the censored data is increased the uncertainty which is associated with reliability analysis will be increased.

Select a model to fit data (6)
a. Preliminary analysis of data Preliminary statistical analysis includes finding and calculating the maximum, minimum, mean, variance, median, trends, and correlation coefficient of the data set.This gives an idea of what mathematical model is best suited for modelling.If the failure data of a component exhibits a high degree of variability or randomness, probabilistic and stochastic models can be applied to the data set.On the other hand, if the range of the data (range = max − min) is small compared to the sample mean and the component is not critical for the system, the variability of data can be ignored and the sample mean can be used to model the data.In other words the mean of the sample represents the mean time to failure (MTTF) of the component.The reliability analysis is followed by taking a sample of failure data for model selection and validation.The size of the data set is an important factor for taking a sample of failure data.If n denotes the size of the data, the data set is small when the n < 20, medium when 20≤ n ≤50 and large when n > 50 [16].For the case of small and medium data set, all data should be used for WPP [22].

b. Estimate model parameters
There are several methods that can be applied for obtaining parameter estimates of probability distributions.These methods include graphic, moments, maximum likelihood and last-squares estimation, Bayes methods, nonlinear median rank regression and Monte Carlo Simulation methods, and many others [23].The accuracy of the estimation depends on the size of the data set and the estimation method.In some cases, such as small data sample sets, graphical methods provide a more precise estimation.
c. Validate the model The last step in the process of selecting appropriate model involves validation.A basic principle in model validation is the assessment of predictive power of the model.In small and medium data set where all the data is used in model selection and parameter estimation, the validation can be carried out using some method such as chi-square test, Kolmogorov-Smirnov (KS) test, Coefficient of Correlation (R2), and Anderson-Darling (AD) test [16] [24].

APPLICATION EXAMPLE
In this section, the central part of the proposed methodology is implemented in a manufacturing process dominated by a machine.The probability of failure of the most critical item is calculated and the influence of preventive maintenance (the considered controllable factor) on risk is analysed.

Identifying the system and one key process
The study was conducted in the production system of a furniture company.One of the key processes of this production system is the milling process, which has four CNC milling machining centres (MMC).One of the main features of these MMC (with automatic tool changers, tool magazines or carousels, CNC control, coolant systems, and enclosures) is its ability to perform a wide variety of three-dimensional works in panels of various types, such as: solid wood, chipboard and MDF (Medium-Density Fibreboard).Cutting can be performed in parallel surfaces located perpendicular planes, or forming different angles allowing thus to build circular grooves, elliptical, milling ball-shaped, concave and convex quickly and accurately.
Since the four MMC are similar, the management of the stock of spare parts becomes more efficient, achieving better service levels with lower safety stocks, and consequently, lower costs.

Identifying dominant failure event
The cost of spares and repair of the electrospindles (Figure 1) (also called the CNC milling motors) is very high compared to other machine parts (around € 12,000 and the repair cost is approximately € 5,000).Moreover, the failure of this item leads to stop the production centre also with high availability costs.Thus, a reliability study of this item is justified in order to estimate indexes to support informed decisions.The combination of different forms of cutters (Figure 2), confer special features to these machines and especially advantages over other machine tools.

Figure 2. Milling cutters combination of CNC machining centre
Electrospindles are crucial items for the performance of CNC milling machines.The main problem of these items is the failures and large time in failure state, when compared to other machine's items.

Identify controllable failure factors
The maintenance policy is the main controllable factor that affects the electrospindles probability of failure.To support the planning of preventive maintenance actions of these items, a study of reliability based on the failure times was made.

Collect historical failure data
Each eletrospindle has 4 axes and the data collected basically consist on registering the exact time of failure occurrence of a particular axis.Of the 4 MMC only 3 have available data.The fourth MMC was recently installed and there is no data available.
Time between failures of eletrospindle has been compiled from the daily maintenance reports during the period from 19/01/2013 to 15/07/2014.The time between failures of failed electrospindle units was calculated based on failure date and installation date.There are 30 observations, of which 11 are classified as electrospindle failure (complete data) and the remaining 19 are censored data that had not failed at the time of observation.The company works five days a week, two daily shifts of 8 hours each.Furthermore, it is assumed that the operational conditions are the same for all the eletrospindles.
Table 1 shows failure data set of eletrospindles obtained in this study that includes 30 failure and censored data.The data relating to failures (Uncensored) are marked with an "F" and the data that do not correspond to failure (Right Censored) with a "C".

Preliminary analysis
A distribution of failure data of Table 1 by the respective milling lines (MMC1, MMC2 and MMC3) and axes (Z1, Z2, Z3 and Z4) is shown in Table 2.The data in Table 2 does not show significant differences in terms of number of failures between MMC or between axes.Also technically there is no difference between the three MMC, all axes operate in a similar manner and the data damage or machine axis is very scarce.Note also that the eletrospindles failure mode that underlies these failure data is always the same -bearing "flu".For all these reasons, the data in Table 1 are treated as being taken from a homogeneous population.Table 3 shows the result of preliminary analysis of electrospindles failure data.Based on these results, since the range of data is not small relative to the sample mean, the variability of data cannot be ignored.Hence, probability models should be used to model the reliability of the eletrospindles.
A study of four adjustment distributions (three of the Weibull family and the Logistic distribution) based on data in Table 1, data in Table 3 (using software Minitab16), gave the results shown in Figure 3. Based on a visual analysis of the probability plots, since the empirical unreliability (probability of failure) for data set is close to the theoretical unreliability, the 3-parameter Weibull seems to be the most suitable model.Table 6 presented below supports the selection of this model.

Parameter estimation
In this study, WPP is used to compute the value of the different parameters of the Weibull.Parameter estimation (Table 4) using Last Squares Estimation Method (LSXY Estimates) and WPP (Figure 4) were carried out using the Minitab16 software.

Model validation
A basic principle in model validation is the assessment of the power of the model.Taking into account what is referred to in the bibliography ( [16], [24]), the "quality" of the adjustment, and therefore, the model validation, was carried out using Correlation Coefficient (R2), and Anderson-Darling (AD*) test.The obtained results are present in Table 5.From Eqs. ( 1), ( 2), ( 3) or ( 4), the probability of failure of eletrospindles for a given value of time between preventive maintenance can be determined.Alternatively, to obtain a given probability of failure the time between preventive maintenance can be determined.
To find out how the required reliability level (e.g.0.80, 0.90, 0.95 and 0.975) can affect the time intervals between maintenance Eq. ( 5) is solved for the variable t and the values are tabulated in Table 6.For example, to achieve a reliability value of 0.90 for the electrospindles, eq. ( 5) is solved, obtaining a value for t of 1064 hours.
Thus, the maintenance interval will be equal to 2794 hours (for t=0 the equipment is already in operation for 1730 hours).
This table can be used to find a suitable preventive maintenance interval (i.e.time interval for repair, servicing or replacement).Based on this table, probability of system failure can be changed.Considering the consequences estimated by a cost C f [7], then the process risk is F(T)*C f , where T represent the maintenance interval.

CONCLUSIONS
Considering risks in an organisation is no longer a distinctive activity but it will be a basic requirement for future holders of ISO 9001:2015 certification.Amongst a variety of risks, internal controllable risks can be determined and managed.A methodology is proposed to be applied in key processes (operational and support processes).To each process the probability of failure and the consequences (express as a financial cost) of such failure should be determined.
The goal of the organisation should not be to put all key processes with the same probability of failure but instead to have similar levels of risk.
The proposed methodology was tested in a company where it was detailed the method of estimating the probability of failure.
After determining the probability of failure as function of a controllable factor (interval period between preventive maintenance), management has a mechanism to adjust the risk for the studied process.This quantitative methodology prevents excessive actions to mitigate risk in a given process (with associated costs) and absence of actions in other critical processes.
The methodology was applied on a process which failure heavily depends on equipment but if historic data exists, it could be replicated in other processes.As a future work the authors will carry out further validations in organisations of different activity sectors.
The proposed methodology could influence not only risk management but also knowledge management, because it is a quantitative method that strives to estimate probabilities of failures and associated costs, putting collaborators discussing in a common language.
Finally, the proposed methodology supports top management in complying with new requirements of ISO 9001 standard and more generally to deal with risk in a quantitative way.

Figure 3 .
Figure 3. Probability Plots for failure time of electrospindles

Figure 5 .
Figure 5. Distribution Overview Plot for life time of the eletrospindles