Variational Inference for Robust Sequential Learning of Multilayered Perceptron Neural Network

We derive a new sequential learning algorithm for Multilayered Perceptron (MLP) neural network robust to outliers. Presence of outliers in data results in failure of the model especially if data processing is performed on-line or in real time. Extended Kalman filter robust to outliers (EKF-OR) is probabilistic generative model in which measurement noise covariance is modeled as stochastic process over the set of symmetric positive-definite matrices in which prior is given as inverse Wishart distribution. Derivation of expressions comes straight form first principles, within Bayesian framework. Analytical intractability of Bayes’ update step is solved using Variational Inference (VI). Experimental results obtained using real world stochastic data show that MLP network trained with proposed algorithm achieves low error and average improvement rate of 7% when compared directly to conventional EKF learning algorithm.


INTRODUCTION
Outliers have enormous importance when it comes to modelling engineering problems in which we have mathematical models of the physical system operating on-line.Outliers are defined as observations that significantly differ from the rest of the data [1], [2].In engineering applications on-line processing of data is essential [2,3,4,5,6,7,8,9,10,11] and failing to recognize, identify and process outliers may seriously jeopardize system's performance and eventually cause failure.Outliers may occur by chance, but more often, they may originate from temporary sensor failures, some unknown system anomalies or unmodeled reactions from the environment or some other disturbances [2].
We develop original sequential learning algorithm for Multilayered Perceptron neural network (MLP).To have system with this ability is of great importance for engineering because this approach bypasses off-line identification and removal of outliers.Furthermore, for sequential systems it is of extreme importance to process outliers as data arrives.Our algorithm is based on a conventional extended Kalman filter (EKF) but with the ability to process outliers during learning process as any other data point.
The structure of the paper is as follows.In the second part of the paper we provide brief survey of research efforts.In the third part we provide detailed and thorough derivation of EKF-OR.Experimental results are given in the fourth part, while concluding remarks in the last section of the paper.

LITERATURE REVIEW AND CONTRIBUTIONS OF THE PAPER
Robust statistics is a broad research field and in this research we focus on robust sequential algorithms for neural network training.For wider perspective, additional information and concepts in terms of general robust statistics the reader is referred to [12, 13, 14 ,15, 16] and references therein.The dominant approach in robust neural network training is to use robust cost function called M-estimator [16]; research community has proposed a number of Mestimators suited for this job: Hample [17], Welsch [18], and Tukey's biweight [19] (among others).All these robust cost function enable down-weighting of outliers.
Another approach is to identify and separate outliers before learning; then, one trains the model with data free of outliers [20,21,22,23].
Finally, the third approach is to use hybrid algorithms and hope for the best.One may find hybrids of fuzzy and Radial Basis Function networks trained with Particle Swarm Optimization (PSO) [24], Support Vector Machines (SVM), RBF and fuzzy inference [25], Support Vector Regression and RBF networks [21].
The main features that set apart our paper from other research efforts are: 1. EKF-OR processes outliers as any other data point and naturally down-weights them within Bayesian framework.2. Robustness to outliers is achieved using "uncertainty about uncertainty" approach [1].The sequence of measurement noise covariance is modelled as stochastic process over the set of symmetric positive-definite matrices in which prior is given as inverse Wishart distribution; 3. Analytical intractability of update step is solved by applying structured variational approximation [26,27,28,29] which puts tight lower bounds on the marginal likelihood of the data.
For additional information and deeper analysis of other research efforts, the reader is kindly referred to analysis given in [2] and references therein.

EXTENDED KALMAN FILTER ROBUST TO OUTLIERS
Let us define sequential learning problem of MLP network in the following form [2,3,4]:

 
, N μ Σ denotes multivariate Gaussian distribution with mean μ and covariance Σ ; w is vector of all network parameters (weights and biases), Q is a process covariance and R t is observation noise covariance matrix.Finally, y t is measurment while H t is measurement Jacobian.Both distributions are conditionally Gaussian, while it is important to stress that measurement covariance is no longer fixed, i.e.R t evolves over time and it is being estimated at each filter iteration.This is the first distinction that sets apart our algorithm from its predecessor Kalman filter.
The sequence { } t R is modeled as stochastic process over the set of symmetric positive-definite matrices [1].Let us define a prior distribution over R t at each time step.In Bayesian statistical modeling, a conjugate distribution is distribution that generates the same functional form of posterior as prior [26,27,28].In this research we assume prior distribution over R t as inverse Wishart, i.e.
where Ω and  denote harmonic mean and degrees of where Λ is scale matrix.Student t distribution has some very attractive properties when it comes to modeling of possible presence of outliers in data.Firstly, Student t distribution has heavier and longer tails then Gaussian, it decays at less then exponential rate, which actually tells our learner that probability mass is spread over wider region.Secondly, its influence function is less sensitive to infinitesimal changes in data.For additional information the reader is referred to [2].

Bayesian learning and Variational Inference
Let us suppose that data is given with the sequence   The state vector is not Gaussian due to the Student t distributed measurement vector.Bayesian learning is performed using the Bayes' rule [2]: Likelihood Prior Posterior Evidence | , ,...,  z w R , and let   q  be an approximate posterior distribution over z t given y t .Marginal log-likelihood of the data is given as [2]: where

 
L q denotes lower bound on the data marginal log-likelihood: and divergence, known as relative entropy: We emphasize that KL is not symmetric hence it is not distance (metric) measure.It is used to quantify similarity between two distributions.

 
|| 0 KL q p  .However, this is hard to achieve (if not impossible), thus knowing that , which is why   L q is known as lower bound on data log-likelihood [2].Now, minimization of   || KL q p , which has to be performed to achieve good approximation of joint posterior with   q  , inevitable leads to maximization of   L q .
However, the main advantage is that   L q operates on complete data log-likelihood and does not involve operations on the true posterior.
Structured variational inference enables us to search for the solution among family of distributions   q  .In this paper we search for the solution of the problem by looking among family of distributions that factor as (see [2]): By doing this, we preserve inner statistical dependencies between state and noises but we omit dependencies between them.In physics, this approach is called the mean field theory assumption [26,27,28].

Derivation of the EKF-OR learning algorithm
Firstly, we need to specify the complete data likelihood, defined as the following product: Now, approximate posteriors of noise and state vector that maximize (9) are given with the following expressions [2]:

R wRy
where p w R y calculated with respect to the distribution   q  .Taking the logarithm of ( 12) and expectation over     t q R we may formulate the expression for the state vector, which is given as: One may notice that the last term of ( 15) is quadratic with respect to the state vector t w .This update equation for     t q w has the same functional form as standard EKF recursion [2,3,4,30,31], in which one starts with initial estimate of the state 1 w and iteratively propagates it via state transition model (1) and updates it using newest observation t y via measurement update equations (2).The difference is that in EKF-OR we use expected value of the measurement covariance, i.e.
Having found the expression for distribution of the state vector, it remains to derive expression for measurement covariance.To do that, we need to specify a model for measurement noise.In this research we shall focus on the case of independent identically distributed (IID) noise, characterized by the following prior at each time stamp Taking expectation of complete data log-likelihood with respect to distribution     t q w results in expression for measurement noise covariance: Where the last term in (17) implies that noise further factors as (see [2]): This result comes as consequence of ( 11) and IID assumption.(15) shows that marginal of state vector given observations is Gaussian In other words, measurement covariance t R is distributed according to inverse Wishart, given observations   t y .Harmonic mean t Ω is given as:  (20), in the limit when s   , harmonic mean t Ω reduces to nominal noise covariance R. We emphasize another interpretation of (21).When difference between predicted output of the network   parameters.
It is important to stress that measurement Jacobian H, which is defined as , is iteratively calculated (within the While loop of the Algorithm 1step #2).We point out that other nonlinear Kalman filters may be used.For example, unscented Kalman Filter (UKF) or extended information filter (EIF) may be applied as learning algorithm of neural network.These algorithms are extensively tested for RBF networks training and they have proven their performance in [4,32,33].Instead of Taylor series linearization these algorithms may be used in our robust sequential learning algorithm without necessity to change the basic idea of the algorithm.

EXPERIMENTAL RESULTS
To fully assess perfomance of MLP network trained with EKF-OR sequential learning algorithm we setup the following experiment.MLP network is to learn highly nonlinear stochastic functions such as values of different stock indexes on the financial market.We have chosen values of actions for three consecutive years ( where N test denotes total number of available testing samples.The lower numerical value of RMSE for testing set free of outliers implies better generalization.All codes are written and run in Matlab  1-5.Results in tables show average, maximum and minimal RMSE for test set free of outliers, as well as the improvement rate (IR) of EKF-OR when compared directly to EKF.
As experimental results given in Tables 1-5 show, EKF-OR outperforms EKF in terms of accuracy, where accuracy is measured by RMSE calculated for test set free of outliers.Furthermore, the average maximum RMSE of EKF-OR is lower than average maximum RMSE of EKF; similarly, the average minimum RMSE value of EKF-OR is higher than average minimum value of RMSE for EKF.The average improvement rate is 7%, where for some experiments the highest IR value reaches 21%.In Figure 1

CONCLUSION
In real world applications of neural networks designers/engineers have to deal with presence of outliers in data.For on-line and especially real time implementations, it is essential to have learning model able to tackle possible outliers in stream of data.
To enable real world implementation of neural networks, in this paper we have derived a new sequential algorithm for robust learning of Multilayered Perceptron (MLP) neural network in presence of outliers.Extended Kalman Filter robust to outliers (EKF-OR) is based on simple intuition of "uncertainty about uncertainty" [1,2]; in EKF-OR we allow measurement covariance matrix to evolve over time and model this process as stochastic process in which prior is modelled as inverse Wishart distribution.EKF-OR sequentially processes all data points, regardless if data point is outlier or not.In EKF-OR outliers are "naturally" down-weighted within learning setup.To solve the problem of analytical intractability of update step in Bayesian framework we applied Variational Inference (VI) in form of structured variational approximation.This enables the algorithm to operate on complete data log-likelihood and to iteratively improve estimates of state and noise.
Experimental results on real world data (real world time series polluted with burst of noise) demonstrate effectiveness and good generalization ability of derived learning algorithm, and together with developed theoretical concept provide strong foundations for future research.
7.12 programing environment; experiments are conducted on laptop computer with Intel(R) Core™ i5-4200U CPU @ 1.6GHz (2.3GHz) with 6GB of RAM, running on 64-bit Windows 7.0.Five different MLP architectures are tested: (1) 3-10-1; (2) 3-20-1; (3) 3-5-5-1; (4) 3-10-5-1; (5) 3-10-10-1.Each experiment is repeated 30 times; each time the new initial values of weights and biases are generated and entire learning process is performed.The results averaged over 30 independent trials are given in Tables one may see S&P500 stock index.The upper part of the figure shows the nominal value of the stock (blue solid line) and values polluted by outliers (black dotted line).The sudden and wild bursts of heavy tailed noise (outliers) are easily noticed.The lower part of Figure 1 depicts test set for S&P500 stock index.As mentioned, the test set is free of outliers.One may see that MLP trained with EKF-OR is able to reconstruct original signal/data regardless of outliers' presence.Similarly, in upper left corner of Figure 2 one may see nominal time series (solid blue) and time series polluted with outliers (dotted black) for Apple stock index.Lower left in Figure 2 depicts reconstructed time series plotted versus nominal free of outliers.The right part of Figure 2 shows box plot for EKF and EKF-OR for 30 independent experimental runs.

Figure 1 .
Figure 1.S&P500 stock index in a given time frame.

Figure 2 .
Figure 2. Apple stock index in a given period.Right part shows box plot of EKF and EKF-OR for 30 independent trials (3-20-1).

. Learning algorithm based on extended Kalman filter Robust to Outliers (EKF-OR) for Multilayered Perceptron Neural Network (IID noise case)
ˆ, t t t g w x and current measurement (data point) t y is 126 ▪ VOL.43, No 2, 2015 Algorithm 1t q w , 2 0.01  .During the burst, measurements are quickly becoming larger and more volatile than in nominal conditions.We emphasize that this noise sequence does not obey IID noise case which may jeoperdize performance of both EKF-OR and generic EKF.However, although IID noise case is deeply embedded into roots of EKF-OR, as we will see, EKF-OR can easily overcome this problem due to VI employment.MLP network is to predict the next value in the series given three previous values/measurements, i.e.
t is sampled from Markov process with p=5% probability of transition.The probabilty that two consecutive elements are equal is given with probability 1-p=95%.If b t =0 the noise is samlped from inverse Wishart distribution with unit covariance and 10 degrees of freedom; otherwise b t =1 it is sampled from Gaussian distribution   2 0, N 