FEEDFORWARD NEURAL NETWORKS: THE LEVENBERG-MARQUARDT OPTIMIZATION AND THE OPTIMAL BRAIN SURGEON PRUNING

: This paper presents the training, testing and pruning of a feedforward neural network with one hidden layer that was used for the prediction of the vowel ”a”. The paper also describes Gradient Descent, the Gauss-Newton and the Levenberg-Marquardt optimization techniques. Optimal Brain Surgeon pruning is applied to the trained network. The stopping criterion was an abrupt change of the Normalized Sum Squares Error. The structure of the feedforward neural network (FNN) was 18 inputs (four for glottal and 14 for speech samples), 3 neurons in the hidden layer and one output. The results have shown that, after pruning, the glottal signal has no effect on the model for a female speaker, while it affects the prediction of the speech pronounced by a male speaker. In both cases, the structure of the FNN is reduced to a small number of parameters.


Introduction
In recent years, artificial neural networks (ANNs) have become one of the most successful structures for solving diverse problems related to artificial intelligence, pattern recognition, classification, curve fitting, time series  FIELD: Telecommunications ARTICLE TYPE: Original Scientific Paper ARTICLE LANGUAGE: English ОРИГИНАЛНИ НАУЧНИ ЧЛАНЦИ ORIGINAL SCIENTIFIC PAPERS ОРИГИНАЛ Н Е НАУЧН Е СТАТ И Ь Ы Ы Ь prediction, and a wide variety of many practical problems. The multilayer perceptron (MLP) is the oldest and the most frequently used type of ANNs. A typical MLP consists of several layers of neurons. Input layer nodes correspond to the feature vector, the output layer consists of one or several neurons. If the input signal in the ANN propagates from the input layer to the output layer, the networks are called feedforward neural networks (FNNs). In order to find the FNN weights, one must minimize the cost function. In the gradient descent (GD) optimization algorithm, the derivatives of the cost function are calculated and weights are updated in an iterative manner, i.e. the error signal is propagated back to the lower layers. This technique is known as the back-propagation (BP) algorithm. The network weights are moved along the negative of the gradient to find a minimum of the error function (Silva et al., 2008), (Wu et al., 2011), (Protić, 2014). However, the GD is relatively slow and the network solution may trap in one of the local minima instead of the global minimum. The Levenberg-Marquardt (LM) algorithm gives efficient solutions of convergence and better optimization than the GD with the combination of GD and the Gauss-Newton (GN) method (Riecke et al., 2009), (Shahin, Pitt, 2012), (Levenberg, 1944), (Marquardt, 1963). The GN presumes that the error function is approximately quadratic near the optimal solution, which is based on the second order Taylor approximation of the sum of the squared errors. In this way, FNNs and the LM algorithm help address the problems of slow convergence and computation consumption caused by the structures of MLPs. The experimental analysis, presented in this paper, was based on a prediction of the vowel "a", spoken by both a female and a male speaker. All experimental results were obtained by using a FNN with one hidden layer and the LM algorithm. The structure of FNN was 18 inputs, 3 hidden-layer neurons, and the output neuron (18-3-1). Activation functions were tangent hyperbolic for all neurons. Since the outputs from the tangent function were limited to the (-1, 1) interval, the signals were also normalized to [-1, 1]. The training was carried out on 1700 samples of speech signals and corresponding glottal signals. First 14 inputs to the FNN were successive samples of speech: y n-1 , ..., y n -14 and the following four inputs corresponded to the glottal signal: g n-1 , ..., g n -4 . The result of processing in the forward direction was the predicted speech signal sample y n . The prediction error, or residual, was used for obtaining the Sum Squares Error (SSE). After training, the resulting structure was tested on 1700 samples of independed test sets.
Furthermore, the Optimal Brain Surgeon (OBS) pruning was applied to the trained FNN. This technique reduced the network structure by removing neurons (from the hidden layer) that have not affected the total error rate. The stopping criterion was the abrupt change of Normalized SSE (NSSE). As a demonstration of the aforementioned techniques, the results were presented in figures and summarized in tables. This paper is organized as follows: the next section describes the GD, the GN and the LM optimization algorithms, as well as the LM optimization technique for the FNN with one hidden layer. Furthermore, the experimental results are presented. The last chapter concludes the paper.

Levenberg-Marquardt optimization
The Levenberg-Marquardt (LM) algorithm can be regarded as a linear combination of the GN method and the GD method. The alternation between these two methods is called a damping strategy, and is controlled by a damping factor. If the damping parameter is large, the LM adjusts parameters like the GD method. If the damping parameter is small, the LM updates parameters like the GN method (Young-tae et al. 2011). GD, GN and LM methods are the optimization algorithms for the basic Least-Squares (LS) problem, i.e. they use LS to fit data. Fitting requires a parametric model that releases the response data to the predictor data with coefficients. The LS method minimizes the summed square of the residuals; a residual being the difference between an observed value and the fitted value provided by a model. See S is the sum of squares, y i and i ŷ are the observed response and its fitted value, respectively, r i is the residual, and n is the number of data points included in the fit. When fitting data, residuals (errors) are assumed random and follow the Gaussian distribution with the zero mean and the constant variance σ 2 (spread of errors is constant).
The LS fitting may be linear, weighted, robust and nonlinear (that fits nonlinear model to data). In general, LS methods involve an iterative procedure for parameter optimization. The nonlinear LS methods involve an iterative improvement to parameter values in order to reduce the sum of the squares of the errors between the function and the measured data points. This is the optimization, i.e. the process of finiding the minimum or maximum value of an objective function.
When using nonlinear optimization algorithms, some choices must be made about the time of stopping the fitting process. Some of possible choices are 1) stop after a fixed number of iteration, 2) stop when the error function falls below the predetermined value, 3) stop when the relative change in the error function falls below some specified value, etc. (Bishop, 1995).

Gradient descent method
The GD method is a general minimization method which updates parameter values in the direction opposite to the gradient of the objective function. It is recognized as a highly convergent method for finding the minimum of simple objective functions (Gavin, 2013). The GD is a first order algorithm because it takes only the first derivative of the function.
Suppose that f(x) is a cost function of a single parameter x. To find a minimum of f, x has to be evolved in such a way that the cost function f(x) decreases. Given some initial value x 0 for x, the value of x can be changed in many directions. In higher dimensions, the analogy of the derivative is the gradient. It is a vector whose entries are partial derivatives of f with respect to each dimension. So, in order to decrease f, evaluate the gradient f ∇ in a particular point, take a step in its negative direction, evaluate the gradient to the new point, get a new vector and take a step in its negative direction, and so on until the local minimum of the function is find. This gives a simple gradient descent algorithm: -initialize x in some way, often just randomly, x=x 0 , -find the gradient of f ∇ , and update it with a small positive amount of the step size (sometimes it is called the learning rate) in the following way: -follow the stopping condition (stopping the procedure either when f is not changing sufficiently quickly, or when the gradient is sufficiently small).
Given stable conditions (a certain choice of λ), it is guaranteed that One problem is how to choose the step size. Local minima are sensitive to the starting point, as well as the step size. The step size influences both the convergence rate and the behaviour of the minimization procedure. If the step size is chosen too large, the current fit jumps over the minimum, perhaps jumping back and it may take a long time to converge. On the other hand, if a step size is too small, the algorithm takes very small steps and very little progress at iterations (Ihler, 2013). One option is to choose a fixed step size that will assure convergence wherever the GD starts. Another option is to choose a different step size at each iteration (step size is changed inside the algorithm).
The GD is the simplest convergent algorithm for finding the minimum of f(x). It quickly approaches the solution from a distance, which is its main advantage. The main disvantage of the algorithm is that it aproaches the best fit very slowly, and tends to zigzag along the bottom of long narrow canyons (Zinn-Bjorkman, 2010). This problem is solved by the GN method.

Gauss-Newton method
In the GN method, the SSE is reduced by assuming the LS function is locally quadratic and finding the minimum of the quadratic. It presumes that the objective function is approximately quadratic in the parameters near the optimal solution. For moderately-sized problems, the GN method typically converges much faster than the GD method (Marquardt, 1963). The GD is based on the second order Taylor approximation of the SSE E around the optimal solution E 0 (See 2).
where u is a parameter vector, δu is the parameter deviation, and H is the Hessian symetric matrix of the second derivates of E. Intuitively Hessian describes the local n-dimensional parabola curvature that approximates E. If H is positive-definite, the parabola is indeed positively curved, meaning that it goes up in all directions, has a definite minimum, and the parameters are estimated in the following way: where u * is the estimated parameter vector.
The problem of finding H -1 was solved by  who applied the outer product approximation to develop a computationally efficient procedure for the calculaton of the inverse Hessian. For N number of parameters in the data set, and the vector g that is the gradient of the error function, the sequential procedure for building up the Hessian is obtained by separating off the contribution from the data point N+1 to give: The initial matrix H 0 is chosen to be αI, where α is small quantity.
Given a formula for updating gradient of f ∇ with a small positive amount, then As many other algorithms for the minimization of an objective function, the LM algorithm also provides a numerical solution to the problem of minimizing a function, over a space of parameters (Kashyap, 1980), (Ljung, 1987), (Larsen, 1993), (Hansen, Rasmussen, 1994), (Fahlman, 1988). The function update with a small step is which is a blend of the above mentioned algorithm, i.e. a search direction is the solution of the matrix equation where λ is the damping factor adjusted at each iteration and guides the optimization process, δ is the parameter update matrix, J is the Jacobian matrix of the first derivatives of E, and I is the identity matrix. If the reduction of E is rapid, a smaller value of λ brings the algorithm closer to the GN algorithm, whereas if the iteration gives insufficient reduction in the residual, λ can be increased, giving a step closer to the GD direction, i.e. the LM method acts more like the GD method when the parameters are far from their optimal value, and acts more like the GN method when the parameters are close to their optimal value.
The update rule goes as follows: if the error goes down following the update, it implies that the quadratic assumption is working and λ has to be reduced (usually by a factor of 10), to reduce the influence of the GD. On the other hand, if the error goes up, λ has to be increased by the same factor. The LM algorithm is thus: do an update in the direction given by the rule above, evaluate the error at the new parameter vector, if the error has increased then retract the step, and go to the first step, otherwise accept this step and decrease λ (Ranganathan, 2004).

Levenberg-Marquardt optimization for the feedforward neural network learning
According to Azimi-Sadjadi and Liou (1992), the FNN with one hidden layer and the sigmoidal-type nonlinearity can approximate any nonlinear function and generate any complex decision region needed for clasification and recognition tasks. Consider the FNN where y i is the output, w and W are the synaptic weight matrices and f j and F i are the activation functions of the hidden and the output layer, re-spectively. q and m represent the number of nodes in the input and the hidden layer, respectively. For the FNN with differentiable activation functions of both input variables and weights, each unit computes a weighted sum of inputs where z i is the activation which sends the connection to the unit j, and w ji is the weight associated with the connection. For a given z j a nonlinear function g(•) is applied, so z i =g(a i ). g(•) is known in advance (Protić, 2014).
Also consider the error function where E n = E n (y 1, ..., y c ). The goal is to evaluate the derivatives of the error E n with respect to the weights The BP formula gives: where δ's can be evaluated backward. In the similar way, this may be used to calculate the other derivatives.
Consider the evaluation of the Jacobian matrix, whose elements are given by the derivatives of the network outputs y k with respect to the network inputs The optimization process goes like this: propagate the input signal through the FNN in the forward direction to obtain outputs at each layer. Generate the output signal for each node. Compute the matrices for updating weights. Determine the state of a particular node. Proceed if the input to the node is within the ramp region, otherwise there is no need for weight updating; then examine the next node. Update the weight vector using the recursion.
Pruning is a technique, a tool that helps decide upon the structure of the FNN. It addresses only the neurons in the hidden layer. There are two types of pruning: 1) incremental, starting with input or output layers and then incrementally decrease a size of FNN and retrain after each iteration; 2) selective pruning, which starts with an already trained FNN, with a fixed size, and then removes hidden neurons that will not affect the error rate of the FNN. In this way, unproduced neurons are removed (Hansen, Rasmussen, 1994).
After the network updating is finished, the pruning is carried out in the following way: for a network trained to a local minimum error the linear term as well as higher order terms in Taylor's equations vanish. The goal is to set one of weights (a parameter) to zero . It should be noted that neither H nor H -1 have to be diagonal (Hassibi, Stork and Wolff, 1993).

Results
The experiments were carried out on the wovel "a" spoken by a female and a male speaker. Both glottal and vocal signals were used for training and testing. The training was carried out on 1700 samples of speech and the corresponding glottal signals. For the FNN training, the LM method was applied as an optimization algorithm. The network structure was 18-3-1, i.e. 18 inputs (14 inputs for the speech and four inputs for the glottal signal), three neurons in the hidden layer, and one output neuron, with a tangent hyperbolic type for all neurons. According to Protic (2014) the two-pole model for approximately (2*n+1)*500Hz, n = 0, 1 ... poles is sutable for speech prediction. Considering the frequency interval of speech sound [0 -4] kHz, for the purpose of research the seven two-pole, i.e. 14 inputs, model is used. Four inputs of the glottal signal correspond to two zeros. Since the resulting values of neurons' outputs were inside the interval (-1, 1), the signal samples were normalized to the values at the interval [-1, 1]. Weights were initialized as random values at the same interval. This structure simulated the speech production system (glottal and vocal tract). The NSSE was used for optimization, where NSSE=SSE/n; n is the size of the error set. The training was carried out so that the NSSE train was less than 0.001. The resulting structure was tested on an independent test set, and the value of the NSSE test was determined. The OBS pruning algorithm was applied to a determined minimum structure of the FNN so that the rejection of neurons did not affect the overall NSSE. The stopping criterion for pruning was that the abrupt change in NSSE prune was more than 10 times higher than the minimum value of NSSEs calculated in the preceding processes.
The optimal structures were 14-3-1 for the female speaker ( Figure 1) and 16-3-1 for the male speaker ( Figure 2).  For these structures, the abrupt changes of NSSEs have shown that NSSE prune was more than ten times higher than NSSE train and NSSE test . (See Table 1). The fully connected FNNs were reduced in size but the reduction was not significant. As it is presented in Figure 1 and Table 1, the nonlinear structure that takes the form of 14-3-1 is a good predictor of the speech signal for the female speaker. It should be noted that, in this case, the glottal signal has no effect on the prediction, and the FNN behaves as a nonlinear AutoRegressive (AR) model, where there is a grouping of the first six parameters (excluding the fifth node). Grouping is also noticeable for nodes 9-11. It stands out the impact of a sample at the 14 th node. For the male speaker there is an impact of the glottal signal which can be seen as an influence of the excitation signal, i.e. a lesser vibration frequency of the vocal cords for the male speaker compared to the female. In this case, the network has the structure of 16-3-1, and the nodes that describe the impact of the speech signal are identically arranged as in the previous case (for the woman), so the FNN behaves as a nonlinear AR model with the eXogeniuos input (ARX).
In both cases, the structure of the FNN is reduced for a small number of parameters. It should be emphasized that if the recognition of speech whose basic frequency is relatively high, the glottal signal does not have to be recorded, which is particularly suitable because the recording technology (electroglottography) is uncomfortable for the speaker (the electrodes are placed externally, on the larynx).

Conclusion
The FNN with one hidden layer that has the structure of 18-3-1 and the hyperbolic tangent transfer functions of all nodes is a good predictor of the speech signal, for the prediction of vowels. The prediction algorithm involves optimization techniques and pruning of neurons that do not have a significant impact on the change of prediction errors. Several optimization algorithms which are described in this paper can be applied to train a neural network. The most popular algorithms are the GD and the GN as well as a combination of these which is known as the LM algorithm. For the GD algorithm, the optimization is based on the first derivative SSE which is calculated forward from the input layer to the output layer, after which the parameters are set backward, from the output to the input using the BP algorithm. The GN uses the quadrature approximation of the prediction error that is approximated by the Taylor series, so the optimization method includes solving a quadratic equation, as well. The second derivatives of the error are evaluated. In the first case, convergence is not fast enough, while in the second case the processing time and consumption are high, because of the necessity of calculation of inverse Hessian matrices. The LM algorithm is a combination of the two previous algorithms, which is determined with the damping factor (damping strategy) for the adjustment with the trade-off between the GD and the GN methods.
In this paper, the LM optimization algorithm was used to train the network. The network was tested after training, after which OBS pruning was derived. Those neurons that do not affect the change of the error were rejected by pruning, while the stopping criterion was the abrupt change of the NSSE.
The results showed that the LM algorithm provided high quality solutions in a prediction of vowels. The experiments were based on the prediction of the vowel "a" that was pronounced by female and male speakers. The amplitudes of signals were normalized to the interval [-1, 1], and adjusted to the tanh transfer functions of neurons. That was also suitable for the comparison of male and female speakers. The optimization algorithm was based on the NSSE, in such a way that the training stopped when the NSSE was less than 0.001. The results showed that this was the proper value for the optimization criterion, considering that the optimized structure gave the minimum test error, and that pruning reduced FNN weights, which were mostly related to the glottal signal. The experiments have proved that OBS pruning can reduce the number of input parameters so that the glottal wave has no influence on the prediction for the female speaker. However, this did not stand for the male speaker, considering the lower basic excitation frequency, i.e. lower speed of opening and closing of the vocal cords. According to the results, it is obvious that the vowel "a" can be predicted right and without mistakes for the given FNN, the training based on the LM algorithm and the OBS pruning. Zaključak FNN sa jednim skrivenim slojem, strukturom 18-3-1 i tangens hiperboličnim prenosnim funkcijama svih čvorova predstavlja dobar prediktor govornog signala, kada je osnova predikcije vokal. Algoritam predikcije podrazumeva optimizacione tehnike i pruning neurona koji ne utiču bitno na promenu greške predikcije. Nekoliko optimizacionih algoritama koji su opisani u ovom radu može se primeniti za trening neuronske mreže. Najpopularniji od njih su GD i GN, kao i njihova kombinacija LM algoritam. Kod GD algoritma optimizacija je bazirana na prvim izvodima funkcije SSE greške po parametrima modela unapred, a parametri se podešavaju unazad od izlaza ka ulazu, BP algoritmom. GN koristi kvadraturnu aproksimaciju greške parametara koja je razvijena u Tejlorov red, pa i optimizacioni metod podrazumeva rešavanje kvadratnih jednačina. Računaju se drugi izvodi funkcije greške. U prvom slučaju konvergencija je nedovoljno brza, dok je u drugom slučaju za izvršenje algorit-