FIXED-BUDGET APPROXIMATION OF THE INVERSE KERNEL MATRIX FOR IDENTIFICATION OF NONLIEAR DYNAMIC PROCESSES

Antropov N., Agafonov E., Tynchenko V., Bukhtoyarov V., Kukartsev V. (2022) FIXED-BUDGET APPROXIMATION OF THE INVERSE KERNEL MATRIX FOR IDENTIFICATION OF NONLINEAR DYNAMIC PROCESSES, Journal of Applied Engineering Science, 20(1), 150 159, DOI:10.5937/ jaes0-31772 Evgeny Agafonov Reshetnev Siberian State University of Science and Technology, Institute of Computer Science and Telecommunications, Department of Systems Analysis and Operations Research, Krasnoyarsk, Russian Federation


INTRODUCTION
Most of the identification methods are based on linear models since their properties and limits are well-known and established. At the same time, quite often one has to deal with nonlinear processes that cannot be identified by linear methods. As a result, over the past decades there have been proposed many identification algorithms for nonlinear processes, for example, algorithms based on neural network modeling [1], fuzzy logic [2], and kernel methods [3]. Nowadays kernel-based methods [4][5][6][7] have become most widely used among identification methods since they allow solving nonlinear identification problems using linear algorithms without any assumptions about the model structure. Kernel methods rely on the so-called kernel trick [8]. Roughly speaking, the kernel trick is based on replacing the inner products in the original linear algorithms with kernels, which calculate the distances between input data points in a high-dimensional Hilbert space [9]. This procedure is also known as metric kernelization. The key feature of kernelization is that nonlinear functions are most likely to be linear in a high-dimensional space. The shortcoming of this procedure is an increase of the problem dimension, which becomes dependent on the training sample size. As a result, the solution of identification problems using kernel methods is associated with considerable computational difficulties, basically related to the inversion of large matrices, which requires O(N 3 ) floating points operations, where N is the number of the training points. There are two approaches to decreasing the computational burden of the kernel-based methods. The first approach is based on using a reduced training data sample instead of a full one. A reduced training data sample is often Nikita Antropov, et al. -Fixed-budget approximation of the inverse kernel matrix for identification of nonlinear dynamic processes called an active set of training points. Points in the active training set could be either selected randomly or greedily w.r.t. some criterion [10][11][12]. When using the active training set of size M, kernel algorithms require O(M 3 ) operations, which is computationally more efficient, but in practice this approach has some drawbacks. Particularly, the random active set selection procedure could lead to model misspecification since kernel methods are nonparametric techniques whose estimates are directly based on training data. As for a greedy selection procedure, it requires at least O(MN 2 ) additional operations for calculation of a selected criterion, which is likely to be too expensive. The second approach is based on reduced-rank approximations of the kernel matrix [13][14][15][16].
The construction of such low-rank approximation can be performed, for example, using the Nyström method [17] or incomplete Cholesky decomposition [18]. The computational complexity of the proposed methods is O(MN 2 ), where M is either unknown until approximation is done or it is predefined by the user [19]. In the first case, one could face the problem of exceeding the computational resources limit. In the second case, there could be a more compact approximation for a smaller M, than for one predefined by the user, which is not efficient enough.
In order to overcome this obstacle, we propose a new regularized kernel least squares algorithm based on a low-rank approximation of the inverse kernel matrix with a fixed budget. The proposed algorithm is based on the incomplete Cholesky decomposition with a modified stopping criterion that is using an approximation error criterion and an upper bound on the maximum dimension M. The novelty of the proposed algorithm is that it allows regulating the maximum computational burden of the identification algorithm and obtaining the least approximation error at the same time.

OBJECT OF STUDY
Let us consider a nonlinear process (plant) that can be modeled by a discrete nonlinear equation: where y n+1 e R Ris an output, x n = [u n ,,y n ,y n-1 ,....,y n-d ] e R m is an input, d is a model order, f(x n )is an unknown function, ε n N (0, σ 2 n) is independent and equally distributed Gaussian noise. Figure 1 presents a schematic representation of the simulated nonlinear process. Many real processes and plants can be described by the model (1). As an example, the application of the proposed algorithm is considered on the identification problem of the input and output pressure of the pump station. The pump station is shown in Figure 2.

Figure 2: Pump station HM 2500-230
The key elements of the pump station are the input pipe, the pump unit, valve, and the output pipe. The Input and output pipes are equipped with pressure sensors. The valve is used to control the input and output pressure.

Kernel methods
Let us denote { , }, 1, where a = [a 1 ,....,a n ] T , k T = [k(x 1 ,x n ),,....k(x n ,x j )]. The kernel k =(x i ,x j )is a positive definite function satisfying Mercer's condition [9]: We will use the kernel function k(x i ,x j ) of the following form: where σ 2 f is a height-scale parameter, the kernel k l (xi ,x j ) depends on length-scale parameters l 1 .l 2 .... Further, we will use the notation θ = [σ 2 f,σ 2 n,l 1 ,l 2 ,....] as a vector of the kernel function k(x i ,x j ) hyperparameters. Batch estimation of the parameters α using a regularized kernel least-squares criterion on training data points D = {x n ,y n },n = 1,N has the following form [20]: (4) where y = [y 1 ,y 2 ,...,y N ] T , σ 2 n is noise variance (regularization), I is an identity matrix. The matrix K + σ 2 n is a positive semi-definite kernel matrix (Gram matrix): When a basic matrix inversion algorithm like Gauss elimination or Cholesky factorization is used, calculation of the expression (3) requires O(N 3 ) floating points operations.
The criterion (6) has the following terms [22]. The first term -0.5y T α is responsible for data fit. The second term -0.5ln|K + σ 2 nI| is the complexity penalty. The third term -0.5lnN2π is the normalization constant. If one would like to apply gradient optimization methods, partial derivatives of the criterion (5) w.r.t. the hyperparameters θ = [σ 2 f, σ 2 n,l 1 ,l 2 ,...] are calculated as follows [21]: The computational complexity of the expressions (6) and (7) is O(N 3 ) due to the need of the matrix (K+ σ 2 nI) -1 .

Fixed-budget approximation of the inverse kernel matrix
Since the matrix is a positive semi-definite one, there exists a unique decomposition K+σ 2 nI ≈ LL T , where L is a lower triangular matrix with positive diagonal elements [23]. In practice, for a wide range of kernel functions K(x i ,x j ), eigenvalues of the matrix K+σ 2 nI are decreasing rapidly, leading to the existence of low-rank approximation K≈GG T , where matrix G is of dimension N x M. For example, the commonly used Gaussian kernel has a rap-idly decaying eigenspectrum, for more details see [24].
In the general case, if the eigenspectrum of the matrix K+σ 2 nI has a more complex structure, symmetric permutation of the rows and columns of the matrix K should be performed to ensure acceptable approximation accuracy and stability of the decomposition algorithm. Existing methods for computing decomposition K≈GG T require O(MN 2 ) operations, where M is either unknown until approximation is done or predefined by the user [19]. In the first case the resulting dimension M depends only on the chosen approximation accuracy δ in the criterion ‖K-GG T ‖<δ , which is hard to choose properly when one would like to prevent the usage of more computational resources than it is permissible. In order to avoid these problems, we suggest setting the upper bound on the maximum dimension M of the matrix G, while continuing to use ‖K-GG T ‖<δ as an approximation error criterion. That is, if ‖K-GG T ‖<δ is not fulfilled at iteration , the algorithm stops in any case. The proposed stopping criterion allows, on the one hand, getting accurate representations by minimizing an approximation error, and, on the other hand, fixing the computational burden of the algorithm at O(MN 2 ), where M can be predefined by the user. Directly applying the approximation K≈GG T , one cannot reduce the computational burden since the matrix GG T has the same dimension as the matrix K. The computational efficiency can be improved by applying the matrix inversion lemma [25], in particular: To ensure stability, the calculation of the inverse (G T G+σ n 2 I) -1 should be performed via Cholesky factorization: where the matrix L is the Cholesky factor of the matrix G T G+σ n 2 I.
Expressions (8) and (9) (10) and the corresponding model evidence will have the form: (11) where the matrix L is the Cholesky factor of the matrix G T G+σ n 2 I . The optimal model order d is one that maximizes the model evidence The expression ln | K+σ n 2 I | can be calculated using matrix determinant lemma [25]:  (12) arg max ( , ) d L d = è (13) The computational complexity of the optimization procedure (13) using Algorithm 1 is O(DCN 2 ), where C < N for a significant number of kernel functions. The proposed algorithm for the kernel-based model order estimation is shown in Algorithm 2.
Algorithm 2: Model order estimation algorithm based on the marginal likelihood Input: training sample D = {x n ,y n },n=1,N, approximation accuracy δ, maximum approximation rank M, maximum model order D, hyperparameters θ. for d=1…D for d=1…D calculate matrix K + σ n 2 I using (10) calculate matrix (K + σ n 2 I) -1 using Algorithm 1 calculate vector α = (K + σ n 2 I) -1 y calculate value L(θ,d) using (11) end for calculate model order d ̂ using (13) Output: estimation of the model order d .

EXPERIMENTAL RESEARCH
The efficiency of the proposed algorithms was tested during simulations using artificial training samples [26], generated by discrete difference equations with additive Gaussian noise. The list of used artificial discrete difference equations is shown in Table 1.
Experiments were performed for algorithms summarized in Table 2.
Training and test samples are given by D={x n ,y n },n= 1,N , where x n =[u n-1 ,y n-1 ,y n-2 ,…,y n-d ]. For simulations we use approximation accuracy threshold δ=1x10 -6 , and we also fixed maximum approximation ranks and active set sizes at M=N/2 , where N is the sample size.Simulations were performed using the following squared exponential kernel function:  we can see RMSE oscillations for the algorithms based on Nyström approximation and the subset of data points that can be explained by their random subset construction procedure causing model misspecification in some cases, especially noticeable for small training samples. From Figure 5 another interesting observation can be made. Particularly, for training sample sizes N from 500 to 2000 and M from 250 to 1000, the subset of data is somewhat better than other algorithms. It seems that when the training size is small (when N< 2500 and M< 1250), the random subset construction procedure can produce a more representative subset than the full sample, while for more large training samples (when N> 2500 and M> 1250) it is not true. In Table 4 We see that the proposed algorithm for the training sample size N = 5000 and M = 2500 is about 4 times faster than the subset of data and about 8 times faster than Nyström method. For small training samples (when N< 2000 and M< 1000), the algorithm based on the random subset of data is faster than the proposed one. In conclusion, the proposed algorithm seems to be more attractive when the training sample size N is larger than 2500 and the maximum approximation rank M is larger than 1250. For smaller training samples, the Nyström method and the subset of data can show similar or even better performance.

SIMULATIONS USING REAL DATA
The prediction performance of the proposed algorithm was tested on a real dataset, obtained during field experiments of the pump station. The structure of the pump station is presented in Figure 6. Training and test data samples are given by D={x n ,y n }=1,N , where x n =[u n-1 ,y n-1 ,y n-2 ,…,y n-5 ] .The Input variable u n is a valve position (closing percentage). The output variable is the pressure of the input (output) pipe. The input and   Figure 7 presents the valve closing percentage in the test sample. Figure 8 shows one-step-ahead prediction results when the maximum rank M is not fixed. Figure  9 shows one-step-ahead predictions for the fixed maximum rank M that is less than one obtained by using the approximation accuracy threshold only. RMSE values and the maximum approximation ranks M are shown in the graphs. From the proposed Figures 8-9 it is clear that less maximum approximation rank M could lead to more accurate predictions, even though the approximation ac-curacy δ is smaller (approximation accuracy δ<=1x10 -6 is reached at M = 785 for the input pressure prediction and at M = 534 for the output pressure predictions). The Presented results confirm the applicability and effectiveness of the proposed algorithm.

CONCLUSION
In this paper, we addressed ourselves to a computational problem of the kernel algorithms. Kernel algorithms are nonparametric techniques that allow solving nonlinear