Simulation of The Linear Mechanical Oscillator on GPU

In this short paper we are describing some methods for solving stochastic differential equations which appears in mechanics. We are concerned with solutions of mechanical oscillators which are subject to random excitation. We are primarly interested in solving stochastic differential equations excited with white noise. We rewrite the random mechanical oscillator equation in the form suitable for applying numerical procedures for solving it. Finally, we give some design notes on the methods used to implement numerical simulation of the graphical processing units.


INTRODUCTION
Mathematical models in the form of Stochastic differential equations (SDE) with oscillating solutions can be found in various scientific areas: mechanical system [1], vibratory impact systems [2], statistical radio engineering [3], vibro-absorber systems [4], selfoscillating systems [5], oscillations of elasto-plastic bodies [6], statistical radio physics [7], and financial mathematics [8].In solving numerically the oscillating ordinary differential equations (ODEs), satisfactory accuracy of calculation by the Euler method is usually obtained when 32 integration steps are used per oscillation period (see [26]).
Previous experiments [9], have shown that when oscillating SDEs with such an integration step are solved numerically, there often occurs instability in the numerical solution, that is, a great increase in the amplitude and variance of oscillations, whereas for the exact SDE solution the variance either does not increase or increase slightly as one moves along the integration interval.Therefore, in comparison to ODEs, the integration step must be decreased by several orders of magnitude.In addition, when small volumes of simulated trajectories of SDE solutions are used, absolutely incorrect estimate of solution moments may be obtained if the distribution densities are highly asymmetric.
With extremely small integration steps of the generalized Euler method and great volumes of simulated trajectories of SDE solutions, supercomputers with a large number of processors are needed to obtain satisfactory accuracy of numerical analysis in a reasonable calculation time.It should be noted that the use of some time-consuming methods that are more accurate than the Euler method, for instance, the Milstein method [10], does not speed up solving the problem, but, on the contrary, makes it more difficult.These methods and their error estimates have been received considerable attentions in the literatures, see [18], [19], and [20].It should be noted that the use of some time-consuming methods that are more accurate than the Euler method, for instance, the Milstein method [10], does not speed up solving the problem, but, on the contrary, makes it more difficult.Some preliminary investigations of the authors on solving SDEs with increasing variance on supercomputers are described in [21].
In this paper we study equation of linear mechanical oscillator with constant coefficients: where  is white noise random process which in every time instance represents Gaussian random variable with zero mean and variance equals to one.We are going to reinterpret the results already known, for the solution of the general form of the stochastic differential equation which is equivalent to the equation (1.1), for our special case.The general results about the solution of the linear stochastic differential equation can be found in [12], [13], and [14].We are going to give the forms of these solutions in the case of equation (1.1) in the section 2. We use these analytic solutions to compare the results with the simulation.We present implementation, in section 3, of the Euler method and analytic solution on Tesla C2075, with computational capability 2.0, with graphics driver TCC, product of NVIDIA Corporation.Finally, we give comparison of the results in Section 5.

ANALYTIC SOLUTION
Equation (1.1) can be written in the following form where ( , ) X x x   and W is the Wiener process.In the special case of the constant coefficients we can solve equation explicitly.For the general form of the solution see [12], [13], and [14].
Proof.We use equation (2.1) for the proof and rewrite it in the matrix notation We know that in general case, (see [12] and [13]), linear stochastic differential equation 3) It is easy to prove that Using this computation and equation (2.3), we can produce stated result. q.e.d Following results from Evans, we are in the position to give the following theorem.
Theorem 2.2.We have where  is covariance matrix, and where Proof.It can be proven using equations given in Evans, and already calculated expressions for At e given in (2.4). q.e.d.

NUMERICAL METHODS
In order to be able to implement simulator for the equation (2.2) we give the following theorem.
Theorem 3.1.Let interval of interest of the values of the stochastic process X be given by   0,T and suppose we are given partition , 0,..., , where k  are Gaussian random variables with zero means and unit variance and . According to Ito definition, see [11], of the stochastic integral we can approximate 0 ( ) 0 where we have used the fact that increment of the Wiener process 1 ( ) ( ) is Gaussian random variable with zero mean and variance l t  .
According to definition of the Ito stochastic integral, we know that expression to the right is converging to the integral on the left.Hence, it follows 0 ( ) 0 Similarly according to the Riemann definition of the integral and according to the fact that integral is smooth, we know that 0 ( ) 0 Combining this two facts statement of the theorem follows. q.e.d.
In order to be able to implement an efficient simulator, we can rewrite equation ( where However, if we want to continue computation for some moment ' t t  , we must renormalize k C and k S with the factor There is also another method which can be used for computation.Actually, it is numerical method which can be used for the computation of the solution of general stochastic differential equations.

Theorem 3.2. Let interval of interest for the solution of
X be   0,T , and define partition of the interest by the points , There exist  such that for every , there exists k C and ( ) It can be proved that Proof.This is well-known Euler-Maruyama method and bound for error can be found in Higham (2001).However, this is result for the Euler-Maruyama method in the case of the general form of the stochastic differential equation, and our equation is the special form in which case Euler-Maruyama method is equivalent to the Milstein method and Mistein method has γ = 1 (see [15]).q.e.d.
Actually, it can be shown that constant C k in the previous theorem depends exponentially on k t , which means that stability of computation is not guaranteed for the presented method.Actually, this exponential behavior requires that  should be small in order to be able to compute with requested precision, otherwise computation break down rapidly.

IMPLEMENTATION DETAILS
In this section we give implementation details for the methods presented in theorems 3.1 and 3.2.We are interested in the computation of the probabilty density function for the random variable   X t , for some fixed moment t > 0. According to the architecture of the graphic processing unit, the obvious parallelization is to compute one realization of   X t in every thread.This is rather obvious since we are working on the SIMT architecture and all threads in the simulation are going to compute the same instruction over different data sets, which guaranties optimal performance (see [17], [22], [23], [24], [25]).
In order to implement simulations on parallel architecture, we use the fact that every sample path is independent of any other sample path.Hence, we implement in every thread one sample path simulation.We are going to present implementation details for the NVIDIA graphic processing unit Tesla C2075, with compute capability 2.0 and driver model TCC.
We use simulation on the interval [0,T] with number of points N, which are equidistantly distrubuted by , / , 0,..., . Parameters that are needed for simulation are 0 0 ,   , and functions f and g.
Functions f and g are implemented as device functions and are specified for the concrete simulation.Parameters 0 0 ,   are transferred to the device in the constant memory.We use constant memory to transfer parameters X to the graphical processing unit.This means that we have seven variables of type double in constant memory.These are not the only variables we need to pass to the constant memory.Obviously we cannot store results of the computation in every time step, since, due to the limitation of the 4GB of total memory.Rather, we pass indexes of time steps in which we want to store results.Indexes of this time steps are passed to the graphical processing unit using integer values in the constant memory.We launch kernel using parameters <<< numOfBlocks, numOfThread >>>.Parameter numOfBlocks amounts to the total number of Blocks kernel is launched for.Architecture of the graphical processing unit we are using shows that this number can not be bigger than 2 16 = 65536.Parameter numOfThread, however, has maximal possible value 2 10  = 1024.The problem with achieving this value for the amount of possible number of threads is rather hard to achieve since there is the problem with maximal number of registers which can be assigned to the every thread.Namely, maximal number of registers per block for this architecture is limited to the value of 2 15 = 32768.However, every thread needs to compute random variable which has Gaussian distribution and every thread need to store variables needed for the computation.Those variables are 0 0 ,   , step size , h h , and variables 1 x and 2 x which represent current value of the computation of 1 X and 2 X .There are also variables needed to compute functions f and g which additionally occupy total number of registers.In total, using careful implementation we can achieve launch of 512 threads per block.Random variables with Gaussian distribution are computed using library function curand_double_normal. Additionally, for the method presented in the theorem 3.1 we use CUDA library functions cos, sin and exp.Extensive usage of this function however, increases complexity of the computation on the CUDA device with respect to the method presented in the theorem 3.2.
After computation of the solutions of the stochastic differential equation, we need to compute probability density functions for the random variables .Firstly, we compute distribution function.However, in order to be able to compute distribution function we first sort realization of the variables i X , 1, 2 i  .We sort using function sort of the thrust library.Sorting can be achieved with at most ten present of the total computational time.After sorting the ensembles we divide realizations of the variables i X , 1, 2 i  , in the small pieces and compute distribution function on the small segments of realizations.Distribution function on these small segments of the samples can be achieved using one thread per segment.Using this implementation for the computation can be achieved using at most one present of time needed for the computation of the realizations.However, since we do not return data to the central processing unit memory, we need to pass parameters for this computation of the distribution function into the constant memory.This amounts in additional two integers passed to the graphical processing unit using constant memory.

NUMERICAL RESULTS
We have used method presented in the previous section and implemented it on Tesla C2075, with compute capability 2.0 and driver model TCC.In this section we present some numerical results.We present results for the linear oscillator with parameters 0 0.02, g  .We assume that  is the Gaussian process with flat power spectral density S = 0.1.We present results for 512 numbers of threads and 2000 blocks.In total this means that we are using 1024000 realizations of the Wiener process.We have chosen to compute probability density function for the time instances 20, 40, 60, 80 and 100.

CONCLUSION
In this short paper we studied possibility of simulating linear oscillator with random excitation on graphical processing units.We give description of the implementation we used and report that we are in position to run over million simulations in parallel.Complexity of running lot of simulations is spread over the significant number of threads on the graphical processing unit which can speed up computation by the factor of one hundred.

Figure 1 . 1 gFigure 1
Figure 1.Probability density function for the displacement for the linear simulator with 0 0.02,  

Figure 2 .
Figure 2. Probability density function for the velocity for the linear simulator with 0 0.02,  

Figure 3 .
Figure 3. Probability density function for the envelope for the linear simulator with 0 0.02,  