INTERPRETING RELATIONSHIPS BETWEEN POLLUTANTS AND CARBON DIOXIDE EMITTED INTO AIR FROM INDUSTRIES IN SERBIA

The focus was on the pollution problem in Serbia and the relationships between CO2 emitted into air from industries and air quality indicators such as particulate matters (PM2.5, PM10), nitrogen and sulfur oxides (NOx, SOx), and volatile organic compounds were analyzed. To identify the dependencies, both parametric and nonparametric statistical learning-based evaluation algorithms were taken into consideration. Both the model structures produced satisfactory estimations with high accuracy levels. As a result of the model interpretation, PM2.5 has been recorded as the main indicator to explore the variability in CO2 concentrations. The implementations exhibited that interpretable machine learning can provide meta-data and sufficient information for making blackbox air quality system more explainable. Thus, the practiced modelling tools, the provided interrelationships as well as the new information could be considered by the national authorities within a computational environmental management strategy.


INTRODUCTION
Air pollution can influence health and it has many long and short-term effects such as cancer, cardiovascular diseases, and respiratory diseases. Recently, strong relationship between air pollution and Covid-19 pandemic has been underlined (Brauer et al. 2021). Providing public air quality is one of the most important environmental management strategies in public management. Managing air quality covers control strategies to perform air pollution reduction. To make them actual the pollution prevention approaches, determination and control of the amounts of emissions released by industrial production and consumptions have critical importance. Ultimately, air quality and climate change are functionally linked from their emission sources to their effects on human health and ecosystems (Melamed, Schmale, & Schneidermesser, 2016).
The number of national and global agreements to focus on environmental conversion and sustainability issues is tending to increase (Liu, & Shen, 2014). There is also a strong relationship between the policies and air pollution measures as well as GHG emissions. According to an OECD report (Organization for Economic Cooperation and Development) published in 2012, air pollution would be the top environmental reason of mortality worldwide by 2050 (OECD, 2012). Therefore, both the national authorities and the international organizations deal more with potential impacts and necessary precautions in recent years.
The greenhouse gas (GHG) effect addresses the natural warming of the earth that ends up when gases in the atmosphere trap heat arising from the sun. GHG emissions such as carbon dioxide (CO 2 ), methane (CH 4 ) and nitrous oxide (N 2 O) hold the surface temperature of the Earth. The GHGs affect global climate conditions and without this impact, the Earth would be a frozen globe. Among the GHGs, CO 2 is the most plentiful pollutant that contributes to climate change by use of fossil fuel combustion (Manisalidis et al., 2020;S. Paraschiv, & L. S. Paraschiv, 2020 industrial production and consumptions can be identified such as particulate matters (PM2.5, PM10), nitrogen and sulfur oxides (NO x , SO x ), ozone, carbon monoxide (CO) as well as volatile organic compounds. The main industrial sources of these pollutants are fuel combustion sources, refineries, cement kilns, quarries and transportation components like highway vehicles (Cristescu, Stoica, & Suditu, 2019).
Even though many investigation outputs were presented in literature on greenhouse gases and air pollution indicators, there are limited studies that handled them together. One of the prominent studies, Zhang et al. (2013) focused on PM10 and CO 2 emissions in Tianjin, China based on electricity industry. In an analytical policy-based study, Bollen and Brink (2014) suggested an analytical framework for air pollution control in Europe by combining greenhouse gases and air pollutants. Recently, an emission reduction analysis for freight transportation and containers has been presented based on the parameters: CO 2 , NO x and PM10 (Bal, & Vleugel, 2020).
Beyond this general scientific literature, the use of statistical techniques and machine learning (ML) algorithms in air pollution control and GHGs has gained popularity in recent years. One of the antecedent studies, Ni et al. (2017) evaluated agricultural air quality in Indiana, USA by national air emissions and statistical analyses. In this study, both CO 2 and some pollutants (NH 3 , H 2 S) were considered. Bellinger et al. (2017) reviewed the ML techniques used in air pollution problem and epidemiology. Air pollution emissions recorded in Spain were predicted by random forest and hierarchical clustering (Martinez-Espana et al., 2018). Similarly, air quality in urban areas of Bulgaria was evaluated by ARIMA and random forest models (Gocheva-Ilieva, Ivanov, & Livieris, 2020). Recently, the air quality assessment of Eskisehir city in Turkey has been performed by statistical learning-based regularization (Tutmez, 2020).
At this stage, understanding the relationships between air pollutants and CO 2 could be critical both at regional and global scales. Due to heterogeneous nature of air pollution environment and multi-source data properties, this modelling attempt requires a holistic perspective including model accuracy and interpretation in together. For this purpose, supervised regression analyses have been identified as optimal to explore the dependencies from linear and non-linear approaches. Exploring the relationships between CO 2 emitted into air from industries and air quality indicators is still a novel problem type and there is no ML-based study in literature focusing on this identification problem. Therefore, dependencies between the major greenhouse gas (CO 2 ) and the major air quality measures have been handled from a statistical machine learning framework. To make realistic and multi-directional analyses, both linear and non-linear regression algorithms have been utilized. Both model types provided high accuracy and meta-data. The numerical outcomes revealed that relatively small particulate matter and nitrogen oxide are the main indicators for understanding the variability in carbon dioxide concentrations resourced from different industrial sources and consumptions.
The rest of the article is structured as follows. Section 2 introduces the problem and the methodological ground used in this study. Section 3 gives the numerical experiments; the results and a brief discussion are given in this section. Section 4 summarizes the findings of the investigation.

Parametric analysis: principal component regression (PCR)
In a multivariate regression analysis, use of limited number of variables in a model is necessary for simplicity. PCR decreases the number of indicator variables and solves data collinearity problem. This approach decomposes a data matrix X into scores T and loadings P as follows (Varmuza, & Filzmoser, 2009): The score matrix T includes the maximum amount of information of data matrix based on orthogonal projections on linear combinations. In Eq. (1), E denotes a residual matrix. The structure can be clarified by the conventional regression form: y Xb e  . (2) In PCR analysis, the matrix X in Eq.
(2) is replaced by T. In this way, main information of the x-data for regression on y is represented (Suryanarayana, The structure given in Eq. (4) has ability to remove collinearity and produce new score vectors. Thus, the PCR model coefficients can be provided as follows: Using the PCR approach in air quality modelling can provide some advantages such as solving multicollinearity problem and a notable improvement in testing predictive accuracy compared with the conventional multiple linear regression (MLR) approach. It should also be noted that the PCR modelling simply seeks to decrease the variability present throughout the predictor space.

Nonparametric Analysis: multivariate adaptive regression splines (MARS)
MARS addresses a multivariate-additive model and nonparametric flexible solution. It focuses on approximation of the relationship between independent variables, x and response variable, y using piecewise linear basis functions (BFs) (Hastie, Tibshirani, & Friedman, 2017): where, t denotes the critical point as with a knot. The "+" identifies positive part.
A MARS algorithm follows a two-step implementation consisting of forward selection and backward deletion procedures. In the forward process, the algorithm determines BFs that are added to the regression model by a fast searching. This process comes to rest when the algorithm provides the maximum number of BFs. In the backward process, the model is pruned to reduce the overall performance and the BFs are detracted from the model. For this operation, the algorithm utilizes the generalized cross-validation (GCV). Thus, an optimally estimated regression structure is provided (Özmen, 2016). The following functional regression structure gives the model (Zhang, & Goh, 2016): where,   mm Bx  represents a BF, and M denotes the number of BFs in the regression model.
Use of the MARS approach in air modelling analysis also includes a potential for provision of some reliable outcomes. Different from other nonlinear models, it obtains an intuitive stepping block into nonlinearity after grasping the structure of the MLR (Boehmke, & Greenwell, 2020).

Study Data
Health impact of ambient air pollution in Serbia has been reported by the World Health Organization. According to the report, an estimate of 6

Structure Identification
The industrial pollution sources in Serbia are categorized by the national authority. Because the industry-specific sources have different level impacts, a logarithmic transformation has been performed. Figure 1. developed by the application data shows the impact levels of the main sectors on air pollution heavily. In addition to electricity-gasair conditioning category, non-metallic manufacturing and mining-quarrying come to the forefront industries in Serbia.
To explore the relationships between CO 2 and air quality characteristics (PM10, PM2.5, NO x , Organic Compound and SO x ) the measured projections have been exhibited first using contour maps in Figure 2. The domains around average CO 2 concentrations have similar levels and shapes. Among the parameters, SO x produces relatively different structure. It could partly be due to scale of the horizontal axis. In addition, burning of the fossil fuels is the major sources of this toxic gas and a steady emission could be expected.

Results and Discussion
PCR-based parametric analysis was carried out following a two-step procedure. First, uncorrelated components were provided. After that, the resulting features were employed as estimators in linear model. The 10-fold cross validation (CV) on the PCR model revealed the optimal number of components based on Root Mean Squared Error measures (RMSE) (Figure 3). Although the optimization process addresses one feature, the use of two or three components having relatively low error levels could produce satisfactory results as well.

Figure 3: 10-fold CV RMSE provided by PCR
The nonlinear regression analyses were carried out using spline-based adaptive regression. Before the CV-optimized structure, an implementation was conducted and variation of the residuals was recorded. In this way, the potential outliers and an overview of the performance of the MARS model have been checked. Figure 4 displays the residuals at different levels using a cumulative distribution plot. In a grid search by 10-fold CV, the focus was on interaction of the first and the second degrees. The Generalized Cross Validation (GCV) approach has been considered during the optimization process. The MARS model that obtains the optimal combination consists of the first-degree interaction effects and retains 4 terms (Figure 5.).

B. Tutmez
Interpreting relationships between pollutants and carbon dioxide emitted into air from industries in Serbia 120 JEMC, VOL. 11, NO. 2, 2021, 115-123 Both the PCR and the MARS models also identified effective regression parameters.  Figure 6. also indicates that there is no relationship between CO 2 emissions and non-methane volatile organic compounds.

Figure 6: Variable importance for models -a) PCR, b) MARS
Along with the training implementations, the accuracy potentials of the regression models have also been inspected by a series of testing applications. To make a comparative evaluation, the conventional LSE-based multiple regression outcomes has also been added in the table. Mean Squared Error (MSE) has been considered as the performance measure index. The performances summarized in Table 2 revealed that testing performances of both the models outperform conventional regression. It should be underlined that the suggested models provided the high capabilities using limited number of variables. Therefore, these models have better generality, transparency as well as flexibility. Looking at the outcomes at a close range, it should be noticed that there is a close relationship between fine particulate (tiny particles in the air that are of two and a half microns or less in width) matters (PM2.5) and carbon dioxide (CO2) concentrations. As a noticeable air pollutant, PM2.5 can be reaching the lungs. The main sources of these particles can be expressed as exhausts of the vehicles, burning of fuels and forest fires. CO 2 also comes from similar sources. Among the industries in Serbia, mining-quarrying and electricity-air conditioning supply have also critical importance on creation of emissions of both PM2.5 and CO 2 together. Therefore, a direct relationship can be recorded in the nature of things. According to measured data, a medium level correlation (about 60%) between these parameters was calculated by Pearson Coefficient Correlation. However, Table 1 remarks the contribution of PM2.5 to CO 2 estimations earth-shakingly (about 100%). Because both the models consider interactions and functional evaluations, the major importance of PM2.5 particles for the main greenhouse gas CO2 has been seen by these algorithms.

CONCLUSIONS
The focus has been on uncovering the relationships between CO 2 emitted into air from industries and air quality indicators in Serbia and they have been analyzed. To specify the dependencies between the major greenhouse gas (CO 2 ) and the major air quality indicators such as particulate matters (PM2.5, PM10), nitrogen and sulfur oxides (NO x , SO x ), and volatile organic compounds, statistical learning-based evaluation tools were considered.
To conduct realistic and multi-directional analyses, both parametric and non-parametric regression algorithms have been utilized.

B. Tutmez
Interpreting relationships between pollutants and carbon dioxide emitted into air from industries in Serbia 122 JEMC, VOL. 11, NO. 2, 2021, 115-123 The numerical experiments on training and testing data exhibited that both the modelling approaches have high and satisfactory accuracy capacities. The model interpretation studies revealed that the major pollutant to appraise CO 2 emissions is PM2.5. In addition, a medium level dependency between NO x and CO2 has been recorded. Interpretable machine learning algorithms provide meta-data and new information for making black-box natural systems explainable. The findings provided a dependable statistical ground with valid algorithms and new information for the national authorities to appraise the relationships and sources and to make some precautions as a part of environmental management system of the country.