MACHINE LEARNING APPROACHES FOR BURNED AREA IDENTIFICATION USING SENTINEL-2 IN CENTRAL KALIMANTAN

Forest or land ﬁ re is a disaster that commonly occurred in Indonesia mainly in Kalimantan and Sumatera. Optical remote sensing satellite becomes a promising technology that can be utilized to identify the burned area in quick time for disaster management response.This study evaluated the use of supervised machine learning, such as Support Vector Machine (SVM), Random Forest (RF), and Deep Neural Network (DNN) to classify burned area in the Central Kalimantan province on June and August 2019 as pre- ﬁ re event and post- ﬁ re event using Sentinel-2 imageries. An imbalanced and a balanced dataset with varying hyper-parameter were used on those classi ﬁ ers. Hotspot data derived from MODIS and Suomi NPP data are also used as training and testing dataset. Based on the study, the imbalanced dataset in ﬂ uences precision and recall values, as well as the accuracy of SVM and DNN classi ﬁ ers, but not as much in RF. RF classi ﬁ er outperforms SVM and DNN in terms of precision, recall, and accuracy for both a balanced dataset and an imbalanced dataset with the accuracy ranged from 98.2 -99.3%. The accuracy of SVM classi ﬁ er is ranged from 94.7-98.1% for an imbalanced dataset and 90.4 % - 98.2 % for a balanced dataset. Although the high accuracy is still can be achieved in DNN classi ﬁ er, there is a changing accuracy from 98.5-98.8 % in a balanced dataset to 95.5-95.7% in an imbalanced dataset. These ﬁ ndings imply that the high accuracy is still can be achieved by SVM, RF, and DNN classi ﬁ ers with an imbalanced or a balanced dataset.


INTRODUCTION
Forest or land fi re is a disaster that occurs every year in Indonesia, mainly on Kalimantan and Sumatera [1], [2], [3]. In 2015, the burned area in Indonesia reached 2.6 million hectares in which 16% of them located in the Central Kalimantan. The overall economic impact loss reached USD 71 million according to the World Bank [4]. Also, the data from the Ministry of Environment and Forestry shows that Central Kalimantan becomes the largest burned area which is about 134 thousand hectares in the year 2019 [5]. Furthermore, the burned area gives many negative impacts includes health and social activities disruption, greenhouse gas emissions, and loss of fl ora and fauna diversity [6], [7]. With many negative impacts, the government and related institutions need to immediately take actions. However, to do that, they have to know the information about the burned area, such as the exact location and how large the impact is. Therefore, burned area information is needed for disaster management and impacts analysis [8]. Remote sensing satellite imagery is an alternative technology that can be utilized to acquire reliable information regarding the burned area [8]. Optical remote sensing satellite data on low to moderate spatial resolution has been widely used to identify burned area such as using MODIS sensor [2], [9], [10], [11] and Landsat [12], [13], [14]. Besides, high spatial resolution satellite data such as Sentinel-2 data is also promising to be used for burned area detection [15], [16], [17], [18] as it gives more burned area estimation due to the ability to identify small fi res [19]. In these past years, with the rise of machine learning technology, the research on geospatial data using machine learning is also increasing. One of the application that is mostly performed using machine learning is land use and land cover classifi cation with Deep Neural Network (DNN), Support Vector Machine (SVM), Ensemble Classifi er using high spatial resolution imagery [20] and Random Forest (RF), k-Nearest Neighbor, and SVM using Sentinel-2 [21]. Those research implied that SVM and RF classifi ers are more versatile to be used than deep learning for land use and land cover application since they were less sensitive to imbalanced data [22]. However, DNN classifi er has benefi cial for remote sensing application which has complex physical model [23]. For instance, increasing accuracy can be achieved using DNN classifi er for land use and land cover application [24]. Some research has been done about the analysis of burned area with machine learning methods. In the case of burned area research, classifi er such as RF, SVM, and neural network, is respectively investigated using PRO-BA-V refl ectance imagery [25] and MODIS data [11], [26], [27], [28]. Those research implied that RF classifi er is the most optimum to be implemented and neural network classifi er needs parameter tuning so that it is    [31] more complex than RF and SVM. Some strategy such as validation-loss should also be done to overcome imbalanced dataset in neural network classifi er [11] so that the use of DNN classifi er for burned area identifi cation is still needed further research. This study is aimed to acquire and evaluate the burned area information model from SVM, RF, and DNN classifier using Sentinel-2 imagery. The SVM and RF classifi ers was chosen since they are simple methods that commonly works well in image classifi cation. Although deep learning is a promising technology in image analysis, it is still need to be researched comprehensively in case of the effectiveness. Therefore, the evaluation of those methods is needed by comparing those classifi er.The physical parameters used are post-fi re Normalized Burn Ratio (NBR post-fi re ) and difference NBR between pre-fi re and post-fi re (ΔNBR). Thus, NBR is one of the most signifi cant parameters for burned area discrimination [29], [30]. The differences in the dataset used will also be a consideration for evaluation.

Study area and data
Since Central Kalimantan is one of the areas that mostly affected by the forest and land fi re in 2019 [5], this area, as seen in Figure 1, is chosen as the study area of this study. Sentinel-2 imagery is chosen as it has high temporal res-olution [31] to minimize cloud cover issue. In this study, satellite imageries with the acquisition times in June and August 2019 were used as the dataset. Sentinel-2 consists of spectral bands as shown in Table 1 in which Narrow Near-Infrared (B8A) and Short-wave Infrared (B12) bands were used to acquire Normalized Burn Ratio.  Figure 2 illustrates the used methodology of this study. It covers the processing of remote sensing imagery which are cloud masking, mosaicking, feature extraction, algorithm training, parameter tuning, classifi cation with each respective machine learning method. Furthermore, the model evaluation includes accuracy and error assessment was also done. Sentinel-2 imageries were acquired from Google Earth Engine, while the hotspot data that are from MODIS and Suomi NPP was derived from the Indonesian National Institute of Aeronautics and Space. In the processing stage, an atmospheric correction was done to the Sentinel-2 data and clipped into the study area. All tasks in the processing stage were done using Google Earth Engine API in Python.

Methods
Features that are used in this study as a physical parameter to classify burned area are pre-fi re Normalized Difference Burned Ratio (NBR pre-fi re ), post-fi re Normalized Burned Ratio (NBR post-fi re ) and Δ Normalized Burned Ratio (ΔNBR) with the formula in the Equation (1) and (2).
(1)   which NIR is Near Infra-red Band and SWIR is Shortwave Infra-red Band.

Training and testing dataset
In this study, the dataset consists of two class namely burned and unburned pixels based on Sentinel-2 imagery derived from Google Earh Engine. Δ NBR categorization of burned severity from United States Geological Survey (USGS), shown in Table 2 and NBR post-fi re values were used to label between burned and unburned pixels. Besides, hotspot data were also used as training samples. Combinations of the presence of hotspot in medium or high confi dential levels and impressive features on a certain location are classifi ed as burned area, whereas hotspot data with unimpressive features are classifi ed as unburned area. This study used two different datasets.
The fi rst is the dataset with comparison is 75:25 between burned and unburned pixels respectively as imbalanced dataset and secondly is 50:50 between burned and unburned pixels respectively as a balanced dataset. The dataset was split into train data and test data.

Implementation of classifi er
The model classifi cation was developed using Python programming language. "Tensor Flow" framework was used to conduct DNN classifi er, whereas the SVM and RF classifi ers were performed using the "scikit-learn" module. Parameter tuning was performed in Support Vector Machine (SVM), Random Forest (RF), and Deep Neural Network (DNN) algorithm. SVM is one of supervised machine learning methods, commonly used for classifi cation and regression. Some hyper-parameter that can be tuned in the implementation of the SVM algorithm, such as a kernel, regularization parameter or also called cost (C), and gamma (γ). The parameter γ determines the infl uence of a single training data on the algorithm and C determine a misclassifi cation size of training data [27]. A high value of C tends to choose small margin and results in small error tolerance. Polynomial kernels and the radial basis function (RBF)        [27]. In a random forest, the number of trees is the hyper-parameter that have to be set. In this study, the various number of trees were investigated namely 5, 10, and 20. Deep Neural Network (DNN) consists of four main components namely input layer, neurons, hidden layer, and output layer. The number of hidden layers infl uences the model performance, but there is no fi xed rule for determining it [20]. In this study, the number of hidden layers was varied. The rectifi ed linear unit was used as an activation. Dropout was also used to prevent over-fi tting.
The data were trained with the fi xed epoch of 50. Table 3 and Table 4 shows the root mean square error (RMSE) of SVM classifi er for an imbalanced and a balanced dataset with a variation of C value and γ value. The result shows that the lowest error produced by the high value of C and the low value of γ in an imbalanced dataset. For instance, in this case, the C value of 100 and the γ value of 0.0001 generate the lowest RMSE. In case of a balanced dataset, there are signifi cant changes in RMSE when the value of γ is high. In general, both an imbalanced and balanced data results in the lowest RMSE for the high value of C and the low value of γ. Figure 3 and Figure 4 show the accuracy between C and γ values on SVM classifi er with an imbalanced dataset and balanced dataset. The accuracy for an imbalanced and balanced dataset is ranging from 94.7 % -98.1 % for an imbalanced dataset and 90.4 % -98.2 % for a balanced dataset. The result implies that the proportion of dataset for burned and unburned area pixels give impact for the accuracy although still can achieve high accuracy for both imbalanced and balanced dataset.

between burned area precision and hyper-parameters (C and γ) of the SVM classifi er with an imbalanced dataset
The precision and recall values were also investigated since they are an effective measurement for the success of prediction [11]. In the case of an imbalanced dataset, the ratio between burned area and unburned area pixels was 75:25. Table 7 and Table 11 show that there are no signifi cant changes in precision and recall values for burned area pixel, which are ranging from 0.95-0.99 with various C and γ value. In contrast, there are signifi cant changes in the recall values but still with high precision for unburned area pixels in an imbalanced dataset. However, the precision values obtained for unburned area pixels are relatively smaller than burned area pixels which are ranging from 0.88-0.96 as shown in Table 9. It shows that the SVM classifi er is slightly affected by the imbalanced dataset.
In the case of a balanced dataset, the ratio between burned area and unburned area pixels was modifi ed to 50:50. Table 8 and Table 13 show that burned area precision and unburned area recall values for a balanced dataset are relatively smaller than the imbalanced dataset when the gamma value was increased (ranging from 0.83-1) with various C and γ value but still with high precision for burned area pixels. In contrast, there are no signifi cant changes in unburned area precision and burned area recall (ranging from 0.96-1).         Figure 5 depicts the accuracy derived from an imbalanced and a balanced dataset with a various number of tree. The accuracy is ranging from 98.7-99.1% for an imbalanced dataset and 98.2 -99.3 % for a balanced dataset. It implies that high accuracy can be obtained for both an imbalanced and a balanced dataset. Supporting Maxwell et al. [22] on remote sensing data studies which explicitly concluded that accuracy is not determined by the number of tree, this study also fi nds that accuracy is not affected by the number of trees.

DNN Classifi er
The accuracy derived from an imbalanced and a balanced dataset in a various number of layer is shown in Figure 6. The accuracy is ranging from 95.5-95.7 % for an imbalanced dataset and 98.5 -98.8 % for a balanced dataset. The result shows that the proportion of dataset for burned and unburned area pixels give an impact for the accuracy although still can achieve high accuracy for both imbalanced and balanced dataset. Table 17 and Table 18 shows the precision and recall values of DNN classifi er for an imbalanced and a balanced dataset respectively. From this result, the overall precision of imbalanced dataset is ranging from 0.83-0.88, whereas the range for a balanced dataset is 0.97-0.98. It indicates that the proportion of dataset infl uences precision value. However, the recall value is not relatively infl uenced by the proportion of dataset as seen in Table  17 and Table 18     dataset. Deep learning will give signifi cant result in a balanced dataset and with a large number of training samples. Also, it gives valuable result in several applications such as image classifi cation so that future work will be focused on the use of a balanced dataset in several deep learning methods. One of the methods is a generative adversarial network which has a promising result for a small amount of training sample.