INFLUENCE OF PRE-PROCESSING ON ANOMALY-BASED INTRUSION DETECTION

: Introduction/purpose: The anomaly-based intrusion detection system detects intrusions based on a reference model which identifies the normal behavior of a computer network and flags an anomaly. Machine-learning models classify intrusions or misuse as either normal or anomaly. In complex computer networks, the number of training records is large, which makes the evaluation of the classifiers computationally expensive. Methods: A feature selection algorithm that reduces the dataset size is presented in this paper. Results: The experiments are conducted on the Kyoto 2006+ dataset and four classifier models: feedforward neural network, k-nearest neighbor, weighted k-nearest neighbor, and medium decision tree. The results show high accuracy of the models, as well as low false positive and false negative rates. Conclusion: The three-step pre-processing algorithm for feature selection and instance normalization resulted in improving performances of four binary classifiers and in decreasing processing time. based IDSs detect malware based on knowledge accumulated from known attacks. Anomaly based IDSs detect deviations from a model


Introduction
Intrusion detection systems (IDSs) monitor computer network behavior to perform diagnostics of the security status and protect the network from malicious activities or various anomalies. Intrusion detection systems can be divided into two basic groups. Misuse or signature based IDSs detect malware based on knowledge accumulated from known attacks. Anomaly based IDSs detect deviations from a model of usual network behavior. The goal of anomaly detection is to build a statistical model of normal network behavior and look for activities which deviate from the created model (Protić & Stanković, 2020, p.7). The main disadvantage of the signature based IDS is a difficulty to detect unknown attacks. The biggest challenge in anomaly detection is to identify what is considered normal. Machine learning (ML) based binary classifiers can detect anomalies with a high accuracy of prediction. In supervised learning, the number of training instances collected over a period of time can be large, which makes the evaluation of the models computationally expensive. Feature selection reduces the training set, which speeds up the processing time and increases the accuracy of the classifiers. This paper shows the results of the experiments of the three-step feature selection and instances normalization pre-processing algorithm conducted on the Kyoto 2006+ dataset and four machine learning models, namely: feedforward neural network (FNN), k -nearest neighbor (k-NN), weighted k-NN (wk-NN), and medium decision tree(DT). Accuracy (ACC), false positive rate (FPR), false negative rate (FNR), and processing time are given.
Feature selection and instances normalization: Threestep pre-processing algorithm One of the major issues in supervised ML is a large number of instances in the training set. The aim of feature selection is to reduce the dataset size and remove irrelevant features. Furthermore, raw data have to be pre-processed before being fed to the input of the model so that effects of one feature cannot dominate the others. In this paper, a threestep pre-processing algorithm for feature selection is presented. The algorithm is given as follows: 1 Identify and discard all irrelevant features; 2 Remove features which cannot be normalized into the range [-1,1]; 3 Normalize instances into the range [-1,1] by applying the hyperbolic tangent function: where n is the number of instances in the dataset. Feature selection improves performances of the classifier, saves memory space and decreases processing time. Additionally, instances normalization speeds up the model training and reduces the domination of one feature over the other ones.  (Protić, 2018, pp.587-589). During the observation period, over 50 million sessions of normal traffic, 43 million sessions of known attacks and 426 thousand sessions of unknown attacks were recorded (Protić & Stanković, 2018, p.44). The dataset consists of 14 statistical features derived from the KDD Cup '99 dataset (Table 1) and 10 additional features which enable more efficient investigation (Table 2) (Ashok Kumar & Venugopalan, 2018), (Song et al, 2011, pp.29-36). Source bytes The number of data bytes sent by the source IP address.

Destination bytes
The number of data bytes sent by the destination IP address.

Count
The number of connections whose source IP address and destination IP address are the same to those of the current connection in the past two seconds. 6 Same_srv_rate % of connections to the same service in the Count feature.
7 Serror_rate % of connections that have 'SYN' errors in the Count feature.
8 Srv_serror_rate % of connections that have 'SYN' errors in the Srv_count (% of connections whose service type is the same to that of the current connections in the past two seconds) feature.

Dst_host_count
Among the past 100 connections whose destination IP address is the same to that of the current connection, the number of connections whose source IP address is also the same to that of the current connection.

Dst_host_srv_count
Among the past 100 connections whose destination IP address is the same to that of the current connection, the number of connections whose service type is also the same to that of the current connection.
11 Dst_host_same_src_port_rate % of connections whose source port is the same to that of the current connection in the Dst_host_count feature.
12 Dst_host_serror_rate % of connections that have 'SYN' errors in the Dst_host_count feature.
13 Dst_host_srv_serror_rate % of connections that have 'SYN' errors in the Dst_host_srv_count feature.
14 Flag The state of the connection at the time the connection was written (tcp, udp). Reflects if IDS triggered an alert for the connection; '0' means any alerts were not triggered and an arabic numeral means the different kind of alerts. The arenthesis indicates the number of the same alert.

Malware_detection
Indicates if malware, also known as malicious software, was observed at the connection; '0' means no malware was observed, and string indicates the corresponding malware was observed at the connection. The parenthesis indicates the number of the same malware.

Ashula_detection
Means if shellcodes and exploit codes were used in the connection; '0' means neither shellcode nor exploit code were observed, and an arabic numeral means the different kinds of the shellcodes or exploit codes. The parenthesis indicates the number of the same shellcode or exploit code.

Label
Indicates whether the session was attack or not; '1' means normal. '-1' means a known attack was observed in the session, and '-2' means an unknown attack was observed in the session.

Source_IP_Address
Means the source IP address used in the session. The original IP address on IPv4 was sanitized to one of the Unique Local IPv6 Unicast Addresses. Also, the same private IP addresses are only valid in the same month; if two private IP addresses are the same within the same month, it means their IP addresses on IPv4 were also the same, otherwise are different. 6 Source Port Number Indicates the source port number used in the session.

Destination IP Address
It was also sanitized.

Destination Port Number
Indicates the destination port number used in the session.

9
Start Time Indicates when the session was started. 10 Duration Indicates how long the session was being established.
The proposed algorithm discards all categorical features as well as features for further investigation, excluding the Label feature (step 1), cuts all features containing instances that cannot be normalized into the range [-1,1] (step 2) and normalize the rest of instances (step 3). Out of 24 features of the Kyoto 2006+ data set, 17 features are left after the first pre-processing algorithm step and nine features (5-13) are left after the pre-processing is done. The Label feature is used for the detection of anomalies. Scaled instances not only reduce the effects of one feature to the others but speed up the FNN since the network training is more efficient if normalization is performed on inputs. If the number of inputs is ≥3, the sigmoid functions used in the hidden layer become easily saturated. If the saturation happens at the beginning of the training, the gradients will be small which may slow down the network training (Protić & Stanković, 2020, p.9).Also, instances are scaled due to the fact that distances in the wk-NN model lose accuracy because of a small difference between the farthest and the nearest neighbors.

Classifier Models
In supervised machine learning, classifiers can be divided into two groups. Lazy learners, such as the k-NN and the wk-NN, do not focus on constructing a general model, but store the training data and weight until a test set appears. Eager learners, such as the FNN, construct a classification model before getting data for predictions.

k-Nearest Neighbor
The k-NN stores all instances corresponding to the training data into the n-dimensional space. Classification is computed on a simple majority vote of the k-NN of each point, based on the Euclidean distance measure given with (Protić & Stanković, 2020, p.9 The prediction speed of the k-NN is medium as well as memory usage. Interpretability of the classifier is hard. In the experiments, the distinction between classes is medium, and the number of neighbors is set to 10 (Protić & Stanković, 2018, p.48).

Weighted k-Nearest Neighbor
The main idea of the wk-NN is to extend the k-NN so that the instances within the training set which are particularly close to the new instance should get a higher weight in the decision than more distant ones (Tsigkritis et al, 2018, pp.70-84). The distances are transformed into the weights as follows: ( The wk-NN classifier adapts as the new training data is collected, which allows the algorithm to respond quickly to changes in the input during real-time use. In contrast with the fast training stage, the algorithm requires expensive testing. All the cost of the algorithm is in the processing time. All characteristics of the classifier are the same as for the k-NN model except flexibility which is also medium (medium distinctions between classes using a distance weight) (Protić & Stanković, 2018, p.48).

Medium Decision Tree
The Decision Tree is one of the graph-like algorithms which use branching methods to illustrate every possible outcome of decisions, where nodes represent features, links represent decision rules and leafs represent outcomes. The Iterative Dichotomy 3 algorithm (ID3) calculates entropy and information gain to build a tree (Protić & Stanković, 2020, p.9). Entropy is a measure which controls how the tree decides to split the data. If the target feature can take on k different values, then the entropy of the S relative to this k-wise classification is given as follows: where p i represents the proportion of S belonging to the class i. The information gain represents the expected reduction in entropy based on the decrease in entropy after the dataset is split on the feature (See Eq.5). .
The feature with the highest information gain will split first. The Gain(S,A) of a feature A relative to a collection of examples S provides information about the target function value; given the value of some other feature A that splits S into the subsets S v (Protić & Stanković, 2020, p.10).The characteristics of the medium DT classifier are: fast prediction speed, low memory usage, easy interpretability and medium model flexibility, i.e. medium number of leaves for finer distinctions between classes. The maximum number of splits is 20 (Protić & Stanković, 2018, p.48).

Feedforward Neural Network
The objective of the FNN is to minimize an output error in accordance with the back-propagation algorithm. The FNN transfer function used in the experiments is given with Eq. 6.
where x l are inputs, y i are outputs, w and W are weight matrices, f j and F i denote the transfer functions of hidden and output layers, m represents the number of inputs, q represents the number of outputs, and w j0 and W f0 denote biases. The objective of the FNN presented in this paper is to minimize the output error in accordance with the Levenberg-Marquardt (LM) algorithm (Levenberg, 1944, pp.164-168) (Marquardt, 1963, pp.431-441).The LM algorithm performs a combined training process: around the area with complex curvature, the LM switches to the gradient descent (GD) algorithm until the local curvature is proper to make a quadratic approximation. Then it approximately becomes the Gauss-Newton(GN) algorithm which can speed up the convergence (Kwaket al, 2011, pp.327-340). The structure of the FNN presented in this paper is 9 inputs, 9 weights in the hidden layer and one output. The transfer function in the hidden layer is tangent hyperbolic while the output layer's transfer function is linear.

Experiments
A key criterion which differentiates classification techniques is prediction accuracy which represents the ratio of the number of instances correctly classified to the total number of instances (See Eq. 7).

FN FP TN TP
where TP (true positive) represents the number of positive samples correctly predicted by the classifier, FN (false negative) represents the number of positive samples wrongly predicted as negative, FP (false positive) represents the number of negative samples wrongly predicted as positive and TN represents the number of negative samples correctly predicted by the model (Ambedkar & Kishore Babu, 2015, pp.25-29). Additionally, processing time (t), false positive rate (FPR) and false negative rate (FNR) are also measurement criteria (Nguyen & Armitage, 2008, p.56). Processing time (t) is a sum of the training and testing time. FPR represent the fraction of negative samples predicted as a positive class (See Eq. 8).

FP TN
FPR is a measure of accuracy for a test. It is defined as the probability of rejecting the null hypothesis, i.e. it is a probability that a false alarm will be raised; a positive result will be given when the true value is negative (Split, 2020). Ideally, FPR should be low (0.1 or less). A low FPR indicates that the classifier does not classify many irrelevant examples as relevant (Shirabad et al, 2007, p.198).
FNR represent the fraction of positive samples predicted as a negative class (Eq.9): (9) FNR is the probability that a true positive will be missed by the classifier.
The experiments presented here are conducted on three preprocessed daily records from the Kyoto 2006+ dataset (See Table 3) and performed using Intel(R), Core(TM) i7-2620M CPU 2.7GHz processor, with 16GB RAM Installed memory. All classifiers are trained so that 70% of daily records are used for training and 30% are used for testing. The results on accuracy, FPR, FNR, and processing time are given in Figures 1-4, respectively. As it can be seen from Figure 1, the wk-NN has the highest accuracy of all of the models (up to 99.5%). However, the accuracies of both k-NN and DT are also high (99.4%). The FNN is less accurate than the other models but its accuracy is still very high (99.2%).
Low FPR (see Figure 2) indicates that the models classify a small number of relevant examples as irrelevant, so the probability a false alarm will be raised is very low. Although all models show a low probability of false alarm, the k-NN model classifies the highest number of irrelevant examples as relevant.
FNR has the highest value for the DT model, trained on the 14.02.2007 dataset. The lowest value of FNR gives the FNN trained on the 27.02.2007 dataset. However, FNRs of all classifiers are lower than 0.8%, and lower than 0.2% if DT is not considered as relevant.
As it is expected, the processing time is high for the lazy learners (the k-NN and the wk-NN) and exceeds 50s. The FNN and the DT show significantly shorter processing time (more than 10 times).

Conclusion
The three-step pre-processing algorithm for feature selection and instance normalization resulted in improving performances of four binary classifiers and in decreasing processing time.
The algorithm reduced three training sets derived from the Kyoto 2006+ dataset. Accuracy, false positive rate, false negative rate, and processing time are given as measures of the performances of the classifiers.
The results show the highest accuracy of the wk-NN model. Low false positive rates indicate that the models classify a small number of relevant examples as irrelevant. FNR is significantly higher for the DT model than for FNN, k-NN, and wk-NN models.
The processing time of the lazy learners is significantly higher than the processing time of eager learners.