A Two-Layer Dimension Reduction and Two-Tier Classification Model for Anomaly-Based Intrusion Detection in IoT Backbone Networks

With increasing reliance on Internet of Things (IoT) devices and services, the capability to detect intrusions and malicious activities within IoT networks is critical for resilience of the network infrastructure. In this paper, we present a novel model for intrusion detection based on two-layer dimension reduction and two-tier classification module, designed to detect malicious activities such as User to Root (U2R) and Remote to Local (R2L) attacks. The proposed model is using component analysis and linear discriminate analysis of dimension reduction module to spate the high dimensional dataset to a lower one with lesser features. We then apply a two-tier classification module utilizing Naïve Bayes and Certainty Factor version of K-Nearest Neighbor to identify suspicious behaviors. The experiment results using NSL-KDD dataset shows that our model outperforms previous models designed to detect U2R and R2L attacks.


Abstract-With increasing reliance on Internet of Things (IoT) devices and services, the capability to detect intrusions and malicious activities within IoT networks is critical for resilience of the network infrastructure. In this paper, we present a novel model for intrusion detection based on two-layer dimension reduction and two-tier classification module, designed to detect malicious activities such as User to Root (U2R) and Remote to Local (R2L) attacks. The proposed model is using component analysis and linear discriminate analysis of dimension reduction module to spate the high dimensional dataset to a lower one with lesser features. We then apply a two-tier classification module utilizing Naïve Bayes and Certainty Factor version of K-Nearest
Neighbor to identify suspicious behaviors.The experiment results using NSL-KDD dataset shows that our model outperforms previous models designed to detect U2R and R2L attacks.

I. INTRODUCTION
nternet of Things (IoT) technologies are becoming increasingly prevalent across different industry sectors such as health care, personal and social domains, and smart cities [1].Similar to most consumer technologies, IoT technologies are not designed with security in mind, which are now emerging as a key barrier in the wider adoption of IoT networks and services [2].Intrusion detection is one of several security mechanisms for managing security intrusions [3], which can be detected in any of four layers of IoT architecture shown in Figure 1 [4].The Network layer not only serves as a backbone for connecting different IoT devices, but also provides opportunities for deploying network-based security defense mechanisms such as Network Intrusion Detection Systems (NIDS) [5], [6], [7].According to the analysis of KDD99 [3] and its latter version NSL-KDD [9], malicious behaviors (attacks) in network-based intrusions can be classified into the following four main categories [7]:  Probe: when an attacker seeks to only gain information about the target network through network and host scanning activities (i.e.ports scanning). DoS (denial of service): when an attacker interrupts legitimate users' access to the given service or machine. U2R (User to Root): when an attacker attempts to escalate a limited user' privilege to a super user or root access (e.g. via malware infection or stolen credentials). R2L (Remote to Local): when an attacker gains remote access to a victim machine imitating existing local users.User to Root (U2R) and Remote to Local (R2L) attacks are among the most challenging attacks to detect as they mimick normal users behavior [10] [11].
IDS are categorized into signature-based and anomalybased detection based on their technique in detecting an intrusion [12].Signature-based IDS relies on a set of predefined malicious activates patterns and attack signatures to detect intrusions while anomaly-based IDS relies on deviations from normal behaviors to detect intrusions [6].Signature-based IDSes generally outperform anomaly-based IDSes in detecting previously known attacks, but the former is ineffective against unknown or polymorphic attacks [13].On the other hand, anomaly-based IDSes are capable of detecting unknown attacks in the absence of a predefined pattern.Due to the diversity of devices deployed in IoT networks, it would be unrealistic and impractical to rely on pre-defined attack patterns for intrusion detection, which limits signature-based IDS utilization in IoT networks [14].
In this paper, we present a network anomaly-based model for intrusion detection, hereafter referred to as Two-layer Dimension Reduction and Two-tier Classification (TDTC) model.The proposed model, designed for anomaly-based intrusion detection in IoT backbone networks, uses two-layer dimension reduction and two-tier classification detection techniques to detect "hard-to-detect" intrusions, such as U2R and R2L attacks.We also demonstrate that the proposed model has the following characteristics:  Higher overall detection rates due to the deployment of a multi-layer classifier  Lower false positive due to deployment of a refinement feature  Accurate detection of U2R and R2L attacks, without reducing performance  Lower computational complexity due to deployment of dimension reduction in the two layers.In the next section, we present related work.The proposed model is presented in Section 3, and evaluation of the model is presented in Section 4. Section 5 concludes this paper and outlines future research topics.

II. RELATED WORK
Existing intrusion detection and prevention models generally use statistical approaches [15] such as Hidden Markov Model (HMM) [15], Bayes theory [16], cluster analysis [17], signal processing [18] and distance measuring [19] to detect anomalous activities.Anomaly detection approaches can be broadly categorized into supervised and unsupervised learning [6].In supervised anomaly detection approach, normal behavior of a system or networks is constructed using a labeled dataset [20].Unsupervised technique assumes that normal behaviors are more frequent and, thus, the model is built based on this assumption; thus, no training data is required [21].Casas et al. [22] proposed an unsupervised NIDS based on subspace clustering and outlier detection and demonstrated that their approach performs well against unknown attacks.In [23], a feature section filter module is proposed, which utilizes Principal Component Analysis and Fisher Dimension Reduction to filter noises.In the approach, Self-Organizing Maps (SOMs) neural model is also used to filter out normal activities.However, this approach has a high false positive rate.Bostani and Sheikhan [24] proposed an unsupervised framework based on Optimum-path forest algorithm and K-Means clustering technique.This framework models malicious and normal behavior of networks.
The supervised anomaly detection approach in [25] leverages both distance measure and density of clusters for intrusion detection.Zhaung et al [26] proposed a model based on random forest algorithm to discover anomaly patterns with a high accuracy yet low false negative rate.
Guo et al. [27] proposed a two-level intrusion detection approach which first detects misuse and then uses KNN algorithm to reduce false alarms.Toosi et al. [28] proposed a multi attack classifier model, which implements a mix of fuzzy neural network, fuzzy inference approach, and genetic algorithms for intrusion detection.Despite a high accuracy rate in identifying normal behaviors and detecting simpler attacks such as DoS attacks and probe, the model performs poorly in detecting low frequency and distribution attacks such as R2L.Horng et al [29] proposed a multi-classification attack model consisting of support vector machines (SVM) and BRICH hierarchical clustering technique to extract significant attributes from KDD99 dataset.Their proposed model has a high detection rate for DoS and Porbe attacks, but is ineffective against U2R and R2L attacks.
Tan et al. [30] proposed a system for DoS detection using multivariate correlation analysis (MCA) to improve the accuracy of network traffic characterization.In [31], a twolayer classification module was used to detect U2R and R2L attacks with low computational complexity due to its optimized feature reduction.Osanaiye et al. [13] proposed an ensemble-based multi-filter feature selection method to detect distributed DoS attacks in cloud environments using four filter methods to achieve an optimum selection over NSL-KDD dataset.Iqbal et al. [32] presented an attack taxonomy for cloud services and suggested a cloud-based intrusion detection system.
Ambusaidi et al. in [33] proposed a mutual information based IDS that selects optimal feature for classification based on feature selection algorithm.Their approach was evaluated using three benchmark data set (KDD Cup 99, NSL-KDD and Kyoto 2006+).
Intrusion detection systems have also been used for managing security risks in industrial control systems [14].For example, Pan et al. [34] proposed a systematic and automated approach to build a hybrid IDS that learns temporal statebased specifications for electric power systems to accurately differentiate between disturbances, normal control operations, and cyber-attacks.Zhou et al. [35] presented an industrial anomaly and multi model driven IDS based on Hidden Markov Model to filter attacks from actual faults.
Security issues can be a barrier to widespread adoption of IoT devices [36].Whitmore et al., [37] showed that wide range of techniques could mitigate cyber threat targeting IoT systems.Ning et al. [38] proposed a hierarchical authentication architecture to provide anonymous data transmission in IoT networks.Cao et al. [39] highlighted the impact and importance of ghost attacks on ZigBee based IoT devices.Chen et al. [40] proposed an autonomic model-driven cyber security management approach for IoT systems, which can be used to estimate, detect, and respond to cyberattacks with little or no human intervention.Teixeira et al. [41] proposed a scheme for thwarting insiders attacks in IoT networks by crosschecking data transformation of every IoT node.

III. PROPOSED TDTC MODEL
The proposed model comprises a dimension reduction module and a classification module, to be discussed in sections III.A and III.B, respectively.

A. Dimension Reduction Module
The dimension reduction module is deployed to address limitations due to dimensionality that may lead to making wrong decisions while increasing computational complexity of the classifier.We deployed both Linear Discriminant Analysis (LDA) (i.e. a supervised dimension reduction technique) and Principal Component Analysis (i.e. an unsupervised dimension reduction technique) in order to address the high dimensionality issue.Principal Component Analysis (PCA) can be used to perform feature selection and extraction [42]: a) Feature selection: choose a subset of all features based on their effectiveness in higher classification (i.e.choosing more informative features) b) Feature extraction: create a subset of new features by combining existing features.In TDTC, we used PCA as a feature extraction mechanism to map the NSL-KDD dataset, which consists of 41 features to one with a lower feature space by removing less significant features.Feature extraction technique is commonly limited to linear transforms: y = Wx as shown in in Figure 2.
Let X be an N-dimensional random vector in the original dataset, and the new feature space consists of lower Mdimensions (M is the number of new dataset features that are transformed) where ( < ).For the transformation operation, we will need to compute Eq. 1 to Eq.3: Where m (mean vector) is: Eigenvector-eigenvalue decomposition: Σv = λv Where v=Eigenvector λ=Eigenvalue (Eq.3) PCA will then sort the eigenvectors in descending order.In other words, eigenvectors with lower eigenvalues have the least information about the distribution of the data and these are the eigenvectors we wish to drop.A common approach is to rank the eigenvectors from the highest to the lowest eigenvalue and choose the top  eigenvectors based on eigenvalues.Similarly, in TDTC, one may decide which eigenvalues are more useful; thus, the ideal feature mapping matrix  can be concluded and used for linear transformation of training and test dataset.
At this layer of dimension reduction, Imbedded Error Function (IEF) factor analysis measure [43] is used to select the principal [44] as shown in Eq.4,where l, m denotes the number of Principal Components (PCs).Both l and m are used to represent the data and number of dimension, respectively.N and  denote the number of samples and Eigenvalues, respectively.
Cross Validation (CV) is used to evaluate optimum principals with minimum errors as shown in Figure 3. Applying selection criteria would reduce some features and help the next layer of dimension reduction module to compute lower dimension matrix and spreadable objects.As observed in Figure 5, Cumulative Percent Variance (CPV) measure with 95% threshold is also examined to justify the selection of optimum dimensions.

B. Linear Discriminant Analysis
Linear computation can be used to achieve a reasonable speed in intrusion detection systems [31].
Since objects (samples) in the PCA-transformed dataset are not ideal for classification, the proposed model utilized another feature reduction module to apply the labeled data in an optimal transformation to new dimensions.LDA examines the class labels to reduce the dimension of large working datasets and LDA is widely used in different domains such as image processing and stock analysis [45].LDA chooses an After the transformation using LDA, the new mapped features will have only four dimensions {lda1, ..., lda4}.
Figure 4 shows the two-dimension of the newly mapped original data set transformed by LDA.In other words, the dataset has been converted into  − 1 dimensions, where C is number of class labels that exist in the original dataset.optimal projection matrix to map a higher dimensional feature space to a new lower dimensional space while preserving the required information for data classification [46].
There are two scatter matrices that need to be obtained in LDA, namely: S B which is the between-class scatter matrix, and S W the within-class scatter matrix.In TDTC, the LDA dimension reduction module transforms the NSL-KDD dataset to a lower dimension.It is assumed that there is a set of n ddimensional vectors of x i , ..., x n belonging to k different class labels of C i , where each i = 1, 2, 3,...,k has n i samples (in TDTC k = 5 e.g.normal, DoS, Probe, U2R, L2R).The projection matrix  is calculated to maximize S Bsee Eq. 6, and minimize S Wsee Eq. 7.
is the mean value of class C i samples, and is given by Eq.8.
Since the ratio J in Eq.9 is within the range of S B and S W , it can be easily maximized as an optimization problem using the projection matrix W r (see Eq.9). (Eq.9) All these operations will be conducted on the training dataset (see Section IV) to obtain an ideal transformation matrix that can be applied to future test sets or unknown instances.

C. Classification Module
At this stage, TDTC is already trained using the transformed dataset and classified incoming traffic utilizing a multilayer classifier (introduced in [31]) to detect anomalies.The choice of the classifier is due to its capability in detecting abnormal behaviors due to the use of:  Two embedded classifiers for assigning exact class labels;  Simple classifier techniques such Naïve Bayes [47] and K-Nearest Neighbor (KNN);  Good similarity measure for rare instances to handle imbalanced datasets; and  Bucketing technique to speed up classification tasks.Figure 6 illustrates how classification modules are applied on incoming labeled instances.The Naïve Bayes classifier is used to classify anomalous behavior, which is then refined to normal instances using the Certainty-Factor version of K-Nearest Neighbor (CF-KNN).Naïve Bayes is an efficient classification method since it presumes independence of all features of each sample in the given class-label (conditional independence assumption).
The transformed features are assessed using correlation coefficient parameter.This measure [48] shows the relation between variables (features) by giving a number in the [-1, 1] interval, where 1 indicates a positive linear correlation, 0 no linear correlation, and −1 a negative linear correlation.The Correlation Coefficient assessments of the final features shows that the transferred features at two layers of dimension reduction module are mostly independent, since ρ=0.This measure indicates that there is no strict dependency among the classifier input featuressee also Tables 1 and 2. The figures dependency among the features also significantly decreases, in comparison to the findings reported in [31].The certainty-factor similarity measure in the classification module is based on the distribution proportion of classes in the training dataset to resolve imbalance data set issue.Certainty-Factor (CF) is a number that lies in [-1, 1] interval and specifies the amount of certainty for a given incoming sample [49].
CF measure is included in the KNN [50] classification module:  Let N (S, k) be k closest adjacent of S;  P (C= c i |D) be the ratio of c i in training set D; and  P (C= c i |N (S, k)) be the ratio of c i in the query result.Now, CF measure can be computed using Eq. 10 and Eq.11: The values of CF(C= c i , N (Q, k)) are in the range of [-1, 1].The CF strategy for KNN classification is defined as: S CF = argmax {CF(C= c i , N(Q, k))} (Eq.12) At this tier, KNN classifier uses a bucketing technique called K-d tree [51] to accelerate the nearest neighbor searching process of KNN.

A. NSL-KDD
In the NSL-KDD dataset, flaws reported in the original KDD99 dataset [52] were removed.Although there are still known issues in the NSL-KDD dataset [53], this does not affect the application of the dataset in this research or the validity of the findings.Each NSL-KDD record consists of a network connection with 41 defined attributes (e.g.protocol type, service and flag), which are labeled as normal or one of the 24 type of attack classes (e.g.Probe, DoS, U2R and R2L

B. Data transformation
Before the dataset is applied, each feature vector is normalized to a positive integer value within the range of [1,100] in order to improve the performance of the classifier and dimension reduction module.Each nominal feature value is specified with a unique integer number (e.g.TCP = 1, UDP = 2, ICMP = 3).The result value of each feature is mapped into an integer number, to avoid any bias, as shown in Eq.13 for each continuousvalued .Continuous-valued features is normalized using logarithm to base 2 and then casting into an integer value.

V. FINDINGS
The experiment was conducted using MATLAB R2015a running on a personal computer (PC) powered by AMD Phenom II X6 3.8GHz and 12 GB RAM.TDTC is trained with both training sets and then evaluated using the test set (Test+).TDTC's classification module is adopted from [31], with the same the parameter setting.Thus, k = 3 was used for CF-KNN classifier.
Figure 7 shows the mapped test dataset into new feature space, after applying the dimension reduction module.TDTC only uses two features of new mapped data (instead of all four features of lda1 to lda4, based on detection rates)see Figure 8.In addition, TDTC has an improved performance in detecting U2R and R2L attacks as shown in Table 5, as well as achieving a higher detection rate against probe attacks.The detection rate for DoS attacks in TDTC is also better than the two-tier model proposed in [16] and [55].False alarm rate shows a reduction to 5.56% from 6.3% reported in [55].

A. Computation complexity
In TDTC, the complexity overhead was reduced to half since only 35 out of 41 data set features were used.TDTC two dimension reduction module performance is an offline task, which is applied once to obtain the transform vectors for incoming samples.The first dimension reduction module is completely unsupervised while the generated class labels were added to the training dataset for another transformation based on the (supervised) LDA technique.The two-tier classification module (defined in [31]) embedded into TDTC reduces the computational complexity.
The computational complexity of Naïve Bayes classifier of the classification module is determined as ( × ), where e is the count of samples in dataset and f represents number of features.Therefore at this level, due to LDA optimum transformation, the first classifier of TDTC is equipped with only four features instead of 35.Thus, the computation overhead decreases by approximately ten times.In the second tier of classifier where KNN classifier was implemented, TDTC maintains only two attributes of the training dataset with the highest detection rate, as shown in Fig 8 .Therefore, KNN consumes less memory space than the original dataset.In addition KNN classifier is equipped with k-d tree [51] for searching nearest samples.K-d tree is a data structure which keeps the data sample based on their distances; thus, this technique helps KNN to search faster than using the traditional approach.
According to the second tier of classifier, searching nearest samples will take O(log n) time on average.

B. Real-world Applications
Since TDTC has a higher performance yet relatively lower resource requirements, it can be deployed to detect intrusion attempts in IoT backbone networks and their infrastructure services.TDTC also can be deployed as an auxiliary service for digital forensics in IoT ecosystem, such as those discussed in [56] to detect residual attack patterns of IoT network layer.
Due to the increases in low frequency, low profile IoTbased attacks [39], TDTC capabilities in detection of U2R and R2L attacks are useful in incident detection and handling.

VI. CONCLUDING REMARKS
With the widespread adoption of IoT devices and services in our data-centric and Internet-connected societies, ensuring the security of IoT infrastructure is important to ensure a secure and stable society.A successful attack on the IoT infrastructure can have crippling effects.For example, compromise of IoT services in smart cities could easily lead to a major chaos or even life threatening situations (see [58], [59], [60]).
In this paper, a model with two-layer dimension reduction and classification was proposed.This model is designed to detect intrusive activities in IoT backbone networks, particularly in detecting low frequency attacks (e.g.U2R and R2L) that could have potentially damaging consequences.Our proposed model outperformed existing similar models in terms of detection rate for both low frequency and common attacks.Since TDTC uses both unsupervised (PCA) and supervised (LDA) feature extraction methods, we were able to accurately distinguish between different attack types and normal behaviors, thanks to utilized classification algorithms.
Future research includes exploring the potential of nonparametric methods such as dimension reduction module and fuzzy clustering to achieve a better classification against U2R, R2L and other attacks.Another interesting future work could be extension of the proposed model to detect intrusions at other layers of the IoT architecture such as application and support layers, as well as other protocols running in the network layer.

Fig 2 .
Fig 2. In PCA, linear transformation is used to reduce high dimension dataset to a low dimension dataset

Fig 5 .
Fig 5. Imbedded Error Function measure of NSL-KDD train data set to select optimum number of dimension with minimum error and information loss.

Fig 3 .
Fig 3. Imbedded Error Function measure of NSL-KDD train data set to select optimum number of dimension with minimum error and information loss.

Fig 4 .
Fig 4. Two-dimensions of new mapped dataset processed by dimension reduction module 13) C. Performance indicators The four common performance indicators for the intrusion detection systems are as follows [54]:  True Positive (TP): indicates that benign behavior is correctly predicted as benign;  True Negative (TN): indicates that malicious behavior is correctly detected;  False Positive (FP): indicates that malicious behavior is identified as benign; and  False Negative (FN): indicates that benign behavior is wrongly detected as malicious.The Detection Rate (DR) is a measure of the classifier correctly detecting malicious samples of all malicious objects, and is computed as: DR = TP FN+TP .The False Alarm Rate is a measure of the classifier wrongly detecting benign samples as malicious of all benign objects, and is computed as: FAR = FP FP + TN

FIG 7 .
FIG 7. TWO-DIMENSIONS OF TRANSFORMED TEST SET WITH OBTAINED PROJECTION MATRIX Reduction and Two-tier Classification Model for Anomaly-Based Intrusion Detection in IoT Backbone Networks Hamed HaddadPajouh, Reza Javidan, Raouf Khayami, Ali Dehghantanha and Kim-Kwang Raymond Choo [4]ig 1.IoT Network Security Architecture[4]

Table 1 .
Transformed Features Dependency Of Train + Data Set After Applying Two Level Of Reduction Due To Correlation Coefficient Measure.

Table 2 .
Transformed Features Dependency Of Train_20% Data Set After Applying Two Level Of Reduction Due To Correlation Coefficient Measure.

TABLE 3 .
NSL-KDD DATA SET CLASSES DISTRIBUTION ). NSL-KDD has two training sets and one test set with different distributionsee Table 1.Since the test set contains 17 new attack types not included in the training set, we can evaluate the effectiveness of TCTD in detecting unknown or uncommon attacks.

Table 4
NSL-KDD Data Set Attacks Label Taxonomy And Their Existence In Train And Test Set Respectively.

TABLE 5 .
A COMPARATIVE SUMMARY