Next Article in Journal
Two Novel Information Entropy Indices for Analysis of the Eddy Current Distribution
Next Article in Special Issue
Change-Point Detection Using the Conditional Entropy of Ordinal Patterns
Previous Article in Journal
Strange Attractors Generated by Multiple-Valued Static Memory Cell with Polynomial Approximation of Resonant Tunneling Diodes
Previous Article in Special Issue
Chart for Thermoelectric Systems Operation Based on a Ternary Diagram for Bithermal Systems
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Combining Entropy Measures for Anomaly Detection

1
Department of Statistics, Universidad Carlos III de Madrid, 28903 Getafe, Madrid, Spain
2
Department of Computer Science and Statistics, University Rey Juan Carlos, 28933 Móstoles, Madrid, Spain
3
Department of Mathematics and Statistics, Universidad Torcuato Di Tella and CONICET, Buenos Aires C1428BCW, Argentina
*
Authors to whom correspondence should be addressed.
Entropy 2018, 20(9), 698; https://doi.org/10.3390/e20090698
Submission received: 31 July 2018 / Revised: 7 September 2018 / Accepted: 10 September 2018 / Published: 12 September 2018
(This article belongs to the Special Issue Entropy: From Physics to Information Sciences and Geometry)

Abstract

:
The combination of different sources of information is a problem that arises in several situations, for instance, when data are analysed using different similarity measures. Often, each source of information is given as a similarity, distance, or a kernel matrix. In this paper, we propose a new class of methods which consists of producing, for anomaly detection purposes, a single Mercer kernel (that acts as a similarity measure) from a set of local entropy kernels and, at the same time, avoids the task of model selection. This kernel is used to build an embedding of data in a variety that will allow the use of a (modified) one-class Support Vector Machine to detect outliers. We study several information combination schemes and their limiting behaviour when the data sample size increases within an Information Geometry context. In particular, we study the variety of the given positive definite kernel matrices to obtain the desired kernel combination as belonging to that variety. The proposed methodology has been evaluated on several real and artificial problems.

1. Introduction

Usual Data Mining tasks, such as classification, regression and anomaly detection, are heavily dependent on the geometry of the underlying data space. Kernel Methods, such as Support Vector Machines (SVM), provide the control on the data space geometry through the use of a Mercer kernel function [1,2]. Such functions, defined in the next section, induce embeddings of the data in feature spaces where Mercer kernels act as inner products. The choice of the appropriate kernel, including its parameters, is a particular case of model selection problems.
For instance, when working with SVM, a delicate parameterization is needed; otherwise, solutions might be suboptimal. In other words, the choice of a suitable kernel function and its parameters will affect both the geometry of the data embedding and the success of the algorithms [3,4]. A typical way to proceed is by means of cross-validation procedures [5]. However, these parameter calibration strategies, although intuitive and simple from an applied point of view, have some important drawbacks. In particular, their computational burden is of practical relevance when implementing cross–validation strategies in problems that involve calibrating a medium to large amount of parameters. An appealing alternative to model selection when working with SVM is to combine or merge different kernel functions into a single kernel [6,7].
Functional data [8] present the particularity of being intrinsically infinite dimensional. This peculiarity implies that classical procedures for multivariate data must be adapted or redesigned to cope with functional data. The statistical distribution of data is a basic element to afford outlier detection problems. Entropies are natural functions to use in anomaly detection problems given that any definition of entropy should produce large values for scattered distributions and small values for concentrated distributions. In addition, statistical distributions are a particular case of functional data and in this way entropy comes then into play in this context.
In this paper, we present an alternative proposal to solving anomaly detection problems that avoids the selection of kernel hyperparameters. A novelty of this work is that the methodology is developed to deal with functional data. We will explore several kernel combination techniques, including some methods from Information Geometry that respect the geometry of the manifold that contains the Gram matrices associated with the Mercer kernels involved.
The paper is organized as follows: Section 2 describes the functional data analysis methods used to produce the data representations from kernels, as well as the minimum entropy method used in this paper for anomaly detection. Section 3 develops several methods to obtain kernel combinations for the task of outlier detection. Section 4 illustrates the theory with simulations and examples; and Section 5 concludes the work.

2. Reproducing Kernel Hilbert Spaces for Multivariate and Functional Data

Let X be the “space” where the data live (a compact metric space). A Mercer kernel is a function K : X × X R symmetric, continuous and such that, for all finite sets S = { x 1 , , x n } X , the matrix whose entries are K ( x i , x j ) i , j { 1 , , n } is positive semidefinite. Often, we will use the term “kernel function” when referring to a Mercer kernel. Kernel functions admit expansions of the type K ( x , z ) = i ϕ ( x ) T ϕ ( z ) for some ϕ : X R d , where d is usually large. In particular, ϕ ( X ) is some manifold embedded in R d [9]. For x X , denote K x the function K x : X R given by K x ( z ) = K ( x , z ) . There exists a unique Hilbert space H K of functions on X made up of the span of the set { K x | x X } , such that for all f H K and x X , f ( x ) = K x , f H K . The Hilbert space H K is said to be a Reproducing Kernel Hilbert Space (RKHS) [10]. Next, we describe the use of RKHS for data analysis, differentiating between the multivariate and functional cases.
In the multivariate case, we consider data sets S = { x 1 , , x n } X , where X is a compact subset of R D . Consider the RKHS H K and the linear integral operator L K defined by L K ( f ) = X K ( · , s ) f ( s ) d s . Since X is compact and K continuous, L K has a countable sequence of eigenvalues { λ j } and eigenfunctions { ϕ j } , and K can be expressed as K ( x , y ) = j λ j ϕ j ( x ) ϕ j ( y ) , where the convergence is absolute and uniform (Mercer’s theorem).
Consider the Gram matrix K S = K ( x i , x j ) , i , j { 1 , , n } . This matrix is real, symmetric and positive definite (by definition of K) and K S ( x i , x j ) = ϕ ( x i ) T ϕ ( x j ) , where ϕ ( x i ) = ( λ j ϕ j ( x i ) ) j is the mapping ϕ : X R d . Straightforwardly, K is the standard scalar product in R d . Thus, the use of K induces both a data transformation and a metric on the original data given by:
d K 2 ( x i , x j ) = ϕ ( x i ) ϕ ( x j ) 2 = ϕ ( x i ) T ϕ ( x i ) + ϕ ( x j ) ϕ ( x j ) 2 ϕ ( x i ) ϕ ( x i ) = K ( x i , x i ) + K ( x j , x j ) 2 K ( x i , x j )
Equation (1) shows that the choice of the kernel K determines the geometry of the data set after the transformation X ϕ ( X ) .
Now, we consider the functional data case, that is, the case where data are functions or, by generalization, infinite dimensional objects (such as images, for instance). Let ( Ω , F , P ) be a probability space, where F is the σ -algebra in Ω and P a σ -finite measure. We consider random elements (functions) X ( ω , t ) : Ω × T R in a metric space ( T , τ ) , where T R is compact and we assume X ( ω , · ) to be continuous functions. We consider kernels K X ( s , t ) = K ( X ( ω , s ) , X ( ω , t ) ) (the classical choice is K X ( s , t ) = E ( X ( ω , s ) X ( ω , t ) ) ). Then, there exists a basis { e i } i 1 of C ( T ) such that for all t T
X ( ω , t ) = i = 1 ξ i ( ω ) e i ( t ) ,
for appropriate coefficients, where the e i are the eigenfunctions associated with the integral operator of K X ( s , t ) .
In real data analysis, we do not have theoretical random paths, or functional data described by mathematical equations, but finite samples from such processes. For instance, if we are considering normal distributions as the object of analysis, we will not know the vectors of means and real covariance matrices( μ and Σ ), but a sample X = { x i } R n from which we will estimate the covariance matrix S = 1 n X X T . In the case of functions, X will be a compact space or manifold in an Euclidean space, Y = R , and there will be available sample curves f n identified with data sets { ( x i , y i ) X × Y } i = 1 n . Let K : X × X R a Mercer Kernel and H K its associated RKHS. Then, the coefficients in Equation (2) can be approximated by solving the following optimization problem [11]:
min f H K 1 n i = 1 n ( f ( x i ) y i ) 2 + γ f K 2 ,
where γ > 0 and f K 2 represents the norm of the function f in H K . The solution, that constitutes an example of Equation (2), is given by f * ( x ) = j λ ^ j ϕ j ( x ) , where the λ ^ j are the weights of the projection of the function corresponding to the sample { ( x i , y i ) } onto the function space generated by the eigenvalues of L K .
Next, we use local entropies for anomaly detection through kernel combinations. For this preliminary work, we explore linear combinations and Karcher means, to validate the intuition that the use of a more natural mean than the arithmetic mean will produce better practical results, as far as positive definite matrices are involved.

2.1. Local Entropy Kernels

In order to link the metric induced by the kernel function and the underlying (empirical) density in the data, we propose local entropy kernels. Consider a measurable cover on ( Ω , F , P ) —the probability space where the random element of interest X is defined—say { Ω i } i 1 , where i 1 Ω i = Ω and Ω i Ω j = for any i j ; we can define the α -Entropy [12] of X as follows:
H α ( X ) = 1 1 α i 1 P ( Ω i ) log P ( Ω i ) α 1 , for α 0 and α 1 .
The parameter α defines to which entropy inside the family of α entropies we are referring to. For instance, when α = 0 , then H α is the Hartley entropy, when α 1 then H α converges to the Shannon entropy and when α then H α converges to the Min-entropy measure. Let S Ω be the collection of finite partitions of Ω , for any subset A = i = 1 n A i S Ω , the entropy of A can be computed as follows:
H α ( A ) = 1 1 α i = 1 n P ( A i ) log P ( A i ) α 1 .
This paves the way to define the Δ -local entropy [13] corresponding to any subset Δ F Ω as follows
h α ( Δ ) = inf Δ ˜ S Ω H α ( Δ ˜ ) , such that Δ Δ ˜ .
Let ( X 1 , , X n ) be a random sample drawn i . i . d . from P, we would like to compute the local entropies of the corresponding random sets Δ 1 , , Δ n , where Δ i = Ω B ( X i 1 ( ω ) , r ) and B ( X i 1 ( ω ) , r ) Ω is the open ball with center in ω and a (data driven) small radius r. In practice, given a sample S n = ( x 1 , , x n ) , we compute the local entropy using the estimator h ^ α ( Δ i ) = d ¯ k ( x i , S n ) / ( 1 α ) , where d ¯ k ( x i , S n ) is the average distance from x i to its k t h -nearest neighbour. Notice that the locality parameter k in d ¯ k ( x , S n ) , which represents the number of neighbours that we take into account to approximate the local entropy around x, is related to r in Δ x = Ω B ( x 1 ( ω ) , r ) . We then consider φ ( x ) = h ^ α ( Δ x ) , with α = 0 , so to define the local entropy kernel as
K ( x , y ) = φ ( x ) T φ ( y ) .
In the next section, we discuss how to avoid model selection problems. To this aim, a set of local entropy kernels is initially estimated from the data. Then, we estimate an average local entropy kernel that takes into account the particular geometry of the space of positive definite matrices. In this way, we obtain a unique low dimensional data representation, from which outliers are detected. This approach does not include neither a model selection step nor a parameter estimation procedure.

3. Kernel Combination for Anomaly Detection

Consider a data sample S n = { x 1 , , x n } X , where the x i can be multivariate or functional data, and consider a set of m Mercer kernels (or matrices) K 1 e , , K m e , that induce m different data embeddings ϕ j : X R d j , where K j e ( x , y ) = ϕ j ( x ) T ϕ j ( y ) . As stated in Equation (1), each of the kernels induces a kernel distance d K j on the original data space X, corresponding to the Euclidean distance on the manifold Z j = ϕ j ( X ) .
Next, we define a new set of transformations, suitable for anomaly detection, in line with the theory of Section 2.1 by:
φ j ( x ) = d K j ( ϕ j ( x ) , ϕ j ( S n ) ) .
The corresponding kernels suitable for outlier detection are
K j ( x , y ) = φ j ( x ) T φ j ( y ) .
Now, kernel functions are positive definite type functions, i.e., the empirical kernel matrix K—obtained via the evaluation of the kernel function into the set of n training points—belongs to the cone of symmetric positive semidefinite matrices P : = { K R n × n | K = K T , K 0 } . Let K 1 , , K m be the empirical kernel matrices defined in Equation (9), all of them in P , and let ( w 1 , , w m ) T be a suitable non-negative vector of combination parameters, then define the “fusion” kernel K
K ( w 1 , , w m ) : = w 1 K 1 + , , + w m K m 0 .
In the context of SVM classification problems, the goal is to find the parameters w 1 , , w m that maximize the optimal margin. Instead, in anomaly detection cases, the goal is to estimate the parameters w 1 , , w m that produce a suitable data representation. This is achieved when the regular data within the sample –represented in the coordinate space provided by the fusion kernel K –have a reduced entropy or equivalently is scarcely scattered and those observations that are atypical in the sample are projected in distant regions from that of the regular data.
Next we consider three particular combination schemes. The first is rather straightforward, the second proposes the mean in the manifold that contains the kernels, and the third is a weighting scheme that assigns the weights according to the use of appropriate choices of entropy functions.
Definition 1
(Multivariate sparsity measures). Consider m different sparsity measures ϕ 1 , , ϕ m and let K 1 , , K m be the corresponding set of Mercer kernels, where K i ( x , y ) = ϕ i T ( x ) ϕ i ( y ) . We define a multivariate concentration measure by Φ = ( ϕ 1 , , ϕ m ) : X R m .
The corresponding kernel, evaluated at the sample S, will be
K Φ ( x i , x j ) = Φ ( x i ) T Φ ( x j ) = ( ϕ 1 ( x i ) , , ϕ m ( x i ) ) T ( ϕ 1 ( x j ) , , ϕ m ( x j ) ) = i = 1 m ϕ i T ( x i ) ϕ i ( x j ) = i = 1 m K i ( x i , x j )
Thus, the kernel corresponding to a multivariate sparsity measure Φ = ( ϕ 1 , , ϕ m ) is the sum of the univariate kernels K i associated with the ϕ i . This fact allows us to interpret linear combination of kernels w i K i as coming from (weighted) multivariate sparsity measures.

3.1. Entropy Weighting

Definition 2
(K-entropy of a data set). Consider a Mercer kernel K acting on a space X, a sample data set S n and the corresponding transformation ϕ : X R d induced by K, where K ( x , y ) = ϕ ( x ) T ϕ ( y ) . The K-entropy of S n is defined by:
E K ( S n ) = i = 1 n j = 1 n | K ( x i , x j ) | = i = 1 n j = 1 n | ϕ ( x i ) T ϕ ( x j ) | .
In the context of outlier detection, consider K 1 , , K m , obtained from sparsity measures. From Equation (9), if a point x is an outlier, then it will be off the main bulk of data points and, thus, φ j ( x ) = d K j ( ϕ j ( x ) , ϕ ( S n ) ) will be large and the same will be true for K j ( x , x i ) for most x i S n . As a consequence, E K j ( S n ) will tend to be large. On the other hand, and following the same reasoning, if a particular kernel K j induces a representation not suitable for detecting the outliers, then E K j ( S n ) will be small. Thus, the measure defined in Equation (11) acts as a true entropy for matrices: If data are very concentrated after the transformation induced by K, then the entropy of the data (measured by the) set will be low.
We establish the entropy-weighting scheme by solving the following semidefinite optimization problem:
max α 1 , , λ m j = 1 m λ j E K j ( S n ) s . t . j = 1 m λ j K j 0 , j = 1 m λ j = 1 , and 0 λ j u j ,
where u j [ ( 0 , 1 ] are some positive constants that may be associated with each kernel matrix K j . We refer to [14] for a detailed description of the basics of semidefinite programming.
Theorem 1.
Consider the previous semidefinite optimization problem. If K 1 , , K m 0 and u i = E K j ( S n ) j E K j ( S n ) , then the solution to the optimization problem is given by λ j * = E K j ( S n ) j E K j ( S n ) .
Proof. 
Given that λ j * = E K j ( S n ) j E K j ( S n ) 0 and K j 0 , the constraint j = 1 m λ j K j 0 holds. In addition,
j = 1 m λ j * = j = 1 m E K j ( S n ) j E K j ( S n ) = 1 .
Since all the λ j * reach their upper bound, the theorem holds and the solution is unique. □
Thus, the entropy-weighting scheme will be:
K * = E K 1 ( S n ) j E K j ( S n ) K 1 + E K 2 ( S n ) j E K j ( S n ) K 2 + + E K m ( S n ) j E K j ( S n ) K m .

3.2. Karcher Mean

Next, we introduce the Karcher mean [15,16,17] of kernel matrices as an alternative approach to the linear combinations of matrices presented in Section 3.1. The Karcher mean preserves the particular Riemannian manifold in which the kernel matrices lie and constitutes a natural definition for the geometric mean of the matrices.
The set of positive definite square matrices P is a Riemannian manifold, with inner product A , B X = T r ( X 1 A X 1 B ) on the tangent space to P at the point X. The distance between A , B P is given by d P ( A , B ) = log ( A 1 / 2 B A 1 / 2 ) F , where · F is the Frobenius norm, that is A F = i j a i j 2 . Given K 1 , , K m kernel matrices, the Karcher mean, denoted onwards as K ¯ ¯ , is defined as the minimizer of the function f ( X ) = i = 1 m d P ( X , K i ) 2 , and it is the unique solution X P of the matrix equation i = 1 m log ( K i 1 X ) = 0 .

4. Experimental Section

In this section, we illustrate, with the aid of multiple numerical examples and real data sets, the performance of the proposed methodology when the goal is to detect abnormal observations in a sample. We consider a list of several kernel functions, namely: (i)  the Gaussian kernel K G ( x i , x j ) = e σ x i x j 2 with parameter σ defined in a grid of values ranging in σ { 0.1 3 , 0.1 2 , 0.1 , 1 , 10 , 50 , 100 , 500 , 10 3 } ; (ii) the linear kernel K L ( x i , x j ) = x i , x j and (iii) the second degree polynomial kernel K P ( x i , x j ) = ( x i , x j + 1 ) 2 . As it was explained in Section 1, the combination methods proposed can be considered as an alternative to model selection techniques for outlier detection purposes. Therefore, the results obtained are presented jointly with the single kernel methods. Our combination methods are denoted as: (i) the average kernel (K), (ii) the kernel constructed using the Karcher mean of the single kernel functions ( K ¯ ¯ ) and (iii) the minimum entropy linear combination kernel or entropy kernel ( E K ).
For comparison purposes, we consider several alternative approaches for anomaly detection in both the multivariate and the functional data frameworks. In the multivariate case, we consider some alternative well-known techniques in the field of machine learning. These methods are: (i) LOF [18] and (ii) HiCS [19]. In the functional case, we test our proposals against three widely used depth measures: the Modified Band Depth (MBD) [20], the Modal Depth (HMD) [21] and the Random Tukey Depth (RTD) [22]. This depth measures induce an order with respect to the functional data set that can be used to determine which observations (curves) are far from the deepest or central point and can be classified as outliers.
Each database presents a set of regular observations and has been contaminated with abnormal or outlying observations. Let P and N be the amount of outlier and normal data in the sample, respectively, and let TP = True Positive and TN = True Negative be the respective quantities detected by different methods. In Table 1 and Table 2, we report the following average metrics TPR = TP/P (True Positive Rate or sensitivity), TNR = TN/N (True Negative Rate or specificity). For the comparison with other techniques using real data sets, in Table 4 and Table 5, we report the area under the ROC curve (AUC) for each experiment.

4.1. Synthetic Data

For the simulated experiment, we consider two synthetic data schemes. The first scheme has been built by generating a synthetic multivariate data set, while, for the second scheme, we have generated a synthetic functional data set.
Synthetic multivariate data: We consider a conditionally normal bivariate distribution model [23] for regular data and outliers were sampled from three different standard Gaussian models. The sample size is n = 1000 . The data for the experiment, illustrated in Figure 1, was obtained using a Gibbs sampler.
Synthetic functional data: We consider random samples of Gaussian processes { x 1 ( t ) , , x n ( t ) } , with sizes 4000 and 2000, where a proportion ν = 0.1 , known a priori, present an atypical pattern, and the remaining n ( 1 ν ) curves are considered the main data. We consider the following generating processes:
X l ( t ) = j = 1 2 ξ j sin ( j π t ) + ε l ( t ) , for l = 1 , , ( 1 ν ) n , Y l ( t ) = j = 1 2 ζ j sin ( j π t ) + ε l ( t ) , for l = 1 , , ν n ,
where t [ 0 , 1 ] , ε ( t ) are independent autocorrelated random error functions and ( ξ 1 , ξ 2 ) is a normally-distributed bivariate random variable (NDMRV) with mean μ ξ = ( 1 , 2 ) and diagonal co-variance matrix Σ ξ = diag ( 1 , 1 ) . To generate the outliers, we consider ( ζ 1 , ζ 2 ) NDMRV with parameters μ ξ = ( 4 , 5 ) and Σ ζ = Σ ξ . The data are plotted in Figure 2.
Table 1 shows the results of the experiment using synthetic multivariate data. Best results are marked using bold enhanced text. It can be observed that the proposed combination methods, namely the mean, the weighted entropy and the Karcher mean perform as well as the best single kernel in terms of the TNR. With respect to the TPR, the best combination method is the one based on the calculation of the Karcher mean.
In Table 2, the results of the experiment using synthetic functional data are presented. In this case, two of the three proposals, the mean and the weighted entropy are always able to perform as well as the best single kernel (the polynomial kernel) in terms of both the TNR and TPR. The method based on the calculation of the Karcher mean obtains good results with respect to the TNR measure.

4.2. Real Data

Regarding real data, we also differentiate between multivariate and functional data. To test and compare proposals using multivariate data, we consider six databases from the UCI machine learning repository [24] which are available and properly described in [25]. The testing and comparison of our proposals using functional data are carried out over two functional data sets: (i) Poblenou NOx Emissions (NOx). This data set contains the nitrogen oxide ( N O x ) emissions levels measured every hour by a control station in Poblenou in Barcelona (Spain). The data are publicly available in the R-package `fda.usc’ [26]. In the data set, working day N O x emissions, considered as regular data, and weekend day N O x emissions considered as atypical data can be distinguished; (ii) Vertical Density Profiles (VDP). This data set contains 24 curves of Vertical Density Profiles which come from the manufacture of engineered woodboards. Each one consists of 314 measurements taken 0.002 inches (see [27] for further details). In Table 3, we give the details about the sample size, the dimension and the percentage of outlier observations for each of the data sets. The NOx and VDP data sets are illustrated in Figure 3.
Table 4 shows the results of the experiment using real multivariate data. It can be observed that the best overall method in average is the weighted entropy proposal. In particular, this method attains the best results for two of the six data bases (Pima and Cardio), and for the rest of the sets its results are close to the best ones. Although the proposed methodologies seem to perform systematically better than other machine learning approaches, it is not clear, in terms of the AUC, whether for some data bases (Glass, Breast Cancer, Breast Cancer Diagnostic and Pima) the difference is statistically significant.
In Table 5, the results of the experiment using real functional data are presented. For the VDP data set, in terms of the AUC measure, the weighted entropy and the mean proposals perform as well as the best single kernels and the MBD. For the NOx data set, the best overall method is the one based on the calculation of the Karcher mean, followed closely by the MBD approach.

4.3. Robustness of the Karcher Mean

In this experiment we explore the robustness of the proposed procedure in the context of detection of atypical functional data. To this aim, we generate n = 100 independent sample paths from the following Gaussian stochastic model:
X ( t ) = ξ 1 sin ( t ) + ξ 2 cos ( t ) for t [ 0 , π ] , and ξ 1 ξ 2 N μ = 0 0 , Σ = 0.75 0.5 0.5 0.75
that is ( ξ 1 , ξ 2 ) follows a zero mean bi-variate normal distribution with covariance parameters σ 11 = σ 22 = 0.75 and σ 12 = σ 21 = 0.5 . Using the representation techniques introduced in Section 2, we can represent these curves as points in R 2 and, moreover, we can estimate (by Maximum Likelihood) a covariance matrix Σ ^ using this data representation. We replicate the previous generating process 10 times, obtaining 10 covariance matrices estimates, namely Σ ^ i for i = 1 , , 10 . Next, we construct the mean estimated covariance matrix as, Σ ^ = i = 1 10 Σ ^ i / 10 , and the Karcher mean estimated covariance matrix, Σ ¯ ¯ ( Σ ^ 1 , , Σ ^ 10 ) . The estimations are illustrated in Figure 4-left, where each ellipse (in grey“”) corresponds to the following equation:
( x 1 cos ( θ ^ i ) + x 2 sin ( θ ^ i ) ) 2 λ ^ 1 , i χ 2 , 0.99 2 + ( x 2 cos ( θ ^ i ) x 1 sin ( θ ^ i ) ) 2 λ ^ 2 , i χ 2 , 0.99 2 = 1 , for i = 1 , , 10 ,
where χ 2 , 0 . 99 2 is the value of a Chi-square with two degrees of freedom that accumulates 0.99 probability, λ ^ 1 , i and λ ^ 2 , i are the estimated eigenvalues, corresponding to each estimate Σ ^ i , and θ ^ i is the estimated rotation angle with respect to the ` x 1 ’ axis. In addition, in the same Figure, the estimated mean Σ ^ (its corresponding ellipse estimation) is shown in red (“- - -”), and in blue (“- - - ”) the Karcher mean. To introduce some anomaly in our data, in Figure 4-right, we added one ellipse constructed with an anomalous bivariate distribution with covariance matrix with elements σ 11 = σ 22 = 7.5 and σ 12 = σ 21 = 10 ; this atypical covariance matrix corresponds to a different stochastic Gaussian model from the baseline introduced in Equation (14).
It can be observed in Figure 4-left that the average covariance matrix and the Karcher mean of the covariance matrix generate similar 99th percentile ellipses. Since the generated covariance matrices Σ ^ i are located in a small region within the cone of semi-definite positive matrices, such a region can be approximated by a linear subspace that contains the average covariance matrix. On the other hand, in Figure 4-right, the curvature of the cone is depicted by the difference in the dispersion of the anomalous covariance matrix, illustrated by the ellipse with a black-dashed line. In this scenario, the Karcher mean of the covariance matrices generates similar 99th percentile ellipse with respect to the regular scenario (left panel), which shows the robustness of the Karcher mean in the presence of outliers. Nevertheless, in the contaminated scenario (right panel), the 99th percentile ellipse generated with the simple average mean of the covariance matrices changes radically with respect to the regular scenario. The robustness in the estimation of the covariance matrix allows us to ensure that the procedure proposed in this paper, based on the estimation of the Karcher mean in the cone of positive definite matrices, will be useful when solving atypical functional data identification problems.
Last but not least, the relevant aspect of this numerical example is that, using the Karcher mean as an estimator of the center of the distribution of semi-definite positive matrices, we are minimizing the Riemannian distance, as it is defined in Section 3, and, as a consequence, the proposed method is able to identify the anomalous covariance matrix with respect to the pattern given by the rest of the distributions.

5. Discussion

In this work, we have explored how to combine different sources of information for anomaly detection within the framework of Entropy measures. We define entropies associated with the transformation induced by Mercer kernels, both for random variables and for data sets. We propose a new class of combination methods that generate a single Mercer kernel (that acts as a similarity measure) for anomaly detection purposes from a set of entropy measures in the context of density estimation. In particular, three combination schemes have been proposed and analysed, namely: (i) an average of the kernel matrices; (ii) the mean in the manifold that contains the kernels; and (iii) a weighting scheme that assigns the weights as the solution of an optimization problem that seeks to maximize a particular kernel entropy. Such proposals, based on the idea of building the final combined kernel matrix within the same variety where the kernel matrices to be combined live, seem to be the most successful ones on average.
An innovative application of this methodology is the use of the Karcher mean as part of a method to identify anomalous covariance matrices. The success of this proposal is due to the fact that the Karcher mean acts as an estimator of the center of the distribution of semi-definite positive matrices, while minimizing their Riemannian distance, allowing the identification of the outlying matrices with respect to the pattern given by such an estimator.
A relevant aspect for the method applicability in real problems is its complexity and costs in comparison with other alternatives. The proposals whose structure is based on a linear combination of kernel matrices have a very low computational cost based on the computation of products of constants and sums of matrices. The proposal based on the use of the Karcher mean has the typical drawback of any semidefinite programming problem, that is, the computational and memory costs are related to the size of the matrices involved. Current systems are not able to deal with dense large matrices, given that processing time and memory grow quasi-exponentially as the size of the matrices increase. See [28] for a discussion on these aspects and current trends to improve the performance of methods for the solution of semidefinite programming problems. Most applications for general dense matrices in semidefinite programming involve a few hundred data cases. Fortunately, in this particular application (outlier detection), we do not need to work with the full database to success. Due to the presence of statistical regularities, a few thousand data cases will usually be enough to collect all the relevant statistical aspects of the data set at hand.
Further research is to be afforded, especially regarding the possibility of exploring other embeddings of the data. For instance, higher dimensional transformations specific for anomaly detection could be designed. In this regard, care should be taken with the scaling of such transformations, as dimensions with large magnitudes with respect to the others may lead to suboptimal results. In this work, for multivariate data, we have compared the methodologies proposed with some multivariate outlier detection techniques. In the future, systematic experiments comparing with other well known methodologies such as XBGOD [29], LODES [30], iForest [31] or MASS [32] are to be carried out. Regarding these multivariate techniques, another interesting research line is the extension of such methodologies to functional data analysis. In this regard, suitable multivariate representations of functional data similar to those in [2] should be explored.

Author Contributions

All authors have contributed equally to the paper.

Funding

This research was funded by CONICET Argentina project 20020150200110BA, the Spanish Ministry of Economy and Competitiveness projects ECO2015–66593-P and GROMA (MTM2015-63710-P). The APC was funded by Project GROMA (MTM2015-63710-P).

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Moguerza, J.M.; Muñoz, A. Support vector machines with applications. Stat. Sci. 2006, 21, 322–336. [Google Scholar] [CrossRef]
  2. Muñoz, A.; González, J. Representing functional data using support vector machines. Pattern Recognit. Lett. 2010, 31, 511–516. [Google Scholar] [CrossRef]
  3. Carl, G.; Sollich, P. Model selection for support vector machine classification. Neurocomputing 2003, 55, 221–249. [Google Scholar] [Green Version]
  4. Chapelle, O.; Vapnik, V.; Bousquet, O.; Mukherjee, S. Choosing multiple parameters for support vector machines. Mach. Learn. 2002, 46, 131–159. [Google Scholar] [CrossRef]
  5. Wahba, G. Support Vector machines, Reproducing Kernel Hilbert Spaces, and randomized GACV. In Advances in Kernel Methods: Support Vector Learning; Schoelkopf, B., Burges, C., Smola, A., Eds.; MIT Press: Cambridge, MA, USA, 1999; pp. 69–88. [Google Scholar]
  6. Lanckriet, G.R.; Cristianini, N.; Bartlett, P.; Ghaoui, L.E.; Jordan, M.I. Learning the kernel matrix with semidefinite programming. J. Mach. Learn. Res. 2004, 5, 27–72. [Google Scholar]
  7. De Diego, I.M.; Muñoz, A.; Moguerza, J.M. Methods for the combination of kernel matrices within a support vector framework. Mach. Learn. 2010, 78, 137–172. [Google Scholar] [CrossRef]
  8. Hsing, T.; Eubank, R. Theoretical Foundations of Functional Data Analysis, with an Introduction to Linear Operators; John Wiley & Sons: Hoboken, NJ, USA, 2015. [Google Scholar]
  9. Smola, A.; Gretton, A.; Song, L.; Schölkopf, B. A Hilbert space embedding for distributions. In Proceedings of the International Conference on Algorithmic Learning Theory, Sendai, Japan, 1–4 October 2007; pp. 13–31. [Google Scholar]
  10. Berlinet, A.; Thomas-Agnan, C. Reproducing Kernel Hilbert Spaces in Probability and Statistics; Springer: New York, NY, USA, 2011. [Google Scholar]
  11. Kimeldorf, G.; Wahba, G. Some results on Tchebycheffian spline functions. J. Math. Anal. Appl. 1971, 33, 82–94. [Google Scholar] [CrossRef]
  12. Rényi, A. On measures of entropy and information. In Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, Berkeley, CA, USA, 20 June–30 July 1960. [Google Scholar]
  13. Martos, G.; Hernández, N.; Muñoz, A.; Moguerza, J.M. Entropy measures for stochastic processes with applications in functional anomaly detection. Entropy 2018, 20, 33. [Google Scholar] [CrossRef]
  14. Vandenberghe, L.; Boyd, S. Semidefinite programming. SIAM Rev. 1996, 38, 49–95. [Google Scholar] [CrossRef]
  15. Karcher, H. Riemannian center of mass and mollifier smoothing. Commun. Pure Appl. Math. 1977, 30, 509–541. [Google Scholar] [CrossRef]
  16. Arnaudon, M.; Barbaresco, F.; Yang, L. Medians and means in Riemannian geometry: Existence, uniqueness and computation. arXiv, 2011; arXiv:1111.3120. [Google Scholar]
  17. Bini, D.A.; Iannazzo, B. Computing the Karcher mean of symmetric positive definite matrices. Linear Algebra Appl. 2013, 438, 1700–1710. [Google Scholar] [CrossRef]
  18. Breunig, M.M.; Kriegel, H.-P.; Ng, R.T.; Sander, J. LOF: Identifying density-based local outliers. In Proceedings of the 2000 ACM SIGMOD international conference on Management of data, Dallas, TX, USA, 15–18 May 2000; pp. 93–104. [Google Scholar]
  19. Keller, F.; Müller, E.; Böhm, K. HiCS: High Contrast Subspaces for Density-Based Outlier Ranking. In Proceedings of the 2012 IEEE 28th International Conference on Data Engineering, Washington, DC, USA, 1–5 April 2012; pp. 1037–1048. [Google Scholar]
  20. López-Pintado, S.; Romo, J. On the concept of depth for functional data. J Am. Stat. Assoc. 2009, 104, 718–734. [Google Scholar] [CrossRef]
  21. Cuevas, A.; Febrero, M.; Fraiman, R. Robust estimation and classification for functional data via projection-based depth notions. Comput. Stat. 2007, 22, 481–496. [Google Scholar] [CrossRef]
  22. Cuesta-Albertos, J.A.; Nieto-Reyes, A. The random Tukey depth. Comput. Stat. Data Anal. 2008, 52, 4979–4988. [Google Scholar] [CrossRef]
  23. Gelman, A.; Meng, X.L. A note on bivariate distributions that are conditionally normal. Am Stat. 1991, 45, 125–126. [Google Scholar]
  24. Blake, C.L.; Merz, C.J. UCI Repository of Machine Learning Databases. Available online: http://archive.ics.uci.edu/ml/index.php (accessed on 10 September 2018).
  25. Rayana, S. ODDS Library. Available online: http://odds.cs.stonybrook.edu (accessed on 10 September 2018).
  26. Febrero-Bande, M.; de la Fuente, M.O. Statistical computing in functional data analysis: The R package fda. usc. J. Stat. Softw. 2012, 51, 1–28. [Google Scholar] [CrossRef]
  27. Moguerza, J.M.; Muñoz, A.; Psarakis, S. Monitoring nonlinear profiles using support vector machines. In Iberoamerican Congress on Pattern Recognition; Springer: Berlin, Germany, 2007; pp. 574–583. [Google Scholar]
  28. Zheng, Y.; Yan, Y.; Liu, S.; Huang, X.; Xu, W. An Efficient Approach to Solve the Large-Scale Semidefinite Programming Problems. Math. Probl. Eng. 2012, 2012, 764760. [Google Scholar] [CrossRef]
  29. Zhao, Y.; Hryniewicki, M.K. XGBOD: Improving Supervised Outlier Detection with Unsupervised Representation Learning. In Proceedings of the International Joint Conference on Neural Networks (IJCNN), Rio, Brazil, 8–13 July 2018. [Google Scholar]
  30. Sathe, S.; Aggarwal, C. LODES: Local Density Meets Spectral Outlier Detection. In Proceedings of the 2016 SIAM International Conference on Data Mining, Miami, FL, USA, 5–7 May 2016; pp. 171–179. [Google Scholar]
  31. Liu, F.T.; Ting, K.M.; Zhou, Z.H. Isolation Forest. In Proceedings of the 2008 Eighth IEEE International Conference on Data Mining (2008), Pisa, Italy, 15–19 December 2008; pp. 413–422. [Google Scholar]
  32. Ting, K.M.; Chuan, T.S.; Liu, F.T. Mass: A New Ranking Measure for Anomaly Detection; Technical Report TR2009/1; Gippsland School of Information Technology, Monash University: Victoria, Australia, 2009. [Google Scholar]
Figure 1. Main data in black (●) and outlying observations in red (*).
Figure 1. Main data in black (●) and outlying observations in red (*).
Entropy 20 00698 g001
Figure 2. Main data in black (—) and outlying observations in red ().
Figure 2. Main data in black (—) and outlying observations in red ().
Entropy 20 00698 g002
Figure 3. NOX (left) and VDP (right) functional data sets. The sample of regular curves in black (“—”), and abnormal curves in red (“”).
Figure 3. NOX (left) and VDP (right) functional data sets. The sample of regular curves in black (“—”), and abnormal curves in red (“”).
Entropy 20 00698 g003
Figure 4. First two coordinates of ten 99th percentile ellipses (“”). Σ ^ ellipse in red (“- - -”) and Σ ¯ ¯ ellipse in blue (“- - -”). Left panel: Gaussian scenario; Right panel: Gaussian scenario contaminated with anomalous covariance matrix in black (“- - -”).
Figure 4. First two coordinates of ten 99th percentile ellipses (“”). Σ ^ ellipse in red (“- - -”) and Σ ¯ ¯ ellipse in blue (“- - -”). Left panel: Gaussian scenario; Right panel: Gaussian scenario contaminated with anomalous covariance matrix in black (“- - -”).
Entropy 20 00698 g004
Table 1. Percentage of TPR (sensitivity) and TNR (specificity) for synthetic multivariate data.
Table 1. Percentage of TPR (sensitivity) and TNR (specificity) for synthetic multivariate data.
Experiment K G σ = 0 , 1 3 K G σ = 0 , 1 2 K G σ = 0 , 1 K G σ = 1 K G σ = 10 K G σ = 50 K G σ = 100 K G σ = 500 K G σ = 10 3 K L K P K ¯ K ¯ ¯ E K
TPR69.369.369.365.30.00.00.00.00.069.362.662.669.362.6
TNR97.797.797.797.792.592.592.592.592.597.797.797.797.797.7
Table 2. Percentage of TPR (sensitivity) and TNR (specificity) for synthetic functional data.
Table 2. Percentage of TPR (sensitivity) and TNR (specificity) for synthetic functional data.
ExperimentTrain Set (n = 4000)Test Set (n = 2000)
TPRTNRTPRTNR
K G σ = 0 , 1 3 90.2598.9189.598.55
K G σ = 0 , 1 2 90.0098.8890.598.44
K G σ = 0 , 1 2.2589.135.084.11
K G σ = 1 0.2588.910.081.00
K G σ = 10 32.0092.4420.091.05
K G σ = 50 23.7591.5220.097.77
K G σ = 100 24.0091.5520.099.22
K G σ = 500 14.5090.5044.057.00
K G σ = 10 3 38.7593.1944.054.88
K L 90.2598.9189.098.55
K P 94.7599.4195.599.27
K ¯ 94.7599.4195.599.27
K ¯ ¯ 38.7593.1944.055.05
E K 94.7599.4195.599.27
Table 3. Summary of the data sets.
Table 3. Summary of the data sets.
Data SetSample Size (n)Dimension (d)% of Outilers
Glass21499 (4.2%)
Vertebral240630 (12.5%)
Breast6839239 (35%)
Breast (Diagnosis)2783021 (5.6%)
Pima (Diabetes)7688268 (35%)
Cardio183121176 (9,6%)
VDP24 (sampled at 314 points)35 (30.3%)
NOx115 (sampled at 24 points)3 (12.5%)
Table 4. Area under the ROC curve (AUC) for multivariate data sets.
Table 4. Area under the ROC curve (AUC) for multivariate data sets.
ExperimentGlassVertebralBreast CancerBreast Cancer (Diag.)PimaCardio
K G σ = 0 , 1 3 88.1357.560.994.849.590.0
K G σ = 0 , 1 2 88.5171.460.894.749.789.5
K G σ = 0 , 1 91.4963.060.594.850.669.6
K G σ = 1 82.8766.962.294.471.965.3
K G σ = 10 76.4077.968.686.174.849.2
K G σ = 50 51.4989.068.185.248.646.8
K G σ = 100 56.4279.268.184.743.845.4
K G σ = 500 53.9085.868.165.252.464.8
K G σ = 10 3 57.5179.568.172.962.267.0
K L 87.8672.860.894.849.190.1
K P 85.8574.159.496.349.494.8
K ¯ 85.8566.462.994.348.894.8
K ¯ ¯ 88.1382.260.794.478.763.4
E K 88.1382.261.494.478.794.8
LOF76.859.356.486.970.959.6
HiCS80.056.659.394.272.463.0
Table 5. Area under the ROC curve (AUC) for functional data sets.
Table 5. Area under the ROC curve (AUC) for functional data sets.
ExperimentVDPNOx
K G σ = 0 , 1 3 100.065.7
K G σ = 0 , 1 2 73.842.9
K G σ = 0 , 1 88.848.1
K G σ = 1 53.948.1
K G σ = 10 50.748.1
K G σ = 50 45.248.1
K G σ = 100 52.348.1
K G σ = 500 52.348.1
K G σ = 10 3 52.348.1
K L 100.057.7
K P 100.052.0
K ¯ 100.065.1
K ¯ ¯ 63.470.9
E K 100.065.1
MBD100.069.6
HMD98.451.6
RTD87.362.8

Share and Cite

MDPI and ACS Style

Muñoz, A.; Hernández, N.; Moguerza, J.M.; Martos, G. Combining Entropy Measures for Anomaly Detection. Entropy 2018, 20, 698. https://doi.org/10.3390/e20090698

AMA Style

Muñoz A, Hernández N, Moguerza JM, Martos G. Combining Entropy Measures for Anomaly Detection. Entropy. 2018; 20(9):698. https://doi.org/10.3390/e20090698

Chicago/Turabian Style

Muñoz, Alberto, Nicolás Hernández, Javier M. Moguerza, and Gabriel Martos. 2018. "Combining Entropy Measures for Anomaly Detection" Entropy 20, no. 9: 698. https://doi.org/10.3390/e20090698

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop