Research
   Home
   Computational Finance
   ICA
   Hybrid Systems
   Recurrent Networks
   Support Vector Machines
   Input Variable Selection
   Classification Methods

Publications
   Book Chapters
   Journal Papers
   Conference Papers
   BibTex

Software
Windale Technologies

Conferences
   NIPS*96
   NIPS*95
   NIPS*94

Information
   Biography
   Contact Me

Support Vector Machines

Classification Using Support Vector Machines

Andrew D. Back
RIKEN Brain Science Institute, Wako-shi, Saitama, Japan.

1 Introduction

Historically, classifiers have been determined by choosing a structure¹, and then selecting a parameter estimation algorithm used to optimize some cost function. The structure chosen fixes the best achievable generalization error, while the parameter estimation algorithm optimizes the cost function with respect to the empirical risk.

There are a number of problems with this approach, however. These include:

The model structure needs to be selected in some manner. If this is not done correctly, then even with zero empirical risk, it is still possible to have a large generalization error [18].
If we wish to avoid the problem of overfitting, as indicated by the above problem, by choosing a smaller model size or order, then it may be difficult to fit the training data (and hence minimize the empirical risk).
Determining a suitable learning algorithm for minimizing the empirical risk may still be quite difficult. It may be very hard or impossible to guarantee that the correct set of parameters.

The support vector method is a recently developed technique which is designed for efficient multidimensional function approximation . The basic idea of support vector machines (SVMs) is to determine a classifier or regression machine which minimizes the empirical risk (that is, the training set error) and the confidence interval (which corresponds to the generalization or test set error) [18].

In SVMs, the idea is to fix the empirical risk associated with an architecture and then to use a method to minimize the generalization error. The primary advantage of SVMs as adaptive models for binary classification and regression is that they provide a classifier with minimal VC dimension which implies low expected probability of generalization errors. SVMs can be used to classify linearly separable data and nonlinearly separable data. They can be used as nonlinear classifiers and regression machines by mapping the input space to a high dimensional feature space. In this high dimensional feature space, linear classification can be performed.

In the last few years, a significant amount of research has been performed in SVMs. Learning algorithms and training methods are examined in . Methods for determining the data to use in support vector methods has been considered in [16]. Decision rules have been considered in [3].

Applications of support vector machines to speaker identification are considered in [14]. Time series prediction applications of support vector machines have been considered in .

Support vector machines have been shown to have a relationship with other recent nonlinear classification and modeling techniques such as: radial basis function networks [17], sparse approximation [5], PCA and regularization [13]. Support vector machines have been used to choose radial basis function centers .

The key to understanding SVMs is to see how they introduce optimal hyperplanes to separate classes of data in the classifiers. We review the main concepts of SVMs in the next section.

2 How Support Vector Machines Work

2.1 Optimal Hyperplanes

Consider an m-dimensional input vector x = [x₁,...,x_m]^T Î X Ì R^m and a 1-dimensional output y Î {-1,1}. Let there exist n training vectors (x_i,y_i) i = 1,..,n. Hence we may write X = [ x₁x₂¼x_n] or

X =

é
ê
ê
ê
ê
ë

x₁₁

x_1n

^··_·

x_m1

x_mn

ù
ú
ú
ú
ú
û

(1)

A hyperplane capable of performing a linear separation of the training data is described by

w^Tx+b = 0

(2)

where w = [ w₁w₂¼w_m] ^T, w Î W Ì R^m.

The concept of an optimal hyperplane was proposed by Vapnik [18]. For the case where the training data is linearly separable, an optimal hyperplane separates the data without error and the distance between the hyperplane and the closest training points is maximal.

2.2 Canonical Hyperplanes

A canonical hyperplane is a hyperplane (in this case we consider the optimal hyperplane) in which the parameters are normalized in a particular manner [18].

Consider (2) which defines the general hyperplane. It is evident that there is some redundancy in this equation as far as separating sets of points. Suppose we have the following classes

y_i[ w^Tx_i+b] ³ 1 i = 1,...,n

(3)

where y Î [ -1,1].

One way in which we can constrain the hyperplane is to observe that on either side of the hyperplane, we may have w^Tx+b > 0 or w^Tx+b < 0. Thus, if we place the hyperplane midway between the two closest points to the hyperplane, then we can scale w,b such that

min
i = 1..n

| w^Tx_i+b| = 0

(4)

Now, the distance d from a point x_i to the hyperplane denoted by ( w,b) is given by

d(w,b;x_i) =

| w^Tx_i+b|

|| w||

(5)

where || w|| = w^Tw [8]. By considering two points on opposite sides of the hyperplane, the canonical hyperplane is found by maximizing the margin

p( w,b)

min
i;y_i = 1

d(w,b;x_i)+

min
j;yj = 1

d(w,b;x_j)

|| w||

(6)

This implies that the minimum distance between two classes i and j is at least [2/( || w|| )] [6,15].

Hence an optimization function which we seek to minimize to obtain canonical hyperplanes, is

J(w) =

|| w|| ²

(7)

Normally, to find the parameters, we would minimize the training error and there are no constraints on w,b. However, in this case, we seek to satisfy the inequality in (3). Thus, we need to solve the constrained optimization problem in which we seek a set of weights which separates the classes in the usually desired manner and also minimizing J(w), so that the margin between the classes is also maximized. Thus, we obtain a classifier with optimally separating hyperplanes.

2.3 An SVM Learning Rule

For any given data set, how can we determine w₀,b₀ such that (8) is minimized? One possible method to use would be a constrained form of gradient descent. In this case, a gradient descent algorithm is used to minimize the cost function J(w), while constraining the changes in the parameters according to ( 3). A better approach to this problem however, is to use Langrange multipliers which is well suited to the nonlinear constraints of (3). Thus, we introduce the Lagrangian

L(w,b,a) =

|| w||²-

n
å
i = 1

a_i( y_i[ w^Tx_i+b] -1)

(8)

where a_i are the Lagrange multipliers and a_i > 0.

The solution is found by maximizing L with respect to a_i and minimizing it with respect to the primal variables w and b . This problem may be transformed from the primal case into its dual [6] and hence we need to solve

max
a

min
w,b

L(w,b,a)

(9)

At the solution point, we have the following conditions

¶L(w₀,b₀,a₀)

¶w

¶L(w₀,b₀,a₀)

¶b

(10)

where solution variables w₀,b₀,a₀ are found. Performing the differentiations, we obtain respectively,

n
å
i = 1

a_0iy_i

w₀

n
å
i = 1

a_0ix_iy_i

(11)

and in each case a_0i > 0, i = 1,..,n.

These are properties of the optimal hyperplane specified by ( w₀,b₀). From (14) we note that given the Lagrange multipliers, the desired weight vector solution can be found directly in terms of the training vectors.

To determine the specific coefficients of the optimal hyperplane specified by ( w₀,b₀) we proceed as follows. Substitute (13) and (14) into (9) to obtain

L_D(w,b,a) =

n
å
i = 1

a_i-

n
å
i = 1

n
å
j = 1

a_ia_jy_iy_j( x_i^Tx_j)

(12)

It is necessary to maximize the dual form of the Lagrangian in (15) to obtain the required Lagrange multipliers. Before doing so however, consider (3) once again. We observe that for this inequality, there will only be some training vectors for which the equality holds true. That is, only for some ( x_i,y_i) will the following equation hold:

y_i[ w^Tx_i+b] = 1 i = 1,...,n

(13)

The training vectors for which this is the case, are called support vectors.

Since we have the Karush-Kühn-Tucker (KKT) conditions that a_0i > 0, i = 1,..,n and that given by (3), from the resulting Lagrangian in (9), we may write a further KKT condition

a_0i( y_i[ w₀^Tx_i+b₀]-1) = 0 i = 1,...,n

(14)

This means, that since the Lagrangian multipliers a_0i are nonzero with only the support vectors as defined in (16), the expansion of w₀ in (14) is with regard to the support vectors only.

Hence we have

w₀ =

å
i Ì S

a_0ix_iy_i

(15)

where S is the set of all support vectors in the training set. To obtain the Lagrangian multipliers a_0i, we need to maximize (15) only over the support vectors, subject to the constraints a_0i > 0, i = 1,..,n and that given in (13). This is a quadratic programming problem and can be readily solved [18]. Having obtained the Lagrangian multipliers, the weights w₀ can be found from (18).

3 Linear Support Vector Machines

3.1 Classification of Linearly Separable Data

A support vector machine which performs the task of classifying linearly separable data is defined as

f(x) = sgn{ w^Tx+b}

(16)

where w,b are found from the training set as indicated in Section ref:sec_canhyp. Hence (19) may be written as

f(x) = sgn

ì
í
î

å
i Ì S

a_0iy_i( x_i^Tx) +b₀

ü
ý
þ

(17)

where a_0i are determined from the solution of the quadratic programming problem in (15) and b₀ is found as

b₀ =

( w₀^Tx_i⁺+w₀^Tx_i^-)

(18)

where x_i⁺ and x_i^- are any input training vector examples from the positive and negative classes respectively. For greater numerical accuracy, we may also use

b₀ =

n
å
i = 1

( w₀^Tx_i⁺+ w₀^Tx_i^-)

(19)

3.2 Classification of Nonlinearly Separable Data

For the case where the data is nonlinearly separable, the above approach can be extended to find a hyperplane which minimizes the number of errors on the training set. This approach is also referred to as soft margin hyperplanes [15]. In this case, the aim is to

y_i[ w^Tx_i+b] ³ 1-x_i i = 1,...,n

(20)

where x_i > 0, i = 1,...,n. In this case, we seek to minimize to optimize

J(w,x) =

|| w||²+C

n
å
i = 1

x_i

(21)

4 Nonlinear Support Vector Machines

For some problems, improved classification results can be obtained using a nonlinear classifier. Consider (20) which is a linear classifier. A nonlinear classifier can be obtained using support vector machines as follows.

The classifer is obtained by the inner product x_i^Tx where i Ì S, the set of support vectors. However, it is not necessary to use the explicit input data to form the classifer. Instead, all that is needed is to use this inner products between the support vectors and the vectors of the feature space [18].

That is, by defining the kernel

K(x_i,x) = x_i^Tx

(22)

a nonlinear classifier can be obtained as

f(x) = sgn

ì
í
î

å
i Ì S

a_0iy_iK(x_i,x)+b₀

ü
ý
þ

(23)

Based on Mercer's theorem, it is possible to introduce a variety of kernel functions, including [18,15]:

Polynomial
The pth order polynomial kernel function is given by

K(x_i,x) =
(24)
Radial basis function

K(x_i,x) = e
(25)
where g > 0.
Multilayer networks
A multilayer network can be employed as a kernel function as follows. We have

K(x_i,x) = s( q( x_i^Tx) +f)
(26)
where s is a sigmoid function.

Note that the use of a nonlinear kernel permits a linear decision function to be used in a high dimensional feature space. We find the parameters following the same procedure as before. The Lagrange multipliers can be found by maximizing the functional

L_D(w,b,a) =

n
å
i = 1

a_i-

n
å
i = 1

n
å
j = 1

a_ia_jy_iy_jK(x_i,x)

(27)

5 Conclusions

Support vector machines offer an extremely powerful method of obtaining models for classification. They provide a mechanism for choosing the model structure in a natural manner which gives low generalization error and empirical risk.

References

[1]: G.E.P. Box and G.M. Jenkins. Time Series Analysis: Forecasting and Control. T. Holden-Day, San Francisco, 1976.
[2]: C. Burges, F. Girosi, P. Niyogi, T. Poggio, B. Schölkopf, K.K. Sung, and V. Vapnik. Choosing RBF centers with the support vector algorithm. A.I. Memo, MIT Artificial Intelligence Lab., Cambridge, MA, 1995.
[3]: C.J.C. Burges. Simplified support vector decision rules, 1996.
[4]: C. Cortes and V. Vapnik. Support vector networks. Machine Learning, 20:1-25, 1995.
[5]: F. Girosi. An equivalence between sparse approximation and Support Vector Machines. Neural Computation, 1997. (in press).
[6]: S. Gunn. Support vector machines for classification and regression. ISIS technical report, Image Speech & Intelligent Systems Group, University of Southampton, 1997.
[7]: K.-R. Müller, A.J. Smola, G. Rätsch, B. Schölkopf, J. Kohlmorgen and V. Vapnik. Predicting time series with support vector machines. 1997. (in press).
[8]: M.Kurtz. Handbook of Applied Mathematics for Engineers and Scientists. McGraw Hill, New York, 1991.
[9]: S. Mukherjee, E. Osuna, and F. Girosi. Nonlinear prediction of chaotic time series using support vector machines. In IEEE Workshop on Neural Networks and Signal Processing (in press), Amelia Island, FL, September 1997.
[10]: E. Osuna, R. Freund, and F. Girosi. An improved training algorithm for support vector machines. In IEEE Workshop on Neural Networks and Signal Processing (in press), Amelia Island, FL, September 1997.
[11]: E. Osuna, R. Freund, and F. Girosi. Support vector machines: Training and applications. A.I. Memo (in press) 1602, MIT A. I. Lab., 1997.
[12]: E. Osuna, R. Freund, and F. Girosi. Training support vector machines: an application to face detection. In Proc. Computer Vision and Pattern Recognition, Puerto Rico, June 16-20 1997.
[13]: T. Poggio and F. Girosi. Notes on PCA, Regularization, Support Vector Machines and Sparsity. A.I. Memo (in press), MIT Artificial Intelligence Laboratory, 1997.
[14]: M. Schmidt. Identifying speakers with support vectors networks. In Proceedings of Interface '96, Sydney, July 1996.
[15]: B. Scholkopf. Support Vector Learning. PhD thesis, Technical University of Berlin, 1997.
[16]: B. Schölkopf, C. Burges, and V. Vapnik. Extracting support data for a given task. In U.M. Fayyad and R. Uthurusamy, editors, Proceedings of the First International Conference on Knowledge Discovery and Data Mining, Menlo Park, CA, 1995. AAAI Press.
[17]: B. Schölkopf, K. Sung, C. Burges, F. Girosi, P. Niyogi, T. Poggio, and V. Vapnik. Comparing support vector machines with gaussian kernels to radial basis function classifiers. IEEE Trans. Sign. Processing, 45:2758 - 2765, 1997.
[18]: V. Vapnik. The Nature of Statistical Learning Theory. Springer, New York, 1995.
[19]: V. Vapnik, S.E. Golowich, and A. Smola. Support vector method for function approximation, regression estimation, and signal processing. In Advances in Neural Information Processings Systems 9, pages 281-287, San Mateo, CA, 1997. Morgan Kaufmann Publishers.