Research
   Home
   Computational Finance
   ICA
   Hybrid Systems
   Recurrent Networks
   Support Vector Machines
   Input Variable Selection
   Classification Methods
   

Publications
   Book Chapters
   Journal Papers
   Conference Papers
   BibTex
  

Software
   Windale Technologies
  

Conferences
   NIPS*96
   NIPS*95
   NIPS*94


Information
   Biography
   Contact Me
  
View Andrew Back's profile on LinkedIn

   Support Vector Machines



Classification Using Support Vector Machines

Andrew D. Back
RIKEN Brain Science Institute, Wako-shi, Saitama, Japan.

1  Introduction

Historically, classifiers have been determined by choosing a structure1, and then selecting a parameter estimation algorithm used to optimize some cost function. The structure chosen fixes the best achievable generalization error, while the parameter estimation algorithm optimizes the cost function with respect to the empirical risk.

There are a number of problems with this approach, however. These include:

  • The model structure needs to be selected in some manner. If this is not done correctly, then even with zero empirical risk, it is still possible to have a large generalization error [18].

  • If we wish to avoid the problem of overfitting, as indicated by the above problem, by choosing a smaller model size or order, then it may be difficult to fit the training data (and hence minimize the empirical risk).

  • Determining a suitable learning algorithm for minimizing the empirical risk may still be quite difficult. It may be very hard or impossible to guarantee that the correct set of parameters.

The support vector method is a recently developed technique which is designed for efficient multidimensional function approximation . The basic idea of support vector machines (SVMs) is to determine a classifier or regression machine which minimizes the empirical risk (that is, the training set error) and the confidence interval (which corresponds to the generalization or test set error) [18].

In SVMs, the idea is to fix the empirical risk associated with an architecture and then to use a method to minimize the generalization error. The primary advantage of SVMs as adaptive models for binary classification and regression is that they provide a classifier with minimal VC dimension which implies low expected probability of generalization errors. SVMs can be used to classify linearly separable data and nonlinearly separable data. They can be used as nonlinear classifiers and regression machines by mapping the input space to a high dimensional feature space. In this high dimensional feature space, linear classification can be performed.

In the last few years, a significant amount of research has been performed in SVMs. Learning algorithms and training methods are examined in . Methods for determining the data to use in support vector methods has been considered in [16]. Decision rules have been considered in [3].

Applications of support vector machines to speaker identification are considered in [14]. Time series prediction applications of support vector machines have been considered in .

Support vector machines have been shown to have a relationship with other recent nonlinear classification and modeling techniques such as: radial basis function networks [17], sparse approximation [5], PCA and regularization [13]. Support vector machines have been used to choose radial basis function centers .

The key to understanding SVMs is to see how they introduce optimal hyperplanes to separate classes of data in the classifiers. We review the main concepts of SVMs in the next section.

2  How Support Vector Machines Work

2.1  Optimal Hyperplanes

Consider an m-dimensional input vector x = [x1,...,xm]T Î X Ì Rm and a 1-dimensional output y Î {-1,1}. Let there exist n training vectors (xi,yi) i = 1,..,n. Hence we may write X = [ x1x2¼xn] or

X = é
ê
ê
ê
ê
ë
x11
¼
x1n
:
···
:
xm1
¼
xmn
ù
ú
ú
ú
ú
û
(1)

A hyperplane capable of performing a linear separation of the training data is described by

wTx+b = 0
(2)
where w = [ w1w2¼wm] T, w Î W Ì Rm.

The concept of an optimal hyperplane was proposed by Vapnik [18]. For the case where the training data is linearly separable, an optimal hyperplane separates the data without error and the distance between the hyperplane and the closest training points is maximal.

2.2  Canonical Hyperplanes

A canonical hyperplane is a hyperplane (in this case we consider the optimal hyperplane) in which the parameters are normalized in a particular manner [18].

Consider (2) which defines the general hyperplane. It is evident that there is some redundancy in this equation as far as separating sets of points. Suppose we have the following classes

yi[ wTxi+b] ³ 1    i = 1,...,n
(3)
where y Î [ -1,1].

One way in which we can constrain the hyperplane is to observe that on either side of the hyperplane, we may have wTx+b > 0 or wTx+b < 0. Thus, if we place the hyperplane midway between the two closest points to the hyperplane, then we can scale w,b such that


min
i = 1..n 
| wTxi+b| = 0
(4)
Now, the distance d from a point xi to the hyperplane denoted by ( w,b) is given by
d(w,b;xi) = | wTxi+b|
|| w||
(5)

where || w|| = wTw [8]. By considering two points on opposite sides of the hyperplane, the canonical hyperplane is found by maximizing the margin

p( w,b)
=

min
i;yi = 1 
d(w,b;xi)+
min
j;yj = 1 
d(w,b;xj)
=
2
|| w||
(6)

This implies that the minimum distance between two classes i and j is at least [2/( || w|| )] [6,15].

Hence an optimization function which we seek to minimize to obtain canonical hyperplanes, is

J(w) = 1
2
|| w|| 2
(7)

Normally, to find the parameters, we would minimize the training error and there are no constraints on w,b. However, in this case, we seek to satisfy the inequality in (3). Thus, we need to solve the constrained optimization problem in which we seek a set of weights which separates the classes in the usually desired manner and also minimizing J(w), so that the margin between the classes is also maximized. Thus, we obtain a classifier with optimally separating hyperplanes.

2.3  An SVM Learning Rule

For any given data set, how can we determine w0,b0 such that (8) is minimized? One possible method to use would be a constrained form of gradient descent. In this case, a gradient descent algorithm is used to minimize the cost function J(w), while constraining the changes in the parameters according to ( 3). A better approach to this problem however, is to use Langrange multipliers which is well suited to the nonlinear constraints of (3). Thus, we introduce the Lagrangian

L(w,b,a) = 1
2
|| w||2- n
å
i = 1 
ai( yi[ wTxi+b] -1)
(8)
where ai are the Lagrange multipliers and ai > 0.

The solution is found by maximizing L with respect to ai and minimizing it with respect to the primal variables w and b . This problem may be transformed from the primal case into its dual [6] and hence we need to solve


max
a 

min
w,b 
L(w,b,a)
(9)

At the solution point, we have the following conditions

L(w0,b0,a0)
w
=
0
L(w0,b0,a0)
b
=
0
(10)
where solution variables w0,b0,a0 are found. Performing the differentiations, we obtain respectively,

n
å
i = 1 
a0iyi
=
0
w0
=
n
å
i = 1 
a0ixiyi
(11)
and in each case a0i > 0, i = 1,..,n.

These are properties of the optimal hyperplane specified by ( w0,b0). From (14) we note that given the Lagrange multipliers, the desired weight vector solution can be found directly in terms of the training vectors.

To determine the specific coefficients of the optimal hyperplane specified by ( w0,b0) we proceed as follows. Substitute (13) and (14) into (9) to obtain

LD(w,b,a) = n
å
i = 1 
ai- 1
2
n
å
i = 1 
n
å
j = 1 
aiajyiyj( xiTxj)
(12)

It is necessary to maximize the dual form of the Lagrangian in (15) to obtain the required Lagrange multipliers. Before doing so however, consider (3) once again. We observe that for this inequality, there will only be some training vectors for which the equality holds true. That is, only for some ( xi,yi) will the following equation hold:

yi[ wTxi+b] = 1    i = 1,...,n
(13)
The training vectors for which this is the case, are called support vectors.

Since we have the Karush-Kühn-Tucker (KKT) conditions that a0i > 0, i = 1,..,n and that given by (3), from the resulting Lagrangian in (9), we may write a further KKT condition

a0i( yi[ w0Txi+b0]-1) = 0    i = 1,...,n
(14)

This means, that since the Lagrangian multipliers a0i are nonzero with only the support vectors as defined in (16), the expansion of w0 in (14) is with regard to the support vectors only.

Hence we have

w0 =
å
i Ì S 
a0ixiyi
(15)
where S is the set of all support vectors in the training set. To obtain the Lagrangian multipliers a0i, we need to maximize (15) only over the support vectors, subject to the constraints a0i > 0, i = 1,..,n and that given in (13). This is a quadratic programming problem and can be readily solved [18]. Having obtained the Lagrangian multipliers, the weights w0 can be found from (18).

3  Linear Support Vector Machines

3.1  Classification of Linearly Separable Data

A support vector machine which performs the task of classifying linearly separable data is defined as

f(x) = sgn{ wTx+b}
(16)
where w,b are found from the training set as indicated in Section ref:sec_canhyp. Hence (19) may be written as
f(x) = sgn ì
í
î

å
i Ì S 
a0iyi( xiTx) +b0 ü
ý
þ
(17)
where a0i are determined from the solution of the quadratic programming problem in (15) and b0 is found as
b0 = 1
2
( w0Txi++w0Txi-)
(18)
where xi+ and xi- are any input training vector examples from the positive and negative classes respectively. For greater numerical accuracy, we may also use
b0 = 1
2n
n
å
i = 1 
( w0Txi++ w0Txi-)
(19)

3.2  Classification of Nonlinearly Separable Data

For the case where the data is nonlinearly separable, the above approach can be extended to find a hyperplane which minimizes the number of errors on the training set. This approach is also referred to as soft margin hyperplanes [15]. In this case, the aim is to

yi[ wTxi+b] ³ 1-xi   i = 1,...,n
(20)
where xi > 0, i = 1,...,n. In this case, we seek to minimize to optimize
J(w,x) = 1
2
|| w||2+C n
å
i = 1 
xi
(21)

4  Nonlinear Support Vector Machines

For some problems, improved classification results can be obtained using a nonlinear classifier. Consider (20) which is a linear classifier. A nonlinear classifier can be obtained using support vector machines as follows.

The classifer is obtained by the inner product xiTx where i Ì S, the set of support vectors. However, it is not necessary to use the explicit input data to form the classifer. Instead, all that is needed is to use this inner products between the support vectors and the vectors of the feature space [18].

That is, by defining the kernel

K(xi,x) = xiTx
(22)
a nonlinear classifier can be obtained as
f(x) = sgn ì
í
î

å
i Ì S 
a0iyiK(xi,x)+b0 ü
ý
þ
(23)

Based on Mercer's theorem, it is possible to introduce a variety of kernel functions, including [18,15]:

  • Polynomial

    The pth order polynomial kernel function is given by

    K(xi,x) =
    (24)

  • Radial basis function
    K(xi,x) = e
    (25)
    where g > 0.

  • Multilayer networks

    A multilayer network can be employed as a kernel function as follows. We have

    K(xi,x) = s( q( xiTx) +f)
    (26)
    where s is a sigmoid function.

Note that the use of a nonlinear kernel permits a linear decision function to be used in a high dimensional feature space. We find the parameters following the same procedure as before. The Lagrange multipliers can be found by maximizing the functional

LD(w,b,a) = n
å
i = 1 
ai- 1
2
n
å
i = 1 
n
å
j = 1 
aiajyiyjK(xi,x)
(27)

5  Conclusions

Support vector machines offer an extremely powerful method of obtaining models for classification. They provide a mechanism for choosing the model structure in a natural manner which gives low generalization error and empirical risk.

References

[1]
G.E.P. Box and G.M. Jenkins. Time Series Analysis: Forecasting and Control. T. Holden-Day, San Francisco, 1976.

[2]
C. Burges, F. Girosi, P. Niyogi, T. Poggio, B. Schölkopf, K.K. Sung, and V. Vapnik. Choosing RBF centers with the support vector algorithm. A.I. Memo, MIT Artificial Intelligence Lab., Cambridge, MA, 1995.

[3]
C.J.C. Burges. Simplified support vector decision rules, 1996.

[4]
C. Cortes and V. Vapnik. Support vector networks. Machine Learning, 20:1-25, 1995.

[5]
F. Girosi. An equivalence between sparse approximation and Support Vector Machines. Neural Computation, 1997. (in press).

[6]
S. Gunn. Support vector machines for classification and regression. ISIS technical report, Image Speech & Intelligent Systems Group, University of Southampton, 1997.

[7]
K.-R. Müller, A.J. Smola, G. Rätsch, B. Schölkopf, J. Kohlmorgen and V. Vapnik. Predicting time series with support vector machines. 1997. (in press).

[8]
M.Kurtz. Handbook of Applied Mathematics for Engineers and Scientists. McGraw Hill, New York, 1991.

[9]
S. Mukherjee, E. Osuna, and F. Girosi. Nonlinear prediction of chaotic time series using support vector machines. In IEEE Workshop on Neural Networks and Signal Processing (in press), Amelia Island, FL, September 1997.

[10]
E. Osuna, R. Freund, and F. Girosi. An improved training algorithm for support vector machines. In IEEE Workshop on Neural Networks and Signal Processing (in press), Amelia Island, FL, September 1997.

[11]
E. Osuna, R. Freund, and F. Girosi. Support vector machines: Training and applications. A.I. Memo (in press) 1602, MIT A. I. Lab., 1997.

[12]
E. Osuna, R. Freund, and F. Girosi. Training support vector machines: an application to face detection. In Proc. Computer Vision and Pattern Recognition, Puerto Rico, June 16-20 1997.

[13]
T. Poggio and F. Girosi. Notes on PCA, Regularization, Support Vector Machines and Sparsity. A.I. Memo (in press), MIT Artificial Intelligence Laboratory, 1997.

[14]
M. Schmidt. Identifying speakers with support vectors networks. In Proceedings of Interface '96, Sydney, July 1996.

[15]
B. Scholkopf. Support Vector Learning. PhD thesis, Technical University of Berlin, 1997.

[16]
B. Schölkopf, C. Burges, and V. Vapnik. Extracting support data for a given task. In U.M. Fayyad and R. Uthurusamy, editors, Proceedings of the First International Conference on Knowledge Discovery and Data Mining, Menlo Park, CA, 1995. AAAI Press.

[17]
B. Schölkopf, K. Sung, C. Burges, F. Girosi, P. Niyogi, T. Poggio, and V. Vapnik. Comparing support vector machines with gaussian kernels to radial basis function classifiers. IEEE Trans. Sign. Processing, 45:2758 - 2765, 1997.

[18]
V. Vapnik. The Nature of Statistical Learning Theory. Springer, New York, 1995.

[19]
V. Vapnik, S.E. Golowich, and A. Smola. Support vector method for function approximation, regression estimation, and signal processing. In Advances in Neural Information Processings Systems 9, pages 281-287, San Mateo, CA, 1997. Morgan Kaufmann Publishers.


Footnotes:

1 This same approach has also been used for time series modelling [1].


File translated from TEX by TTH, version 1.57.
 

 

 

Home   Research  Contact Me  Feedback  Publications

Copyright © 1989-2008 Andrew Back. All Rights Reserved.