Support Vector Machines
Classification Using Support Vector Machines
Andrew D. Back
RIKEN Brain Science Institute, Wakoshi, Saitama, Japan.
1 Introduction
Historically, classifiers have been determined by choosing a structure^{1}, and then selecting a parameter estimation
algorithm used to optimize some cost function. The structure chosen fixes
the best achievable generalization error, while the parameter estimation
algorithm optimizes the cost function with respect to the empirical risk.
There are a number of problems with this approach, however. These include:
 The model structure needs to be selected in some manner. If this is
not done correctly, then even with zero empirical risk, it is still possible
to have a large generalization error [18].
 If we wish to avoid the problem of overfitting, as indicated by the
above problem, by choosing a smaller model size or order, then it may be
difficult to fit the training data (and hence minimize the empirical risk).
 Determining a suitable learning algorithm for minimizing the
empirical risk may still be quite difficult. It may be very hard or
impossible to guarantee that the correct set of parameters.
The support vector method is a recently developed technique which is
designed for efficient multidimensional function approximation . The basic idea of support vector machines (SVMs) is
to determine a classifier or regression machine which minimizes the
empirical risk (that is, the training set error) and the confidence interval
(which corresponds to the generalization or test set error) [18].
In SVMs, the idea is to fix the empirical risk associated with an
architecture and then to use a method to minimize the generalization error.
The primary advantage of SVMs as adaptive models for binary classification
and regression is that they provide a classifier with minimal VC dimension
which implies low expected probability of generalization errors. SVMs can be
used to classify linearly separable data and nonlinearly separable data.
They can be used as nonlinear classifiers and regression machines by mapping
the input space to a high dimensional feature space. In this high
dimensional feature space, linear classification can be performed.
In the last few years, a significant amount of research has been performed
in SVMs. Learning algorithms and training methods are examined in . Methods for determining the data to
use in support vector methods has been considered in [16].
Decision rules have been considered in [3].
Applications of support vector machines to speaker identification are
considered in [14]. Time series prediction applications of
support vector machines have been considered in .
Support vector machines have been shown to have a relationship with other
recent nonlinear classification and modeling techniques such as: radial
basis function networks [17], sparse approximation
[5], PCA and regularization [13]. Support vector
machines have been used to choose radial basis function centers .
The key to understanding SVMs is to see how they introduce optimal
hyperplanes to separate classes of data in the classifiers. We review the
main concepts of SVMs in the next section.
2 How Support Vector Machines Work
2.1 Optimal Hyperplanes
Consider an mdimensional input vector
x = [x_{1},...,x_{m}]^{T} Î X Ì R^{m} and a 1dimensional output
y Î {1,1}. Let there exist n training vectors
(x_{i},y_{i}) i = 1,..,n. Hence we may write
X = [ x_{1}x_{2}¼x_{n}] or
A hyperplane capable of performing a linear separation of the training data
is described by
where w = [ w_{1}w_{2}¼w_{m}] ^{T}, w Î W Ì R^{m}.
The concept of an optimal hyperplane was proposed by Vapnik [18].
For the case where the training data is linearly separable, an optimal
hyperplane separates the data without error and the distance between the
hyperplane and the closest training points is maximal.
2.2 Canonical Hyperplanes
A canonical hyperplane is a hyperplane (in this case we consider the optimal
hyperplane) in which the parameters are normalized in a particular manner
[18].
Consider (2) which defines the general hyperplane. It is evident
that there is some redundancy in this equation as far as separating sets of
points. Suppose we have the following classes
y_{i}[ w^{T}x_{i}+b] ³ 1 i = 1,...,n 
 (3) 
where y Î [ 1,1].
One way in which we can constrain the hyperplane is to observe that on
either side of the hyperplane, we may have w^{T}x+b > 0 or
w^{T}x+b < 0. Thus, if we place the hyperplane midway
between the two closest points to the hyperplane, then we can scale w,b such that

min
i = 1..n

 w^{T}x_{i}+b = 0 
 (4) 
Now, the distance d from a point x_{i} to the hyperplane
denoted by ( w,b) is given by
d(w,b;x_{i}) = 
 w^{T}x_{i}+b
 w


 (5) 
where  w = w^{T}w [8]. By considering
two points on opposite sides of the hyperplane, the canonical hyperplane is
found by maximizing the margin




min
i;y_{i} = 1

d(w,b;x_{i})+ 
min
j;yj = 1

d(w,b;x_{j}) 
 


 
 (6) 
This implies that the minimum distance between two classes i and j is at
least [2/(  w )] [6,15].
Hence an optimization function which we seek to minimize to obtain canonical
hyperplanes, is
Normally, to find the parameters, we would minimize the training error and
there are no constraints on w,b. However, in this case, we seek
to satisfy the inequality in (3). Thus, we need to solve the
constrained optimization problem in which we seek a set of weights which
separates the classes in the usually desired manner and also minimizing J(w), so that the margin between the classes is also maximized.
Thus, we obtain a classifier with optimally separating hyperplanes.
2.3 An SVM Learning Rule
For any given data set, how can we determine w_{0},b_{0}
such that (8) is minimized? One possible method to use would be a
constrained form of gradient descent.
In this case, a
gradient descent algorithm is used to minimize the cost function J(w), while constraining the changes in the parameters according to (
3). A better approach to this problem however, is to use Langrange
multipliers which is well suited to the nonlinear constraints of (3). Thus, we introduce the Lagrangian
L(w,b,a) = 
1
2

 w^{2} 
n å
i = 1

a_{i}( y_{i}[ w^{T}x_{i}+b] 1) 
 (8) 
where a_{i} are the Lagrange multipliers and a_{i} > 0.
The solution is found by maximizing L with respect to a_{i} and
minimizing it with respect to the primal variables w and b . This problem may be transformed from the primal case into its
dual [6] and hence we need to solve
At the solution point, we have the following conditions
where solution variables w_{0},b_{0},a_{0} are found.
Performing the differentiations, we obtain respectively,



 


n å
i = 1

a_{0i}x_{i}y_{i} 

 
 (11) 
and in each case a_{0i} > 0, i = 1,..,n.
These are properties of the optimal hyperplane specified by ( w_{0},b_{0}). From (14) we note that given the
Lagrange multipliers, the desired weight vector solution can be found
directly in terms of the training vectors.
To determine the specific coefficients of the optimal hyperplane specified
by ( w_{0},b_{0}) we proceed as follows. Substitute (13) and (14) into (9) to obtain
L_{D}(w,b,a) = 
n å
i = 1

a_{i} 
1
2


n å
i = 1


n å
j = 1

a_{i}a_{j}y_{i}y_{j}( x_{i}^{T}x_{j}) 
 (12) 
It is necessary to maximize the dual form of the Lagrangian in (15) to obtain the required Lagrange multipliers. Before doing so however,
consider (3) once again. We observe that for this inequality,
there will only be some training vectors for which the equality holds true.
That is, only for some ( x_{i},y_{i}) will the
following equation hold:
y_{i}[ w^{T}x_{i}+b] = 1 i = 1,...,n 
 (13) 
The training vectors for which this is the case, are called support
vectors.
Since we have the KarushKühnTucker (KKT) conditions that
a_{0i} > 0, i = 1,..,n and that given by (3), from the resulting
Lagrangian in (9), we may write a further KKT condition
a_{0i}( y_{i}[ w_{0}^{T}x_{i}+b_{0}]1) = 0 i = 1,...,n 
 (14) 
This means, that since the Lagrangian multipliers a_{0i} are nonzero
with only the support vectors as defined in (16), the expansion of
w_{0} in (14) is with regard to the support vectors
only.
Hence we have
w_{0} = 
å
i Ì S

a_{0i}x_{i}y_{i} 
 (15) 
where S is the set of all support vectors in the training set. To obtain
the Lagrangian multipliers a_{0i}, we need to maximize (15) only over the support vectors, subject to the constraints a_{0i} > 0,
i = 1,..,n and that given in (13). This is a quadratic
programming problem and can be readily solved [18]. Having
obtained the Lagrangian multipliers, the weights w_{0} can be
found from (18).
3 Linear Support Vector Machines
3.1 Classification of Linearly Separable Data
A support vector machine which performs the task of classifying linearly
separable data is defined as
where w,b are found from the training set as indicated in Section
ref:sec_canhyp. Hence (19) may be written as
f(x) = sgn 
ì í
î


å
i Ì S

a_{0i}y_{i}( x_{i}^{T}x) +b_{0} 
ü ý
þ


 (17) 
where a_{0i} are determined from the solution of the quadratic
programming problem in (15) and b_{0} is found as
b_{0} = 
1
2

( w_{0}^{T}x_{i}^{+}+w_{0}^{T}x_{i}^{}) 
 (18) 
where x_{i}^{+} and x_{i}^{} are any input training
vector examples from the positive and negative classes respectively. For
greater numerical accuracy, we may also use
b_{0} = 
1
2n


n å
i = 1

( w_{0}^{T}x_{i}^{+}+ w_{0}^{T}x_{i}^{}) 
 (19) 
3.2 Classification of Nonlinearly Separable Data
For the case where the data is nonlinearly separable, the above approach can
be extended to find a hyperplane which minimizes the number of errors on the
training set. This approach is also referred to as soft margin hyperplanes
[15]. In this case, the aim is to
y_{i}[ w^{T}x_{i}+b] ³ 1x_{i} i = 1,...,n 
 (20) 
where x_{i} > 0, i = 1,...,n. In this case, we seek to minimize to
optimize
J(w,x) = 
1
2

 w^{2}+C 
n å
i = 1

x_{i} 
 (21) 
4 Nonlinear Support Vector Machines
For some problems, improved classification results can be obtained using a
nonlinear classifier. Consider (20) which is a linear
classifier. A nonlinear classifier can be obtained using support vector
machines as follows.
The classifer is obtained by the inner product x_{i}^{T}x
where i Ì S, the set of support vectors. However, it is not necessary to use the explicit input data to form the
classifer.
Instead, all that is needed is to use this inner
products between the support vectors and the vectors of the feature space
[18].
That is, by defining the kernel
a nonlinear classifier can be obtained as
f(x) = sgn 
ì í
î


å
i Ì S

a_{0i}y_{i}K(x_{i},x)+b_{0} 
ü ý
þ


 (23) 
Based on Mercer's theorem, it is possible to introduce a variety of kernel
functions, including [18,15]:
Note that the use of a nonlinear kernel permits a linear decision function
to be used in a high dimensional feature space. We find the parameters
following the same procedure as before. The Lagrange multipliers can be
found by maximizing the functional
L_{D}(w,b,a) = 
n å
i = 1

a_{i} 
1
2


n å
i = 1


n å
j = 1

a_{i}a_{j}y_{i}y_{j}K(x_{i},x) 
 (27) 
5 Conclusions
Support vector machines offer an extremely powerful method of obtaining
models for classification. They provide a mechanism for choosing the model
structure in a natural manner which gives low generalization error and empirical risk.
References
 [1]

G.E.P. Box and G.M. Jenkins.
Time Series Analysis: Forecasting and Control.
T. HoldenDay, San Francisco, 1976.
 [2]

C. Burges, F. Girosi, P. Niyogi, T. Poggio, B. Schölkopf, K.K. Sung, and
V. Vapnik.
Choosing RBF centers with the support vector algorithm.
A.I. Memo, MIT Artificial Intelligence Lab., Cambridge, MA, 1995.
 [3]

C.J.C. Burges.
Simplified support vector decision rules, 1996.
 [4]

C. Cortes and V. Vapnik.
Support vector networks.
Machine Learning, 20:125, 1995.
 [5]

F. Girosi.
An equivalence between sparse approximation and Support Vector
Machines.
Neural Computation, 1997.
(in press).
 [6]

S. Gunn.
Support vector machines for classification and regression.
ISIS technical report, Image Speech & Intelligent Systems Group,
University of Southampton, 1997.
 [7]

K.R. Müller, A.J. Smola, G. Rätsch, B. Schölkopf, J. Kohlmorgen and V.
Vapnik.
Predicting time series with support vector machines.
1997.
(in press).
 [8]

M.Kurtz.
Handbook of Applied Mathematics for Engineers and Scientists.
McGraw Hill, New York, 1991.
 [9]

S. Mukherjee, E. Osuna, and F. Girosi.
Nonlinear prediction of chaotic time series using support vector
machines.
In IEEE Workshop on Neural Networks and Signal Processing (in
press), Amelia Island, FL, September 1997.
 [10]

E. Osuna, R. Freund, and F. Girosi.
An improved training algorithm for support vector machines.
In IEEE Workshop on Neural Networks and Signal Processing (in
press), Amelia Island, FL, September 1997.
 [11]

E. Osuna, R. Freund, and F. Girosi.
Support vector machines: Training and applications.
A.I. Memo (in press) 1602, MIT A. I. Lab., 1997.
 [12]

E. Osuna, R. Freund, and F. Girosi.
Training support vector machines: an application to face detection.
In Proc. Computer Vision and Pattern Recognition, Puerto Rico,
June 1620 1997.
 [13]

T. Poggio and F. Girosi.
Notes on PCA, Regularization, Support Vector Machines and Sparsity.
A.I. Memo (in press), MIT Artificial Intelligence Laboratory, 1997.
 [14]

M. Schmidt.
Identifying speakers with support vectors networks.
In Proceedings of Interface '96, Sydney, July 1996.
 [15]

B. Scholkopf.
Support Vector Learning.
PhD thesis, Technical University of Berlin, 1997.
 [16]

B. Schölkopf, C. Burges, and V. Vapnik.
Extracting support data for a given task.
In U.M. Fayyad and R. Uthurusamy, editors, Proceedings of the
First International Conference on Knowledge Discovery and Data Mining, Menlo
Park, CA, 1995. AAAI Press.
 [17]

B. Schölkopf, K. Sung, C. Burges, F. Girosi, P. Niyogi, T. Poggio, and
V. Vapnik.
Comparing support vector machines with gaussian kernels to radial
basis function classifiers.
IEEE Trans. Sign. Processing, 45:2758  2765, 1997.
 [18]

V. Vapnik.
The Nature of Statistical Learning Theory.
Springer, New York, 1995.
 [19]

V. Vapnik, S.E. Golowich, and A. Smola.
Support vector method for function approximation, regression
estimation, and signal processing.
In Advances in Neural Information Processings Systems 9, pages
281287, San Mateo, CA, 1997. Morgan Kaufmann Publishers.
Footnotes:
^{1} This same approach has also been used for time series modelling [1].
File translated from T_{E}X by T_{T}H, version 1.57.
