Supervised learning: Support Vector Machines (SVM)

Support Vector Machines (SVM) are a robust and versatile category of supervised machine learning algorithms, used for both classification and regression tasks, but predominantly known for their application in classification. Developed originally in the 1960s and then refined in the 1990s, SVMs have become a fundamental tool in the data scientist’s toolkit due to their ability to handle both linear and non-linear boundaries depending on the kernel used.

How SVM Works

The main idea of SVM is to find the optimal hyperplane which best separates the data into its respective classes. For two-dimensional data, this hyperplane can be thought of as a line dividing a plane in two parts where each class lays on either side.

Key Components:

  • Hyperplane: This is the decision boundary that separates different classes in the feature space. In SVM, the aim is to find the hyperplane that has the maximum margin, i.e., the maximum distance between data points of both classes.
  • Support Vectors: These are the data points that are closest to the hyperplane and influence its position and orientation. SVMs are named after these support vectors because changing their position can alter the position of the hyperplane.
  • Margin: This is a gap between the two lines on the closest class points.

Types of SVM

  1. Linear SVM: Linear SVM is used for linearly separable data, where the data can be separated using a straight line (or hyperplane in higher dimensions).
  2. Non-linear SVM: When data is not linearly separable, SVM uses a method called the kernel trick to transform the input space to a higher dimensional space where a hyperplane can be used to separate the data. Common kernels include:
    • Polynomial Kernel
    • Radial Basis Function (RBF) Kernel: Often just called the Gaussian kernel.
    • Sigmoid Kernel

The Kernel Trick

The kernel trick is a key feature that allows SVM to fit the maximum-margin hyperplane in a transformed feature space. The transformation is performed by a kernel function which computes a dot product between two vectors in the feature space. This allows the algorithm to fit the data in a higher-dimensional space without explicitly calculating the coordinates in that space, which can be computationally expensive.

Training SVM

The training of an SVM involves selecting a kernel type and its parameters as well as a regularization parameter 𝐶C that controls the trade-off between achieving a low error on the training data and minimizing the norm of the weights, thus ensuring a lower model complexity and better generalization properties on new data.

Applications of SVM

  • Face Detection: SVM classifies parts of the image as a face and non-face and creates a square boundary around the face.
  • Text and Hypertext Categorization: SVMs allow Textual data to be classified with high accuracy using training data. It is used for various classification tasks such as news articles, emails, and web pages classification.
  • Classification of Images: SVMs provide better search accuracy for image classification. It provides better accuracy in comparison to the traditional query-based searching techniques.
  • Bioinformatics: It includes protein classification and cancer classification. SVMs are used to classify proteins with up to 90% of the compounds classified correctly.
  • Handwriting Recognition: SVMs are used to recognize handwritten characters used widely.

Advantages of SVM

  • Effective in high dimensional spaces.
  • Still effective in cases where the number of dimensions exceeds the number of samples.
  • Uses a subset of training points in the decision function (called support vectors), so it is also memory efficient.
  • Versatile: Different Kernel functions can be specified for the decision function. Common kernels are provided, but it is also possible to specify custom kernels.

Limitations of SVM

  • If the number of features is much greater than the number of samples, the method is likely to give poor performances.
  • SVMs do not directly provide probability estimates, these are calculated using an expensive five-fold cross-validation.
  • They are sensitive to the type of kernel used and can be quite prone to over-fitting if the kernel parameters and regularization terms are not carefully chosen.