Skip to content

svm

ai.svm

Support vector machines (SVMs) are a set of supervised learning methods used for classification, regression and outliers detection.

The advantages of support vector machines are:

  • Effective in high dimensional spaces.
  • Still effective in cases where number of dimensions is greater than the number of samples.
  • Uses a subset of training points in the decision function (called support vectors), so it is also memory efficient.
  • Versatile: different Kernel functions can be specified for the decision function. Common kernels are provided, but it is also possible to specify custom kernels.

The disadvantages of support vector machines include:

  • If the number of features is much greater than the number of samples, avoid over-fitting in choosing Kernel functions and regularization term is crucial.
  • SVMs do not directly provide probability estimates, these are calculated using an expensive five-fold cross-validation (see Scores and probabilities, below).

Types of SVMs:

  • Linear SVM

  • Linear SVMs use a linear decision boundary to separate the data points of different classes. When the data can be precisely linearly separated, linear SVMs are very suitable. This means that a single straight line (in 2D) or a hyperplane (in higher dimensions) can entirely divide the data points into their respective classes. A hyperplane that maximizes the margin between the classes is the decision boundary.

  • Non-Linear SVM

  • Non-Linear SVM can be used to classify data when it cannot be separated into two classes by a straight line (in the case of 2D). By using kernel functions, nonlinear SVMs can handle nonlinearly separable data. The original input data is transformed by these kernel functions into a higher-dimensional feature space, where the data points can be linearly separated. A linear SVM is used to locate a nonlinear decision boundary in this modified space.

The followings are the SVMs implementations that ai includes:

  • ai.svm.classes.LinearSVC

LinearSVC

Linear SVC classifier implementation.

Linear SVMs use a linear decision boundary to separate the data points of different classes. When the data can be precisely linearly separated, linear SVMs are very suitable. This means that a single straight line (in 2D) or a hyperplane (in higher dimensions) can entirely divide the data points into their respective classes. A hyperplane that maximizes the margin between the classes is the decision boundary.

The objective of training a LinearSVC is to minimize the norm of the weight vector \(||w||\), which is the slope of the decision function.

\[\min_{{w, b}} \dfrac{1}{2} w^{T}w = \dfrac{1}{2} ||w||^{2}\]
\[\mathrm{subject\ to}\ t^{i} (w^{T}x^{i} + b) \ge 1\]

Note:

We are minimizing \(\dfrac{1}{2}w^{T}w\), which is equal to \(\dfrac{1}{2}||w||^{2}\), rather than minimizing \(||w||\). Indeed, \(\dfrac{1}{2}||w||^{2}\) has a nice and simple derivative (just \(w\)) while \(||w||\) is not differentiable at \(w = 0\). Optimization algorithms work much better on differentiable functions.

The cost function used by Linear SVM classifier is the following:

\[J(w, b) = \dfrac{1}{2}w^{T}w + C\sum_{i=1}^{m}\max(0, 1 - t^{i}(w^{T}x^{i} + b))\]

The loss function used is the Hinge Loss function that clips the value at \(0\):

\[\max(0, 1 - t^{i}(w^{T}x^{i} + b))\]
Source code in ai/svm/classes.py
class LinearSVC:
  """**Linear SVC classifier** implementation.

  Linear SVMs use a linear decision boundary to separate the data points of
  different classes. When the data can be precisely linearly separated, linear
  SVMs are very suitable. This means that a single straight line (in 2D) or a
  hyperplane (in higher dimensions) can entirely divide the data points into
  their respective classes. A hyperplane that maximizes the margin between the
  classes is the decision boundary.

  The objective of training a `LinearSVC` is to minimize the norm of the weight
  vector $||w||$, which is the slope of the decision function.

  $$\\min_{{w, b}} \\dfrac{1}{2} w^{T}w = \\dfrac{1}{2} ||w||^{2}$$

  $$\\mathrm{subject\\ to}\\ t^{i} (w^{T}x^{i} + b) \\ge 1$$

  Note:

    We are minimizing $\\dfrac{1}{2}w^{T}w$, which is equal to
    $\\dfrac{1}{2}||w||^{2}$, rather than minimizing $||w||$.
    Indeed, $\\dfrac{1}{2}||w||^{2}$ has a nice and simple derivative
    (just $w$) while $||w||$ is not differentiable at $w = 0$.
    Optimization algorithms work much better on differentiable functions.

  The cost function used by **Linear SVM classifier** is the following:

  $$J(w, b) = \\dfrac{1}{2}w^{T}w +
  C\\sum_{i=1}^{m}\\max(0, 1 - t^{i}(w^{T}x^{i} + b))$$

  The loss function used is the **Hinge Loss** function that clips the value at
  $0$:

  $$\\max(0, 1 - t^{i}(w^{T}x^{i} + b))$$
  """

  def __init__(
    self,
    *,
    alpha: Optional[np.float16] = .01,
    lambda_p: Optional[np.float32] = .01,
    n_iters: Optional[np.int64] = 1_000
  ):
    """Initializes model's `alpha`, `lambda parameter` and number of
    `iterations`.

    Args:
      alpha: Model's learning rate. High value might over shoot the minimum
        loss, while low values might make the model to take forever to learn.
      lambda_p: Lambda parameter for updating weights.
      n_iters: Maximum number of updations to make over the weights and bias in
        order to reach to a effecient prediction that minimizes the loss.
    """
    self._alpha = alpha
    self._lambda_p = lambda_p
    self._n_iters = n_iters

    self._bias = None
    self._weights = None

  def fit(self, X: np.ndarray, y: np.ndarray) -> 'LinearSVC':  # pylint: disable=invalid-name
    """Fit `LinearSVC` according to X, y.

    The objective of training a `LinearSVC` is to minimize the norm of the
    weight vector $||w||$, which is the slope of the decision function.

    $$\\min_{{w, b}} \\dfrac{1}{2} w^{T}w = \\dfrac{1}{2} ||w||^{2}$$

    $$\\mathrm{subject\\ to}\\ t^{i} (w^{T}x^{i} + b) \\ge 1$$

    Note:

      We are minimizing $\\dfrac{1}{2}w^{T}w$, which is equal to
      $\\dfrac{1}{2}||w||^{2}$, rather than minimizing $||w||$.
      Indeed, $\\dfrac{1}{2}||w||^{2}$ has a nice and simple derivative
      (just $w$) while $||w||$ is not differentiable at
      $w = 0$. Optimization algorithms work much better on differentiable
      functions.

    The cost function used by **Linear SVM classifier** is the following:

    $$J(w, b) = \\dfrac{1}{2}w^{T}w +
    C\\sum_{i=1}^{m}\\max(0, 1 - t^{i}(w^{T}x^{i} + b))$$

    The loss function used is the **Hinge Loss** function that clips the value
    at $0$:

    $$\\max(0, 1 - t^{i}(w^{T}x^{i} + b))$$

    Args:
      X: Training vectors, where `n_samples` is the number of samples and
        `n_features` is the number of features.
      y: Target values.

    Returns:
      Returns the instance itself.
    """
    self._bias = 0
    self._weights = np.random.rand(X.shape[1])

    # We need the decision function to be greater than 1 for all positive
    # training instances, and lower than -1 for all negative training instances.
    # If we define ``t_i = -1`` for negative instances, and ``t_i = 1`` for
    # positive instances, then we can acheive less margin violation.
    t = np.where(y <= 0, -1, 1)
    for _ in range(self._n_iters):
      for idx, x_i in enumerate(X):
        # Since, cost function is different at different decision boundaries we
        # calculate weights and bias differently.
        if t[idx] * (np.dot(x_i, self._weights) - self._bias) >= 1:
          self._weights -= self._alpha * (2 * self._lambda_p * self._weights)
        else:
          self._weights -= self._alpha * (
            2 * self._lambda_p * self._weights - np.dot(t[idx], x_i)
          )
          self._bias -= self._alpha * t[idx]
    return self

  def predict(self, X: np.ndarray) -> np.ndarray:  # pylint: disable=invalid-name
    """Predict for `X` using the previously calculated `weights` and `bias`.

    Args:
      X: Feature vector.

    Returns:
      Target vector.

    Raises:
      RuntimeError: If `predict` is called before `fit`.
      ValueError: If shape of the given `X` differs from the shape of the `X`
        given to the `fit` function.
    """
    if self._weights is None or self._bias is None:
      raise RuntimeError(
        f'{self.__class__.__name__}: predict called before fitting data'
      )

    if X.shape[1] != self._weights.shape[0]:
      raise ValueError(
        (
          f'Number of features {X.shape[1]} does not match previous data '
          f'{self._weights.shape[0]}.'
        )
      )

    return np.sign(np.dot(X, self._weights) - self._bias)

__init__(*, alpha=0.01, lambda_p=0.01, n_iters=1000)

Initializes model's alpha, lambda parameter and number of iterations.

Parameters:

Name Type Description Default
alpha Optional[float16]

Model's learning rate. High value might over shoot the minimum loss, while low values might make the model to take forever to learn.

0.01
lambda_p Optional[float32]

Lambda parameter for updating weights.

0.01
n_iters Optional[int64]

Maximum number of updations to make over the weights and bias in order to reach to a effecient prediction that minimizes the loss.

1000
Source code in ai/svm/classes.py
def __init__(
  self,
  *,
  alpha: Optional[np.float16] = .01,
  lambda_p: Optional[np.float32] = .01,
  n_iters: Optional[np.int64] = 1_000
):
  """Initializes model's `alpha`, `lambda parameter` and number of
  `iterations`.

  Args:
    alpha: Model's learning rate. High value might over shoot the minimum
      loss, while low values might make the model to take forever to learn.
    lambda_p: Lambda parameter for updating weights.
    n_iters: Maximum number of updations to make over the weights and bias in
      order to reach to a effecient prediction that minimizes the loss.
  """
  self._alpha = alpha
  self._lambda_p = lambda_p
  self._n_iters = n_iters

  self._bias = None
  self._weights = None

fit(X, y)

Fit LinearSVC according to X, y.

The objective of training a LinearSVC is to minimize the norm of the weight vector \(||w||\), which is the slope of the decision function.

\[\min_{{w, b}} \dfrac{1}{2} w^{T}w = \dfrac{1}{2} ||w||^{2}\]
\[\mathrm{subject\ to}\ t^{i} (w^{T}x^{i} + b) \ge 1\]

Note:

We are minimizing \(\dfrac{1}{2}w^{T}w\), which is equal to \(\dfrac{1}{2}||w||^{2}\), rather than minimizing \(||w||\). Indeed, \(\dfrac{1}{2}||w||^{2}\) has a nice and simple derivative (just \(w\)) while \(||w||\) is not differentiable at \(w = 0\). Optimization algorithms work much better on differentiable functions.

The cost function used by Linear SVM classifier is the following:

\[J(w, b) = \dfrac{1}{2}w^{T}w + C\sum_{i=1}^{m}\max(0, 1 - t^{i}(w^{T}x^{i} + b))\]

The loss function used is the Hinge Loss function that clips the value at \(0\):

\[\max(0, 1 - t^{i}(w^{T}x^{i} + b))\]

Parameters:

Name Type Description Default
X ndarray

Training vectors, where n_samples is the number of samples and n_features is the number of features.

required
y ndarray

Target values.

required

Returns:

Type Description
LinearSVC

Returns the instance itself.

Source code in ai/svm/classes.py
def fit(self, X: np.ndarray, y: np.ndarray) -> 'LinearSVC':  # pylint: disable=invalid-name
  """Fit `LinearSVC` according to X, y.

  The objective of training a `LinearSVC` is to minimize the norm of the
  weight vector $||w||$, which is the slope of the decision function.

  $$\\min_{{w, b}} \\dfrac{1}{2} w^{T}w = \\dfrac{1}{2} ||w||^{2}$$

  $$\\mathrm{subject\\ to}\\ t^{i} (w^{T}x^{i} + b) \\ge 1$$

  Note:

    We are minimizing $\\dfrac{1}{2}w^{T}w$, which is equal to
    $\\dfrac{1}{2}||w||^{2}$, rather than minimizing $||w||$.
    Indeed, $\\dfrac{1}{2}||w||^{2}$ has a nice and simple derivative
    (just $w$) while $||w||$ is not differentiable at
    $w = 0$. Optimization algorithms work much better on differentiable
    functions.

  The cost function used by **Linear SVM classifier** is the following:

  $$J(w, b) = \\dfrac{1}{2}w^{T}w +
  C\\sum_{i=1}^{m}\\max(0, 1 - t^{i}(w^{T}x^{i} + b))$$

  The loss function used is the **Hinge Loss** function that clips the value
  at $0$:

  $$\\max(0, 1 - t^{i}(w^{T}x^{i} + b))$$

  Args:
    X: Training vectors, where `n_samples` is the number of samples and
      `n_features` is the number of features.
    y: Target values.

  Returns:
    Returns the instance itself.
  """
  self._bias = 0
  self._weights = np.random.rand(X.shape[1])

  # We need the decision function to be greater than 1 for all positive
  # training instances, and lower than -1 for all negative training instances.
  # If we define ``t_i = -1`` for negative instances, and ``t_i = 1`` for
  # positive instances, then we can acheive less margin violation.
  t = np.where(y <= 0, -1, 1)
  for _ in range(self._n_iters):
    for idx, x_i in enumerate(X):
      # Since, cost function is different at different decision boundaries we
      # calculate weights and bias differently.
      if t[idx] * (np.dot(x_i, self._weights) - self._bias) >= 1:
        self._weights -= self._alpha * (2 * self._lambda_p * self._weights)
      else:
        self._weights -= self._alpha * (
          2 * self._lambda_p * self._weights - np.dot(t[idx], x_i)
        )
        self._bias -= self._alpha * t[idx]
  return self

predict(X)

Predict for X using the previously calculated weights and bias.

Parameters:

Name Type Description Default
X ndarray

Feature vector.

required

Returns:

Type Description
ndarray

Target vector.

Raises:

Type Description
RuntimeError

If predict is called before fit.

ValueError

If shape of the given X differs from the shape of the X given to the fit function.

Source code in ai/svm/classes.py
def predict(self, X: np.ndarray) -> np.ndarray:  # pylint: disable=invalid-name
  """Predict for `X` using the previously calculated `weights` and `bias`.

  Args:
    X: Feature vector.

  Returns:
    Target vector.

  Raises:
    RuntimeError: If `predict` is called before `fit`.
    ValueError: If shape of the given `X` differs from the shape of the `X`
      given to the `fit` function.
  """
  if self._weights is None or self._bias is None:
    raise RuntimeError(
      f'{self.__class__.__name__}: predict called before fitting data'
    )

  if X.shape[1] != self._weights.shape[0]:
    raise ValueError(
      (
        f'Number of features {X.shape[1]} does not match previous data '
        f'{self._weights.shape[0]}.'
      )
    )

  return np.sign(np.dot(X, self._weights) - self._bias)

classes

Linear Support Vector Classification.

LinearSVC

Linear SVC classifier implementation.

Linear SVMs use a linear decision boundary to separate the data points of different classes. When the data can be precisely linearly separated, linear SVMs are very suitable. This means that a single straight line (in 2D) or a hyperplane (in higher dimensions) can entirely divide the data points into their respective classes. A hyperplane that maximizes the margin between the classes is the decision boundary.

The objective of training a LinearSVC is to minimize the norm of the weight vector \(||w||\), which is the slope of the decision function.

\[\min_{{w, b}} \dfrac{1}{2} w^{T}w = \dfrac{1}{2} ||w||^{2}\]
\[\mathrm{subject\ to}\ t^{i} (w^{T}x^{i} + b) \ge 1\]

Note:

We are minimizing \(\dfrac{1}{2}w^{T}w\), which is equal to \(\dfrac{1}{2}||w||^{2}\), rather than minimizing \(||w||\). Indeed, \(\dfrac{1}{2}||w||^{2}\) has a nice and simple derivative (just \(w\)) while \(||w||\) is not differentiable at \(w = 0\). Optimization algorithms work much better on differentiable functions.

The cost function used by Linear SVM classifier is the following:

\[J(w, b) = \dfrac{1}{2}w^{T}w + C\sum_{i=1}^{m}\max(0, 1 - t^{i}(w^{T}x^{i} + b))\]

The loss function used is the Hinge Loss function that clips the value at \(0\):

\[\max(0, 1 - t^{i}(w^{T}x^{i} + b))\]
Source code in ai/svm/classes.py
class LinearSVC:
  """**Linear SVC classifier** implementation.

  Linear SVMs use a linear decision boundary to separate the data points of
  different classes. When the data can be precisely linearly separated, linear
  SVMs are very suitable. This means that a single straight line (in 2D) or a
  hyperplane (in higher dimensions) can entirely divide the data points into
  their respective classes. A hyperplane that maximizes the margin between the
  classes is the decision boundary.

  The objective of training a `LinearSVC` is to minimize the norm of the weight
  vector $||w||$, which is the slope of the decision function.

  $$\\min_{{w, b}} \\dfrac{1}{2} w^{T}w = \\dfrac{1}{2} ||w||^{2}$$

  $$\\mathrm{subject\\ to}\\ t^{i} (w^{T}x^{i} + b) \\ge 1$$

  Note:

    We are minimizing $\\dfrac{1}{2}w^{T}w$, which is equal to
    $\\dfrac{1}{2}||w||^{2}$, rather than minimizing $||w||$.
    Indeed, $\\dfrac{1}{2}||w||^{2}$ has a nice and simple derivative
    (just $w$) while $||w||$ is not differentiable at $w = 0$.
    Optimization algorithms work much better on differentiable functions.

  The cost function used by **Linear SVM classifier** is the following:

  $$J(w, b) = \\dfrac{1}{2}w^{T}w +
  C\\sum_{i=1}^{m}\\max(0, 1 - t^{i}(w^{T}x^{i} + b))$$

  The loss function used is the **Hinge Loss** function that clips the value at
  $0$:

  $$\\max(0, 1 - t^{i}(w^{T}x^{i} + b))$$
  """

  def __init__(
    self,
    *,
    alpha: Optional[np.float16] = .01,
    lambda_p: Optional[np.float32] = .01,
    n_iters: Optional[np.int64] = 1_000
  ):
    """Initializes model's `alpha`, `lambda parameter` and number of
    `iterations`.

    Args:
      alpha: Model's learning rate. High value might over shoot the minimum
        loss, while low values might make the model to take forever to learn.
      lambda_p: Lambda parameter for updating weights.
      n_iters: Maximum number of updations to make over the weights and bias in
        order to reach to a effecient prediction that minimizes the loss.
    """
    self._alpha = alpha
    self._lambda_p = lambda_p
    self._n_iters = n_iters

    self._bias = None
    self._weights = None

  def fit(self, X: np.ndarray, y: np.ndarray) -> 'LinearSVC':  # pylint: disable=invalid-name
    """Fit `LinearSVC` according to X, y.

    The objective of training a `LinearSVC` is to minimize the norm of the
    weight vector $||w||$, which is the slope of the decision function.

    $$\\min_{{w, b}} \\dfrac{1}{2} w^{T}w = \\dfrac{1}{2} ||w||^{2}$$

    $$\\mathrm{subject\\ to}\\ t^{i} (w^{T}x^{i} + b) \\ge 1$$

    Note:

      We are minimizing $\\dfrac{1}{2}w^{T}w$, which is equal to
      $\\dfrac{1}{2}||w||^{2}$, rather than minimizing $||w||$.
      Indeed, $\\dfrac{1}{2}||w||^{2}$ has a nice and simple derivative
      (just $w$) while $||w||$ is not differentiable at
      $w = 0$. Optimization algorithms work much better on differentiable
      functions.

    The cost function used by **Linear SVM classifier** is the following:

    $$J(w, b) = \\dfrac{1}{2}w^{T}w +
    C\\sum_{i=1}^{m}\\max(0, 1 - t^{i}(w^{T}x^{i} + b))$$

    The loss function used is the **Hinge Loss** function that clips the value
    at $0$:

    $$\\max(0, 1 - t^{i}(w^{T}x^{i} + b))$$

    Args:
      X: Training vectors, where `n_samples` is the number of samples and
        `n_features` is the number of features.
      y: Target values.

    Returns:
      Returns the instance itself.
    """
    self._bias = 0
    self._weights = np.random.rand(X.shape[1])

    # We need the decision function to be greater than 1 for all positive
    # training instances, and lower than -1 for all negative training instances.
    # If we define ``t_i = -1`` for negative instances, and ``t_i = 1`` for
    # positive instances, then we can acheive less margin violation.
    t = np.where(y <= 0, -1, 1)
    for _ in range(self._n_iters):
      for idx, x_i in enumerate(X):
        # Since, cost function is different at different decision boundaries we
        # calculate weights and bias differently.
        if t[idx] * (np.dot(x_i, self._weights) - self._bias) >= 1:
          self._weights -= self._alpha * (2 * self._lambda_p * self._weights)
        else:
          self._weights -= self._alpha * (
            2 * self._lambda_p * self._weights - np.dot(t[idx], x_i)
          )
          self._bias -= self._alpha * t[idx]
    return self

  def predict(self, X: np.ndarray) -> np.ndarray:  # pylint: disable=invalid-name
    """Predict for `X` using the previously calculated `weights` and `bias`.

    Args:
      X: Feature vector.

    Returns:
      Target vector.

    Raises:
      RuntimeError: If `predict` is called before `fit`.
      ValueError: If shape of the given `X` differs from the shape of the `X`
        given to the `fit` function.
    """
    if self._weights is None or self._bias is None:
      raise RuntimeError(
        f'{self.__class__.__name__}: predict called before fitting data'
      )

    if X.shape[1] != self._weights.shape[0]:
      raise ValueError(
        (
          f'Number of features {X.shape[1]} does not match previous data '
          f'{self._weights.shape[0]}.'
        )
      )

    return np.sign(np.dot(X, self._weights) - self._bias)
__init__(*, alpha=0.01, lambda_p=0.01, n_iters=1000)

Initializes model's alpha, lambda parameter and number of iterations.

Parameters:

Name Type Description Default
alpha Optional[float16]

Model's learning rate. High value might over shoot the minimum loss, while low values might make the model to take forever to learn.

0.01
lambda_p Optional[float32]

Lambda parameter for updating weights.

0.01
n_iters Optional[int64]

Maximum number of updations to make over the weights and bias in order to reach to a effecient prediction that minimizes the loss.

1000
Source code in ai/svm/classes.py
def __init__(
  self,
  *,
  alpha: Optional[np.float16] = .01,
  lambda_p: Optional[np.float32] = .01,
  n_iters: Optional[np.int64] = 1_000
):
  """Initializes model's `alpha`, `lambda parameter` and number of
  `iterations`.

  Args:
    alpha: Model's learning rate. High value might over shoot the minimum
      loss, while low values might make the model to take forever to learn.
    lambda_p: Lambda parameter for updating weights.
    n_iters: Maximum number of updations to make over the weights and bias in
      order to reach to a effecient prediction that minimizes the loss.
  """
  self._alpha = alpha
  self._lambda_p = lambda_p
  self._n_iters = n_iters

  self._bias = None
  self._weights = None
fit(X, y)

Fit LinearSVC according to X, y.

The objective of training a LinearSVC is to minimize the norm of the weight vector \(||w||\), which is the slope of the decision function.

\[\min_{{w, b}} \dfrac{1}{2} w^{T}w = \dfrac{1}{2} ||w||^{2}\]
\[\mathrm{subject\ to}\ t^{i} (w^{T}x^{i} + b) \ge 1\]

Note:

We are minimizing \(\dfrac{1}{2}w^{T}w\), which is equal to \(\dfrac{1}{2}||w||^{2}\), rather than minimizing \(||w||\). Indeed, \(\dfrac{1}{2}||w||^{2}\) has a nice and simple derivative (just \(w\)) while \(||w||\) is not differentiable at \(w = 0\). Optimization algorithms work much better on differentiable functions.

The cost function used by Linear SVM classifier is the following:

\[J(w, b) = \dfrac{1}{2}w^{T}w + C\sum_{i=1}^{m}\max(0, 1 - t^{i}(w^{T}x^{i} + b))\]

The loss function used is the Hinge Loss function that clips the value at \(0\):

\[\max(0, 1 - t^{i}(w^{T}x^{i} + b))\]

Parameters:

Name Type Description Default
X ndarray

Training vectors, where n_samples is the number of samples and n_features is the number of features.

required
y ndarray

Target values.

required

Returns:

Type Description
LinearSVC

Returns the instance itself.

Source code in ai/svm/classes.py
def fit(self, X: np.ndarray, y: np.ndarray) -> 'LinearSVC':  # pylint: disable=invalid-name
  """Fit `LinearSVC` according to X, y.

  The objective of training a `LinearSVC` is to minimize the norm of the
  weight vector $||w||$, which is the slope of the decision function.

  $$\\min_{{w, b}} \\dfrac{1}{2} w^{T}w = \\dfrac{1}{2} ||w||^{2}$$

  $$\\mathrm{subject\\ to}\\ t^{i} (w^{T}x^{i} + b) \\ge 1$$

  Note:

    We are minimizing $\\dfrac{1}{2}w^{T}w$, which is equal to
    $\\dfrac{1}{2}||w||^{2}$, rather than minimizing $||w||$.
    Indeed, $\\dfrac{1}{2}||w||^{2}$ has a nice and simple derivative
    (just $w$) while $||w||$ is not differentiable at
    $w = 0$. Optimization algorithms work much better on differentiable
    functions.

  The cost function used by **Linear SVM classifier** is the following:

  $$J(w, b) = \\dfrac{1}{2}w^{T}w +
  C\\sum_{i=1}^{m}\\max(0, 1 - t^{i}(w^{T}x^{i} + b))$$

  The loss function used is the **Hinge Loss** function that clips the value
  at $0$:

  $$\\max(0, 1 - t^{i}(w^{T}x^{i} + b))$$

  Args:
    X: Training vectors, where `n_samples` is the number of samples and
      `n_features` is the number of features.
    y: Target values.

  Returns:
    Returns the instance itself.
  """
  self._bias = 0
  self._weights = np.random.rand(X.shape[1])

  # We need the decision function to be greater than 1 for all positive
  # training instances, and lower than -1 for all negative training instances.
  # If we define ``t_i = -1`` for negative instances, and ``t_i = 1`` for
  # positive instances, then we can acheive less margin violation.
  t = np.where(y <= 0, -1, 1)
  for _ in range(self._n_iters):
    for idx, x_i in enumerate(X):
      # Since, cost function is different at different decision boundaries we
      # calculate weights and bias differently.
      if t[idx] * (np.dot(x_i, self._weights) - self._bias) >= 1:
        self._weights -= self._alpha * (2 * self._lambda_p * self._weights)
      else:
        self._weights -= self._alpha * (
          2 * self._lambda_p * self._weights - np.dot(t[idx], x_i)
        )
        self._bias -= self._alpha * t[idx]
  return self
predict(X)

Predict for X using the previously calculated weights and bias.

Parameters:

Name Type Description Default
X ndarray

Feature vector.

required

Returns:

Type Description
ndarray

Target vector.

Raises:

Type Description
RuntimeError

If predict is called before fit.

ValueError

If shape of the given X differs from the shape of the X given to the fit function.

Source code in ai/svm/classes.py
def predict(self, X: np.ndarray) -> np.ndarray:  # pylint: disable=invalid-name
  """Predict for `X` using the previously calculated `weights` and `bias`.

  Args:
    X: Feature vector.

  Returns:
    Target vector.

  Raises:
    RuntimeError: If `predict` is called before `fit`.
    ValueError: If shape of the given `X` differs from the shape of the `X`
      given to the `fit` function.
  """
  if self._weights is None or self._bias is None:
    raise RuntimeError(
      f'{self.__class__.__name__}: predict called before fitting data'
    )

  if X.shape[1] != self._weights.shape[0]:
    raise ValueError(
      (
        f'Number of features {X.shape[1]} does not match previous data '
        f'{self._weights.shape[0]}.'
      )
    )

  return np.sign(np.dot(X, self._weights) - self._bias)

kernels

Common kernel implementations for Support Vector Machines.

grbf(x, x_prime, sigma)

Implements Gaussian Radial Basis Function (RBF).

The Gaussian RBF is a commonly used kernel function in Support Vector Machines and other machine learning algorithms. It's defined as:

\[K(x, x') = e^{-\dfrac{||x - x'||^{2}}{2\sigma^{2}}}\]

Here's the derivation:

  • The Gaussian RBF is defined as a function of two data points, \(x\) and \(x'\), with a parameter \(\sigma\) that controls the width of the kernel.
  • We start by computing the euclidean distance between the two data points \(x\) and \(x'\):
\[||x - x'||^{2} = \sum_{i=1}^{n}(x_{i} - x'_{i})^2\]

Now, let's insert this distance into the Gaussian RBF formula:

\[K(x, x') = e^{-\dfrac{||x - x'||^{2}}{2\sigma^{2}}}\]

We have an exponential term, and we can simplify it further:

\[e^{-\dfrac{||x - x'||^{2}}{2\sigma^{2}}} = e^{-\dfrac{1}{2\sigma^{2}} \sum_{i=1}^{n}(x_{i} - x'_{i})^2}\]

We can further generalize it using the property \(e^{a + b}\) = \(e^a * e ^b\), we can separate the exponential factors:

\[K(x, x') = e^{-\dfrac{1}{2\sigma^{2}} \sum_{i=1}^{n}(x_{i} - x'_{i})^2} = \prod_{i=1}^{n}e^{-\dfrac{1}{2\sigma^{2}}(x - x')^{2}}\]

This is the mathematical implementation of the Gaussian RBF. It measures the similarity between two data points \(x\) and \(x'\) based on the Euclidean distance between them, with the parameter \(\sigma\) controlling the width of the kernel.

Parameters:

Name Type Description Default
x ndarray

The numpy vector \(x\).

required
x_prime ndarray

The numpy vector \(x'\).

required
sigma float32

The paramter \(\sigma\) to control the width of the kernel.

required

Returns:

Type Description
ndarray

RBF value.

Examples:

Example data points

x1 = np.array([1, 2, 3]) x2 = np.array([4, 5, 6]) ...

Set the width parameter

sigma = 1.0 ...

Calculate the RBF value

rbf_result = grbf(x1, x2, sigma) print("Gaussian RBF Value:", rbf_result)

Source code in ai/svm/kernels.py
def grbf(x: np.ndarray, x_prime: np.ndarray, sigma: np.float32) -> np.ndarray:
  """Implements Gaussian Radial Basis Function (RBF).

  The Gaussian RBF is a commonly used kernel function in Support Vector Machines
  and other machine learning algorithms. It's defined as:

  $$K(x, x') = e^{-\\dfrac{||x - x'||^{2}}{2\\sigma^{2}}}$$

  Here's the derivation:

  * The Gaussian RBF is defined as a function of two data points, $x$ and
  $x'$, with a parameter $\\sigma$ that controls the width of the
  kernel.
  * We start by computing the euclidean distance between the two data points
  $x$ and $x'$:

  $$||x - x'||^{2} = \\sum_{i=1}^{n}(x_{i} - x'_{i})^2$$

  Now, let's insert this distance into the Gaussian RBF formula:

  $$K(x, x') = e^{-\\dfrac{||x - x'||^{2}}{2\\sigma^{2}}}$$

  We have an exponential term, and we can simplify it further:

  $$e^{-\\dfrac{||x - x'||^{2}}{2\\sigma^{2}}} =
  e^{-\\dfrac{1}{2\\sigma^{2}} \\sum_{i=1}^{n}(x_{i} - x'_{i})^2}$$

  We can further generalize it using the property
  $e^{a + b}$ = $e^a * e ^b$, we can separate the exponential
  factors:

  $$K(x, x') = e^{-\\dfrac{1}{2\\sigma^{2}}
  \\sum_{i=1}^{n}(x_{i} - x'_{i})^2} =
  \\prod_{i=1}^{n}e^{-\\dfrac{1}{2\\sigma^{2}}(x - x')^{2}}$$

  This is the mathematical implementation of the Gaussian RBF. It measures the
  similarity between two data points $x$ and $x'$ based on the
  Euclidean distance between them, with the parameter $\\sigma$
  controlling the width of the kernel.

  Args:
    x: The numpy vector $x$.
    x_prime: The numpy vector $x'$.
    sigma: The paramter $\\sigma$ to control the width of the kernel.

  Returns:
    RBF value.

  Examples:

  >>> # Example data points
  >>> x1 = np.array([1, 2, 3])
  >>> x2 = np.array([4, 5, 6])
  ...
  >>> # Set the width parameter
  >>> sigma = 1.0
  ...
  >>> # Calculate the RBF value
  >>> rbf_result = grbf(x1, x2, sigma)
  >>> print("Gaussian RBF Value:", rbf_result)
  """
  sqr_dist = np.sum(np.power((x - x_prime), 2))
  grbf_val = np.exp(-sqr_dist / (2 * sigma**2))
  return grbf_val