Support vector machines are algorithms that find the optimal line separating data points with different target values. Just like logistic regression models, we are drawing a straight-line decision boundary on the input space. But in this case, we use geometric arguments to motivate the position and orientation of the line. After the location and orientation of the plane is determined, it becomes the decision boundary, and new cases can be scored depending on which side of the boundary they are on. If the data points are linearly separable, then an infinite number of separating hyperplanes exist, evidenced by the varying angles shown in the diagram. The starting point to reach a unique solution is to think of a "fat" hyperplane between the points with different target values. This leads to a separator that has the largest margin of error, or essentially "wiggle room" on either side. Among all these hyperplanes, only one of them has the maximum margin. It is essentially the median of the fat hyperplane. Data is not always linearly separable, so points that are on the wrong side of the decision boundary lead to a penalty term in the algorithm. This penalty term is the hinge loss function and is used to generalize support vector machines to cases where the data is not linearly separable. This soft-margin classifier still looks for the maximum margin of error, but now also minimizes the hinge loss penalty associated with the incorrectly classified points. In many real-world situations, the data is not linearly separable, but a soft-margin classifier would make too many mistakes to be a viable solution. A solution to this problem is to transform the data to a higher dimensional space and then find the maximum margin hyperplane in this higher dimension. Data points that are not linearly separable in lower dimensions can become linearly separable in higher dimensions, although computation is always harder when we increase the number of dimensions. To manage this computation, the kernel trick is used to convert the computation of dot product in higher dimensions into the use of a kernel function in lower dimensions. This nonlinear kernel function enables us to find the maximum margin hyperplane in higher dimensions without transforming every data point into its higher dimensional representation. After the maximum margin hyperplane is calculated in the higher dimensional space, it can be transformed into a nonlinear decision boundary in the lower dimensional space, enabling us to use support vector machines to generate nonlinear decision boundaries.