This post is meant to be a stand-alone post, but just in case you want a deeper dive into the theory behind SVMs, feel free to check out the previous two posts in this mini-series:
Binary SVMs (Part 1): https://medium.com/@ssp247/ai-sidewalk-2-delving-into-binary-svms-part-1-d304907dd520
Binary SVMs (Part 2): https://medium.com/@ssp247/ai-sidewalk-3-delving-into-binary-svms-part-2-eea83a4d4546
We will make our own Binary SVM using Python and Numpy to optimize the hinge loss function, and then train and test our model on the Wisconsin Breast Cancer Dataset. Let’s see what accuracy we can achieve, and do some further exploration of the results!
This dataset is meant to be used for predicting if a tumor is malignant (M) or benign (B) given a set of input features including:
amongst many more. In total the dataset has 32 features, all of which are continuous numeric quantities.
The data can be downloaded from here: https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29
Hinge loss optimization is one of the most popular modern-day methods for developing custom SVMs without use of any existing frameworks. This type of formulation has a few advantages, including:
- Relative ease of implementation
- Pretty good performance across different datasets
That being said, here is the optimization we will implement:
The method we will use to optimize this is known as Stochastic Gradient Descent. In pseudocode, the general Gradient Descent routine could be described as follows:
Because of the summation in our Hinge Loss function contains a summation over all of the training data points, we will slightly tweak the base Gradient Descent algorithm to update the loss in batches of size 1 instead of per iteration. Furthermore, since the Hinge Loss function isn’t differentiable, its sub-gradient becomes the following:
We will implement this update and much more in the section below!
Step 0— Bring in Necessary Imports
Step 1a — Train-Test Splitting the Data + Mapping ‘M’ to -1 and ‘B’ to 1
Just like last time, we will settle with a 80–20 split where 80% of the data will be utilized to train our SVM and the other 20% will be used for testing. Prior to going about this split though, we make sure to re-map our labels to numeric classes. For this example, we will simply map ‘M’ to -1 and ‘B’ to 1.
Step 1b — Standardize the input data to our Model
If we look closely at the input feature values for our model, we can see that some of them are of a much larger magnitude than others, so before utilizing them on our model we use sklearn’s StandardScaler class to normalize both the train and test feature values:
Step 2 — Produce our Model’s class
We encapsulate all of our SVM’s logic into a single python class called SVM, which we can then make different instances for whenever we wish to use an SVM for prediction! This class will have a constructor and two methods, all of which are outlined below:
- __init__: This ‘constructs’ our SVM object and we will provide it with any class parameters we wish to initialize.
- fit: This method will simply train our model, and is where we will tweak the model’s weight vector (w) based on the optimization derived above.
- predict: This method will predict labels on training data we haven’t seen before by utilizing our model’s fully tweaked weight vector.
(2a) Init Method Breakdown
In this class constructor, we simply take in three components:
- C: Our regularization term in the hinge loss optimization
- max_iterations: The maximum number of iterations we will train our model for in the fit() method
- learning_rate: The “size” of the steps we will take during each of the training iterations (More on this in a jiffy!)
(2b) Fit Method Breakdown
The fit method is where the bulk of our computational work will be handled. The following is the code for this method:
As outlined by the comments we first initialize the class weights to random values (Step 1), and then we continually update those weights per training point for max_iterations. Finally, the print statement at the end of each iteration is simply meant as a check that the loss is continually decreasing.
(2b) Predict Method Breakdown
The predict method is pretty simple. Here we simply utilize the trained weights in order to product predicted labels for the test set. The rule used for classifying individual points is as follows:
The above translated into code:
All of the relevant code utilized for this article can be found at:
This is a log of loss decay from a sample training run of our code:
This shows that the largest decreases take place in the first rounds of training, and slowly these improvements become smaller and smaller in magnitude.
Our model achieves an impressive 96.49% accuracy on our test set, without any hyper-tuning on our parameters (namely C and learning rate)! But let’s dive a little deeper into our results.
Confusion matrices are frequently used to get a better idea of how well classification models perform, here are a couple of key terms to know:
- True Negative: When a ‘negative’ label is correctly classified (in our case its a 1, or a Benign tumor)
- True Positive: When a ‘positive’ label is correctly classified (in our case its a -1, or a Malignant tumor)
- False Negative: When a ‘positive’ label is incorrectly classified as being ‘negative’ label (when a Malignant tumor would get classified as being Benign)
- False Positive: When a ‘negative’ label is incorrectly classified as be a ‘positive’ label (when a Benign tumor is classified as being Malignant)
Our model’s Confusion Matrix:
It turns out that our model produces only 2 false positives and 2 false negatives on the test data. In the case of tumor classification, we would have to assign different ‘costs’ to each of the false negative (patient having a malignant tumor but thinking it is benign) and false positive (patient having a benign tumor but thinking it is malignant) cases in order to come up with an analysis of which of these errors we would want to minimize.
What are some ways of making our model better (in addition to the above)? Well, right now we are simply using pre-defined values for our learning rate and regularization parameters, so we could definitely start by hyper-tuning these via cross-validation. Other possible extensions include adding an intercept parameter, b to our SVM’s hyperplane calculation, and transforming the input data via utilization of kernels.
That’s it for the SVM mini-series! Next time we will branch out from the world of ML into the more general world of AI, and explore the Minimax algorithm and how we can leverage it to make decisions under uncertainty.