Learn to build Support Vector Machine models for classification problems with python
Support Vector Machine
SVM works by mapping data to a high-dimensional feature space so that data points can be categorized, even when the data are not otherwise linearly separable. A separator between the categories is found, then the data is transformed in such a way that the separator could be drawn as a hyperplane. Following this, the characteristics of new data can be used to predict the group to which a new record should belong
Advantages
SVM is a very helpful method if we don’t have much idea about the data. It can be used for data such as image, text, audio, etc. It can be used for the data that is not regularly distributed and have unknown distribution.
There are many algorithms used for classification in machine learning but SVM is better than most of the other algorithms used as it has better accuracy in results.
SVM performs and generalized well on the out of sample data. Due to this as it performs well on out of generalization sample data SVM proves itself to be fast as the sure fact says that in SVM for the classification of one sample, the kernel function is evaluated and performed for each and every support vectors.
SVM generally do not suffer the condition of overfitting and performs well when there is a clear indication of separation between classes. SVM can be used when the total no of samples is less than the no of dimensions and performs well in terms of memory.
Support Vector Machine is useful in finding the separating Hyperplane, finding a hyperplane can be useful to classify the data correctly between different groups.
Disadvantages
SVMs do not perform well on highly skewed/imbalanced data sets. These are training data sets in which the number of samples that fall in one of the classes far outnumbers those that are a member of the other class. On the other hand, Logistic Regression is good at handling skewed data sets.
SVMs are also not a good option especially if you have multiple classes. Ultimately, in this case, you get back to a binary classifier and then use some kind of a voting mechanism to classify a sample to one of the classes.
SVMs are not efficient if the number of features is very huge in number compared to the training samples.
SVM Applications
Image Recognition
Text Category Assignment
Spam Detection
Sentiment Analysis
Gene Expression Classification
Regression
Outlier Detection
Clustering
Python for SVM
After finishing our theory part on SVM, we are now ready to build and train an SVM model in python. Before that, why to use python? Python is a general-purpose and a highly-efficient language, which can be learned easily. We can build various types of machine learning models feasibly with python. With that, let’s get started.
Importing Packages
After finishing our theory part on SVM, we are now ready to build and train an SVM model in python. Before that, why to use python? Python is a general-purpose and a highly-efficient language, which can be learned easily. We can build various types of machine learning models feasibly with python. In this article, we are going to build an SVM model to predict whether a patient is having a mild and severe stage of cancer (benign or malignant). With that, let’s get started.
Python Implementation:
Importing Data & EDA
In this article, we are going to use a cancer dataset that contains all the data and information about the cells. Follow the code to import the data in python.
Python Implementation:
Output:
The characteristics of the cell samples from each patient are contained in fields ‘Clump’ to ‘Mit’. The values are graded from 1 to 10, with 1 being the closest to benign. The ‘Class’ field contains the diagnosis, as confirmed by separate medical procedures, as to whether the samples are benign (value = 2) or malignant (value = 4).
Let's look at the distribution of the classes based on Clump thickness and the Uniformity of cell size using a scatter plot in python.
Python Implementation:
Output:
Data processing
Let’s first have a look at the data type of the variables in our cancer data. To do this we can use the ‘dtypes’ function provided by the Pandas package in python.
Python Implementation:
Output:
It looks like the ‘BareNuc’ column includes values that are not numerical. Using the ‘astype’ function and the ‘numeric’ we can convert object type variables into integer variables. Follow the code to convert the values in python.
Python Implementation:
Output:
Our next step is to define the independent variable and the dependent, and using that we will be splitting our data into a training set and testing set.
Feature Selection & Train Test Split
As I said before, we are going to define the X and Y variables. After defining the variables, it is highly recommended to convert them into arrays that can be helpful while building the model. Follow the code to define the variables in python.
Python Implementation:
Output:
Now we can split the data using our defined X and Y variables into a training set and testing set. To do this, we can use the ‘train_test_split’ function provided by scikit-learn in python.
Python Implementation:
Output:
Modeling & Prediction
The SVM algorithm offers a choice of kernel functions for performing its processing. Basically, mapping data into a higher dimensional space is called kernelling. The mathematical function used for the transformation is known as the kernel function, and can be of different types, such as:
Linear
Polynomial
Radial Basis Function (RBF)
Sigmoid
Each of these functions has its characteristics, its pros and cons, and its equation, but as there’s no easy way of knowing which function performs best with any given dataset, we usually choose different functions in turn and compare the results. In this article, we are going to use the Radial Basis Function (RBF) kernel to build and train our model. Let’s do it in python!
Python Implementation:
Output:
With our trained SVM model, we can pass on some test values into it to make some predictions. Follow the code to make predictions in Python.
Python Implementation:
Output:
Evaluation
Now that we have built, trained, and made some predictions using our SVM model. In order to check the accuracy of our model results, we can use the evaluation metrics functions provided by scikit-learn in python. In this article, we are going to use the ‘accuracy_score’ metric and the ‘confusion_matrix’ metric to evaluate our model. Let’s start with the ‘accuracy_score’ evaluation metric in python.
Python Implementation:
Output:
Our next evaluation metric is ‘confusion_matrix’. Instead of just printing the confusion matrix, it will make more sense when it is being graphed or plotted. Even though there are built-in functions to make a confusion matrix plot, it will be more understandable when it is plotted manually. Follow the code to produce a confusion matrix plot in python.
Python Implementation:
Output:
Final Thoughts!
In this article, we walked through the basics of SVM, its advantages, and its disadvantages. Followed by that, we have learned to program SVM models in python to deal with the cancer data and made some predictions too. Finally, we learned to evaluate our model using the evaluation metrics in python. The only thing we missed out is the math part. So, don’t forget to cover it before you move on to the next concept. Also, in this article, we have utilized only one kernel and there are other kernels too. So ensure that you practice with that too to solve problems. With that, we come to the end of this article and I’ve provided the source code for the built SVM models at the end of the article.
Happy Machine Learning!
Full code:
Well presented Nikhil
Good Nikhil
Good explanation about SVM algorithm and its uses and limitations.