Classification
Naive Bayes
Naive Bayes is a classification algorithm which checks the probability of a test point belonging to a class. Unlike other algorithms, Naive Bayes is purely a classification algorithm.
As explained in the last release this is the formula for the Bayes theorem and as the algorithm name suggests Naive Bayes is inspired by the Bayes theorem.
Image source: https://www.simplilearn.com/ice9/free_resources_article_thumb/Bayes.JPG
Naive Bayes is not one algorithm it is a family of algorithms following the assumption no two variables are dependent on each other.
For example, A car may be red and seat 2 people. These are 2 independent variables and can be used in the Naive Bayes algorithm to differentiate between a honda and a Ferrari.
This condition of independence becomes one drawback for the algorithm because in real life most features are interrelated.
Naive Bayes works well for large datasets and its simplicity outperforms most classification methods.
The main advantages are:
- It is fast, and easy to understand
- It is not prone to overfitting
- It does not need much training data
There are different types of Naive Bayes classifiers and the 3 major types are:
- Bernoulli Naive Bayes: This is the case when there are only 2 classes. Options like “Yes” or “No”, “True” or “False” etc. Data follows the Bernoulli distribution hence the name.
- Multinomial Naive Bayes: This type follows the multinomial distribution where events are represented by feature vectors. This is usually used in natural language processing to tag emails and documents into various categories.
Image source: https://www.mathworks.com/help/examples/stats/win64/ComputeTheMultinomialDistributionPdfExample_01.png
- Gaussian Naive Bayes: This is the type where numerical and continuous features are present. The probabilities are calculated using the Gaussian distribution. This is also used for text classification in natural language processing where the probability of appearance of words is used to tag the texts.
Gaussian Distribution Graph
A majority of research papers on text classification start off by using the Naive Bayes classifier as the baseline model.
Now that there is an introductory idea of how Naive Bayes works let’s dive a little more into the maths behind it.
Starting with the Bayes theorem
Here X represents the independent variables while y represents the output or dependent variable.
The assumption works that all variables are completely independent of each other hence X translates to x1, x2 x3 and so on
So, the proportionality becomes
Combining this for all x
Now the target for the Naive Bayes algorithm is to find the class which has maximum probability for the target. Which refers to finding the maximum value of y
For this argmax operation is used.
Python Implementation for Naive Bayes
Link to run and check code:
Importing the libraries
import pandas as pd import numpy as np from sklearn.naive_bayes import GaussianNB |
Reading the dataset
dataframe = pd.read_csv(“../input/titanic/train.csv”) |
Taking only the independent and useful data into the final data frame
# Name, Ticket , Passanger ID have almost no correlation to the outcome
final_dataframe= dataframe[[‘Survived’, ‘Pclass’, ‘Sex’, ‘Age’, ‘SibSp’, |
Labeling the values in the “Sex” column of the dataset to numbers
final_dataframe[“Sex”] = final_dataframe[“Sex”].replace(to_replace=final_dataframe[“Sex”].unique(), value = [1 , 0]) |
One hot encoding the dataset
This is an encoding algorithm in the sklearn library to get categorical data into various columns and make encode it in a way that the dataset can be sent to the machine learning model
# One hot encoding final_dataframe = pd.get_dummies(final_dataframe, drop_first=True) |
Creating the training and testing datasets
train_y = final_dataframe[“Survived”] train_x = final_dataframe[[‘Pclass’, ‘Sex’, ‘Age’, ‘SibSp’, ‘Parch’, ‘Fare’, ‘Embarked_Q’,’Embarked_S’]] |
Train Test Split for the data
from sklearn.model_selection import train_test_split
train_data, val_data, train_target, val_target = train_test_split(train_x,train_y, train_size=0.8) |
Fighting the dataset into the model and then using it to predict the test dataset.
model = GaussianNB() model.fit(train_data,train_target) val_pred= model.predict(val_data) |
Calculating Accuracy
from sklearn.metrics import accuracy_score
print(‘Model accuracy score: {0:0.4f}’. format(accuracy_score(val_target, val_pred))) |
This was a sample for the famous titanic dataset where Naive Bayes is used with an accuracy of 76.9%
Things that can be done for better usage of the Naive Bayes Model:
- Transform continuous data into Gaussian distribution if it is not already in it.
- When needed apply smoothing techniques like Laplace Correction before the prediction of class in the test dataset.
- Remove the interdependent features.
- Ensembling techniques like bagging and boosting are not applicable for Naive Bayes because these methods work with variance minimization and Naive Bayes has no variance to minimise.