Machine Learning – Classification

Machine Learning - Classification

Classification

Naive Bayes

Naive Bayes is a classification algorithm which checks the probability of a test point belonging to a class. Unlike other algorithms, Naive Bayes is purely a classification algorithm.

As explained in the last release this is the formula for the Bayes theorem and as the algorithm name suggests Naive Bayes is inspired by the Bayes theorem.

Image source: https://www.simplilearn.com/ice9/free_resources_article_thumb/Bayes.JPG

 

Naive Bayes is not one algorithm it is a family of algorithms following the assumption no two variables are dependent on each other.

For example, A car may be red and seat 2 people. These are 2 independent variables and can be used in the Naive Bayes algorithm to differentiate between a honda and a Ferrari.

This condition of independence becomes one drawback for the algorithm because in real life most features are interrelated.

 

Naive Bayes works well for large datasets and its simplicity outperforms most classification methods.

The main advantages are:

 

  • It is fast, and easy to understand
  • It is not prone to overfitting
  • It does not need much training data

 

 

There are different types of Naive Bayes classifiers and the 3 major types are:

  1. Bernoulli Naive Bayes: This is the case when there are only 2 classes. Options like “Yes” or “No”, “True” or “False” etc. Data follows the Bernoulli distribution hence the name.
  2. Multinomial Naive Bayes: This type follows the multinomial distribution where events are represented by feature vectors. This is usually used in natural language processing to tag emails and documents into various categories.

Image source: https://www.mathworks.com/help/examples/stats/win64/ComputeTheMultinomialDistributionPdfExample_01.png

 

  1. Gaussian Naive Bayes: This is the type where numerical and continuous features are present. The probabilities are calculated using the Gaussian distribution. This is also used for text classification in natural language processing where the probability of appearance of words is used to tag the texts.

Gaussian Distribution Graph

 

A majority of research papers on text classification start off by using the Naive Bayes classifier as the baseline model.

 

Now that there is an introductory idea of how Naive Bayes works let’s dive a little more into the maths behind it.

 

Starting with the Bayes theorem

Here X represents the independent variables while y represents the output or dependent variable.

The assumption works that all variables are completely independent of each other hence X translates to x1, x2 x3 and so on

 

So, the proportionality becomes

 

Combining this for all x

 

Now the target for the Naive Bayes algorithm is to find the class which has maximum probability for the target. Which refers to finding the maximum value of y

For this argmax operation is used.

 

Python Implementation for Naive Bayes

Link to run and check code:

Importing the libraries

 

import pandas as pd
import numpy as np
from sklearn.naive_bayes import GaussianNB

 

Reading the dataset

 

dataframe = pd.read_csv(“../input/titanic/train.csv”)

 

Taking only the independent and useful data into the final data frame

 

# Name, Ticket , Passanger ID have almost no correlation to the outcome

final_dataframe= dataframe[[‘Survived’, ‘Pclass’, ‘Sex’, ‘Age’, ‘SibSp’,
‘Parch’, ‘Fare’, ‘Embarked’]]
final_dataframe = final_dataframe.dropna()

 

Labeling the values in the “Sex” column of the dataset to numbers

final_dataframe[“Sex”] = final_dataframe[“Sex”].replace(to_replace=final_dataframe[“Sex”].unique(), value = [1 , 0])

 

One hot encoding the dataset

This is an encoding algorithm in the sklearn library to get categorical data into various columns and make encode it in a way that the dataset can be sent to the machine learning model

 

# One hot encoding
final_dataframe = pd.get_dummies(final_dataframe, drop_first=True)

 

Creating the training and testing datasets

 

train_y = final_dataframe[“Survived”]
train_x = final_dataframe[[‘Pclass’, ‘Sex’, ‘Age’, ‘SibSp’,
‘Parch’, ‘Fare’, ‘Embarked_Q’,’Embarked_S’]]

 

Train Test Split for the data

 

from sklearn.model_selection import train_test_split

train_data, val_data, train_target, val_target = train_test_split(train_x,train_y, train_size=0.8)

 

Fighting the dataset into the model and then using it to predict the test dataset.

 

model = GaussianNB()
model.fit(train_data,train_target)
val_pred= model.predict(val_data)

 

Calculating Accuracy

 

from sklearn.metrics import accuracy_score

print(‘Model accuracy score: {0:0.4f}’. format(accuracy_score(val_target, val_pred)))

 

 

 

This was a sample for the famous titanic dataset where Naive Bayes is used with an accuracy of 76.9%

 

Things that can be done for better usage of the Naive Bayes Model:

  1. Transform continuous data into Gaussian distribution if it is not already in it.
  2. When needed apply smoothing techniques like Laplace Correction before the prediction of class in the test dataset.
  3. Remove the interdependent features.
  4. Ensembling techniques like bagging and boosting are not applicable for Naive Bayes because these methods work with variance minimization and Naive Bayes has no variance to minimise.

Leave a Reply

Your email address will not be published. Required fields are marked *