Machine Learning – Classification

Machine Learning - Classification

Classification

Naive Bayes

Naive Bayes is a classification algorithm which checks the probability of a test point belonging to a class. Unlike other algorithms, Naive Bayes is purely a classification algorithm.

As explained in the last release this is the formula for the Bayes theorem and as the algorithm name suggests Naive Bayes is inspired by the Bayes theorem.

Image source: https://www.simplilearn.com/ice9/free_resources_article_thumb/Bayes.JPG

Naive Bayes is not one algorithm it is a family of algorithms following the assumption no two variables are dependent on each other.

For example, A car may be red and seat 2 people. These are 2 independent variables and can be used in the Naive Bayes algorithm to differentiate between a honda and a Ferrari.

This condition of independence becomes one drawback for the algorithm because in real life most features are interrelated.

Naive Bayes works well for large datasets and its simplicity outperforms most classification methods.

The main advantages are:

  • It is fast, and easy to understand
  • It is not prone to overfitting
  • It does not need much training data

There are different types of Naive Bayes classifiers and the 3 major types are:

  1. Bernoulli Naive Bayes: This is the case when there are only 2 classes. Options like “Yes” or “No”, “True” or “False” etc. Data follows the Bernoulli distribution hence the name.
  2. Multinomial Naive Bayes: This type follows the multinomial distribution where events are represented by feature vectors. This is usually used in natural language processing to tag emails and documents into various categories.
  3. Gaussian Naive Bayes: This is the type where numerical and continuous features are present. The probabilities are calculated using the Gaussian distribution. This is also used for text classification in natural language processing where the probability of appearance of words is used to tag the texts.

Gaussian Distribution Graph

A majority of research papers on text classification start off by using the Naive Bayes classifier as the baseline model.

Now that there is an introductory idea of how Naive Bayes works let’s dive a little more into the maths behind it.

Starting with the Bayes theorem.

Here X represents the independent variables while y represents the output or dependent variable.

The assumption works that all variables are completely independent of each other hence X translates to x1, x2 x3 and so on

So, the proportionality becomes Combining this for all x

Now the target for the Naive Bayes algorithm is to find the class which has maximum probability for the target. Which refers to finding the maximum value of y

For this, you can use argmax operation.

Python Implementation for Naive Bayes

Link to run and check code:

Importing the libraries

import pandas as pd
import numpy as np
from sklearn.naive_bayes import GaussianNB

Reading the dataset

dataframe = pd.read_csv(“../input/titanic/train.csv”)

Taking only the independent and useful data into the final data frame

# Name, Ticket , Passanger ID have almost no correlation to the outcome

final_dataframe= dataframe[[‘Survived’, ‘Pclass’, ‘Sex’, ‘Age’, ‘SibSp’,
‘Parch’, ‘Fare’, ‘Embarked’]]
final_dataframe = final_dataframe.dropna()

Labeling the values in the “Sex” column of the dataset to numbers

final_dataframe[“Sex”] = final_dataframe[“Sex”].replace(to_replace=final_dataframe[“Sex”].unique(), value = [1 , 0])

One hot encoding the dataset

This encoding algorithm in the sklearn library transforms categorical data into various columns and encodes it in a format suitable for input into a machine learning model.

# One hot encoding
final_dataframe = pd.get_dummies(final_dataframe, drop_first=True)

Creating the training and testing datasets

train_y = final_dataframe[“Survived”]
train_x = final_dataframe[[‘Pclass’, ‘Sex’, ‘Age’, ‘SibSp’,
‘Parch’, ‘Fare’, ‘Embarked_Q’,’Embarked_S’]]

Train Test Split for the data

from sklearn.model_selection import train_test_split

train_data, val_data, train_target, val_target = train_test_split(train_x,train_y, train_size=0.8)

Fighting the dataset into the model and then using it to predict the test dataset.

model = GaussianNB()
model.fit(train_data,train_target)
val_pred= model.predict(val_data)

Calculating Accuracy

from sklearn.metrics import accuracy_score

print(‘Model accuracy score: {0:0.4f}’. format(accuracy_score(val_target, val_pred)))

This was a sample for the famous titanic dataset where you can use it Naive Bayes with an accuracy of 76.9%

Things that you can do for better usage of the Naive Bayes Model:

  1. Transform continuous data into Gaussian distribution if it is not already in it.
  2. When needed apply smoothing techniques like Laplace Correction before the prediction of class in the test dataset.
  3. Remove the interdependent features.
  4. Ensembling techniques like bagging and boosting are not applicable for Naive Bayes because these methods work with variance minimization and Naive Bayes has no variance to minimise.

FAQs

1. What is machine learning classification?

Machine learning classification is a type of supervised learning where the algorithm learns from labeled data to predict the categorical class labels of new data points. It involves training a model on a dataset with known class labels and then using that model to classify unseen data into predefined categories or classes.

2. How does classification differ from other types of machine learning tasks?

Classification is a specific type of supervised learning where the output variable is categorical or discrete, meaning it belongs to a finite set of classes or categories. In contrast, regression predicts continuous numerical values, and clustering groups data points based on similarity without predefined classes.

3. What are some common algorithms used for classification tasks?

Several algorithms are commonly used for classification tasks, including:

  • Logistic Regression
  • Decision Trees
  • Random Forests
  • Support Vector Machines (SVM)
  • k-Nearest Neighbors (k-NN)
  • Naive Bayes
  • Neural Networks (e.g., Multilayer Perceptrons)

4. What types of problems can classification algorithms solve?

Classification algorithms are used to solve various problems, including:

  • Email spam detection
  • Sentiment analysis (e.g., positive or negative sentiment in text)
  • Disease diagnosis
  • Credit risk assessment
  • Handwritten digit recognition
  • Image classification (e.g., identifying objects in images)

5. How do you evaluate the performance of a classification model?

Performance of a classification model is evaluated using various metrics, including:

  • Accuracy: The proportion of correctly classified instances out of total instances.
  • Precision: The proportion of true positive predictions out of total positive predictions.
  • Recall (Sensitivity): The proportion of true positive predictions out of actual positive instances.
  • F1 score: The harmonic mean of precision and recall, providing a balance between the two.
  • Confusion matrix: A matrix showing the counts of true positive, true negative, false positive, and false negative predictions.

6. What are some common challenges in classification tasks?

Common challenges in classification tasks include:

  • Imbalanced data: When one class is significantly more prevalent than others, leading to biased models.
  • Overfitting: When the model learns the training data too well, capturing noise and resulting in poor generalization to unseen data.
  • Feature selection: Choosing relevant features and dealing with high-dimensional data can impact model performance.
  • Model interpretation: Understanding how the model makes decisions and explaining its predictions to stakeholders.

7. How can classification models be improved?

Classification models can be improved through various techniques, including:

  • Feature engineering: Creating new features or transforming existing ones to improve model performance.
  • Hyperparameter tuning: Optimizing model parameters to find the best configuration for improved accuracy.
  • Ensemble methods: Combining multiple models to reduce variance and improve generalization.
  • Handling imbalanced data: Using techniques such as resampling, weighted loss functions, or ensemble methods to address class imbalance.
  • Regularization: Adding penalties to the model’s parameters to prevent overfitting and improve generalization.

 

Leave a Reply

Your email address will not be published. Required fields are marked *