Classification algorithms depend on the dataset being used, and data scientists have curated various algorithms that can be used in certain situations.
Most popular types of Classification Algorithms:
- Linear Classifiers
- Logistic regression
- Naive Bayes classifier
- Support vector machines
- Kernel estimation
- k-nearest neighbor
- Decision trees
- Random forests
Let’s discuss this in more detail and understand when to use which algorithm.
1. Logistic Regression:
Even though the term Regression is used in the name, it is not a regression algorithm but a classification one. This algorithm gives us a binary output of 0 or 1, pass or fail, etc.
The logistic regression algorithm fits the dataset into a function and then predicts the probability of the output being 0 or 1.
Probability is given by P(Y=1|X) or P(Y=0|X). This represents the probability of 1 when X has occurred and the probability of 0 when X has occurred.
Here X represents the independent variables, and Y represents the dependent variables.
This algorithm can calculate the probability of a situation between 0 and 1 or determine whether an object is in the picture.
2 important parts of the logistic regression algorithm are the hypothesis and sigmoid curve, which will be discussed later.
2. Naive Bayes:
Naive Bayes is not one algorithm it is a family of algorithms following the same assumptions, such as no two variables are dependent on each other, and the algorithm follows the Bayes theorem.
For example, A car may be red and seat 2 people. These are 2 independent variables and can be used in the Naive Bayes algorithm to differentiate between a honda and a Ferrari.
Bayes theorem formula:
P(A|B) – the probability of event A occurring, given event B has occurred
P(B|A) – the probability of event B occurring, given event A has occurred
P(A) – the probability of event A
P(B) – the probability of event B
Naive Bayes is mainly used for text classification problems and multiclass problems. It is a quick way to predict categories even when the dataset is small.
The biggest drawback is the assumption that all data in the dataset should be independent. In real-life data, this is very difficult to follow. Also, if a category was not present in the training dataset, then the model will give a probability of 0 and not make any prediction.
Support Vector Machine Algorithm:
The SVM algorithm plots the categories on an n-dimensional plane (n is the number of independent variables). Then SVM algorithm creates decision boundaries to separate them into categories/classes. The best decision boundary is the hyperplane. The algorithm chooses the extreme data points to create such a hyperplane. These data points are called support vectors, hence the name of the algorithm Support Vector Machines.
The image below represents a 2-dimensional SVM model dividing the datasets into 2 categories.
SVM algorithm is most used for Face detection, image classification, text categorization, etc.
K_nearest neighbor (KNN) is one of the most basic classification algorithms. It follows the assumption that similar things exist in close proximity to each other.
The K-nearest neighbor algorithm calculates the distance between the various points whose category is known and then selects the shortest distance for the new data point.
This is an amazing algorithm when working with small datasets, and it is so simple that it doesn’t require additional assumptions or tuning.
The most famous use-case for KNN is in the recommender systems, such as recommending products on Amazon, movies on Netflix, or videos on Youtube. KNN may not be the most efficient approach for large amounts of data, so companies may use their algorithms to do the job, but for small-scale recommendation systems, KNN is a good choice.
The Decision Tree algorithm uses a tree-like representation to solve problems. It starts by splitting the dataset into 2 or more sets by using the rules found in the dataset. Each decision node asks a question to the dataset and then sorts it accordingly until it reaches the desired categories. This last layer of nodes is called the leaf nodes.
Let’s use this example where fruits need to be classified. The root node starts with a color, sorting the dataset into Green, Yellow, and Red. Then it checks the size of the Red and Green datasets while checking the shape of the yellow part. The ones coming from the root node’s Yellow, Green, and Red sides have been classified in the next step into Banana, watermelon, Apple and Grape. These parts of the dataset have gotten the label they needed. Next, it takes another iteration for the still unclassified data. It divides the yellow side by size into grapefruit and lemon while going for taste in the red size and dividing into cherry and grape.
In the end, the leaf nodes can be seen with the classifier’s outputs, i.e., Watermellon, Apple, Grape, Grapefruit, Lemon, Banana, and Cherry.
Random Forest Algorithm
At the time of training, Random forest constructs a bunch of decision trees to help get the output category/class. This algorithm helps prevent overfitting and is more accurate than simple decision trees in most cases.
Since it uses more than one decision tree, the algorithm’s complexity increases, real-time prediction becomes slow, and it does not provide good results when data is sparse.
One use case of this algorithm is in the healthcare industry, where it is used to diagnose patients based on past medical records. It can also be used in the banking industry to detect the creditworthiness of loan applicants.
This was to give an introductory intuition to the algorithms More details to follow.