Classification algorithms depend on the type of dataset that is being used and data scientists have curated various types of algorithms that can be used in certain situations.
Most popular types of Classification Algorithms:
- Linear Classifiers
- Logistic regression
- Naive Bayes classifier
- Support vector machines
- Kernel estimation
- k-nearest neighbour
- Decision trees
- Random forests
Let’s discuss this in some more detail and get an intuition of when to use which algorithm.
Even though the term Regression is used in the name it is not a regression algorithm but a classification one. This algorithm gives us a binary output of 0 or 1, pass or fails, etc.
The logistic regression algorithm fits the dataset into a function and then predicts the probability of the output being 0 or 1.
Probability is given by P(Y=1|X) or P(Y=0|X). This represents the probability of 1 when X has occurred and the probability of 0 when X has occurred.
This algorithm can be used to calculate the probability of a situation between 0 and 1 or determine whether an object is in the picture or not.
2 important parts of the logistic regression algorithm are the hypothesis and sigmoid curve which will be discussed later.
Naive Bayes is not one algorithm it is a family of algorithms following the same assumptions such as no two variables are dependent on each other and the algorithm follows the Bayes theorem.
For example, A car may be red and seat 2 people. These are 2 independent variables and can be used in the Naive Bayes algorithm to differentiate between a honda and a Ferrari.
Bayes theorem formula:
P(A|B) – the probability of event A occurring, given event B has occurred
P(B|A) – the probability of event B occurring, given event A has occurred
P(A) – the probability of event A
P(B) – the probability of event B
Naive Bayes is mainly used for text classification problems and multiclass problems. It is a quick way to predict categories even when the dataset is small.
The biggest drawback it has is the assumption that all data in the dataset should be independent. In real-life data, this is very difficult to follow. Also if there was a category that was not present in the training dataset then the model will give a probability of 0 and not make any prediction.
Support Vector Machine Algorithm:
In the SVM algorithm, the categories are plotted on an n-dimensional plane (n is the number of independent variables). Then SVM algorithm creates decision boundaries to separate them into categories/classes. The best decision boundary is the hyperplane. The extreme data points are chosen by the algorithm to create such a hyperplane. These data points are called support vectors, hence the name of the algorithm Support Vector Machines.
SVM algorithm is most used for Face detection, image classification, text categorization, etc.
K_nearest neighbour (KNN) is one of the most basic classification algorithms. It follows the assumption that similar things exist in close proximity to each other.
The K-nearest neighbour algorithm calculates the distance between the various points whose category is known and then selects the shortest distance for the new data point.
This is an amazing algorithm when working with small datasets and it is so simple that it doesn’t even require additional assumptions or tuning.
The most famous use-case for KNN is in the recommender systems such as recommending products on Amazon, movies on Netflix or videos on Youtube. KNN may not be the most efficient approach for large amounts of data so the companies may use their own algorithms to do the job but for small scale recommendation systems KNN is a good choice.
The Decision Tree algorithm uses a tree-like representation to solve problems. It starts by splitting the dataset into 2 or more sets by using the rules found in the dataset. Each decision node asks a question to the dataset and then sorts it accordingly until it reaches the desired categories. This last layer of nodes is called the leaf nodes.
Let’s use this example where fruits need to be classified. In the root node, it starts with colour where it sorts the dataset into Green, Yellow and Red. Then it checks the size of the Red and Green datasets while checking the shape of the yellow part. The ones coming from the Yellow, Green and Red sides of the root node have been classified in the next step into Banana, Watermellon, Apple and Grape. These parts of the dataset have gotten the label they needed. Next, it takes another iteration for the still unclassified data and divides the yellow side by size into grapefruit and lemon while going for taste in the red size and dividing into cherry and grape.
In the end, the leaf nodes can be seen with the outputs of the classifier i.e. Watermellon, Apple, Grape, Grapefruit, Lemon, Banana and Cherry.
Random Forest Algorithm
At the time of training Random forest constructs a bunch of decision trees to help get the output category/class. This algorithm helps prevent overfitting and is more accurate than simple decision trees in most cases.
Since it uses more than one decision tree the complexity of the algorithm increases and real-time prediction becomes slow and it does not provide good results when data is sparse.
One use-case of this algorithm is in the health care industry where it is used to diagnose patients based on past medical records. It can also be used in the banking industry to detect the creditworthiness of loan applicants.
This was just to give an introductory intuition to the algorithms More details to follow.