Activation functions get their names from being used in neural networks as they decide whether a particular neuron should be activated or not. In the context of this release, the sigmoid function will be discussed in detail as it is used in the logistic regression algorithm for binary classification.
Sigmoid Activation function
Sigmoid function’s main use is to limit the input values and get them between 0 and 1.
The formula for the sigmoid function is:
This function looks like an “S” when plotted on the 2-D graph.
Logistic regression inherently is a modification of linear regression and to convert the linear equation viable for classification the sigmoid function is used. Based on the value of σ (sigmoid) the final classification is made.
Here y is the output of logistic regression.
This is a continuously increasing function which is differentiable everywhere.
The differentiability is useful in the case of neural networks because during back propagation while using gradient descent to calculate weights of the neural network derivative of the activation function is taken.
(Some terms might not be clear right now but will be understood after reading the neural networks release)
Unfortunately, this is computationally expensive and not zero-centred so avoided in neural networks and essentially used in binary classification.
Python Implementation of the sigmoid function:
def sigmoid(z):
y_head = 1 / (1+np.exp(-z)) return y_head |
This activation function is also called a squashing function because it can take very large values and squash them in the range (0,1).
The sigmoid function is very important in the world of neural networks because if only linear functions were used then the model would learn linearly and by adding a hidden sigmoid layer it can work with non-linear problems as well.
As aforementioned, sigmoid works with only binary classification problems a modified version of it called softmax is used for multi-class classification situations.
Softmax:
This is a function that uses multiple sigmoids in one function. Mathematically it looks like:
Applying softmax gives the probability for the datapoint belonging to each class and its sum is always one.
Python Implementation for softmax function:
def softmax_function(x): z = np.exp(x) z_ = z/z.sum() return z_ |
Tanh:
Tanh is also an activation function. This function is a modification of the sigmoid by making it symmetric around the origin.
The formula for Tanh is:
tanh(z)=2sigmoid(2z)-1
Or
tanh(x) = 2/(1+e^(-2x)) -1
If you notice the graph for both sigmoid and tanh is an “S” shape but tanh is centred around the origin and keeps its range between -1 and 1 unlike sigmoids 0 to 1 range.
Python implementation for tanh:
def tanh(x): z = (2/(1 + np.exp(-2*x))) -1 return z |
Swish:
Swish is another modification of sigmoid and was developed by researchers at Google while looking for a computationally efficient function.
Formula for swish is:
f(x) = x*sigmoid(x)
or
f(x) = x/(1-e^-x)
Python implementation for swish:
def swish_function(x): return x/(1-np.exp(-x)) |
This function was created to outperform ReLU (another activation function). This is the version of sigmoid where values vary from negative infinity to infinity.
In conclusion, the sigmoid function and its variations are mostly used for classification problems.