What is an activation function?
Activation functions get their names from being used in neural networks as they decide whether a particular neuron should be activated. In the context of this release, the sigmoid function will be discussed in detail as it is used in the logistic regression algorithm for binary classification.
Sigmoid Activation function
The sigmoid function’s main use is to limit the input values and get them between 0 and 1.
The formula for the sigmoid function is:
This function looks like an “S” when plotted on the 2-D graph.
Logistic regression inherently is a modification of linear regression, and the sigmoid function is used to convert the linear equation viable for classification. The final classification is made based on the value of σ (sigmoid).
Here y is the output of logistic regression.
This is a continuously increasing function that is differentiable everywhere.
Differentiability is important in neural networks because, during backpropagation, we use gradient descent to calculate the weights of the neural network by taking the derivative of the activation function.
(Some terms might not be clear right now but will be understood after reading the neural networks release)
Unfortunately, this is computationally expensive and not zero-centered so avoided in neural networks and essentially used in binary classification.
Python Implementation of the sigmoid function:
def sigmoid(z):
y_head = 1 / (1+np.exp(-z)) return y_head |
This activation function is also called a squashing function because it can take very large values and squash them in the range (0,1).
The sigmoid function is very important in the world of neural networks because if you only use linear functions, then the model would learn linearly. By adding a hidden sigmoid layer, it can work with non-linear problems.
As mentioned earlier, while the sigmoid function is suitable for binary classification problems, its modified version, the softmax function, is employed for multi-class classification scenarios.
Softmax:
This is a function that uses multiple sigmoids in one function. Mathematically it looks like:
Applying softmax gives the probability for the datapoint belonging to each class and its sum is always one.
Python Implementation for softmax function:
def softmax_function(x): z = np.exp(x) z_ = z/z.sum() return z_ |
Tanh:
Tanh is also an activation function. This function modifies the sigmoid by making it symmetric around the origin.
The formula for Tanh is:
tanh(z)=2sigmoid(2z)-1
Or
tanh(x) = 2/(1+e^(-2x)) -1
You will see that both the sigmoid and tanh graphs have an “S” shape. While the sigmoid function ranges from 0 to 1, the tanh function is centered around the origin and spans a range from -1 to 1.
Python implementation for tanh:
def tanh(x): z = (2/(1 + np.exp(-2*x))) -1 return z |
Swish:
Researchers at Google developed Swish, a modification of the sigmoid function, in their search for a computationally efficient function.
Formula for swish is:
f(x) = x*sigmoid(x)
or
f(x) = x/(1-e^-x)
Python implementation for swish:
def swish_function(x): return x/(1-np.exp(-x)) |
This design of the function is to perform better than ReLU, which is another activation function. This is the version of sigmoid, where values vary from negative infinity to infinity.
In conclusion, the sigmoid function and you can mostly use its variations for classification problems.
The sigmoid activation function, popularly known as the logistic function due to its mathematical form resembling the logistic curve, serves as a non-linear activation function in artificial neural networks. It maps the input values to a range between 0 and 1, which makes it suitable for problems that require binary classification or probabilistic outputs.
The sigmoid function is defined as:
σ(x) = 1 / (1 + e^(-x))
where x is the input to the function and e is the base of the natural logarithm (approximately 2.71828).
The sigmoid function has several important properties that make it useful in neural networks:
- Non-linearity: The sigmoid function introduces non-linearity to the network, allowing it to model complex relationships between inputs and outputs. The non-linear nature of the sigmoid function enables neural networks to approximate non-linear functions and solve more complex tasks.
- Differentiability: The sigmoid function is differentiable for all values of x. This property is crucial for training neural networks using gradient-based optimization algorithms, such as backpropagation. The ability to calculate derivatives allows the network to update its weights and learn from the training data.
- Output Range: The sigmoid function maps the input values to a range between 0 and 1. This property makes it suitable for binary classification tasks, where the network needs to assign probabilities to the two classes. We can interpret the output value of the sigmoid function as the probability that the input belongs to the positive class.
Despite its usefulness, the sigmoid activation function has some limitations:
- Vanishing Gradient Problem: The gradient of the sigmoid function becomes very small for large positive and negative values of x. This causes the gradients to “vanish” during backpropagation, which can slow down or hinder the learning process in deep neural networks. The vanishing gradient problem can make it difficult for the network to propagate errors and update the weights effectively.
- Output Saturation: The output of the sigmoid function saturates at 0 for large negative values of x and at 1 for large positive values of x. When the network is in these saturated regions, the gradients become very small, and the network may struggle to make further progress in learning. This can result in slower convergence or even the complete cessation of learning.
- Not Zero-Centered: The sigmoid function is not zero-center, meaning that its output values are always positive. This can introduce biases in the subsequent layers of the network, particularly when combined with certain weight initialization techniques. Biases in the network can affect its ability to learn and generalize effectively.
- Computationally Expensive: The exponential calculations involved in evaluating the sigmoid function can be computationally expensive, especially when dealing with large matrices or in deep neural networks. This can lead to slower training and inference times compared to other activation functions.
Conclusion
Despite its limitations, the sigmoid activation function still finds applications in certain scenarios, such as binary classification tasks and as an activation function in the output layer for probabilistic outputs. However, when it comes to hidden layers in deep neural networks, practitioners often prefer other activation functions such as ReLU (Rectified Linear Unit) or its variants like Leaky ReLU, ELU (Exponential Linear Unit), and SELU (Scaled Exponential Linear Unit). These functions help mitigate the vanishing gradient problem and enhance training performance.
In conclusion, the sigmoid activation function is a non-linear function that you can commonly use in artificial neural networks. It maps input values to a range between 0 and 1 and is useful for binary classification tasks and probabilistic outputs. However, it suffers from the vanishing gradient problem, output saturation, lack of zero-centering, and computational complexity. Therefore, practitioners often favor other activation functions for the hidden layers in deep neural networks.
FAQs
What is the Sigmoid activation function?
The Sigmoid activation function is a mathematical function used in neural networks to introduce non-linearity into the output of a neuron.
How does the Sigmoid function work?
The Sigmoid function takes an input value and squashes it into the range [0, 1] using the logistic function: f(x) = 1 / (1 + e^(-x)).
When is the Sigmoid function typically used?
The Sigmoid function is commonly used in binary classification problems, where it maps input values to probabilities between 0 and 1.
What are the advantages of using the Sigmoid function?
The Sigmoid function produces smooth outputs and ensures that the output of a neuron is bounded, making it useful for gradient-based optimization algorithms.
Are there any limitations of the Sigmoid function?
Yes, the Sigmoid function tends to saturate and produce gradients close to zero for large input values, which can slow down learning during training (vanishing gradient problem).
Can the Sigmoid function handle multi-class classification problems?
While the Sigmoid function is primarily used for binary classification, it can be extended to handle multi-class classification problems using techniques like one-vs-all or one-vs-one.
How does the choice of activation function impact neural network performance?
The choice of activation function can significantly impact neural network performance, affecting factors such as convergence speed, model stability, and the ability to capture complex patterns in the data.