Activation functions get their names from you using it in neural networks as they decide whether a particular neuron should activate or not. In the context of this release, we will discuss the sigmoid function in detail as it is used in the logistic regression algorithm for binary classification.
Sigmoid Activation function
Sigmoid function’s main use is to limit the input values and get them between 0 and 1.
The formula for the sigmoid function is:
This function looks like an “S” when plotted on the 2-D graph.
Logistic regression is fundamentally a modification of linear regression. To adapt the linear equation for classification purposes, the sigmoid function is employed. The final classification is determined based on the value of σ (sigmoid).
Here y is the output of logistic regression.
This is a continuously increasing function which is differentiable everywhere.
The differentiability is useful in the case of neural networks because during back propagation while using gradient descent to calculate weights of the neural network derivative of the activation function is taken.
(Some terms might not be clear right now but will be understood after reading the neural networks release)
Unfortunately, this is computationally expensive and not zero-centred so avoided in neural networks and essentially used in binary classification.
Python Implementation of the sigmoid function:
def sigmoid(z):
y_head = 1 / (1+np.exp(-z)) return y_head |
This activation function is also known as a squashing function because it can compress large values into the range (0, 1).
The sigmoid function holds great significance in the realm of neural networks. If you employe only linear functions, the model would learn linearly. By introducing a hidden sigmoid layer, the model can effectively handle non-linear problems.
As previously noted, sigmoid is suitable for binary classification tasks, but for multi-class classification scenarios, a modified version called softmax is utilized.
Softmax:
This is a function that uses multiple sigmoids in one function. Mathematically it looks like:
Applying softmax gives the probability for the datapoint belonging to each class and its sum is always one.
Python Implementation for softmax function:
def softmax_function(x): z = np.exp(x) z_ = z/z.sum() return z_ |
Tanh:
Tanh is also an activation function. This function is a modification of the sigmoid by making it symmetric around the origin.
The formula for Tanh is:
tanh(z)=2sigmoid(2z)-1
Or
tanh(x) = 2/(1+e^(-2x)) -1
If you observe, both the sigmoid and tanh functions exhibit an “S” shape in their graphs. However, tanh is centered around the origin and has a range between -1 and 1, in contrast to the sigmoid function which has a range between 0 and 1.
Python implementation for tanh:
def tanh(x): z = (2/(1 + np.exp(-2*x))) -1 return z |
Swish:
Researchers at Google developed Swish as a modification of the sigmoid function while seeking a computationally efficient alternative.
Formula for swish is:
f(x) = x*sigmoid(x)
or
f(x) = x/(1-e^-x)
Python implementation for swish:
def swish_function(x): return x/(1-np.exp(-x)) |
This function creates to outperform ReLU (another activation function). This is the version of sigmoid where values vary from negative infinity to infinity.
In conclusion, you can use the sigmoid function and its variations for classification problems.
FAQs
What is an activation function in a neural network?
An activation function is a mathematical function applied to the output of each neuron in a neural network. It introduces non-linearity to the network, allowing it to learn and represent complex patterns and relationships in the data. Activation functions determine whether a neuron should be activated (fired) or not based on the weighted sum of its inputs.
Why are activation functions necessary in neural networks?
Activation functions are necessary in neural networks because they introduce non-linearity, enabling the network to learn and approximate complex functions. Without activation functions, neural networks would be limited to representing linear relationships, which would severely restrict their expressive power and ability to solve non-linear problems effectively.
What are some common activation functions used in neural networks?
There are several common activation functions used in neural networks, including:
- Sigmoid: S-shaped curve that squashes input values between 0 and 1.
- Hyperbolic Tangent (Tanh): Similar to the sigmoid function but squashes input values between -1 and 1.
- Rectified Linear Unit (ReLU): Piecewise linear function that returns 0 for negative input values and the input itself for positive values.
- Leaky ReLU: Similar to ReLU but allows a small, non-zero gradient for negative input values to prevent dying neurons.
- Softmax: Converts a vector of arbitrary real values into a vector of probabilities that sum to 1, commonly used in the output layer of classification networks.
How do I choose the right activation function for my neural network?
The choice of activation function depends on the specific characteristics of your data and the task you’re trying to solve. In general, ReLU is a good default choice for hidden layers due to its simplicity and effectiveness in training deep networks. Sigmoid and Tanh are commonly used in the output layer for binary and multi-class classification tasks, respectively. It’s also common to experiment with different activation functions and architectures to find the best combination for your particular problem.
What are the properties of an ideal activation function?
An ideal activation function should have the following properties:
- Non-linearity: Introduce non-linearity to the network to learn complex functions.
- Continuity: Be continuous and differentiable to enable efficient optimization using gradient-based methods like backpropagation.
- Monotonicity: Preserve the order of the input values to ensure that increasing input values result in increasing output values.
- Computational efficiency: Be computationally efficient to evaluate and differentiate, especially in deep neural networks with millions of parameters.