Lesson notes from my masters class on Neural network essentials in tokyo data science and some added notes from stanford cs230 lecture.
Sigmoid and Softmax Function
A sigmoid function produces a decimal output between 0 and 1, and is
often used in binary classification. Therefore, it is especially used
for models where we have to predict the probability as an output.Since
probability of anything exists only between the range of 0 and 1,
sigmoid is the right choice.
Softmax works with the same principle for muliple classification each
class has a probability that will all add up to 1.0. Sofrmax is usually
appled to the output layer of a neural network where the raw prediction
could take a value between [-infinity, +infinity ]. Softmax is a
activation function and not loss. A method in which we can measure loss
after softmax is applied to the output layer is through Cross-Entropy
Where represents each output and dependent upon all others.
The formula computes the exponential (e-power) of the given input value
and the sum of exponential values of all the values in the inputs. Then
the ratio of the exponential of the input value and the sum of
exponential values is the output of the softmax function. Softmax
function calculates the probabilities distribution of the event over ‘n’
different events. In general way of saying, this function will calculate
the probabilities of each target class over all possible target classes.
Later the calculated probabilities will be helpful for determining the
target class for the given inputs.
- There are a lot of analogies used for entropy: disorder,
uncertainty, surprise, unpredictability, amount of information and
- Calculating entropy is finding the smallest possible average size of
a lossless encoding of messeages sent from source to destination
- The entropy tells us the theoretical minimum average encoding size
for events that follow a particular probability distribution. Each
encoding size can be calculated by knowing the probability and
minimum number of bits to relay the message.
- If entropy is high (encoding size is big on average), it means we
have many message types with small probabilities. Hence, every time
a new message arrives, you’d expect a different type than previous
messages. You may see it as a disorder or uncertainty or
unpredictability. When a message types with much smaller probability
than other message types happens, it appears as a surprise because
on average you’d expect other more frequently sent message types.
Also, a rare message type has more information than more frequent
message types because it eliminates a lot of other probabilities and
tells us more specific information.
- More details on entropy:
Cross entropy loss is defined as
Binary Cross-Entropy Loss
represents the sigmoid/logit activation function
Categorical Cross-Entropy loss
In the usual case of Multi-Class classification the labels are one-hot
encoded, so only the positive class keeps its term in the loss. There is
only one element of the Target vector which is not zero .
So discarding the elements of the summation which are zero due to target
labels, we can write it as:
Derivative with respect to
Derivative with respect to