Neural Network Essentials 3: Classification and Cross-Entropy

Lesson notes from my masters class on Neural network essentials in tokyo data science and some added notes from stanford cs230 lecture.

  • Entropy H(p)=plog(p)H(p)=-\sum p\cdot log(p)

Sigmoid and Softmax Function

A sigmoid function produces a decimal output between 0 and 1, and is often used in binary classification. Therefore, it is especially used for models where we have to predict the probability as an output.Since probability of anything exists only between the range of 0 and 1, sigmoid is the right choice.


Softmax works with the same principle for muliple classification each class has a probability that will all add up to 1.0. Sofrmax is usually appled to the output layer of a neural network where the raw prediction could take a value between [-infinity, +infinity ]. Softmax is a activation function and not loss. A method in which we can measure loss after softmax is applied to the output layer is through Cross-Entropy Loss.


Where sis_{i}represents each output and dependent upon all others.

The formula computes the exponential (e-power) of the given input value and the sum of exponential values of all the values in the inputs. Then the ratio of the exponential of the input value and the sum of exponential values is the output of the softmax function. Softmax function calculates the probabilities distribution of the event over ‘n’ different events. In general way of saying, this function will calculate the probabilities of each target class over all possible target classes. Later the calculated probabilities will be helpful for determining the target class for the given inputs.



  • There are a lot of analogies used for entropy: disorder, uncertainty, surprise, unpredictability, amount of information and so on.
  • Calculating entropy is finding the smallest possible average size of a lossless encoding of messeages sent from source to destination
  • The entropy tells us the theoretical minimum average encoding size for events that follow a particular probability distribution. Each encoding size can be calculated by knowing the probability and minimum number of bits to relay the message.
  • If entropy is high (encoding size is big on average), it means we have many message types with small probabilities. Hence, every time a new message arrives, you’d expect a different type than previous messages. You may see it as a disorder or uncertainty or unpredictability. When a message types with much smaller probability than other message types happens, it appears as a surprise because on average you’d expect other more frequently sent message types. Also, a rare message type has more information than more frequent message types because it eliminates a lot of other probabilities and tells us more specific information.
  • More details on entropy:

Cross entropy loss is defined as

CE=tilog(si)CE=-\sum t_{i}log(s_{i})

Binary Cross-Entropy Loss


σ\sigma represents the sigmoid/logit activation function

Categorical Cross-Entropy loss

In the usual case of Multi-Class classification the labels are one-hot encoded, so only the positive class keeps its term in the loss. There is only one element of the Target vector which is not zero ti=tpt_{i}=t_{p}. So discarding the elements of the summation which are zero due to target labels, we can write it as:


Derivative with respect to SpostiveS_{postive}

LSp=(eSpjCeSj1)\frac{\partial L}{\partial S_{p}}=(\frac{e^{S_{p}}}{\mathop{\sum_{j}^{C}e^{S_{j}}}}-1)

Derivative with respect to SnegativeS_{negative}

LSn=(eSpjCeSj)\frac{\partial L}{\partial S_{n}}=(\frac{e^{S_{p}}}{\mathop{\sum_{j}^{C}e^{S_{j}}}})

Written by@Ryan Liwag
Data scientist who dabbles in Machine learning and software engineering. If your interested in working with me, drop me an email at