Lesson notes from my masters class on Neural network essentials in tokyo data science and some added notes from stanford cs230 lecture.
MSE, MAE and RMSE
- It is the most commonly used regression loss function. As the name
suggests, Mean square error is measured as the average of squared
difference between predictions and actual observations. It’s only
concerned with the average magnitude of error irrespective of their
direction. However, due to squaring, predictions which are far away
from actual values are penalized heavily in comparison to less
- MAE is the sum of absolute differences between our target and
predicted variables. So it measures the average magnitude of errors
in a set of predictions, without considering their directions. (If
we consider directions also, that would be called Mean Bias Error
(MBE), which is a sum of residuals/errors). The range is also 0 to
- Both the RMSE and the MAE are ways to measure the distance between
two vectors: the vector of predictions and the vector of target
values. MAE corresponds to the l1 norm or Manhattan norm while RMSE
corresponds to the l2 norm or Euclidian Norm. The higher the norm
index, the more it focuses on large values and neglects small ones
- The largest change between MAE and MSE is the squared property, this
propery causes MSE to penalize errors higher than MAE. This also why
MSE or RMSE will always be greater than MAE.
- One big problem with using MAE for training of neural nets is its
constantly large gradient, which can lead to missing minimal at the
end of training using gradient descent. For MSE, gradient decreases
as the loss gets close to its minimal, making it more precise.
- Is less sensitive to outliers in data than the squared error loss.
It’s also differentiable at 0. It’s basically absolute error, which
becomes quadratic when error is small. How small that error has to
be to make it quadratic depends on a hyper-parameter,
(delta), which can be tuned. Huber loss approaches MAE when
~ 0 and MSE when ~ (large numbers.)
One big problem with using MAE for training of neural nets is its
constantly large gradient, which can lead to missing minimal at the end
of training using gradient descent. For MSE, gradient decreases as the
loss gets close to its minimal, making it more precise. Huber loss can
be really helpful in such cases, as it curves around the minimal which
decreases the gradient. And it’s more robust to outliers than MSE.
Therefore, it combines good properties from both MSE and MAE. However,
the problem with Huber loss is that we might need to train
hyper-parameter delta which is an iterative process.