Neural Network Essentials 4: MSE, MAE and Huber Loss

Lesson notes from my masters class on Neural network essentials in tokyo data science and some added notes from stanford cs230 lecture.

  • MAE f(y,y^)f(y,\hat{y})=1mm[]i=1yiyi^\frac{1}{m}\stackrel[m]{i=1}{\sum}|y_{i}-\hat{y_{i}|}
  • MSE f(y,y^)=1mm[]i=1(yyi)^2f(y,\hat{y})=\frac{1}{m}\stackrel[m]{i=1}{\sum}(y-\hat{y_{i})}^{2}
  • RMSE MSE\sqrt{MSE}

MSE, MAE and RMSE

  • It is the most commonly used regression loss function. As the name suggests, Mean square error is measured as the average of squared difference between predictions and actual observations. It’s only concerned with the average magnitude of error irrespective of their direction. However, due to squaring, predictions which are far away from actual values are penalized heavily in comparison to less deviated predictions.
  • MAE is the sum of absolute differences between our target and predicted variables. So it measures the average magnitude of errors in a set of predictions, without considering their directions. (If we consider directions also, that would be called Mean Bias Error (MBE), which is a sum of residuals/errors). The range is also 0 to \infty
  • Both the RMSE and the MAE are ways to measure the distance between two vectors: the vector of predictions and the vector of target values. MAE corresponds to the l1 norm or Manhattan norm while RMSE corresponds to the l2 norm or Euclidian Norm. The higher the norm index, the more it focuses on large values and neglects small ones
  • The largest change between MAE and MSE is the squared property, this propery causes MSE to penalize errors higher than MAE. This also why MSE or RMSE will always be greater than MAE.
  • One big problem with using MAE for training of neural nets is its constantly large gradient, which can lead to missing minimal at the end of training using gradient descent. For MSE, gradient decreases as the loss gets close to its minimal, making it more precise.

Huber Loss

  • Is less sensitive to outliers in data than the squared error loss. It’s also differentiable at 0. It’s basically absolute error, which becomes quadratic when error is small. How small that error has to be to make it quadratic depends on a hyper-parameter, δ\delta (delta), which can be tuned. Huber loss approaches MAE when δ\delta ~ 0 and MSE when δ\delta ~ \infty (large numbers.)

One big problem with using MAE for training of neural nets is its constantly large gradient, which can lead to missing minimal at the end of training using gradient descent. For MSE, gradient decreases as the loss gets close to its minimal, making it more precise. Huber loss can be really helpful in such cases, as it curves around the minimal which decreases the gradient. And it’s more robust to outliers than MSE. Therefore, it combines good properties from both MSE and MAE. However, the problem with Huber loss is that we might need to train hyper-parameter delta which is an iterative process.


Written by@Ryan Liwag
Data scientist who dabbles in Machine learning and software engineering. If your interested in working with me, drop me an email at rjhontomin@gmail.com