Neural Network Essentials 2: Loss and Gradient Descent

Lesson notes from my masters class on Neural network essentials in tokyo data science and some added notes from stanford cs230 lecture.

  • Observation (xi,^{i},yi^{i})
  • Linear Model yi=mxi+by^{i}=mx^{i}+b
  • Loss Function (MSE) LL= 1m(yiy^i)\frac{1}{m}\sum(y_{i}-\hat{y}_{i})

Neural network learning is ultimately a optimization problem. To solve these problems it must have a clearly defined metric to understand if it is learning, and this metric is known as the Loss. And Gradient descent is a iterative algorithm that allows use to minimize the lost by adjusting the parameters. So in terms of out linear model we are adjusting the α\alpha and β\beta. Gradient descent starts with random parameters and travels to the loss functions lowest point (reminder: The convexivity of a problem is based on the loss function. with fewer paramaters its better to use a convex loss function but when dealing with more parameters you can be opted to use a non-convex function due to having many saddlepoints).

So how does gradient descent lower the loss exactly?

The idea is to update parameters in order to lower the loss. So you have to know the gradient/slope of a function with respect to each parameter. (For the example knowing the loss function and the linear model we try to diffferentiate them with respect to x and b). The slope gives us thedirection in which direction to decrease the parameter in order to minimize the loss. With this step sizes can be calculated for each feature : step size = gradient*learning rate and we calculate the new parameters : new parameter = old parameter - step size. The learning rate mentioned above is a flexible parameter that influences the convergence of the algorithm. Larger learning rates make the algorithm take huge steps down the slope and it might jump across the minimum point thereby missing it. So, it is always good to stick to low learning rate such as 0.01. It can also be mathematically shown that gradient descent algorithm takes larger steps down the slope if the starting point is high above and takes baby steps as it reaches closer to the destination to be careful not to miss it and also be quick enough.

Summary of gradient descent step : Θ(t+1)\varTheta(t+1)=Θ(t)αddpL(Θ(t),y)\varTheta(t)-\alpha\cdot\frac{d}{dp}L(\varTheta^{(t)},y) where Θ(t)\varTheta(t) is the current parameter iteration and Θ(t+1)\varTheta(t+1) is the updated paramer. α\alpha is the learning rate

This may be considered as the loss function with the linear model

LL= 1m1m(y(αx+β))2\frac{1}{m}\sum_{1}^{m}(y-(\alpha x+\beta))^{2}

Gradient of the loss is

L\nabla L= (Lα,Lβ)\frac{\partial L}{\partial\alpha},\frac{\partial L}{\partial\beta})

Gradient of the loss with respect to α\alpha and β\beta

Lα=Lα1m1m(y(αx+β))2\frac{\partial L}{\partial\alpha}=\frac{\partial L}{\partial\alpha}\frac{1}{m}\sum_{1}^{m}(y-(\alpha x+\beta))^{2}

  =2m(y(αx+β))1mLα(y(αx+β))\;=\frac{2}{m}(y-(\alpha x+\beta))\sum_{1}^{m}\frac{\partial L}{\partial\alpha}(y-(\alpha x+\beta))

  =2m1m(y(αx+β))(α)\;=\frac{2}{m}\sum_{1}^{m}(y-(\alpha x+\beta))\cdot(-\alpha)

  =2m1mα(yy^)\;=-\frac{2}{m}\sum_{1}^{m}\alpha(y-\hat{y})

Lβ=Lβ1m1m(y(αx+β))2\frac{\partial L}{\partial\beta}=\frac{\partial L}{\partial\beta}\frac{1}{m}\sum_{1}^{m}(y-(\alpha x+\beta))^{2}

  =2m1m(yy^)\;=-\frac{2}{m}\sum_{1}^{m}(y-\hat{y})

If learning rate is ϵ\epsilon then gradient update will be

ααϵLα\alpha\leftarrow\alpha-\epsilon\frac{\partial L}{\partial\alpha}

ββϵLβ\beta\leftarrow\beta-\epsilon\frac{\partial L}{\partial\beta}

Polynomial regressions as linear regressions

Convex function optimization is relatively easy and always leads to a global optimum. Optimization of non-convex functions is much harder. For non-convex loss functions of few parameters, gradient descent can easily get stuck in local optima with a high loss. For neural networks with many parameters, however, this is not a problem. To illustrate why think of a critical point of the loss function (where the gradient is zero). For this to be a local minimum, moving from this point in the direction of any parameter must take us higher. For simplicity, let us assume that there is a 50% chance of going higher and a 50% chance of going lower and that there is independence across the parameters. Before getting close to a local minimum, we will probably pass around many saddle points, so the loss function is likely to be substantially reduced by that time we reach it. We will not reach a global minimum, but this is unlikely to be a problem because the global minimum could correspond to overfitting.

Helpful websites and resources:


Written by@Ryan Liwag
Data scientist who dabbles in Machine learning and software engineering. If your interested in working with me, drop me an email at rjhontomin@gmail.com