Gradient Descent is decent!

*For best experience open the blog in Desktop-mode*

OBJECTIVE: To find the best fit line.
CLAIM: When the Loss Function is minimized, we obtain the best fit line.

Well, here we go. In Machine learning( ML ),finding the optimal solution for the given set of data points is a very essential step. Gradient Descent plays an important role in doing so with the help of statistical analysis. There are 3 types of Gradient descents,

Batch Gradient Descent
Stochastic Gradient Descent
Mini-batch Gradient Descent

Here we will see how Gradient Descent works overall. Considering a Linear regression let us operate on the graph shown below,

     The 'x' shown represent the data points whereas the line passing through those points represents the solution. Current solution may or may not be optimal since the difference between ACTUAL POINTS and PREDICTED POINTS is quite visible. The vertical dotted line in the next graph depicts the error.

So basically the main objective here is to reduce the Loss Function(Total Error). In ML, this is done by managing two factors,
SLOPE & INTERCEPT.

      There are several ways to calculate Loss Function,

Sum of Absolute errors (SAE)
Sum of Square (SSE)
Mean Square error (MSE)
Root Mean Square error (RMSE)

       Most commonly used for algorithm is MSE but if more precision is required RMSE should be used.

     We know the equation for a line is given by 'y=mx+c' where m= slope and c= y-intercept. For calculating the Loss Function value, we select any random intercept point, this helps the gradient descent to optimize better. Assuming c=0 and for this line slope comes out to be m=1.5 so the error for a single point can be given by PREDICTED HEIGHT(y1)= 0 + 1.5*X (x1) where X will be the Predicted point x-coordinate. RESIDUAL/ERROR= ACTUAL POINT HEIGHT(y2) - PREDICTED HEIGHT(y1). Similarly errors for other points can be calculated. We plot this residues w.r.t different values of intercept as show below.

Loss Function Vs Intercept

Calculation Illustration

Looking at Loss Function Vs Intercept graph, one would really be terrified due to so many points into consideration but Gradient Descent is way more efficient! What do I mean by that? Gradient Descent takes large steps when the cost is larger and reduces its step size when the cost value becomes least.

Generalized Equation for Residual Value can be given by,

=

\textstyle \sum

(A.H - (c+1.5*Predicted x-corr))²

where,
A.H= Actual pt. height
c= y-intercept
m= slope,

Next step is to take derivative of squared terms at all the points w.r.t c (intercept). Derivative helps in determining the slope of the curve at any particular point. So for a single pt. when we input a value for intercept in the above equation we obtain the slope at that point.

Obtaining slope on differentiation.

This slope is then multiplied by a factor called 'Learning rate' (Usually equal to 0.1) which gives us the New Intercept. As we come down the curve the slope gets lesser than the previous value and so does Step Size. This way the algorithm neither takes too large steps nor too small.

        One might get a question that how does the algorithm will know when to stop with the calculation? Well, there is something known as Exit Criteria. Exit criteria states that if the slope obtained is less than equal to 0.001, the algorithm processing should be terminated and the values for which the latest slope was obtained should be noted.

    Conclusion, In this way, Gradient Descent algorithm works to satisfy the objective mentioned at the top of this blog. One should always remember that the claim is not all true. One may get the least error line but it may not be the best fitting one. Also one should avoid Overfitting Algorithm

        Thank you!

Some useful references:

Machine Learning

Search This Blog

Gradient Descent is decent!

Comments

Post a Comment