Gradient Descent

Gradient Descent is the most popular optimizer to update parameters and it uses the gradient of the error with respect to the parameter. But the parameter update rule is different and thus there are different variants of Gradient Descent.

Mini- Batch Gradient Decent

It is the simplest algorithm, where we update parameters in each batch. In each batch there will be number of examples.

Import necessary module

pandas : Working for DataFrame
numpy : For array operation
matplotlib : for visualization

Read csv data in pandas Dataframe

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

df = pd.read_csv('F:/MDS-Private-Study-Materials/Second Semester/Applied Machine Learning/data.csv')
df.head()
df.shape

(100, 2)

df.head()

	x	y
0	32.502345	31.707006
1	53.426804	68.777596
2	61.530358	62.562382
3	47.475640	71.546632
4	59.813208	87.230925

Split Data

We split data into train set and test set in the ratio of 70% to 30%

split = int(len(df)*0.7)
train_x, test_x = df[:split].x , df[split:].x
train_y, test_y = df[:split].y , df[split:].y

len(train_x)

Create class `Mini_batch_gradient_descent`

Create method create_batch inside class which takes train data, test data and batch_sizes as parameter. We create mini_batches = [] to store the value of each batches. data = np.stack((train_x,train_y), axis=1) function join train_x and train_y into first dimension. Number of batches is row divide by batches size. We use for loop in the range of no_of_batches. mini_batch = data[i * batch_size: (i+1)* batch_size] which create mini batch from 0 to if we give batch_size 10 upto 10 in the next iteration mini_batch will be 10 to 20 and so on. If number of row is not exactly divided by batch size then there will be data remain outside of batch so we will do train_x.shape[0]%batch_size != 0 then we will keep all the batch data into mini_batches list by using append function. In first column we kept x values and in second column we kept y values. Finally we return mini_batches.

class Mini_batch_gradient_descent:
    def create_batch(self,train_x, train_y, batch_size):

        mini_batches = []
        data = np.stack((train_x,train_y), axis=1)

        no_of_bathches = train_x.shape[0]//batch_size

        for i in range(no_of_bathches):

            mini_batch = data[i * batch_size: (i+1)* batch_size]

            mini_batches.append((mini_batch[:,0],mini_batch[:,1]))
        if train_x.shape[0]%batch_size != 0:
            mini_batchi = data[i* batch_size:]
            mini_batches.append((mini_batch[:,0],mini_batch[:,1]))
        return mini_batches

Model Fitting

The function fit() is defined, with the parameters train x, train y, alpha, epochs, batch size. Then we set the slope and intercept values to 1 and 1, respectively. In addition, the letter l denotes the total number of observations. In the range of epochs, we used for loop. The batch function is then called, as explained in the previous cell block. We utilize the for loop in batches again, this time with xb indicating batch x values and yb indicating batch y values, and then we reshape the xb and yb. We can take all rows but only one column with xb.reshape(-1,1). The same is true for y. Define the linear regression projected value now. The mean square error is next defined, and finally, the mean square error is calculated.

import time 
class Mini_batch_gradient_descent:

    def create_batch(self,train_x, train_y, batch_size):
        mini_batches = []
        data = np.stack((train_x,train_y), axis=1)

        no_of_bathches = train_x.shape[0]//batch_size

        for i in range(no_of_bathches):

            mini_batch = data[i * batch_size: (i+1)* batch_size]

            mini_batches.append((mini_batch[:,0],mini_batch[:,1]))
        if train_x.shape[0]%batch_size != 0:
            mini_batchi = data[i* batch_size:]
            mini_batches.append((mini_batch[:,0],mini_batch[:,1]))
        return mini_batches   

    def fit(self, train_x, train_y, alpha, epochs, batch_size, show_every=1000):

        self.m = np.random.randn(1,1)
        self.c = np.random.randn(1,1)

        l = len(train_x)

        for i in range(epochs):

            batches = self.create_batch(train_x, train_y, batch_size)

            t1=time.time()
            for batch in batches:

                xb = batch[0]
                yb = batch[1]
                n = len(xb)

                xb = xb.reshape(-1,1)
                #print(xb)
                yb = yb.reshape(-1,1)
                #print(yb)

                delta_intercept = None
                delta_slope = None
                yp = np.dot(xb, self.m)+self.c

                err=(0.5/n) * np.sum((yb-yp)**2)

                delta_intercept = -(1/n) * np.sum(yb-yp)
                delta_slope = -(1/n) * np.sum((yb-yp)*xb)

                self.m = self.m - alpha* (delta_slope)
                self.c = self.c - alpha * (delta_intercept)
            if i% show_every==0:
                print(f"Iteration {i}: Error: {err}, time = {(time.time()-t1)/60}")

    def slope_intercept(self):
        print(f"Slope is {self.m[0][0]}")
        print(f"Intercept is {self.c[0][0]}")

    def predict(self, test_x):
        test_x = test_x.reshape(test_x.shape[0],1)
        self.m = self.m.reshape((self.m.shape[0],self.m.shape[0]))
        result = np.dot(test_x,self.m) + self.m
        return result

We have also defined functions for slope and intercept as well as prediction in the code block below. The prediction function takes the value of test x and returns the result.

Function call

We use our own dummy data to invoke the function here. The goal of using this data is to determine whether or not our model is functioning appropriately. Our model is looking fantastic. Our model should give us a value for c 1 and a value for m 1500, and we can see that it does.

mgd = Mini_batch_gradient_descent()

mgd.fit(np.arange(1,500)/500, 1+3*np.arange(1,500),0.01,5000, 32)

Iteration 0: Error: 759204.9669527125, time = 2.5204817454020183e-05
Iteration 1000: Error: 6.181828420803425e-05, time = 0.0
Iteration 2000: Error: 4.176020605988289e-14, time = 0.0
Iteration 3000: Error: 7.235164641068069e-23, time = 1.659393310546875e-05
Iteration 4000: Error: 7.235164641068069e-23, time = 1.1889139811197917e-05

mgd.slope_intercept()

Slope is 1499.999999999971
Intercept is 1.0000000000151974

It seems that our model did pretty well. Lets try on our original data.

mgd.fit(train_x, train_y,0.0001,5000, 16)

Iteration 0: Error: 273.93246976150095, time = 8.436044057210286e-06
Iteration 1000: Error: 47.568936899285255, time = 1.6987323760986328e-05
Iteration 2000: Error: 47.47580315198226, time = 1.659393310546875e-05
Iteration 3000: Error: 47.38452227067236, time = 0.0
Iteration 4000: Error: 47.29505307924061, time = 1.6828378041585285e-05

Lets see the fitted model on our data.

plt.scatter(train_x, train_y)
plt.plot(train_x, np.dot(train_x.to_numpy().reshape(-1,1), mgd.m)+mgd.c, c="r")
plt.show()

png

Since the data points wee scattered, the best linear line goes from the middle.

Batch Gradient Descent

The function define gradient descent() takes the parameters train x, y true, epochs, learning rate. Set the value of m and c to 1. Make an empty list to keep track of the number of errors in each iteration. The entire length of observations is denoted by the letter l. Use for loops in the period range. Define xb and yb from train x and y true, respectively, where .reshape(-1,1) indicates that we can choose any rows but only one column. Define yp now. Then define what square mistake implies. Then, for both slope and intercept, determine the gradient. After that, change the values of the parameters m and c. Finally, we provided dummy data before testing our model on real data to ensure that it was working properly.

def batch_gradient_descent(train_x, y_true, epochs, learning_rate = 0.001, show_every=1000):

    m = np.random.randn(1,1)
    c = np.random.randn(1,1)
    l = len(train_x)
    errs = []

    for i in range(epochs):
        t1= time.time()

        xb = train_x.reshape(-1,1)
        yb = y_true.reshape(-1,1)

        delta_intercept = None
        delta_slope = None
        yp = np.dot(xb, m)+c

        n = len(xb)
        err=(0.5/n) * np.sum((yb-yp)**2)

        delta_intercept = -(1/n) * np.sum(yb-yp)
        delta_slope = -(1/n) * np.sum((yb-yp)*xb)

        m = m - learning_rate * delta_slope
        c = c- learning_rate* delta_intercept
        errs.append(err)
        if i%show_every==0:
            print(f"Iteration {i}: Error: {err}, time: {(time.time()-t1)/60}")
    return m, c, errs

Lets use simple example again.

m,c,errs=batch_gradient_descent(np.arange(1,500)/500, 1+3*np.arange(1,500),5000,0.1)

Iteration 0: Error: 377387.06815160357, time: 1.781781514485677e-05
Iteration 1000: Error: 0.11260563371092129, time: 0.0
Iteration 2000: Error: 2.2137776134628694e-07, time: 0.0
Iteration 3000: Error: 4.3521897372030944e-13, time: 0.0
Iteration 4000: Error: 8.557543667725901e-19, time: 1.662572224934896e-05

array([[1500.]])

Which is obvious as we have made data that way.

In our case error is decreasing rapidly in some iterations and decreasing gradually in last some iterations.

plt.plot(errs)

[]

png

Using original Data

m,c,errs = batch_gradient_descent(train_x.to_numpy(), train_y.to_numpy(),5000,0.0001)

plt.plot(errs)

Iteration 0: Error: 2621.494652507566, time: 0.0
Iteration 1000: Error: 58.74680115227563, time: 0.0
Iteration 2000: Error: 58.745205635524144, time: 1.659393310546875e-05
Iteration 3000: Error: 58.743620912278544, time: 0.0
Iteration 4000: Error: 58.74204690952185, time: 0.0

[]

png

It did pretty well.

Stochastic Gradient Decent

if we keep value of batch_size = 1 in mini-batch gradient descent then it becomes stochastic gradient descent.

class Stochastic_gradient_descent:
    def create_batch(self,train_x, train_y, batch_size):
        mini_batches = []
        data = np.stack((train_x,train_y), axis=1)

        no_of_bathches = train_x.shape[0]//batch_size

        for i in range(no_of_bathches):

            mini_batch = data[i * batch_size: (i+1)* batch_size]

            mini_batches.append((mini_batch[:,0],mini_batch[:,1]))
        if train_x.shape[0]%batch_size != 0:
            mini_batchi = data[i* batch_size:]
            mini_batches.append((mini_batch[:,0],mini_batch[:,1]))
        return mini_batches

    def fit(self, train_x, train_y, alpha, epochs, batch_size, show_every=1000):

        self.m = np.random.randn(1,1)
        self.c = np.random.randn(1,1)

        l = len(train_x)
        self.errs = []
        for i in range(epochs):

            batches = self.create_batch(train_x, train_y, batch_size)

            t1=time.time()
            for batch in batches:

                xb = batch[0]
                yb = batch[1]
                n = len(xb)

                xb = xb.reshape(-1,1)
                #print(xb)
                yb = yb.reshape(-1,1)
                #print(yb)

                delta_intercept = None
                delta_slope = None
                yp = np.dot(xb, self.m)+self.c

                err=(0.5/n) * np.sum((yb-yp)**2)

                delta_intercept = -(1/n) * np.sum(yb-yp)
                delta_slope = -(1/n) * np.sum((yb-yp)*xb)

                self.m = self.m - alpha* (delta_slope)
                self.c = self.c - alpha * (delta_intercept)
                self.errs.append(err)
            if i%show_every==0:
                print(f"Iteration {i}: Error: {err}, time = {(time.time()-t1)/60}") 

    def slope_intercept(self):
        print(f"Slope is {self.m[0][0]}")
        print(f"Intercept is {self.c[0][0]}")

    def predict(self, test_x):
        test_x = test_x.reshape(test_x.shape[0],1)
        self.m = self.m.reshape((self.m.shape[0],self.m.shape[0]))
        result = np.dot(test_x,self.m) + self.m
        return result

Again lets use simple example.

mgd = Stochastic_gradient_descent()

err = mgd.fit(np.arange(1,500)/500, 1+3*np.arange(1,500),0.01,5000, 1)

Iteration 0: Error: 7859.2012987632725, time = 0.0008223692576090495
Iteration 1000: Error: 6.462348535570529e-23, time = 0.00018201271692911783
Iteration 2000: Error: 6.462348535570529e-23, time = 0.0001661062240600586
Iteration 3000: Error: 6.462348535570529e-23, time = 0.000166932741800944
Iteration 4000: Error: 6.462348535570529e-23, time = 0.00018284320831298828

Lets see parameters and error.

mgd.m, mgd.c

(array([[1500.]]), array([[1.]]))

plt.plot(mgd.errs)
plt.show()

png

Momentum Gradient Descent

In ordinary GD, the weight or parameter is updated based on the gradient of current step only but in the Momentum, we will update parameter based on this and previous step. We will take only little fraction of the previous gradient. While typical GD stuck in local minima, Momentum provides extra push to get out of it.

def momentum_batch_gradient_descent(train_x, y_true, epochs, 
                                   learning_rate = 0.001, momentum_rate=0.0001, show_every=1000):
    m = np.random.randn(1,1)
    c = np.random.randn(1,1)
    l = len(train_x)

    pds = 0
    pdi = 0
    errs = []

    for i in range(epochs):

        xb = train_x.reshape(-1,1)
        yb = y_true.reshape(-1,1)

        delta_intercept = None
        delta_slope = None
        yp = np.dot(xb, m)+c

        n = len(xb)
        err=(0.5/n) * np.sum((yb-yp)**2)

        delta_intercept = -(1/n) * np.sum(yb-yp)
        delta_slope = -(1/n) * np.sum((yb-yp)*xb)

        if i%show_every==0:
            print(f"Iteration {i}: Error: {err}")

        m = momentum_rate*pds+ m - learning_rate * delta_slope
        c = momentum_rate*pdi+c- learning_rate* delta_intercept

        pds = delta_slope
        pdi = delta_intercept
        errs.append(err)

    return m, c, errs

m,c,errs = momentum_batch_gradient_descent(train_x.to_numpy(), train_y.to_numpy(), 5000, 0.0001, 0.00001)

Iteration 0: Error: 2299.3333871682066
Iteration 1000: Error: 58.997090247292114
Iteration 2000: Error: 58.99412941821364
Iteration 3000: Error: 58.99118662197093
Iteration 4000: Error: 58.98826174873571

plt.plot(errs)
plt.show()

png

Comparison Between Three Gradient Descent in terms of time

Batch-gradient descent and mini-batch-gradient descent are faster as compare to stochastic gradient descent.
In contrast to the accuracy and space, stochastic performs better.

There are some cool blogs about gradient descent in our site, please find them here.

Linear Regression Using Different Gradient Descent

Gradient Descent

Mini- Batch Gradient Decent

Import necessary module

Split Data

Create class `Mini_batch_gradient_descent`

Model Fitting

Function call

Batch Gradient Descent

Stochastic Gradient Decent

Momentum Gradient Descent

Comparison Between Three Gradient Descent in terms of time

Related

Leave a ReplyCancel reply

Gradient Descent

Mini- Batch Gradient Decent

Import necessary module

Split Data

Create class Mini_batch_gradient_descent

Model Fitting

Function call

Batch Gradient Descent

Stochastic Gradient Decent

Momentum Gradient Descent

Comparison Between Three Gradient Descent in terms of time

Share this:

Related

Leave a ReplyCancel reply

Create class `Mini_batch_gradient_descent`