Multilayer Perceptron: Solving XOR Problem from Scratch in Python
In this blog we are going to explore how nonlinear problem like XOR can be solved using multi layer perceptron. We have already tried how to apply multi layer perceptron on majority function please have a look here.
We all are familiar with single layer perceptron (SLP) which are commonly used to classify problems that are linearly separable. If we choose a single layer perceptron for a nonlinearly separable problem, the results may not be as fruitful as we expect. As a result, we must look for an alternative solution to a nonlinear problem, and one such solution is the multilayer perceptron. Because AND and OR like linearly separable problem, a single layer perceptron is adequate to be used. However, the problem which we can not solve linearly like XOR, thus we employed multi layer perceptron with back propagation to solve it.
Forward Propagation
The network's synaptic weights are fixed in this phase, and the input signal is propagated layer by layer through the network until it reaches the output. As a result, alterations in this phase are limited to the activation potentials and outputs of the network's neurons.
Algorithm
 Initialize all weights and biases in the network
 While termination conditions are not satisfied
 For each training tuple X in D
 For each input layer unit
j: yj = xj
. Output of an input unit is its actual input value  For each hidden or output unit j, compute the net input of unit j with respect to the previous layer,
compute the output of each unit j.
$$ y_i = \frac{1}{1 + e^{vj}}$$
Back Propagation
The backpropagation algorithm is a popular approach for training multilayer perceptron. There are two stages while training of MLP.
 Forward Phase
 Backward Phase
Forward Phase
The network's synaptic weights are fixed in this phase, and the input signal is propagated layer by layer through the network until it reaches the output. As a result, alterations in this phase are limited to the activation potentials and outputs of the network's neurons.
Backward Phase
An error signal is generated in this step by comparing the network's output to a desired response. The ensuing error signal is then propagated backwards, layer by layer, via the network. During the second phase, the network's synaptic weights are gradually adjusted.
Algorithm
Forward algorithm is same as I wrote in single layer perceptron. Back propagation algorithm is as follows:

The error signal at the output of neuron j at iteration n is given by
$$ e_j(n) = d_j(n)  y_j(n) $$ where dj(n) is actual output and yj(n) is predicted output By MLP 
The total error energy E(n) for all the neurons in the output layer is therefore
$$ E(n) = \frac{1}{2} \sum(e_j^2(n))$$
Where c is the set of neuron in the output layer.

Let N be the total number of training vectors (examples). Then the average squared error is

Consider the neuron j the local field vj(n) and output yj(n) of neuron j is given by
$$ vj(n) = \sum_{i=1}^{m} wji(n)yi(n) $$
$$ yj(n) = \phi_j(v_j(n))$$ where y_j is output of neuron i and w_ji is weight of link from i to j. 
The correction
$$
\Delta w_{ij}(n) $$
made to the weight is proportional to the partial derivative $$\frac{ \delta(E)}{\delta(w_{ji}})$$ of instantaneous error
Using the chain rule of calculus, this gradient can be expressed as follows
 We can get the following partial derivatives
$$ \frac{\delta E(n)}{\delta e_j(n)} = e_j(n)$$ $$ \frac{\delta e_j(n)}{\delta y_j(n)} = 1 $$
$$ \frac{\delta y_j(n)}{\delta v_j(n)} = \phi'_j(v_j(n)) $$
Putting all partial derivatives in equation 1 we get,
$$ \frac{\delta E}{\delta w_{ij}} =  e_j(n)\phi'_j(v_j(n)) y_i(n).....(2)$$
Correction factor:
Using equation (2) and (3) we can write,
$$ \Delta w_{ij} =  \alpha e_j(n)\phi'_j(v_j(n)) y_i(n)......(4) $$
This can be written as,
$$ \delta w_{ij} = \alpha \delta y_j(n)$$ Where $$ \delta_j(n) = e_j(n)\phi'_j(v_j(n))......(5)$$
 The error term in equation 4 and 5 depends upon location of neuron in the MLP.
Implementation of multilayer perceptron for XOR problem from scratch
 Start by defining an activation function and its derivative, for us, Sigmoid.
# Training of XOR function using Backpropagation
import numpy as np
def sigmoid (x):
return 1/(1 + np.exp(x))
def sigmoid_derivative(x):
return x * (1  x)
 Prepare input datasets. Lets use XOR, for which output will be 1 if x1, x2 differs from one another.
#Input datasets
x = np.array([[0,0],[0,1],[1,0],[1,1]])
t = np.array([[0],[1],[1],[0]])
x.shape,t.shape
((4, 2), (4, 1))
 Set the epoch to the desired outcome and the learning parameter to the same value.
epochs = 100000
lr = 0.01
 Here, we are going to use multilayer perceptron of size 2,2,1. That means 2 input node in input layer one hidden layers of size 2 and output layer.
ILNeurons, HLNeurons, OLNeurons = 2,2,1
 Initialization of weight and bias for respective layer.
#Random weights and bias initialization
wh = np.random.uniform(size=(ILNeurons,HLNeurons))
#print(wh)
bh =np.random.uniform(size=(1,HLNeurons))
wo = np.random.uniform(size=(HLNeurons,OLNeurons))
#print(wo)
bo = np.random.uniform(size=(1,OLNeurons))
Forward Training
The output of input layer is the input for the first hidden layer and so on.
#Training algorithm
for i in range(epochs):
#Forward Propagation
vh= np.dot(x,wh)
vh = vh+bh
yh = sigmoid(vh)
vo = np.dot(yh,wo)
vo = vo+bo
yo = sigmoid(vo)
Backward Training
Backward Propagation Has to be done differently based on the position of neuron
Case I: Neuron j is output layer neuron The neuron j's desired response dj(n) is directly available. In this example, calculating the error ej(n) is trivial. To determine the weight update term, we can utilize equations 4 or 5.
Case II: Neuron j is hidden layer neuron Neuron j does not have any desirable responses. A hidden neuron's error signal must be calculated recursively in terms of the error signals of all neurons connected to it, as shown below.
#Training algorithm
for i in range(epochs):
#Forward Propagation
vh= np.dot(x,wh)
vh = vh+bh
yh = sigmoid(vh)
vo = np.dot(yh,wo)
vo = vo+bo
yo = sigmoid(vo)
#Backward propagation
error = t  yo
deltao = error * sigmoid_derivative(yo)
hidden_error = deltao.dot(wo.T)
deltah = hidden_error * sigmoid_derivative(yh)
#Updating Weights and Biases
wo += yh.T.dot(deltao) *lr
bo += np.sum(deltao,axis=0,keepdims=True) *lr
wh += x.T.dot(deltah)*lr
bh += np.sum(deltah,axis=0,keepdims=True) *lr
print("Final hidden weights: ",end='')
print(*wh)
print("Final hidden bias: ",end='')
print(*bh)
print("Final output weights: ",end='')
print(*wo)
print("Final output bias: ",end='')
print(*bo)
print("\nOutput from neural network after 10,000 epochs: ",end='')
print(*yo)
Final hidden weights: [5.81424148 3.65030952] [5.79863503 3.64731901]
Final hidden bias: [2.4062392 5.58225138]
Final output weights: [7.40755241] [8.04340317]
Final output bias: [3.32739567]
Output from neural network after 10,000 epochs: [0.06037213] [0.94396036] [0.94401095] [0.06078971]
In above result, the output [0.06037213] [0.94396036] [0.94401095] [0.06078971]
is near to our expected output [0 1 1 0]
and thus we can say that our network solved a XOR problem.