PCA From Scratch In Python

PCA is a method of dimensional reduction. When a machine learning method has a large number of features, we must choose the features that contribute the most and ignore the datasets that are less significant. This is where PCA comes into play. Consequently, PCA is an unsupervised method for choosing useful datasets from a huge collection of features. It is a statistical technique that converts the observations of correlated features into a set of linearly uncorrelated features via an orthogonal transformation.

PCA on Numerical Dataset

Given the following data use PCA to reduce the dimension from 2 to 1.

Feature EN.1 EN.2 EN.3 EN.4
x 4 8 13 7
y 11 4 5 14

Following are the steps to obtained the required number of PCA.

Step 1: Dataset

Feature EN.1 EN.2 EN.3 EN.4
x 4 8 13 7
y 11 4 5 14

No.of feature, n = 2

No.of Sample, N = 4

Step 2: Computation of mean of variables

Step 3: Computation of covariance matrix

ordered paired are,
(x, x), (x, y), (y, x), (y, y)

If we have n variables then we have $$n^2$$ ordered pair.

(i) Covariance of all ordered pairs

(ii) Covariance Matrix

cov(x,x) cov(x,y)
cov(y,x) cov(y,y)
14 -11
-11 23

Step 4:Eigen value, Eigen vector, Normalized Eigen vector

(i) Eigen Values

lambda = 30.38, 6.61

(ii) Eigen ectors

Hence,normalized eigen vector u1 of 𝜆1 i.e e1 [-0.83025082, 0.55738997]

for 𝜆2 normalized eigen vector i.e e2 [-0.55738997, -0.83025082]

Step 5: Derive New Dataset

Feature EN.1 EN.2 EN.3 EN.4
pc1 p11 p12 p13 p14

for pc1:
$$ p11 = e1^T[-4, 2.5]^T = -4.008989537218968 $$
$$ p12 = e1^T[0, 3.5]^T = 3.7361286866113304$$
$$ p13 = e1^T[5, -2.5]^T = 0.1539328264398951$$
$$ p14 = e1^T[-1, 6]^T = 0.1189280241677419$$

Feature EN.1 EN.2 EN.3 EN.4
pc1 -4.008 3.736 0.153 0.118

Import Necessary Module

import numpy as np
import pandas as pd

DataSet

#Understanding the mathematics behind PCA
A = np.matrix([[4,8,13,7],
               [11,4,5,14]])
A.shape
(2, 4)
dataset = pd.DataFrame(A,columns  = ['EN1','EN2','EN3','EN4'], index = ["x","y"])
dataset
EN1 EN2 EN3 EN4
x 4 8 13 7
y 11 4 5 14

Compute the mean of variable

np.mean(dataset,1)
x    8.0
y    8.5
dtype: float64

Compute co-variance matrix

cov_mat = dataset.T.cov()
cov_mat
x y
x 14.0 -11.0
y -11.0 23.0

Computation of eigenvalues and eigenvectors

w, v = np.linalg.eig(cov_mat)
w
array([ 6.61513568, 30.38486432])
v
array([[-0.83025082,  0.55738997],
       [-0.55738997, -0.83025082]])
v[1:].T
v[1:].shape
(1, 2)

Computation of first principal component

p11 = np.dot(v[1:],  np.array([[4-8],[11-8.5]]))
p11
array([[0.15393283]])
p12 = np.dot(v[1:],  np.array([[8-8],[4-8.5]]))
p12
array([[3.73612869]])
p13 = np.dot(v[1:],  np.array([[13-8],[5-8.5]]))
p13
array([[0.11892802]])
p14 = np.dot(v[1:],  np.array([[7-8],[14-8.5]]))
p14
array([[-4.00898954]])
pc1 = [p11,p12,p13,p14]
pd.Series(pc1)
0    [[0.1539328264398951]]
1    [[3.7361286866113304]]
2    [[0.1189280241677419]]
3    [[-4.008989537218968]]
dtype: object

Leave a Reply

Scroll to top
Subscribe to our Newsletter

Hello surfer, thank you for being here. We are as excited as you are to share what we know about data. Please subscribe to our newsletter for weekly data blogs and many more. If you’ve already done it, please close this popup.



No, thank you. I do not want.
100% secure.