PCA is a method of dimensional reduction. When a machine learning method has a large number of features, we must choose the features that contribute the most and ignore the datasets that are less significant. This is where PCA comes into play. Consequently, PCA is an unsupervised method for choosing useful datasets from a huge collection of features. It is a statistical technique that converts the observations of correlated features into a set of linearly uncorrelated features via an orthogonal transformation.
PCA on Numerical Dataset
Given the following data use PCA to reduce the dimension from 2 to 1.
Feature | EN.1 | EN.2 | EN.3 | EN.4 |
---|---|---|---|---|
x | 4 | 8 | 13 | 7 |
y | 11 | 4 | 5 | 14 |
Following are the steps to obtained the required number of PCA.
Step 1: Dataset
Feature | EN.1 | EN.2 | EN.3 | EN.4 |
---|---|---|---|---|
x | 4 | 8 | 13 | 7 |
y | 11 | 4 | 5 | 14 |
No.of feature, n = 2
No.of Sample, N = 4
Step 2: Computation of mean of variables
Step 3: Computation of covariance matrix
ordered paired are,
(x, x), (x, y), (y, x), (y, y)
If we have n variables then we have $$n^2$$ ordered pair.
(i) Covariance of all ordered pairs
(ii) Covariance Matrix
cov(x,x) | cov(x,y) |
---|---|
cov(y,x) | cov(y,y) |
14 | -11 |
---|---|
-11 | 23 |
Step 4:Eigen value, Eigen vector, Normalized Eigen vector
(i) Eigen Values
lambda = 30.38, 6.61
(ii) Eigen ectors
Hence,normalized eigen vector u1 of 𝜆1 i.e e1 [-0.83025082, 0.55738997]
for 𝜆2 normalized eigen vector i.e e2 [-0.55738997, -0.83025082]
Step 5: Derive New Dataset
Feature | EN.1 | EN.2 | EN.3 | EN.4 |
---|---|---|---|---|
pc1 | p11 | p12 | p13 | p14 |
for pc1:
$$ p11 = e1^T[-4, 2.5]^T = -4.008989537218968 $$
$$ p12 = e1^T[0, 3.5]^T = 3.7361286866113304$$
$$ p13 = e1^T[5, -2.5]^T = 0.1539328264398951$$
$$ p14 = e1^T[-1, 6]^T = 0.1189280241677419$$
Feature | EN.1 | EN.2 | EN.3 | EN.4 |
---|---|---|---|---|
pc1 | -4.008 | 3.736 | 0.153 | 0.118 |
Import Necessary Module
import numpy as np
import pandas as pd
DataSet
#Understanding the mathematics behind PCA
A = np.matrix([[4,8,13,7],
[11,4,5,14]])
A.shape
(2, 4)
dataset = pd.DataFrame(A,columns = ['EN1','EN2','EN3','EN4'], index = ["x","y"])
dataset
EN1 | EN2 | EN3 | EN4 | |
---|---|---|---|---|
x | 4 | 8 | 13 | 7 |
y | 11 | 4 | 5 | 14 |
Compute the mean of variable
np.mean(dataset,1)
x 8.0
y 8.5
dtype: float64
Compute co-variance matrix
cov_mat = dataset.T.cov()
cov_mat
x | y | |
---|---|---|
x | 14.0 | -11.0 |
y | -11.0 | 23.0 |
Computation of eigenvalues and eigenvectors
w, v = np.linalg.eig(cov_mat)
w
array([ 6.61513568, 30.38486432])
v
array([[-0.83025082, 0.55738997],
[-0.55738997, -0.83025082]])
v[1:].T
v[1:].shape
(1, 2)
Computation of first principal component
p11 = np.dot(v[1:], np.array([[4-8],[11-8.5]]))
p11
array([[0.15393283]])
p12 = np.dot(v[1:], np.array([[8-8],[4-8.5]]))
p12
array([[3.73612869]])
p13 = np.dot(v[1:], np.array([[13-8],[5-8.5]]))
p13
array([[0.11892802]])
p14 = np.dot(v[1:], np.array([[7-8],[14-8.5]]))
p14
array([[-4.00898954]])
pc1 = [p11,p12,p13,p14]
pd.Series(pc1)
0 [[0.1539328264398951]]
1 [[3.7361286866113304]]
2 [[0.1189280241677419]]
3 [[-4.008989537218968]]
dtype: object