Naive Bayes is a machine learning algorithm that is used for classification tasks. It is called "naive" because it makes a simplifying assumption about the data, specifically that the features in the data are independent of one another. Despite this assumption, the algorithm has been shown to be effective in many real-world applications.
Previously, we wrote blogs on many machine learning algorithms(Classification, Predication) as well as many other topics to help you sharpen your knowledge of how machine work. Please kindly visit our site and we would be happy if we got some feedback from you to improve our writing. To see some of them, you can follow the mentioned links.
This post has been cross-posted from my GitHub page iamdurga.github.io.
In this blog, we are going to do Naive Bayes by scratch. We can easily fit our own code instead of fitting our Naive Bayes model from the library; however, writing code from scratch really helps you appreciate how algorithms work. We already talked about how mathematically naive Bayes works. If you want to see how Naive Bayes works mathematically, click on the following link.
Naive Bayes Theorem
Naive Bayes work on the basis of Bayes Theorem
$$P(A\mid B)=\frac {P(B\mid A) \cdot P(A)}{P(B)}$$
Where, P(A/B) = Posterior Probability
P(B/A) = Conditional Probability
P(A) = Prior Probability
P(B) = Marginal Probability.
Here, P(B) is constant for all values and hence does not contribute much to classifying the dataset, so we neglect that term while doing calculations.
Import Necessary Module
import pandas as pd
Data Load
Here we use diabetes data. If you are interested in obtaining a data set, please visit link.Talking about the dataset, it predicts whether a patient has diabetes or not on the basis of measurements (pregnancys,glucosee,blood pressuree,skin thicknesss,insulinn, BMI,diabetes pedigree functionn,agee).
df = pd.read_csv('diabetes.csv')
df.head()
Pregnancies | Glucose | BloodPressure | SkinThickness | Insulin | BMI | DiabetesPedigreeFunction | Age | Outcome | |
---|---|---|---|---|---|---|---|---|---|
0 | 6 | 148 | 72 | 35 | 0 | 33.6 | 0.627 | 50 | 1 |
1 | 1 | 85 | 66 | 29 | 0 | 26.6 | 0.351 | 31 | 0 |
2 | 8 | 183 | 64 | 0 | 0 | 23.3 | 0.672 | 32 | 1 |
3 | 1 | 89 | 66 | 23 | 94 | 28.1 | 0.167 | 21 | 0 |
4 | 0 | 137 | 40 | 35 | 168 | 43.1 | 2.288 | 33 | 1 |
Convert Continuous Data to Categorical Data
For classification tasks, we always need data in categorical format, but our available data is in continuous type. So our first task is to convert the given data into categorical data while also dealing with missing values. To change the given data into categorical format, we use the "pandas cut function." which work in the following manner:
cut(array, bins, level)
If we do not provide the bins size it automatically create the bins size in this way,
M = maximum value of array
m = minimum value of array
R = M - m
bins = round(sqrt(R))
But, in our care, we already provided the bin size, so it must first find the maximum value and divide that value by the number of bins, and then categorize our dataset according to the level we have provided.
labels = ['lower','medium','higher']
for i in df.columns[:-1]:
mean = df[i].mean()
df[i] = df[i].replace(0,mean)
df[i] = pd.cut(df[i],bins=len(labels),labels=labels)
Count Function
The below count function simply counts how much data there is according to class and category, respectively.
def count(df,colname,label,target):
rule = (df[colname] == label) & (df['Outcome'] == target)
return len(df[rule])
predicted = []
probabilities = {0:{},1:{}}
Split Data into Train and Test Set
We use 70% of the data without any random process in this data splitting process; however, this is not a good approach because it may be affected by bias at times, so I strongly recommend you use the k-folds approach or any other random splitting process.
train_percent = 70
train_len = int((train_percent*len(df))/100)
train_X = df.iloc[:train_len,:]
test_X = df.iloc[train_len+1:,:-1]
test_y = df.iloc[train_len+1:,-1]
Prior probability
Now, let's calculate prior probability by using following formula,
P(Outcome => 0) = Count(0)/Total Number of data
P(Outcome => 1) = Count(1)/Total Number of data
total_0 = count(train_X,'Outcome',0,0)
total_1 = count(train_X,'Outcome',1,1)
prior_prob_0 = total_0/len(train_X)
prior_prob_1 = total_1/len(train_X)
Conditional Probability
In Bayes Theorem there is another kind of probability which is known as conditional probability. Lets calculate it using following formula.
Conditional-probability = P(Count of Category/Count of 0)
Conditional-probability = P(Count of Category/Count of 1)
for col in train_X.columns[:-1]:
probabilities[0][col] = {}
probabilities[1][col] = {}
for category in labels:
total_ct_0 = count(train_X,col,category,0)
total_ct_1 = count(train_X,col,category,1)
probabilities[0][col][category] = total_ct_0 / total_0
probabilities[1][col][category] = total_ct_1 / total_1
Posterior probability
Finally,
Posterior Probability = Conditional Probability * Prior probability
for row in range(0,len(test_X)):
prod_0 = prior_prob_0
prod_1 = prior_prob_1
for feature in test_X.columns:
prod_0 *= probabilities[0][feature][test_X[feature].iloc[row]]
prod_1 *= probabilities[1][feature][test_X[feature].iloc[row]]
if prod_0 > prod_1:
predicted.append(0)
else:
predicted.append(1)
Which is the class for the given dataset?
have_diabetes = 0
do_not_have = 0
for i in predicted:
if i == 0:
do_not_have +=1
else:
have_diabetes += 1
print(have_diabetes)
print(do_not_have)
print("Final predication for given dataset is patient do not have diabetes")
79
151
Final predication for given dataset is patient do not have diabetes
The Given Classification Model's Accuracy
tp,tn,fp,fn = 0,0,0,0
for j in range(0,len(predicted)):
if predicted[j] == 0:
#print(test_y.iloc[j])
if test_y.iloc[j] == 0:
tp += 1
else:
fp += 1
else:
if test_y.iloc[j] == 1:
tn += 1
else:
fn += 1
++print('Accuracy for training length '+str(train_percent)+'% : ',((tp+tn)/len(test_y))*100)
Accuracy for training length 70% : 80.0