How to do Preprocessing of Dataset Before Applying Machine Learning Algorithms

First, import the packages required to continue.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

# Read the dataset using Pandas

Read previously loaded data and store it in a variable named df, display the first few rows with head(), by default head() will return first 5 rows of the dataset, but you can specify any number of rows like head(10).

df = pd.read_csv('train.csv')
df.head()
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S

# Check the Dimension of the Dataset

Let's check for the basic information about the dataset by running simple commands.

#training_set.shape which returns a number of rows coloumns in dataset
df.shape
(891, 12)
#training_set.columns: It returns column headings.
df.columns
Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
dtype='object')
#training_set.isnull().sum(): It returns a number of null values in each column.
df.isnull().sum()
PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

# Perspectives

The target variable, "Survived," is what we will forecast after preprocessing our data. Therefore, we keep that column.

Only the "Age," "Cabin," and "Embarked" columns lack values.

PassengerId, Name, and Ticket don't really contribute much to forecasting the target variable.

We can create a new column called "Size of the family" because "ParCh" (Parent/Child) and "SibSp" (Siblings/Spouse) details are related to families.

The category data that needs to be encoded to numerical values are "Sex," "Cabin," and "Embarked."

# Columns dropping

We will eliminate the columns with the lowest priority in this step. This category includes columns like "Ticket" and "PassengerId." To remove the columns, use drop().

#dropping ticket column
df.drop(['Ticket','PassengerId'], 1, inplace=True)
df.head()
Survived Pclass Name Sex Age SibSp Parch Fare Cabin Embarked
0 0 3 Braund, Mr. Owen Harris male 22.0 1 0 7.2500 NaN S
1 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 71.2833 C85 C
2 1 3 Heikkinen, Miss. Laina female 26.0 0 0 7.9250 NaN S
3 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 53.1000 C123 S
4 0 3 Allen, Mr. William Henry male 35.0 0 0 8.0500 NaN S

# Now, let’s run df.info(), and look at the status of our dataset.

df.info()

RangeIndex: 891 entries, 0 to 890
Data columns (total 10 columns):
#   Column    Non-Null Count  Dtype
---  ------    --------------  -----
0   Survived  891 non-null    int64
1   Pclass    891 non-null    int64
2   Name      891 non-null    object
3   Sex       891 non-null    object
4   Age       714 non-null    float64
5   SibSp     891 non-null    int64
6   Parch     891 non-null    int64
7   Fare      891 non-null    float64
8   Cabin     204 non-null    object
9   Embarked  889 non-null    object
dtypes: float64(2), int64(4), object(4)
memory usage: 69.7+ KB

# Which column has missing Value?

‘Cabin’, ‘Embarked’ and ‘Age’ column has missing values.

# Creating new classes "Cabin"

Although the Cabin column has 687 missing values, it has a distinctive character that indicates the deck number at the beginning. As a result, we are going to establish a column named Deck to extract this information, which we may utilize in our forecast in the future.

# user defined function
def assignDeckValue(CabinCode):
if pd.isnull(CabinCode):
category = 'Unknown'
else:
category = CabinCode[0]
return category

Deck = np.array([assignDeckValue(cabin) for cabin in df['Cabin'].values])

df = df.assign(Deck = Deck)
df.head()
Survived Pclass Name Sex Age SibSp Parch Fare Cabin Embarked Deck
0 0 3 Braund, Mr. Owen Harris male 22.0 1 0 7.2500 NaN S Unknown
1 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 71.2833 C85 C C
2 1 3 Heikkinen, Miss. Laina female 26.0 0 0 7.9250 NaN S Unknown
3 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 53.1000 C123 S C
4 0 3 Allen, Mr. William Henry male 35.0 0 0 8.0500 NaN S Unknown

# Derive a new column

‘ParCh’ and ‘SibSp’ are the details related to family size, so let’s derive a new column named ‘Size of the Family’.

df['FamilySize'] = df['SibSp'] + df['Parch'] + 1

‘Name’: Instead of dropping right away, from the Name of the Passenger, we need to get only their Title

 #Using expression pattern to extract the Title of the passenger
df['Title'] = df.Name.str.extract(' ([A-Za-z]+)\.', expand=False)

# Changing to common category
df['Title'] = df['Title'].replace(['Dr', 'Rev', 'Col', 'Major', 'Countess', 'Sir', 'Jonkheer', 'Lady', 'Capt', 'Don'], 'Others')
df['Title'] = df['Title'].replace('Ms', 'Miss')
df['Title'] = df['Title'].replace('Mme', 'Mrs')
df['Title'] = df['Title'].replace('Mlle', 'Miss')
df.head()
Survived Pclass Name Sex Age SibSp Parch Fare Cabin Embarked Deck FamilySize Title
0 0 3 Braund, Mr. Owen Harris male 22.0 1 0 7.2500 NaN S Unknown 2 Mr
1 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 71.2833 C85 C C 2 Mrs
2 1 3 Heikkinen, Miss. Laina female 26.0 0 0 7.9250 NaN S Unknown 1 Miss
3 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 53.1000 C123 S C 2 Mrs
4 0 3 Allen, Mr. William Henry male 35.0 0 0 8.0500 NaN S Unknown 1 Mr

Now, let's drop Cabin, Name columns, we have extracted needed information from these two.

df.drop(['Cabin','Name'],1,inplace=True)

# This is how our dataset looks like now.

df.head()
Survived Pclass Sex Age SibSp Parch Fare Embarked Deck FamilySize Title
0 0 3 male 22.0 1 0 7.2500 S Unknown 2 Mr
1 1 1 female 38.0 1 0 71.2833 C C 2 Mrs
2 1 3 female 26.0 0 0 7.9250 S Unknown 1 Miss
3 1 1 female 35.0 1 0 53.1000 S C 2 Mrs
4 0 3 male 35.0 0 0 8.0500 S Unknown 1 Mr

# Managing missing values

‘Embarked’: For the Embarked column, only two rows are missing their data. Embarked accepts categorical values such as C = Cherbourg, Q = Queenstown, and S = Southampton; in this instance, we may just substitute the most frequently occurring value, which in this example is 'S,' for the missing values.

# Returns count for each category
df['Embarked'].value_counts()

# Fills null values with 'S'-most common occurence
common = 'S'
df['Embarked']=df['Embarked'].fillna('S')

# Checking the no of null values now
df['Embarked'].isnull().sum()
0
df.isnull().sum()
Survived        0
Pclass          0
Sex             0
Age           177
SibSp           0
Parch           0
Fare            0
Embarked        0
Deck            0
FamilySize      0
Title           0
dtype: int64

Age: Using the mean value for each group, we will impute the missing values in the 'Age' column. Because there are many age ranges, taking the mean of the entire column can lead to conflicting results.

means = df.groupby('Title')['Age'].mean()

title_list = ['Master','Miss','Mr','Mrs','Others']
def age_missing_replace(means, dframe, title_list):
for title in title_list:
temp = dframe['Title'] == title
dframe.loc[temp, 'Age'] = dframe.loc[temp, 'Age'].fillna(means[title])

age_missing_replace(means, df, title_list)
df.head()
Survived Pclass Sex Age SibSp Parch Fare Embarked Deck FamilySize Title
0 0 3 male 22.0 1 0 7.2500 S Unknown 2 Mr
1 1 1 female 38.0 1 0 71.2833 C C 2 Mrs
2 1 3 female 26.0 0 0 7.9250 S Unknown 1 Miss
3 1 1 female 35.0 1 0 53.1000 S C 2 Mrs
4 0 3 male 35.0 0 0 8.0500 S Unknown 1 Mr

# Coding categorical characteristics

Without being converted to numerical values, many machine learning algorithms are unable to accommodate categorical variables. Fortunately, there are numerous methods to manage this scenario provided by the sci-kit-learn and pandas python tools.

It's them,

• Finding and Replacing

• Labeling

• A single hot encode

• Individual Binary Encoding

• Using Sci-kit Learn's LabelEncoder

At first, we will only use map to convert the categorical values into numerical data ().

df['Embarked'] = df['Embarked'].map({'C':0, 'Q':1, 'S':2})
df['Sex'] = df['Sex'].map({'male':0, 'female':1})
df['Title'] = df['Title'].map({'Master':0,'Miss':1,'Mr':2,'Mrs':3,'Others':4})

df.head(20)
Survived Pclass Sex Age SibSp Parch Fare Embarked Deck FamilySize Title
0 0 3 0 22.000000 1 0 7.2500 2 Unknown 2 2
1 1 1 1 38.000000 1 0 71.2833 0 C 2 3
2 1 3 1 26.000000 0 0 7.9250 2 Unknown 1 1
3 1 1 1 35.000000 1 0 53.1000 2 C 2 3
4 0 3 0 35.000000 0 0 8.0500 2 Unknown 1 2
5 0 3 0 32.368090 0 0 8.4583 1 Unknown 1 2
6 0 1 0 54.000000 0 0 51.8625 2 E 1 2
7 0 3 0 2.000000 3 1 21.0750 2 Unknown 5 0
8 1 3 1 27.000000 0 2 11.1333 2 Unknown 3 3
9 1 2 1 14.000000 1 0 30.0708 0 Unknown 2 3
10 1 3 1 4.000000 1 1 16.7000 2 G 3 1
11 1 1 1 58.000000 0 0 26.5500 2 C 1 1
12 0 3 0 20.000000 0 0 8.0500 2 Unknown 1 2
13 0 3 0 39.000000 1 5 31.2750 2 Unknown 7 2
14 0 3 1 14.000000 0 0 7.8542 2 Unknown 1 1
15 1 2 1 55.000000 0 0 16.0000 2 Unknown 1 3
16 0 3 0 2.000000 4 1 29.1250 1 Unknown 6 0
17 1 2 0 32.368090 0 0 13.0000 2 Unknown 1 2
18 0 3 1 31.000000 1 0 18.0000 2 Unknown 2 3
19 1 3 1 35.788991 0 0 7.2250 0 Unknown 1 3

Manually replacing the categorical value is not the right choice if there are many categories. Let’s do one conversion using LabelEncoder() provided by sklearn.preprocessing library.

from sklearn import preprocessing
le = preprocessing.LabelEncoder()
df['Deck'] = le.fit_transform(df['Deck'])

df.head()
Survived Pclass Sex Age SibSp Parch Fare Embarked Deck FamilySize Title
0 0 3 0 22.0 1 0 7.2500 2 8 2 2
1 1 1 1 38.0 1 0 71.2833 0 2 2 3
2 1 3 1 26.0 0 0 7.9250 2 8 1 1
3 1 1 1 35.0 1 0 53.1000 2 2 2 3
4 0 3 0 35.0 0 0 8.0500 2 8 1 2