Load the dataset
First, import the packages required to continue.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
Read the dataset using Pandas
Read previously loaded data and store it in a variable named df, display the first few rows with head(), by default head() will return first 5 rows of the dataset, but you can specify any number of rows like head(10).
df = pd.read_csv('train.csv')
df.head()
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
Check the Dimension of the Dataset
Let's check for the basic information about the dataset by running simple commands.
#training_set.shape which returns a number of rows coloumns in dataset
df.shape
(891, 12)
#training_set.columns: It returns column headings.
df.columns
Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
dtype='object')
#training_set.isnull().sum(): It returns a number of null values in each column.
df.isnull().sum()
PassengerId 0
Survived 0
Pclass 0
Name 0
Sex 0
Age 177
SibSp 0
Parch 0
Ticket 0
Fare 0
Cabin 687
Embarked 2
dtype: int64
Preparing the dataset
Perspectives
The target variable, "Survived," is what we will forecast after preprocessing our data. Therefore, we keep that column.
Only the "Age," "Cabin," and "Embarked" columns lack values.
PassengerId, Name, and Ticket don't really contribute much to forecasting the target variable.
We can create a new column called "Size of the family" because "ParCh" (Parent/Child) and "SibSp" (Siblings/Spouse) details are related to families.
The category data that needs to be encoded to numerical values are "Sex," "Cabin," and "Embarked."
Columns dropping
We will eliminate the columns with the lowest priority in this step. This category includes columns like "Ticket" and "PassengerId." To remove the columns, use drop().
#dropping ticket column
df.drop(['Ticket','PassengerId'], 1, inplace=True)
df.head()
Survived | Pclass | Name | Sex | Age | SibSp | Parch | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | 7.2500 | NaN | S |
1 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | 71.2833 | C85 | C |
2 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | 7.9250 | NaN | S |
3 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 53.1000 | C123 | S |
4 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 8.0500 | NaN | S |
Now, let’s run df.info(), and look at the status of our dataset.
df.info()
RangeIndex: 891 entries, 0 to 890
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Survived 891 non-null int64
1 Pclass 891 non-null int64
2 Name 891 non-null object
3 Sex 891 non-null object
4 Age 714 non-null float64
5 SibSp 891 non-null int64
6 Parch 891 non-null int64
7 Fare 891 non-null float64
8 Cabin 204 non-null object
9 Embarked 889 non-null object
dtypes: float64(2), int64(4), object(4)
memory usage: 69.7+ KB
Which column has missing Value?
‘Cabin’, ‘Embarked’ and ‘Age’ column has missing values.
Creating new classes "Cabin"
Although the Cabin column has 687 missing values, it has a distinctive character that indicates the deck number at the beginning. As a result, we are going to establish a column named Deck to extract this information, which we may utilize in our forecast in the future.
# user defined function
def assignDeckValue(CabinCode):
if pd.isnull(CabinCode):
category = 'Unknown'
else:
category = CabinCode[0]
return category
Deck = np.array([assignDeckValue(cabin) for cabin in df['Cabin'].values])
df = df.assign(Deck = Deck)
df.head()
Survived | Pclass | Name | Sex | Age | SibSp | Parch | Fare | Cabin | Embarked | Deck | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | 7.2500 | NaN | S | Unknown |
1 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | 71.2833 | C85 | C | C |
2 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | 7.9250 | NaN | S | Unknown |
3 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 53.1000 | C123 | S | C |
4 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 8.0500 | NaN | S | Unknown |
Derive a new column
‘ParCh’ and ‘SibSp’ are the details related to family size, so let’s derive a new column named ‘Size of the Family’.
df['FamilySize'] = df['SibSp'] + df['Parch'] + 1
‘Name’: Instead of dropping right away, from the Name of the Passenger, we need to get only their Title
#Using expression pattern to extract the Title of the passenger
df['Title'] = df.Name.str.extract(' ([A-Za-z]+)\.', expand=False)
# Changing to common category
df['Title'] = df['Title'].replace(['Dr', 'Rev', 'Col', 'Major', 'Countess', 'Sir', 'Jonkheer', 'Lady', 'Capt', 'Don'], 'Others')
df['Title'] = df['Title'].replace('Ms', 'Miss')
df['Title'] = df['Title'].replace('Mme', 'Mrs')
df['Title'] = df['Title'].replace('Mlle', 'Miss')
df.head()
Survived | Pclass | Name | Sex | Age | SibSp | Parch | Fare | Cabin | Embarked | Deck | FamilySize | Title | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | 7.2500 | NaN | S | Unknown | 2 | Mr |
1 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | 71.2833 | C85 | C | C | 2 | Mrs |
2 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | 7.9250 | NaN | S | Unknown | 1 | Miss |
3 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 53.1000 | C123 | S | C | 2 | Mrs |
4 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 8.0500 | NaN | S | Unknown | 1 | Mr |
Now, let's drop Cabin, Name columns, we have extracted needed information from these two.
df.drop(['Cabin','Name'],1,inplace=True)
This is how our dataset looks like now.
df.head()
Survived | Pclass | Sex | Age | SibSp | Parch | Fare | Embarked | Deck | FamilySize | Title | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 3 | male | 22.0 | 1 | 0 | 7.2500 | S | Unknown | 2 | Mr |
1 | 1 | 1 | female | 38.0 | 1 | 0 | 71.2833 | C | C | 2 | Mrs |
2 | 1 | 3 | female | 26.0 | 0 | 0 | 7.9250 | S | Unknown | 1 | Miss |
3 | 1 | 1 | female | 35.0 | 1 | 0 | 53.1000 | S | C | 2 | Mrs |
4 | 0 | 3 | male | 35.0 | 0 | 0 | 8.0500 | S | Unknown | 1 | Mr |
Managing missing values
‘Embarked’: For the Embarked column, only two rows are missing their data.
Embarked accepts categorical values such as C = Cherbourg, Q = Queenstown, and S = Southampton; in this instance, we may just substitute the most frequently occurring value, which in this example is 'S,' for the missing values.
# Returns count for each category
df['Embarked'].value_counts()
# Fills null values with 'S'-most common occurence
common = 'S'
df['Embarked']=df['Embarked'].fillna('S')
# Checking the no of null values now
df['Embarked'].isnull().sum()
0
df.isnull().sum()
Survived 0
Pclass 0
Sex 0
Age 177
SibSp 0
Parch 0
Fare 0
Embarked 0
Deck 0
FamilySize 0
Title 0
dtype: int64
Age: Using the mean value for each group, we will impute the missing values in the 'Age' column. Because there are many age ranges, taking the mean of the entire column can lead to conflicting results.
means = df.groupby('Title')['Age'].mean()
title_list = ['Master','Miss','Mr','Mrs','Others']
def age_missing_replace(means, dframe, title_list):
for title in title_list:
temp = dframe['Title'] == title
dframe.loc[temp, 'Age'] = dframe.loc[temp, 'Age'].fillna(means[title])
age_missing_replace(means, df, title_list)
df.head()
Survived | Pclass | Sex | Age | SibSp | Parch | Fare | Embarked | Deck | FamilySize | Title | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 3 | male | 22.0 | 1 | 0 | 7.2500 | S | Unknown | 2 | Mr |
1 | 1 | 1 | female | 38.0 | 1 | 0 | 71.2833 | C | C | 2 | Mrs |
2 | 1 | 3 | female | 26.0 | 0 | 0 | 7.9250 | S | Unknown | 1 | Miss |
3 | 1 | 1 | female | 35.0 | 1 | 0 | 53.1000 | S | C | 2 | Mrs |
4 | 0 | 3 | male | 35.0 | 0 | 0 | 8.0500 | S | Unknown | 1 | Mr |
Coding categorical characteristics
Without being converted to numerical values, many machine learning algorithms are unable to accommodate categorical variables. Fortunately, there are numerous methods to manage this scenario provided by the sci-kit-learn and pandas python tools.
It's them,
-
Finding and Replacing
-
Labeling
-
A single hot encode
-
Individual Binary Encoding
-
Using Sci-kit Learn's LabelEncoder
At first, we will only use map to convert the categorical values into numerical data ().
df['Embarked'] = df['Embarked'].map({'C':0, 'Q':1, 'S':2})
df['Sex'] = df['Sex'].map({'male':0, 'female':1})
df['Title'] = df['Title'].map({'Master':0,'Miss':1,'Mr':2,'Mrs':3,'Others':4})
df.head(20)
Survived | Pclass | Sex | Age | SibSp | Parch | Fare | Embarked | Deck | FamilySize | Title | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 3 | 0 | 22.000000 | 1 | 0 | 7.2500 | 2 | Unknown | 2 | 2 |
1 | 1 | 1 | 1 | 38.000000 | 1 | 0 | 71.2833 | 0 | C | 2 | 3 |
2 | 1 | 3 | 1 | 26.000000 | 0 | 0 | 7.9250 | 2 | Unknown | 1 | 1 |
3 | 1 | 1 | 1 | 35.000000 | 1 | 0 | 53.1000 | 2 | C | 2 | 3 |
4 | 0 | 3 | 0 | 35.000000 | 0 | 0 | 8.0500 | 2 | Unknown | 1 | 2 |
5 | 0 | 3 | 0 | 32.368090 | 0 | 0 | 8.4583 | 1 | Unknown | 1 | 2 |
6 | 0 | 1 | 0 | 54.000000 | 0 | 0 | 51.8625 | 2 | E | 1 | 2 |
7 | 0 | 3 | 0 | 2.000000 | 3 | 1 | 21.0750 | 2 | Unknown | 5 | 0 |
8 | 1 | 3 | 1 | 27.000000 | 0 | 2 | 11.1333 | 2 | Unknown | 3 | 3 |
9 | 1 | 2 | 1 | 14.000000 | 1 | 0 | 30.0708 | 0 | Unknown | 2 | 3 |
10 | 1 | 3 | 1 | 4.000000 | 1 | 1 | 16.7000 | 2 | G | 3 | 1 |
11 | 1 | 1 | 1 | 58.000000 | 0 | 0 | 26.5500 | 2 | C | 1 | 1 |
12 | 0 | 3 | 0 | 20.000000 | 0 | 0 | 8.0500 | 2 | Unknown | 1 | 2 |
13 | 0 | 3 | 0 | 39.000000 | 1 | 5 | 31.2750 | 2 | Unknown | 7 | 2 |
14 | 0 | 3 | 1 | 14.000000 | 0 | 0 | 7.8542 | 2 | Unknown | 1 | 1 |
15 | 1 | 2 | 1 | 55.000000 | 0 | 0 | 16.0000 | 2 | Unknown | 1 | 3 |
16 | 0 | 3 | 0 | 2.000000 | 4 | 1 | 29.1250 | 1 | Unknown | 6 | 0 |
17 | 1 | 2 | 0 | 32.368090 | 0 | 0 | 13.0000 | 2 | Unknown | 1 | 2 |
18 | 0 | 3 | 1 | 31.000000 | 1 | 0 | 18.0000 | 2 | Unknown | 2 | 3 |
19 | 1 | 3 | 1 | 35.788991 | 0 | 0 | 7.2250 | 0 | Unknown | 1 | 3 |
Manually replacing the categorical value is not the right choice if there are many categories.
Let’s do one conversion using LabelEncoder() provided by sklearn.preprocessing library.
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
df['Deck'] = le.fit_transform(df['Deck'])
Dataset ready…
df.head()
Survived | Pclass | Sex | Age | SibSp | Parch | Fare | Embarked | Deck | FamilySize | Title | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 3 | 0 | 22.0 | 1 | 0 | 7.2500 | 2 | 8 | 2 | 2 |
1 | 1 | 1 | 1 | 38.0 | 1 | 0 | 71.2833 | 0 | 2 | 2 | 3 |
2 | 1 | 3 | 1 | 26.0 | 0 | 0 | 7.9250 | 2 | 8 | 1 | 1 |
3 | 1 | 1 | 1 | 35.0 | 1 | 0 | 53.1000 | 2 | 2 | 2 | 3 |
4 | 0 | 3 | 0 | 35.0 | 0 | 0 | 8.0500 | 2 | 8 | 1 | 2 |