Exploratory Data Analysis in R
Hello everyone welcome to our another blog of in R where we will perform various statistical test on Titanic dataset.
Code to Read Titanic Dataset
data = read.csv("E:/code/Titanic Survival Practice/train.csv")
df = data.frame(data)
summary(df)
PassengerId Survived Pclass Name
Min. : 1.0 Min. :0.0000 Min. :1.000 Length:891
1st Qu.:223.5 1st Qu.:0.0000 1st Qu.:2.000 Class :character
Median :446.0 Median :0.0000 Median :3.000 Mode :character
Mean :446.0 Mean :0.3838 Mean :2.309
3rd Qu.:668.5 3rd Qu.:1.0000 3rd Qu.:3.000
Max. :891.0 Max. :1.0000 Max. :3.000
Sex Age SibSp Parch
Length:891 Min. : 0.42 Min. :0.000 Min. :0.0000
Class :character 1st Qu.:20.12 1st Qu.:0.000 1st Qu.:0.0000
Mode :character Median :28.00 Median :0.000 Median :0.0000
Mean :29.70 Mean :0.523 Mean :0.3816
3rd Qu.:38.00 3rd Qu.:1.000 3rd Qu.:0.0000
Max. :80.00 Max. :8.000 Max. :6.0000
NA's :177
Ticket Fare Cabin Embarked
Length:891 Min. : 0.00 Length:891 Length:891
Class :character 1st Qu.: 7.91 Class :character Class :character
Mode :character Median : 14.45 Mode :character Mode :character
Mean : 32.20
3rd Qu.: 31.00
Max. :512.33
From the sample of the RMS Titanic data, we can find below columns:
- Survived: Outcomes of survival(0 = No,1 = Yes)
- Pclass: Socio-economic class(1 = upper class,2 = middle class,3 = Lower class)
- Name: Name of the passenger
- Sex: Sex of the passenger
- Sibsp: Number of siblings and spouses of the passenger aboard
- Parch: Number of parents and children of the passenger aboard
- Ticket: Ticket number of the passenger
- fare: fare paid by the passenger
- Cabin: cabin number of the passenger(some entities contain NON)
- Embarked: port of the embarked of the passenger.
Parametric facts about age
Age of entire passengers
summary(df$Age)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0.42 20.12 28.00 29.70 38.00 80.00 177
From above mean age of passenger was 29.70 similarly median age of passenger was 28.00 also maximum age of passenger was 80 year. There were 177 values were missing in the data.
Average Age of Male
df1 = df$Age[df$Sex == "male"]
summary(df1)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0.42 21.00 29.00 30.73 39.00 80.00 124
Average age of male passenger was 30.73 similarly max age of male passenger was 80.
Average Age of Female
df2 = df$Age[df$Sex == "female"]
summary(df2)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0.75 18.00 27.00 27.92 37.00 63.00 53
From above we can see that average age of female passenger was 27.92, median age of female passenger was 27 on the same way maximum age of female passenger was 63.
Taking Random Sample From Dataframe
s1 <- df[sample(nrow(df), 600), ]
summary(s1$Age )
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0.67 20.00 28.00 29.63 38.25 80.00 128
We have taken 600 sample randomly and calculate summary from sample.
mean(s1$Age, na.rm=TRUE)
29.62625
Mean age of sample calculated above is 29.91.
Does The Average Age Of Sample Resemble The Entire Population
Here we are going to choose student t-test( one sample t-test).
T test is parametric test hence it approximate to the normal distribution as sample size increase. When choosing a t-test, we will need to consider two things: whether the groups being compared come from a single population or two different populations, and whether you want to test the difference in a specific direction. (Taken from here.)
Hypothesis Testing with t-test
There are two types of hypothesis
-
$H_0$ : Null hypothesis is also defined as hypothesis of no difference. It should be simple hypothesis. It is tested for possible rejection on the basis of sample observations drawn from the population. Which is accepted when p-value is greater than 0.05.
-
$H_1$ : Any hypothesis which is mutually exclusive and complementary to the null hypothesis is called alternative hypothesis. It is also known as the name research hypothesis. Which is accepted when p-value is less than 0.05.
They are tested like,
- $H_0$ : $\bar{x}$ = $\mu$ (Sample mean and populations are equal or sample is taken from same population.)
- $H_1$: $\bar{x} \neq \mu$ (Sample mean not equal to population mean.)
t.test(df$Age,mu = mean(s1$Age, na.rm=TRUE))
One Sample t-test
data: df$Age
t = 0.13404, df = 713, p-value = 0.8934
alternative hypothesis: true mean is not equal to 29.62625
95 percent confidence interval:
28.63179 30.76645
sample estimates:
mean of x
29.69912
From above p value is not less than 0.05 hence we do not have strong evidence to reject our null hypothesis. That means we can not make any assumptions about our sample.
Does two sample has same average age?
s2 <- df[sample(nrow(df), 700), ]
summary(s2$Age )
t.test(s1$Age, s2$Age, alternative = "two.sided", var.equal = FALSE)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0.42 21.00 28.00 29.61 37.00 74.00 144
Welch Two Sample t-test
data: s1$Age and s2$Age
t = 0.022409, df = 992.82, p-value = 0.9821
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-1.767968 1.808813
sample estimates:
mean of x mean of y
29.62625 29.60583
Not enough evidence to reject the null hypothesis.
Percentage of passenger with respect to gender
number_per_gender <- table(df$Sex)
prop.table(number_per_gender)*100
female male
35.2413 64.7587
Is Average Age of Male and Female significantly different or not?
We will be using two sample t-test.
t.test(Age~Sex, data=df)
Welch Two Sample t-test
data: Age by Sex
t = -2.5259, df = 560.05, p-value = 0.01181
alternative hypothesis: true difference in means between group female and group male is not equal to 0
95 percent confidence interval:
-4.9967983 -0.6250732
sample estimates:
mean in group female mean in group male
27.91571 30.72664
From above, p value, we conclude that average age of each gender is not equal. That means mean age of male passenger not equal to mean age of female passenger.
Find survive ratio in each gender.
library(dplyr)
gen_df = df %>%
group_by(Sex, Survived) %>%
# group_by(Survived) %>%
summarise(n = n()) %>%
mutate(freq = n / sum(n))
gen_df
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
summarise()
has grouped output by 'Sex'. You can override using the .groups
argument.
Sex | Survived | n | freq |
---|---|---|---|
<chr> | <int> | <int> | <dbl> |
female | 0 | 81 | 0.2579618 |
female | 1 | 233 | 0.7420382 |
male | 0 | 468 | 0.8110919 |
male | 1 | 109 | 0.1889081 |
From above table we can say that more than 81% of male did not survive where as only 25.7% of female did not survive. Without taking any statistical tests, one can make assumption that female survival rate is higher but lets verify it.
Is survival rate different for each gender or not?
t.test(df$Survived[df$Sex == "male"],
df$Survived[df$Sex == "female"],
alternative = "two.sided", var.equal = FALSE)
Welch Two Sample t-test
data: df$Survived[df$Sex == "male"] and df$Survived[df$Sex == "female"]
t = -18.672, df = 584.43, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.6113121 -0.4949481
sample estimates:
mean of x mean of y
0.1889081 0.7420382
By looking into p value we can reject the null hypothesis of being the survived rate significantly equal.
Using Chi-square Test
Chi-square test is non parametric test. Hence it does not approximate to the normal distribution as sample size increase.Chi-square test(also a $\chi$^2) test is statistical hypothesis test that is valid to perform the test statistic is chi-squared distributed under the null hypothesis. Source
There are different types of chi-square-test:
- Chi-Square-Pearson Test(Test of Independence of attributes)
- Chi-Square test goodness of fit
- Chi-square homogeneity of proportion
In this blog we are going to implement Chi-Square-Pearson Test. Normally in this test, we have to check the p-values. Moreover, like all statistical tests, we assume this test as a null hypothesis and an alternate hypothesis.
The main proof to reject null hypothesis is if our p vale is less then 0.05. If p value greater then 0.05 we accept our null hypothesis. We set up null hypothesis and alternative hypothesis as follow,
- $H_0$ : Two variables are independent to each others.
- $H_1$ : Two variables are dependent to each others.
Let's do it in code
chisq.test(df$Sex, df$Survived)
Pearson's Chi-squared test with Yates' continuity correction
data: df$Sex and df$Survived
X-squared = 260.72, df = 1, p-value < 2.2e-16
From above test we see that p value less then 0.05, hence we have strong evidence to reject our null hypothesis. $H_1$ is accepted. It means Gender and Survival variables are dependent to each other.