Exploratory Data Analysis in R with Tests

# Exploratory Data Analysis in R

Hello everyone welcome to our another blog of in R where we will perform various statistical test on Titanic dataset.

## Code to Read Titanic Dataset

data = read.csv("E:/code/Titanic Survival Practice/train.csv")
df = data.frame(data)
summary(df)
  PassengerId       Survived          Pclass          Name
Min.   :  1.0   Min.   :0.0000   Min.   :1.000   Length:891
1st Qu.:223.5   1st Qu.:0.0000   1st Qu.:2.000   Class :character
Median :446.0   Median :0.0000   Median :3.000   Mode  :character
Mean   :446.0   Mean   :0.3838   Mean   :2.309
3rd Qu.:668.5   3rd Qu.:1.0000   3rd Qu.:3.000
Max.   :891.0   Max.   :1.0000   Max.   :3.000

Sex                 Age            SibSp           Parch
Length:891         Min.   : 0.42   Min.   :0.000   Min.   :0.0000
Class :character   1st Qu.:20.12   1st Qu.:0.000   1st Qu.:0.0000
Mode  :character   Median :28.00   Median :0.000   Median :0.0000
Mean   :29.70   Mean   :0.523   Mean   :0.3816
3rd Qu.:38.00   3rd Qu.:1.000   3rd Qu.:0.0000
Max.   :80.00   Max.   :8.000   Max.   :6.0000
NA's   :177
Ticket               Fare           Cabin             Embarked
Length:891         Min.   :  0.00   Length:891         Length:891
Class :character   1st Qu.:  7.91   Class :character   Class :character
Mode  :character   Median : 14.45   Mode  :character   Mode  :character
Mean   : 32.20
3rd Qu.: 31.00
Max.   :512.33                                        

From the sample of the RMS Titanic data, we can find below columns:

1. Survived: Outcomes of survival(0 = No,1 = Yes)
2. Pclass: Socio-economic class(1 = upper class,2 = middle class,3 = Lower class)
3. Name: Name of the passenger
4. Sex: Sex of the passenger
5. Sibsp: Number of siblings and spouses of the passenger aboard
6. Parch: Number of parents and children of the passenger aboard
7. Ticket: Ticket number of the passenger
8. fare: fare paid by the passenger
9. Cabin: cabin number of the passenger(some entities contain NON)
10. Embarked: port of the embarked of the passenger.

summary(df$Age)  Min. 1st Qu. Median Mean 3rd Qu. Max. NA's 0.42 20.12 28.00 29.70 38.00 80.00 177  From above mean age of passenger was 29.70 similarly median age of passenger was 28.00 also maximum age of passenger was 80 year. There were 177 values were missing in the data. ### Average Age of Male df1 = df$Age[df$Sex == "male"] summary(df1)  Min. 1st Qu. Median Mean 3rd Qu. Max. NA's 0.42 21.00 29.00 30.73 39.00 80.00 124  Average age of male passenger was 30.73 similarly max age of male passenger was 80. ### Average Age of Female df2 = df$Age[df$Sex == "female"] summary(df2)  Min. 1st Qu. Median Mean 3rd Qu. Max. NA's 0.75 18.00 27.00 27.92 37.00 63.00 53  From above we can see that average age of female passenger was 27.92, median age of female passenger was 27 on the same way maximum age of female passenger was 63. ## Taking Random Sample From Dataframe s1 <- df[sample(nrow(df), 600), ] summary(s1$Age )
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's
0.67   20.00   28.00   29.63   38.25   80.00     128 

We have taken 600 sample randomly and calculate summary from sample.

mean(s1$Age, na.rm=TRUE) 29.62625 Mean age of sample calculated above is 29.91. ### Does The Average Age Of Sample Resemble The Entire Population Here we are going to choose student t-test( one sample t-test). T test is parametric test hence it approximate to the normal distribution as sample size increase. When choosing a t-test, we will need to consider two things: whether the groups being compared come from a single population or two different populations, and whether you want to test the difference in a specific direction. (Taken from here.) #### Hypothesis Testing with t-test There are two types of hypothesis •$H_0$: Null hypothesis is also defined as hypothesis of no difference. It should be simple hypothesis. It is tested for possible rejection on the basis of sample observations drawn from the population. Which is accepted when p-value is greater than 0.05. •$H_1$: Any hypothesis which is mutually exclusive and complementary to the null hypothesis is called alternative hypothesis. It is also known as the name research hypothesis. Which is accepted when p-value is less than 0.05. They are tested like, •$H_0$:$\bar{x}$=$\mu$(Sample mean and populations are equal or sample is taken from same population.) •$H_1$:$\bar{x} \neq \mu$(Sample mean not equal to population mean.) t.test(df$Age,mu = mean(s1$Age, na.rm=TRUE))  One Sample t-test data: df$Age
t = 0.13404, df = 713, p-value = 0.8934
alternative hypothesis: true mean is not equal to 29.62625
95 percent confidence interval:
28.63179 30.76645
sample estimates:
mean of x
29.69912 

From above p value is not less than 0.05 hence we do not have strong evidence to reject our null hypothesis. That means we can not make any assumptions about our sample.

#### Does two sample has same average age?

s2 <- df[sample(nrow(df), 700), ]
summary(s2$Age ) t.test(s1$Age, s2$Age, alternative = "two.sided", var.equal = FALSE)  Min. 1st Qu. Median Mean 3rd Qu. Max. NA's 0.42 21.00 28.00 29.61 37.00 74.00 144 Welch Two Sample t-test data: s1$Age and s2$Age t = 0.022409, df = 992.82, p-value = 0.9821 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -1.767968 1.808813 sample estimates: mean of x mean of y 29.62625 29.60583  Not enough evidence to reject the null hypothesis. ## Percentage of passenger with respect to gender number_per_gender <- table(df$Sex)
prop.table(number_per_gender)*100
 female    male
35.2413 64.7587 

### Is Average Age of Male and Female significantly different or not?

We will be using two sample t-test.

t.test(Age~Sex, data=df)
    Welch Two Sample t-test

data:  Age by Sex
t = -2.5259, df = 560.05, p-value = 0.01181
alternative hypothesis: true difference in means between group female and group male is not equal to 0
95 percent confidence interval:
-4.9967983 -0.6250732
sample estimates:
mean in group female   mean in group male
27.91571             30.72664 

From above, p value, we conclude that average age of each gender is not equal. That means mean age of male passenger not equal to mean age of female passenger.

## Find survive ratio in each gender.

library(dplyr)
gen_df = df %>%
group_by(Sex, Survived) %>%
# group_by(Survived) %>%
summarise(n = n()) %>%
mutate(freq = n / sum(n))
gen_df
Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

filter, lag

The following objects are masked from 'package:base':

intersect, setdiff, setequal, union

summarise() has grouped output by 'Sex'. You can override using the .groups argument.
A grouped_df: 4 × 4
SexSurvivednfreq
<chr><int><int><dbl>
female0 810.2579618
female12330.7420382
male 04680.8110919
male 11090.1889081

From above table we can say that more than 81% of male did not survive where as only 25.7% of female did not survive. Without taking any statistical tests, one can make assumption that female survival rate is higher but lets verify it.

### Is survival rate different for each gender or not?

t.test(df$Survived[df$Sex == "male"],
df$Survived[df$Sex == "female"],
alternative = "two.sided", var.equal = FALSE)
    Welch Two Sample t-test

data:  df$Survived[df$Sex == "male"] and df$Survived[df$Sex == "female"]
t = -18.672, df = 584.43, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.6113121 -0.4949481
sample estimates:
mean of x mean of y
0.1889081 0.7420382 

By looking into p value we can reject the null hypothesis of being the survived rate significantly equal.

### Using Chi-square Test

Chi-square test is non parametric test. Hence it does not approximate to the normal distribution as sample size increase.Chi-square test(also a $\chi$^2) test is statistical hypothesis test that is valid to perform the test statistic is chi-squared distributed under the null hypothesis. Source

There are different types of chi-square-test:

• Chi-Square-Pearson Test(Test of Independence of attributes)
• Chi-Square test goodness of fit
• Chi-square homogeneity of proportion

In this blog we are going to implement Chi-Square-Pearson Test. Normally in this test, we have to check the p-values. Moreover, like all statistical tests, we assume this test as a null hypothesis and an alternate hypothesis. The main proof to reject null hypothesis is if our p vale is less then 0.05. If p value greater then 0.05 we accept our null hypothesis. We set up null hypothesis and alternative hypothesis as follow,

• $H_0$ : Two variables are independent to each others.
• $H_1$ : Two variables are dependent to each others.

#### Let's do it in code

chisq.test(df$Sex, df$Survived)
    Pearson's Chi-squared test with Yates' continuity correction

data:  df$Sex and df$Survived
X-squared = 260.72, df = 1, p-value < 2.2e-16

From above test we see that p value less then 0.05, hence we have strong evidence to reject our null hypothesis. $H_1$ is accepted. It means Gender and Survival variables are dependent to each other.