Exploratory Data Analysis in R with Tests

Exploratory Data Analysis in R

Hello everyone welcome to our another blog of in R where we will perform various statistical test on Titanic dataset.

Code to Read Titanic Dataset

data = read.csv("E:/code/Titanic Survival Practice/train.csv")
df = data.frame(data)
summary(df)
  PassengerId       Survived          Pclass          Name          
 Min.   :  1.0   Min.   :0.0000   Min.   :1.000   Length:891        
 1st Qu.:223.5   1st Qu.:0.0000   1st Qu.:2.000   Class :character  
 Median :446.0   Median :0.0000   Median :3.000   Mode  :character  
 Mean   :446.0   Mean   :0.3838   Mean   :2.309                     
 3rd Qu.:668.5   3rd Qu.:1.0000   3rd Qu.:3.000                     
 Max.   :891.0   Max.   :1.0000   Max.   :3.000                     

     Sex                 Age            SibSp           Parch       
 Length:891         Min.   : 0.42   Min.   :0.000   Min.   :0.0000  
 Class :character   1st Qu.:20.12   1st Qu.:0.000   1st Qu.:0.0000  
 Mode  :character   Median :28.00   Median :0.000   Median :0.0000  
                    Mean   :29.70   Mean   :0.523   Mean   :0.3816  
                    3rd Qu.:38.00   3rd Qu.:1.000   3rd Qu.:0.0000  
                    Max.   :80.00   Max.   :8.000   Max.   :6.0000  
                    NA's   :177                                     
    Ticket               Fare           Cabin             Embarked        
 Length:891         Min.   :  0.00   Length:891         Length:891        
 Class :character   1st Qu.:  7.91   Class :character   Class :character  
 Mode  :character   Median : 14.45   Mode  :character   Mode  :character  
                    Mean   : 32.20                                        
                    3rd Qu.: 31.00                                        
                    Max.   :512.33                                        

From the sample of the RMS Titanic data, we can find below columns:

  1. Survived: Outcomes of survival(0 = No,1 = Yes)
  2. Pclass: Socio-economic class(1 = upper class,2 = middle class,3 = Lower class)
  3. Name: Name of the passenger
  4. Sex: Sex of the passenger
  5. Sibsp: Number of siblings and spouses of the passenger aboard
  6. Parch: Number of parents and children of the passenger aboard
  7. Ticket: Ticket number of the passenger
  8. fare: fare paid by the passenger
  9. Cabin: cabin number of the passenger(some entities contain NON)
  10. Embarked: port of the embarked of the passenger.

Parametric facts about age

Age of entire passengers

summary(df$Age)
       Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
       0.42   20.12   28.00   29.70   38.00   80.00     177 

From above mean age of passenger was 29.70 similarly median age of passenger was 28.00 also maximum age of passenger was 80 year. There were 177 values were missing in the data.

Average Age of Male

df1 = df$Age[df$Sex == "male"]
summary(df1)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
   0.42   21.00   29.00   30.73   39.00   80.00     124 

Average age of male passenger was 30.73 similarly max age of male passenger was 80.

Average Age of Female

df2 = df$Age[df$Sex == "female"]
summary(df2)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
   0.75   18.00   27.00   27.92   37.00   63.00      53 

From above we can see that average age of female passenger was 27.92, median age of female passenger was 27 on the same way maximum age of female passenger was 63.

Taking Random Sample From Dataframe

s1 <- df[sample(nrow(df), 600), ]
summary(s1$Age )
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
   0.67   20.00   28.00   29.63   38.25   80.00     128 

We have taken 600 sample randomly and calculate summary from sample.

mean(s1$Age, na.rm=TRUE)

29.62625

Mean age of sample calculated above is 29.91.

Does The Average Age Of Sample Resemble The Entire Population

Here we are going to choose student t-test( one sample t-test).
T test is parametric test hence it approximate to the normal distribution as sample size increase. When choosing a t-test, we will need to consider two things: whether the groups being compared come from a single population or two different populations, and whether you want to test the difference in a specific direction. (Taken from here.)

Hypothesis Testing with t-test

There are two types of hypothesis

  • $H_0$ : Null hypothesis is also defined as hypothesis of no difference. It should be simple hypothesis. It is tested for possible rejection on the basis of sample observations drawn from the population. Which is accepted when p-value is greater than 0.05.

  • $H_1$ : Any hypothesis which is mutually exclusive and complementary to the null hypothesis is called alternative hypothesis. It is also known as the name research hypothesis. Which is accepted when p-value is less than 0.05.

They are tested like,

  • $H_0$ : $\bar{x}$ = $\mu$ (Sample mean and populations are equal or sample is taken from same population.)
  • $H_1$: $\bar{x} \neq \mu$ (Sample mean not equal to population mean.)
t.test(df$Age,mu = mean(s1$Age, na.rm=TRUE))
    One Sample t-test

data:  df$Age
t = 0.13404, df = 713, p-value = 0.8934
alternative hypothesis: true mean is not equal to 29.62625
95 percent confidence interval:
 28.63179 30.76645
sample estimates:
mean of x 
 29.69912 

From above p value is not less than 0.05 hence we do not have strong evidence to reject our null hypothesis. That means we can not make any assumptions about our sample.

Does two sample has same average age?

s2 <- df[sample(nrow(df), 700), ]
summary(s2$Age )
t.test(s1$Age, s2$Age, alternative = "two.sided", var.equal = FALSE)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
   0.42   21.00   28.00   29.61   37.00   74.00     144 

    Welch Two Sample t-test

data:  s1$Age and s2$Age
t = 0.022409, df = 992.82, p-value = 0.9821
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -1.767968  1.808813
sample estimates:
mean of x mean of y 
 29.62625  29.60583 

Not enough evidence to reject the null hypothesis.

Percentage of passenger with respect to gender

number_per_gender <- table(df$Sex)
prop.table(number_per_gender)*100
 female    male 
35.2413 64.7587 

Is Average Age of Male and Female significantly different or not?

We will be using two sample t-test.

t.test(Age~Sex, data=df)
    Welch Two Sample t-test

data:  Age by Sex
t = -2.5259, df = 560.05, p-value = 0.01181
alternative hypothesis: true difference in means between group female and group male is not equal to 0
95 percent confidence interval:
 -4.9967983 -0.6250732
sample estimates:
mean in group female   mean in group male 
            27.91571             30.72664 

From above, p value, we conclude that average age of each gender is not equal. That means mean age of male passenger not equal to mean age of female passenger.

Find survive ratio in each gender.

library(dplyr)
gen_df = df %>%
  group_by(Sex, Survived) %>%
# group_by(Survived) %>%
  summarise(n = n()) %>%
  mutate(freq = n / sum(n))
gen_df
Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

summarise() has grouped output by 'Sex'. You can override using the .groups argument.
A grouped_df: 4 × 4
Sex Survived n freq
<chr> <int> <int> <dbl>
female 0 81 0.2579618
female 1 233 0.7420382
male 0 468 0.8110919
male 1 109 0.1889081

From above table we can say that more than 81% of male did not survive where as only 25.7% of female did not survive. Without taking any statistical tests, one can make assumption that female survival rate is higher but lets verify it.

Is survival rate different for each gender or not?

t.test(df$Survived[df$Sex == "male"],
       df$Survived[df$Sex == "female"],
       alternative = "two.sided", var.equal = FALSE)
    Welch Two Sample t-test

data:  df$Survived[df$Sex == "male"] and df$Survived[df$Sex == "female"]
t = -18.672, df = 584.43, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -0.6113121 -0.4949481
sample estimates:
mean of x mean of y 
0.1889081 0.7420382 

By looking into p value we can reject the null hypothesis of being the survived rate significantly equal.

Using Chi-square Test

Chi-square test is non parametric test. Hence it does not approximate to the normal distribution as sample size increase.Chi-square test(also a $\chi$^2) test is statistical hypothesis test that is valid to perform the test statistic is chi-squared distributed under the null hypothesis. Source

There are different types of chi-square-test:

  • Chi-Square-Pearson Test(Test of Independence of attributes)
  • Chi-Square test goodness of fit
  • Chi-square homogeneity of proportion

In this blog we are going to implement Chi-Square-Pearson Test. Normally in this test, we have to check the p-values. Moreover, like all statistical tests, we assume this test as a null hypothesis and an alternate hypothesis.
The main proof to reject null hypothesis is if our p vale is less then 0.05. If p value greater then 0.05 we accept our null hypothesis. We set up null hypothesis and alternative hypothesis as follow,

  • $H_0$ : Two variables are independent to each others.
  • $H_1$ : Two variables are dependent to each others.

Let's do it in code

chisq.test(df$Sex, df$Survived)
    Pearson's Chi-squared test with Yates' continuity correction

data:  df$Sex and df$Survived
X-squared = 260.72, df = 1, p-value < 2.2e-16

From above test we see that p value less then 0.05, hence we have strong evidence to reject our null hypothesis. $H_1$ is accepted. It means Gender and Survival variables are dependent to each other.

Leave a Reply

Scroll to top
Subscribe to our Newsletter

Hello surfer, thank you for being here. We are as excited as you are to share what we know about data. Please subscribe to our newsletter for weekly data blogs and many more. If you’ve already done it, please close this popup.



No, thank you. I do not want.
100% secure.