by Kevin Vo

In [2]:

```
# Set up
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stat
```

- The given data is:

In [3]:

```
# Import titanic data into a data frame
filename = 'titanic_data.csv'
titanic_df = pd.read_csv(filename)
titanic_df.head()
```

Out[3]:

**Name, SibSp, Parch, Ticket**, also **Cabin** (I have looked at all the values of Cabin. There is not enough information that need to consider this as a factor) and **Embarked**.

In [4]:

```
titanic_df.drop(['SibSp','Parch','Ticket','Cabin','Embarked','Name'], axis = 1, inplace = True)
titanic_df.head()
```

Out[4]:

Currently, the dimension of Titanic data frame is:

In [5]:

```
titanic_df.shape
```

Out[5]:

**Missing Data**: Let's check how many NAs are there in each variable(column)

In [6]:

```
if True:
def number_of_NA(column):
return column.size - column.dropna(axis = 0).size
print titanic_df.apply(number_of_NA)
```

**Age** contains 177 *NA*. We should only drop these NAs when we analyze **Survived** condtion by **Age** separately. Therefore, we will handle this issue later.

**SURVIVED AND AGE**:

Histogram of Age grouped by Surival Condition

In [7]:

```
%pylab inline
bins = numpy.linspace(0,100,20)
plt.hist(titanic_df.groupby('Survived').get_group(0)['Age'].dropna(axis = 0),
bins,histtype='stepfilled',alpha=0.5, color = 'r', label = 'Death')
plt.hist(titanic_df.groupby('Survived').get_group(1)['Age'].dropna(axis = 0),
bins,histtype='stepfilled',alpha=0.5, color = 'g', label = 'Survived')
plt.title("Histogram of Age grouped by Survival condition")
plt.legend(loc = 'upper right')
plt.show()
print "Figure 1"
```

The histogram tells us:

```
+ Children are more likely to survive.
+ Adults has less chance to survive.
```

**Question:** Are Death and Survived indifferent at any age?

**SURVIVED AND PCLASS & FARE**:

It is clear that the first class passengers will spend more in fare

In [8]:

```
g = sns.stripplot(x = "Pclass", y = "Fare", data = titanic_df)
print "Figure 2"
```

**Survived** and **Pclass**:

In [9]:

```
sns.set_style("whitegrid")
g = sns.factorplot('Survived', col = 'Pclass', col_wrap = 4, data = titanic_df, kind = 'count', size = 2.5, aspect = .8)
sns.despine(left=True)
print "Figure 3"
```

The above chart tells us that people who are poor or lower class were more likely to die.

**SURVIVED AND SEX**:

In [10]:

```
sns.set_style("whitegrid")
g = sns.factorplot('Survived', col = 'Sex', col_wrap = 4, data = titanic_df, kind = 'count', size = 2.5, aspect = .8)
sns.despine(left=True)
print "Figure 4"
```

The above chart tells us that male is more likely to die.

The

**independent variable**is the**survival**condition:*non-survived(death)*or*survived*The

**dependent variable**is the**Age**of each person in each survival condition group.

**Null Hypothesis**:
__The two populations represented by the two conditions( survived group and non-survived group) have the same distribution of age.__

$H_o$: `Age values of two datasets(survived vs non-survived) come from the same population.`

$H_a$: `Non-survived group tend to have larger Age values than survived group.`

The **Mann-Whitney U test** is used as a statistical test for this proposed hypothesis. It is chosen based on these following assumptions:

Assumption #1: Dependent variable should be measured at the ordinal or continuous level. In this case, age is a continuous variable.

Assumption #2: Independent variable should consist of two categorical, independent groups. Our independent variable is surival condtion which is separated by two categorical independent groups: from survived to non-survived.

Assumption #3: Independence of observations. It is satisfied because there is not any one who could be in both non-survived group and survived group.

Assumption #4: A Mann-Whitney U test can be used when your two variables are not normally distributed. Based on Figure 1, we can see that the histogram of both groups are not symmetric. They are skewed to the right.

In [11]:

```
with_survived_mean = mean(titanic_df.groupby('Survived').get_group(1)['Age'].dropna(axis = 0))
with_non_survived_mean = mean(titanic_df.groupby('Survived').get_group(0)['Age'].dropna(axis = 0))
U, p = stat.mannwhitneyu(titanic_df.groupby('Survived').get_group(0)['Age'].dropna(axis = 0),
titanic_df.groupby('Survived').get_group(1)['Age'].dropna(axis = 0))
print "Mean age of Survived group is: ", with_survived_mean
print "Mean age of Non-Survived group is:", with_non_survived_mean
print "U-statistic: ", U
print "p-value :", p
```

U-statistic is very high and the p-value = 0.16 > 0.05

**=>** We fail to reject the null hypothesis at 5% significant level.

**Conclusion:** __Based on the provided data, We are 95% confident to conclude that age has no effect on survival condition.__

The

**independent variable**is the**survival**condition:*non-survived(death)*or*survived*The

**dependent variable**is the**Pclass**which is social economic status: 1 = Upper, 2 = Middle, 3 = Lower.

**Null Hypothesis**:
__In the population, the two categorical variables are independent.__

$H_o$: `In the population, survival variable and Pclass variable are independent.`

$H_a$: `In the population, survival variable and Pclass variable are associated (dependent).`

The **Chi-square Test of Independece** is used as a statistical test for this proposed hypothesis. It is chosen because:

- Independent and dependent variable are catergorical.
- It is a 2x2 contigency table (each case contribute to 1 cell only)
- The sample data is large enough.
- Each cell in the frequency table is larger than 5 ( Frequency table is shown below)

In [12]:

```
print "Frequency Table is:"
titanic_df[['Survived','Pclass']].pivot_table(columns= ['Pclass'],index=['Survived'],aggfunc= len, margins = True)
```

Out[12]:

__ Notice:__ From the frequency table, we can see a magnitude difference of people who are from lower social class did not survive compared to others.

In [13]:

```
chi2, p, ddof, expected = stat.chi2_contingency(
titanic_df[['Survived','Pclass']].pivot_table(columns= ['Pclass'],index=['Survived'],aggfunc= len))
expected = pd.DataFrame(expected, index = [0,1], columns = [1,2,3])
expected.index.names = ['Survived']
expected.columns.names = ['Pclass']
msg = """
Test Statistic X^2: {}\np-value: {}\nDegrees of Freedom: {}
==================================================
\n\t\tExpected table"""
print( msg.format( chi2, p, ddof) )
display(expected)
```

As expected, Chi-squared Statistics is very high, and p-value is extremely close to zero.

**=>** We reject the null hypothesis at 5% significant level.

**Conclusion:** __Based on the provided data, we are 95% confident to conclude that there is associated effect of Social Economics Class on Survival Condition. __

The

**independent variable**is the**survival**condition:*non-survived(death)*or*survived*The

**dependent variable**is the**Sex**which contains values as: female and male.

**Null Hypothesis**:
__In the population, the two categorical variables are independent.__

$H_o$: `In the population, survival variable and Sex variable are independent.`

$H_a$: `In the population, survival variable and Sex variable are associated (dependent).`

The **Chi-square Test of Independece** is used as a statistical test for this proposed hypothesis. It is chosen because:

- Independent and dependent variable are catergorical.
- It is a 2x2 contigency table (each case contribute to 1 cell only)
- The sample data is large enough.
- Each cell in the frequency table is larger than 5 ( Frequency table is shown below)

In [14]:

```
print "Frequency Table is:"
titanic_df[['Survived','Sex']].pivot_table(columns= ['Sex'],index=['Survived'],aggfunc= len, margins = True)
```

Out[14]:

__ Notice:__ From the frequency table, we can see a magnitude difference of non-survived male compared to others.

In [16]:

```
chi2, p, ddof, expected = stat.chi2_contingency(
titanic_df[['Survived','Sex']].pivot_table(columns= ['Sex'],index=['Survived'],aggfunc= len))
expected = pd.DataFrame(expected, index = [0,1], columns = ['female','male'])
expected.index.names = ['Survived']
expected.columns.names = ['Sex']
msg = """
Test Statistic X^2: {}\np-value: {}\nDegrees of Freedom: {}
==================================================
\n\t\tExpected table"""
print( msg.format( chi2, p, ddof) )
display(expected)
```

As expected, Chi-squared Statistics is very high, and p-value is extremely close to zero.

**=>** We reject the null hypothesis at 5% significant level.

**Conclusion:** __Based on the provided data, we are 95% confident to conclude that there is associated effect of Gender on Survival Condition. __

**Based on the given data, we have very little evidence to conclude that age has any effect on survival condition. In the other way, gender and social economics class has an effect on survival condition. From the above frequency tables, we can see that people who are at lower class are more likely to die or female are more likely to survive**