by Kevin Vo
# Set up
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stat
# Import titanic data into a data frame
filename = 'titanic_data.csv'
titanic_df = pd.read_csv(filename)
titanic_df.head()
However, the data contains too many variables. Since our interest is factors made people more likely to survive, therefore I will remove these variables from the dataset: Name, SibSp, Parch, Ticket, also Cabin (I have looked at all the values of Cabin. There is not enough information that need to consider this as a factor) and Embarked.
titanic_df.drop(['SibSp','Parch','Ticket','Cabin','Embarked','Name'], axis = 1, inplace = True)
titanic_df.head()
Currently, the dimension of Titanic data frame is:
titanic_df.shape
if True:
def number_of_NA(column):
return column.size - column.dropna(axis = 0).size
print titanic_df.apply(number_of_NA)
So only Age contains 177 NA. We should only drop these NAs when we analyze Survived condtion by Age separately. Therefore, we will handle this issue later.
Histogram of Age grouped by Surival Condition
%pylab inline
bins = numpy.linspace(0,100,20)
plt.hist(titanic_df.groupby('Survived').get_group(0)['Age'].dropna(axis = 0),
bins,histtype='stepfilled',alpha=0.5, color = 'r', label = 'Death')
plt.hist(titanic_df.groupby('Survived').get_group(1)['Age'].dropna(axis = 0),
bins,histtype='stepfilled',alpha=0.5, color = 'g', label = 'Survived')
plt.title("Histogram of Age grouped by Survival condition")
plt.legend(loc = 'upper right')
plt.show()
print "Figure 1"
The histogram tells us:
+ Children are more likely to survive.
+ Adults has less chance to survive.
Question: Are Death and Survived indifferent at any age?
It is clear that the first class passengers will spend more in fare
g = sns.stripplot(x = "Pclass", y = "Fare", data = titanic_df)
print "Figure 2"
So we can choose one of them as a factor that made people more likely to survive. Let's take a look at Survived and Pclass:
sns.set_style("whitegrid")
g = sns.factorplot('Survived', col = 'Pclass', col_wrap = 4, data = titanic_df, kind = 'count', size = 2.5, aspect = .8)
sns.despine(left=True)
print "Figure 3"
The above chart tells us that people who are poor or lower class were more likely to die.
sns.set_style("whitegrid")
g = sns.factorplot('Survived', col = 'Sex', col_wrap = 4, data = titanic_df, kind = 'count', size = 2.5, aspect = .8)
sns.despine(left=True)
print "Figure 4"
The above chart tells us that male is more likely to die.
The independent variable is the survival condition: non-survived(death) or survived
The dependent variable is the Age of each person in each survival condition group.
Null Hypothesis: The two populations represented by the two conditions( survived group and non-survived group) have the same distribution of age.
$H_o$: Age values of two datasets(survived vs non-survived) come from the same population.
$H_a$: Non-survived group tend to have larger Age values than survived group.
The Mann-Whitney U test is used as a statistical test for this proposed hypothesis. It is chosen based on these following assumptions:
Assumption #1: Dependent variable should be measured at the ordinal or continuous level. In this case, age is a continuous variable.
Assumption #2: Independent variable should consist of two categorical, independent groups. Our independent variable is surival condtion which is separated by two categorical independent groups: from survived to non-survived.
Assumption #3: Independence of observations. It is satisfied because there is not any one who could be in both non-survived group and survived group.
Assumption #4: A Mann-Whitney U test can be used when your two variables are not normally distributed. Based on Figure 1, we can see that the histogram of both groups are not symmetric. They are skewed to the right.
with_survived_mean = mean(titanic_df.groupby('Survived').get_group(1)['Age'].dropna(axis = 0))
with_non_survived_mean = mean(titanic_df.groupby('Survived').get_group(0)['Age'].dropna(axis = 0))
U, p = stat.mannwhitneyu(titanic_df.groupby('Survived').get_group(0)['Age'].dropna(axis = 0),
titanic_df.groupby('Survived').get_group(1)['Age'].dropna(axis = 0))
print "Mean age of Survived group is: ", with_survived_mean
print "Mean age of Non-Survived group is:", with_non_survived_mean
print "U-statistic: ", U
print "p-value :", p
U-statistic is very high and the p-value = 0.16 > 0.05
=> We fail to reject the null hypothesis at 5% significant level.
Conclusion: Based on the provided data, We are 95% confident to conclude that age has no effect on survival condition.
The independent variable is the survival condition: non-survived(death) or survived
The dependent variable is the Pclass which is social economic status: 1 = Upper, 2 = Middle, 3 = Lower.
Null Hypothesis: In the population, the two categorical variables are independent.
$H_o$: In the population, survival variable and Pclass variable are independent.
$H_a$: In the population, survival variable and Pclass variable are associated (dependent).
The Chi-square Test of Independece is used as a statistical test for this proposed hypothesis. It is chosen because:
print "Frequency Table is:"
titanic_df[['Survived','Pclass']].pivot_table(columns= ['Pclass'],index=['Survived'],aggfunc= len, margins = True)
Notice: From the frequency table, we can see a magnitude difference of people who are from lower social class did not survive compared to others.
chi2, p, ddof, expected = stat.chi2_contingency(
titanic_df[['Survived','Pclass']].pivot_table(columns= ['Pclass'],index=['Survived'],aggfunc= len))
expected = pd.DataFrame(expected, index = [0,1], columns = [1,2,3])
expected.index.names = ['Survived']
expected.columns.names = ['Pclass']
msg = """
Test Statistic X^2: {}\np-value: {}\nDegrees of Freedom: {}
==================================================
\n\t\tExpected table"""
print( msg.format( chi2, p, ddof) )
display(expected)
As expected, Chi-squared Statistics is very high, and p-value is extremely close to zero.
=> We reject the null hypothesis at 5% significant level.
Conclusion: Based on the provided data, we are 95% confident to conclude that there is associated effect of Social Economics Class on Survival Condition.
The independent variable is the survival condition: non-survived(death) or survived
The dependent variable is the Sex which contains values as: female and male.
Null Hypothesis: In the population, the two categorical variables are independent.
$H_o$: In the population, survival variable and Sex variable are independent.
$H_a$: In the population, survival variable and Sex variable are associated (dependent).
The Chi-square Test of Independece is used as a statistical test for this proposed hypothesis. It is chosen because:
print "Frequency Table is:"
titanic_df[['Survived','Sex']].pivot_table(columns= ['Sex'],index=['Survived'],aggfunc= len, margins = True)
Notice: From the frequency table, we can see a magnitude difference of non-survived male compared to others.
chi2, p, ddof, expected = stat.chi2_contingency(
titanic_df[['Survived','Sex']].pivot_table(columns= ['Sex'],index=['Survived'],aggfunc= len))
expected = pd.DataFrame(expected, index = [0,1], columns = ['female','male'])
expected.index.names = ['Survived']
expected.columns.names = ['Sex']
msg = """
Test Statistic X^2: {}\np-value: {}\nDegrees of Freedom: {}
==================================================
\n\t\tExpected table"""
print( msg.format( chi2, p, ddof) )
display(expected)
As expected, Chi-squared Statistics is very high, and p-value is extremely close to zero.
=> We reject the null hypothesis at 5% significant level.
Conclusion: Based on the provided data, we are 95% confident to conclude that there is associated effect of Gender on Survival Condition.
Based on the given data, we have very little evidence to conclude that age has any effect on survival condition. In the other way, gender and social economics class has an effect on survival condition. From the above frequency tables, we can see that people who are at lower class are more likely to die or female are more likely to survive