Exploring Survival on the Titanic¶

Welcome to the final coding challenge! This comprehensive project requires you to apply and combine many methods which we have learned. If a method is not introduced in our tutorial, a short description of the methods to use is shown, which you could use Python's help() or Ipython's ? to bring out the help documentation of the function to use.

The Case¶

The sinking of the Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.

While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.

The ultimate goal of this challenge is to build up a predictive model that answers this question: "what sorts of people are more likely to survive" by looking at the characteristics of passengers.

As it's your first day in the data science track, instead of building up a classification model, we would analyze the data and answer that question in an exploratory way.

Data¶

The data has been split into two parts. You will be asked to merge them together.

part1.csv
part2. csv

Data Dictionary

Workflow of Exploratory Data Analysis¶

load the data
Explore the data
Wrangle, prepare and cleanse the data.
Analyze, visualize and identify patterns.

Fun time¶

Download the data.¶

Dataset 1: click here
Dataset 2: click here

Import packages¶

Hint: pandas and numpy.

import pandas as pd
import numpy as np

Load the two datasets¶

Hint: read_csv()

df1 = pd.read_csv("part1.csv")
df2 = pd.read_csv("part2.csv")

Append the dataset2 to the dataset 1¶

Hint: .append()

df = df1.append(df2)

Explore Data¶

get the column names of the data
get the shape info of the data
preview the data,looking at the first 10 rows
which variables/columns are purely numerical? which columns are string types?
which columns have missing values? and the number of missings?
get the descriptive statistics of the numerical variables?
get the descriptive statistics of the nominal/categorical variables?
Make a boxplot of the variable Age
get to know how many passengers died and how many survived by looking at the Survival column

#get the column names of the new dataset

df.columns

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

# get the shape info of the new dataset

df.shape

(891, 12)

#preview the dataset, looking at the first 10 rows

df.head(10)

#which varialbes are numerical? which variables have strings?

df.dtypes

PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object

#which columns have missing values? 
#and how many values are missing from those columns?

df.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

#The descriptive statistics of those numerical variables/columns? 
#specifically we want to know the mean, std, min, max and percentiles.
#Hint: describe()

df.describe()

#The descriptive statistics of the categorical varialbes? 
#e.g., count, unique, top, frequency
#Hint: .describe(include=np.object)
#Link to the help documentation of the describe function 
# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.describe.html

df.describe(include=np.object)

#Plotting the age varialbe using boxplot
#Hint:.plot(y='Age', kind='box')
# if the diagram is not shown, please do the following in a new cell

import matplotlib
%matplotlib inline

df.plot(y='Age',kind='box')

<matplotlib.axes._subplots.AxesSubplot at 0x10827186128>

# Plot the survived varialbe using bar chart
# Hint: value_counts()

df.Survived.value_counts()

0    549
1    342
Name: Survived, dtype: int64

Drop columns¶

Drop the Cabin column as it has too many missings and it has nothing to do with the classification model.
Hint: .drop()

df.drop(columns=['Cabin'],inplace=True)

Impute missing values¶

for numerical variables: use the median Age->median Hint:df.Age.fillna()
for nominal/categorical variables: use the most frequent category. Embarked->most frequent port Hint: df.Embarked.fillna()

#impute the missings of the Age column

df.Age.fillna(value=df.Age.median(), inplace=True)

#impute the missings of the Embarked column
###Hint: How to get the most frequent value of the Embarked
#Hint: df.Embarked.mode()[0]
#or df.Embarked.value_counts.index[0]

most_fre = df.Embarked.mode()[0]
df.Embarked.fillna(value=most_fre, inplace=True)

#check if all the missing values are gone

df.isnull().sum()

PassengerId    0
Survived       0
Pclass         0
Name           0
Sex            0
Age            0
SibSp          0
Parch          0
Ticket         0
Fare           0
Embarked       0
dtype: int64

Analyze the data¶

get the survival rates by gender (Hypothesis: Were women more likely to survive?)
get the survival rates by ticket class (hypothesis: did rich people have a higher chance to survive?)
get the survival rates by both gender and ticket class
What is the survial rate of Rose? What is the survival rate of Jack?

#use groupby method to calcuate the means of survival column

#calcuate the survival rate by Sex

df.groupby(by='Sex')['Survived'].mean()

Sex
female    0.742038
male      0.188908
Name: Survived, dtype: float64

#calcuate the survival rate by pclass group

df.groupby(by='Pclass')['Survived'].mean()

Pclass
1    0.629630
2    0.472826
3    0.242363
Name: Survived, dtype: float64

#calcaute the survival rate by both pclass and sex

df.groupby(by=['Sex', 'Pclass'])['Survived'].mean()

Sex     Pclass
female  1         0.968085
        2         0.921053
        3         0.500000
male    1         0.368852
        2         0.157407
        3         0.135447
Name: Survived, dtype: float64

#What is the survival rate of Rose? How about Jack?

Rose: 96.80%, Jack: 13.54%

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	0	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	0	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	0	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	0	373450	8.0500	NaN	S
5	6	0	3	Moran, Mr. James	male	NaN	0	0	330877	8.4583	NaN	Q
6	7	0	1	McCarthy, Mr. Timothy J	male	54.0	0	0	17463	51.8625	E46	S
7	8	0	3	Palsson, Master. Gosta Leonard	male	2.0	3	1	349909	21.0750	NaN	S
8	9	1	3	Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)	female	27.0	0	2	347742	11.1333	NaN	S
9	10	1	2	Nasser, Mrs. Nicholas (Adele Achem)	female	14.0	1	0	237736	30.0708	NaN	C

	PassengerId	Survived	Pclass	Age	SibSp	Parch	Fare
count	891.000000	891.000000	891.000000	714.000000	891.000000	891.000000	891.000000
mean	446.000000	0.383838	2.308642	29.699118	0.523008	0.381594	32.204208
std	257.353842	0.486592	0.836071	14.526497	1.102743	0.806057	49.693429
min	1.000000	0.000000	1.000000	0.420000	0.000000	0.000000	0.000000
25%	223.500000	0.000000	2.000000	20.125000	0.000000	0.000000	7.910400
50%	446.000000	0.000000	3.000000	28.000000	0.000000	0.000000	14.454200
75%	668.500000	1.000000	3.000000	38.000000	1.000000	0.000000	31.000000
max	891.000000	1.000000	3.000000	80.000000	8.000000	6.000000	512.329200

	Name	Sex	Ticket	Cabin	Embarked
count	891	891	891	204	889
unique	891	2	681	147	3
top	Morley, Mr. William	male	1601	G6	S
freq	1	577	7	4	644