Exploring Survival on the Titanic

Welcome to the final coding challenge! This comprehensive project requires you to apply and combine many methods which we have learned. If a method is not introduced in our tutorial, a short description of the methods to use is shown, which you could use Python's help() or Ipython's ? to bring out the help documentation of the function to use.

The Case

The sinking of the Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.

While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.

The ultimate goal of this challenge is to build up a predictive model that answers this question: "what sorts of people are more likely to survive" by looking at the characteristics of passengers.

As it's your first day in the data science track, instead of building up a classification model, we would analyze the data and answer that question in an exploratory way.

Data

The data has been split into two parts. You will be asked to merge them together.

  1. part1.csv
  2. part2. csv

Data Dictionary data dictionary

Workflow of Exploratory Data Analysis

  1. load the data
  2. Explore the data
  3. Wrangle, prepare and cleanse the data.
  4. Analyze, visualize and identify patterns.

Fun time

Download the data.

Dataset 1: click here
Dataset 2: click here

Import packages

Hint: pandas and numpy.

In [1]:
import pandas as pd
import numpy as np

Load the two datasets

Hint: read_csv()

In [2]:
df1 = pd.read_csv("part1.csv")
df2 = pd.read_csv("part2.csv")

Append the dataset2 to the dataset 1

Hint: .append()

In [3]:
df = df1.append(df2)

Explore Data

  1. get the column names of the data
  2. get the shape info of the data
  3. preview the data,looking at the first 10 rows
  4. which variables/columns are purely numerical? which columns are string types?
  5. which columns have missing values? and the number of missings?
  6. get the descriptive statistics of the numerical variables?
  7. get the descriptive statistics of the nominal/categorical variables?
  8. Make a boxplot of the variable Age
  9. get to know how many passengers died and how many survived by looking at the Survival column
In [26]:
#get the column names of the new dataset
In [4]:
df.columns
Out[4]:
Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')
In [27]:
# get the shape info of the new dataset
In [5]:
df.shape
Out[5]:
(891, 12)
In [29]:
#preview the dataset, looking at the first 10 rows
In [6]:
df.head(10)
Out[6]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
5 6 0 3 Moran, Mr. James male NaN 0 0 330877 8.4583 NaN Q
6 7 0 1 McCarthy, Mr. Timothy J male 54.0 0 0 17463 51.8625 E46 S
7 8 0 3 Palsson, Master. Gosta Leonard male 2.0 3 1 349909 21.0750 NaN S
8 9 1 3 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) female 27.0 0 2 347742 11.1333 NaN S
9 10 1 2 Nasser, Mrs. Nicholas (Adele Achem) female 14.0 1 0 237736 30.0708 NaN C
In [21]:
#which varialbes are numerical? which variables have strings? 
In [7]:
df.dtypes
Out[7]:
PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object
In [8]:
#which columns have missing values? 
#and how many values are missing from those columns?
In [9]:
df.isnull().sum()
Out[9]:
PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64
In [10]:
#The descriptive statistics of those numerical variables/columns? 
#specifically we want to know the mean, std, min, max and percentiles.
#Hint: describe()
In [11]:
df.describe()
Out[11]:
PassengerId Survived Pclass Age SibSp Parch Fare
count 891.000000 891.000000 891.000000 714.000000 891.000000 891.000000 891.000000
mean 446.000000 0.383838 2.308642 29.699118 0.523008 0.381594 32.204208
std 257.353842 0.486592 0.836071 14.526497 1.102743 0.806057 49.693429
min 1.000000 0.000000 1.000000 0.420000 0.000000 0.000000 0.000000
25% 223.500000 0.000000 2.000000 20.125000 0.000000 0.000000 7.910400
50% 446.000000 0.000000 3.000000 28.000000 0.000000 0.000000 14.454200
75% 668.500000 1.000000 3.000000 38.000000 1.000000 0.000000 31.000000
max 891.000000 1.000000 3.000000 80.000000 8.000000 6.000000 512.329200
In [42]:
#The descriptive statistics of the categorical varialbes? 
#e.g., count, unique, top, frequency
#Hint: .describe(include=np.object)
#Link to the help documentation of the describe function 
# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.describe.html
In [12]:
df.describe(include=np.object)
Out[12]:
Name Sex Ticket Cabin Embarked
count 891 891 891 204 889
unique 891 2 681 147 3
top Morley, Mr. William male 1601 G6 S
freq 1 577 7 4 644
In [49]:
#Plotting the age varialbe using boxplot
#Hint:.plot(y='Age', kind='box')
# if the diagram is not shown, please do the following in a new cell 

import matplotlib
%matplotlib inline

In [13]:
df.plot(y='Age',kind='box')
Out[13]:
<matplotlib.axes._subplots.AxesSubplot at 0x10827186128>
In [14]:
# Plot the survived varialbe using bar chart
# Hint: value_counts()
In [21]:
df.Survived.value_counts()
Out[21]:
0    549
1    342
Name: Survived, dtype: int64

Drop columns

Drop the Cabin column as it has too many missings and it has nothing to do with the classification model.
Hint: .drop()

In [22]:
df.drop(columns=['Cabin'],inplace=True)

Impute missing values

  1. for numerical variables: use the median Age->median Hint:df.Age.fillna()
  2. for nominal/categorical variables: use the most frequent category. Embarked->most frequent port Hint: df.Embarked.fillna()
In [55]:
#impute the missings of the Age column
In [23]:
df.Age.fillna(value=df.Age.median(), inplace=True)
In [24]:
#impute the missings of the Embarked column
###Hint: How to get the most frequent value of the Embarked
#Hint: df.Embarked.mode()[0]
#or df.Embarked.value_counts.index[0]
In [25]:
most_fre = df.Embarked.mode()[0]
df.Embarked.fillna(value=most_fre, inplace=True)
In [60]:
#check if all the missing values are gone
In [28]:
df.isnull().sum()
Out[28]:
PassengerId    0
Survived       0
Pclass         0
Name           0
Sex            0
Age            0
SibSp          0
Parch          0
Ticket         0
Fare           0
Embarked       0
dtype: int64

Analyze the data

  1. get the survival rates by gender (Hypothesis: Were women more likely to survive?)
  2. get the survival rates by ticket class (hypothesis: did rich people have a higher chance to survive?)
  3. get the survival rates by both gender and ticket class
  4. What is the survial rate of Rose? What is the survival rate of Jack?
In [31]:
#use groupby method to calcuate the means of survival column
In [35]:
#calcuate the survival rate by Sex
In [34]:
df.groupby(by='Sex')['Survived'].mean()
Out[34]:
Sex
female    0.742038
male      0.188908
Name: Survived, dtype: float64
In [81]:
#calcuate the survival rate by pclass group
In [36]:
df.groupby(by='Pclass')['Survived'].mean()
Out[36]:
Pclass
1    0.629630
2    0.472826
3    0.242363
Name: Survived, dtype: float64
In [37]:
#calcaute the survival rate by both pclass and sex
In [38]:
df.groupby(by=['Sex', 'Pclass'])['Survived'].mean()
Out[38]:
Sex     Pclass
female  1         0.968085
        2         0.921053
        3         0.500000
male    1         0.368852
        2         0.157407
        3         0.135447
Name: Survived, dtype: float64
In [39]:
#What is the survival rate of Rose? How about Jack?  

Rose: 96.80%, Jack: 13.54%

In [ ]: