Exploring Survival on the Titanic

Welcome to the final coding challenge! This comprehensive project requires you to apply and combine many methods which we have learned. If a method is not introduced in our tutorial, a short description of the methods to use is shown, which you could use Python's help() or Ipython's ? to bring out the help documentation of the function to use.

The Case

The sinking of the Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.

While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.

The ultimate goal of this challenge is to build up a predictive model that answers this question: "what sorts of people are more likely to survive" by looking at the characteristics of passengers.

As it's your first day in the data science track, instead of building up a classification model, we would analyze the data and answer that question in an exploratory way.

Data

The data has been split into two parts. You will be asked to merge them together.

  1. part1.csv
  2. part2. csv

Data Dictionary data dictionary

Workflow of Exploratory Data Analysis

  1. load the data
  2. Explore the data
  3. Wrangle, prepare and cleanse the data.
  4. Analyze, visualize and identify patterns.

Fun time

Download the data.

Dataset 1: click here
Dataset 2: click here

Import packages

Hint: pandas and numpy.

In [ ]:
 

Load the two datasets

Hint: read_csv()

In [ ]:
 

Append the dataset2 to the dataset 1

Hint: .append()

In [ ]:
 

Explore Data

  1. get the column names of the data
  2. get the shape info of the data
  3. preview the data,looking at the first 10 rows
  4. which variables/columns are purely numerical? which columns are string types?
  5. which columns have missing values? and the number of missings?
  6. get the descriptive statistics of the numerical variables?
  7. get the descriptive statistics of the nominal/categorical variables?
  8. Make a boxplot of the variable Age
  9. get to know how many passengers died and how many survived by looking at the Survival column
In [26]:
#get the column names of the new dataset
In [ ]:
 
In [27]:
# get the shape info of the new dataset
In [ ]:
 
In [29]:
#preview the dataset, looking at the first 10 rows
In [ ]:
 
In [21]:
#which varialbes are numerical? which variables have strings? 
In [ ]:
 
In [8]:
#which columns have missing values? 
#and how many values are missing from those columns?
In [ ]:
 
In [10]:
#The descriptive statistics of those numerical variables/columns? 
#specifically we want to know the mean, std, min, max and percentiles.
#Hint: describe()
In [ ]:
 
In [42]:
#The descriptive statistics of the categorical varialbes? 
#e.g., count, unique, top, frequency
#Hint: .describe(include=np.object)
#Link to the help documentation of the describe function 
# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.describe.html
In [ ]:
 
In [49]:
#Plotting the age varialbe using boxplot
#Hint:.plot(y='Age', kind='box')
# if the diagram is not shown, please do the following in a new cell 

import matplotlib
%matplotlib inline

In [ ]:
 
In [2]:
# how many passenagers died and how many survived? 
# Hint: value_counts()
In [ ]:
 

Drop columns

Drop the Cabin column as it has too many missings and it has nothing to do with the classification model.
Hint: .drop()

In [ ]:
 

Impute missing values

  1. for numerical variables: use the median Age->median Hint:df.Age.fillna()
  2. for nominal/categorical variables: use the most frequent category. Embarked->most frequent port Hint: df.Embarked.fillna()
In [55]:
#impute the missings of the Age column
In [ ]:
 
In [24]:
#impute the missings of the Embarked column
###Hint: How to get the most frequent value of the Embarked
#Hint: df.Embarked.mode()[0]
#or df.Embarked.value_counts.index[0]
In [ ]:
 
In [4]:
#check if all the missing values are gone
In [ ]:
 

Analyze the data

  1. get the survival rates by gender (Hypothesis: Were women more likely to survive?)
  2. get the survival rates by ticket class (hypothesis: did rich people have a higher chance to survive?)
  3. get the survival rates by both gender and ticket class
  4. What is the survial rate of Rose? What is the survival rate of Jack?
In [31]:
#use groupby method to calcuate the means of survival column
In [35]:
#calcuate the survival rate by Sex
In [ ]:
 
In [81]:
#calcuate the survival rate by pclass group
In [ ]:
 
In [37]:
#calcaute the survival rate by both pclass and sex
In [ ]:
 
In [5]:
#What is the survival rate of Rose? How about Jack?  
In [ ]: