Welcome to the final coding challenge! This comprehensive project requires you to apply and combine many methods which we have learned. If a method is not introduced in our tutorial, a short description of the methods to use is shown, which you could use Python's help() or Ipython's ? to bring out the help documentation of the function to use.

The sinking of the Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.

While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.

The ultimate goal of this challenge is to build up a predictive model that answers this question: "what sorts of people are more likely to survive" by looking at the characteristics of passengers.

As it's your first day in the data science track, instead of building up a classification model, we would analyze the data and answer that question in an exploratory way.

The data has been split into two parts. You will be asked to merge them together.

- part1.csv
- part2. csv

**Data Dictionary**

- load the data
- Explore the data
- Wrangle, prepare and cleanse the data.
- Analyze, visualize and identify patterns.

**Hint:** pandas and numpy.

In [1]:

```
import pandas as pd
import numpy as np
```

**Hint:** read_csv()

In [2]:

```
df1 = pd.read_csv("part1.csv")
df2 = pd.read_csv("part2.csv")
```

**Hint**: .append()

In [3]:

```
df = df1.append(df2)
```

- get the column names of the data
- get the shape info of the data
- preview the data,looking at the first 10 rows
- which variables/columns are purely numerical? which columns are string types?
- which columns have missing values? and the number of missings?
- get the descriptive statistics of the numerical variables?
- get the descriptive statistics of the nominal/categorical variables?
- Make a boxplot of the variable Age
- get to know how many passengers died and how many survived by looking at the Survival column

In [26]:

```
#get the column names of the new dataset
```

In [4]:

```
df.columns
```

Out[4]:

In [27]:

```
# get the shape info of the new dataset
```

In [5]:

```
df.shape
```

Out[5]:

In [29]:

```
#preview the dataset, looking at the first 10 rows
```

In [6]:

```
df.head(10)
```

Out[6]:

In [21]:

```
#which varialbes are numerical? which variables have strings?
```

In [7]:

```
df.dtypes
```

Out[7]:

In [8]:

```
#which columns have missing values?
#and how many values are missing from those columns?
```

In [9]:

```
df.isnull().sum()
```

Out[9]:

In [10]:

```
#The descriptive statistics of those numerical variables/columns?
#specifically we want to know the mean, std, min, max and percentiles.
#Hint: describe()
```

In [11]:

```
df.describe()
```

Out[11]:

In [42]:

```
#The descriptive statistics of the categorical varialbes?
#e.g., count, unique, top, frequency
#Hint: .describe(include=np.object)
#Link to the help documentation of the describe function
# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.describe.html
```

In [12]:

```
df.describe(include=np.object)
```

Out[12]:

In [49]:

```
#Plotting the age varialbe using boxplot
#Hint:.plot(y='Age', kind='box')
# if the diagram is not shown, please do the following in a new cell
```

`import matplotlib`

`%matplotlib inline`

In [13]:

```
df.plot(y='Age',kind='box')
```

Out[13]:

In [14]:

```
# Plot the survived varialbe using bar chart
# Hint: value_counts()
```

In [21]:

```
df.Survived.value_counts()
```

Out[21]:

Drop the Cabin column as it has too many missings and it has nothing to do with the classification model.

**Hint**: .drop()

In [22]:

```
df.drop(columns=['Cabin'],inplace=True)
```

- for numerical variables: use the median Age->median
**Hint:**df.Age.fillna() - for nominal/categorical variables: use the most frequent category. Embarked->most frequent port
**Hint:**df.Embarked.fillna()

In [55]:

```
#impute the missings of the Age column
```

In [23]:

```
df.Age.fillna(value=df.Age.median(), inplace=True)
```

In [24]:

```
#impute the missings of the Embarked column
###Hint: How to get the most frequent value of the Embarked
#Hint: df.Embarked.mode()[0]
#or df.Embarked.value_counts.index[0]
```

In [25]:

```
most_fre = df.Embarked.mode()[0]
df.Embarked.fillna(value=most_fre, inplace=True)
```

In [60]:

```
#check if all the missing values are gone
```

In [28]:

```
df.isnull().sum()
```

Out[28]:

- get the survival rates by gender (Hypothesis: Were women more likely to survive?)
- get the survival rates by ticket class (hypothesis: did rich people have a higher chance to survive?)
- get the survival rates by both gender and ticket class
- What is the survial rate of Rose? What is the survival rate of Jack?

In [31]:

```
#use groupby method to calcuate the means of survival column
```

In [35]:

```
#calcuate the survival rate by Sex
```

In [34]:

```
df.groupby(by='Sex')['Survived'].mean()
```

Out[34]:

In [81]:

```
#calcuate the survival rate by pclass group
```

In [36]:

```
df.groupby(by='Pclass')['Survived'].mean()
```

Out[36]:

In [37]:

```
#calcaute the survival rate by both pclass and sex
```

In [38]:

```
df.groupby(by=['Sex', 'Pclass'])['Survived'].mean()
```

Out[38]:

In [39]:

```
#What is the survival rate of Rose? How about Jack?
```

Rose: 96.80%, Jack: 13.54%

In [ ]:

```
```