Essential Functions

Reading/Writing Data

Pandas can handle virtually any data file format. Below is a table containing supported data formats and their reader & writer functions.

data formats supported

Import Parameters

read_csv

It has more than 50 parameters!? Here are some important parameters you might need to pay attention to.

  1. filepath_or_buffer: Any valid string path is acceptable. The string could be a URL. for example: Windows: "D:/data/xx.csv" OS: "usr/you/document/xxx.csv"
    Tip: for Mac users, if you do not know the path of a file, you could simply drag the target file into a terminal. The prompt tells you the path of that file.
  2. sep: Separator/delimiter to use to parse the file. Defaults to ','. If your data is TSV (tab-separated values) file, pass '\t' to this parameter.
  3. header: the row number to use as column names. Defaults to 'infer'. If your data has no columns present in the file, set this parameter to None and also pass a list of names to the name parameter.
  4. names: used when your data has no columns.
  5. usecols: return a subset of the columns. Instead of reading the whole dataset, this parameter tells pandas to read the columns of your interest. It is particularly useful when you are dealing with a massive dataset.
  6. encoding: defaults to 'utf-8'. check here to get a list of all supported encodings.

Example

  1. Hit this link to download a CSV file to your laptop. Use the read_csv() method to load it to the Notebook.
  2. The URL address to the Titanic dataset. https://raw.githubusercontent.com/mwaskom/seaborn-data/master/titanic.csv
In [2]:
import pandas as pd
import numpy as np
data = pd.read_csv(filepath_or_buffer='data/titanic.csv')#change the file path accordingly or use the URL address
In [3]:
data.head()
Out[3]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S

Exploring Data

The first thing to do after your data is loaded into Notebook, is to get to know the basic information (e.g., the dimension, size, shape, data types of all the columns) of your data and what your data looks like roughly. All the basic information can be retrieved from DataFrame object's attributes.

Hint: What are attributes of an object? Attributes are the info/properties pre-calculated when you instantiate an object.

Getting the basic info of the data through attributes

  1. shape: return a tuple representing the dimensionality of the Dataframe
  2. ndim: return an integer representing the number of dimensions
  3. dtypes: return a Series including the data type of each column of the DataFrame.
  4. index: Return the index (also known as row labels) of the Dataframe
  5. columns: return a list of column names
In [5]:
#Example: Here I am using some Numpy functions to manually create a 2-d array and then transform it to a DataFrame instance. 
#You could skip over this step and simply use the Titanic data. 
data = pd.DataFrame(np.random.randint(0,10, (4,3)), 
                    columns= ['col1', 'col2','col3'],
                   index=['row1','row2','row3','row4'])
In [6]:
data.shape
Out[6]:
(4, 3)
In [18]:
data.ndim
Out[18]:
2
In [19]:
data.dtypes
Out[19]:
col1    int32
col2    int32
col3    int32
dtype: object
In [24]:
data.index
Out[24]:
Index(['row1', 'row2', 'row3', 'row4'], dtype='object')
In [23]:
data.columns
Out[23]:
Index(['col1', 'col2', 'col3'], dtype='object')

Viewing the data

  1. head(n): return the first n rows. It is a quick solution to get a good sense of data and also to test if your data is loaded as you expect. The parameter n defaults to 5.
  2. tail(n): return the last n rows. The parameter n defaults to 5.
  3. describe(percentile = None, include=None, exclude=None): Generate descriptive statistics of the included columns that summarize the central tendency, dispersion, shape, ignoring NaN values. By default, this summarization only applies to numeric columns.
  4. mean(), sum(),min(), max(), idxmin(), idxmax(): get the mean, sum, minimum, maximum, the index of the minimum value, the index of maximum value of all columns
In [122]:
#Example
data = pd.DataFrame(np.random.randint(0,10, (4,3)), 
                    columns= ['col1', 'col2','col3'],
                   index=['row1','row2','row3','row4'])
In [123]:
data.head(2)
Out[123]:
col1 col2 col3
row1 6 5 1
row2 0 9 0
In [124]:
data.tail(1)
Out[124]:
col1 col2 col3
row4 9 3 5
In [125]:
data.describe()
Out[125]:
col1 col2 col3
count 4.000000 4.00 4.00000
mean 4.250000 5.75 3.00000
std 4.031129 2.50 2.94392
min 0.000000 3.00 0.00000
25% 1.500000 4.50 0.75000
50% 4.000000 5.50 3.00000
75% 6.750000 6.75 5.25000
max 9.000000 9.00 6.00000
In [126]:
data.mean()
Out[126]:
col1    4.25
col2    5.75
col3    3.00
dtype: float64
In [127]:
data.sum()
Out[127]:
col1    17
col2    23
col3    12
dtype: int64
In [128]:
data.max()
Out[128]:
col1    9
col2    9
col3    6
dtype: int32
In [129]:
data.idxmax()
Out[129]:
col1    row4
col2    row2
col3    row3
dtype: object

Indexing: Selecting, Slicing and Modifying Data

Pandas has two indexing systems. One is the position-based indexing and the other is the label-based indexing. While dealing with Series or DataFrame objects, you can use either positions (like what you do with Python on indexing a list) or label for selecting, slicing and modifying values.

Position-based Indexing System

position-based indexing system

Label-based indexing System

label-based indexing system

Two methods for indexing

  1. .iloc[]: position-based method for indexing. Allowed inputs are:
    1. An integer, e.g. 5.
    2. A list or array of integers, e.g. [4, 3, 0].
    3. A slice object with ints, e.g. 1:7.
    4. A boolean array.
    5. A callable function with one argument (the calling Series or DataFrame) and that returns valid output for indexing (one of the above). This is useful in method chains, when you don’t have a reference to the calling object, but would like to base your selection on some value.
  2. .loc[]: label-based method for indexing. Allowed inputs are:
    1. A single label, e.g. 5 or 'a', (note that 5 is interpreted as a label of the index).
    2. A list or array of labels, e.g. ['a', 'b', 'c'].
    3. A slice object with labels, e.g. 'a':'f'. Warning: Contrary to usual python slices, For the slice in Pandas, both the start and the stop slice bounds are included.
    4. A boolean array of the same length as the axis being sliced, e.g. [True, False, True].
    5. A callable function with one argument (the calling Series or DataFrame) and that returns valid output for indexing (one of the above)

What is a boolean array? (Aka Boolean Mask array)

A boolean array is an array filled with True or False values. It is primarily used as a mask indicating whether the corresponding entries is selected or not.

In [7]:
#Example
data = pd.DataFrame(np.random.random((4,5)))
data
Out[7]:
0 1 2 3 4
0 0.375997 0.835685 0.940172 0.121947 0.906368
1 0.436972 0.303608 0.871348 0.708181 0.617845
2 0.160288 0.534491 0.341066 0.628389 0.236705
3 0.390603 0.284738 0.604126 0.764724 0.681014
In [8]:
mask_row = [True, False, False, False]
data.iloc[mask_row,:] # data.loc[mask_row,:] also works
Out[8]:
0 1 2 3 4
0 0.375997 0.835685 0.940172 0.121947 0.906368
In [10]:
mask_column = [True, True, False,False, False]
data.iloc[:, mask_column] # data.loc[:, mask_column] also works
Out[10]:
0 1
0 0.375997 0.835685
1 0.436972 0.303608
2 0.160288 0.534491
3 0.390603 0.284738
In [12]:
data.loc[:, mask_column]
Out[12]:
0 1
0 0.375997 0.835685
1 0.436972 0.303608
2 0.160288 0.534491
3 0.390603 0.284738
In [13]:
##A step further: We want to select rows whose first column>0.4
data.loc[data.iloc[:,0]>0.4]
Out[13]:
0 1 2 3 4
1 0.436972 0.303608 0.871348 0.708181 0.617845
In [14]:
#decompose the compound statement above. 
# The same effect can be accompolished by the following code
mask = data.iloc[:,0]>0.4
data.loc[mask]
Out[14]:
0 1 2 3 4
1 0.436972 0.303608 0.871348 0.708181 0.617845

Pre-processing Data

Before you start to analyze your data, you might realize that you have to spend a large amount of time cleaning up data. The possible cleansing operations can be:

1. dropping irrelevant columns/rows, 
2. adding new columns/rows, 
3. fixing data missings,  dropping columns/rows in which too many missing entries are present
4. transforming certain columns E.g., add 100 to column x.  
  1. .drop(columns=, inplace=False): Drop columns by names, pass a list of names to the parameter columns. If inplace is True, the changes is operated on the original data, otherwise, a copy will be returned.
  2. .dropna(axis='columns', how='any', thresh=None, subset=None, inplace=False) Drop columns by missings. Threshold: the minimum number of non-Nan values. subset: labels along other axis to consider.
  3. .insert(loc, column, value): Add a new column. loc: the insertion index. Must be between 0 and len(columns)
  4. .rename(columns={oldname: newname, }, inplace=False) Rename columns:
  5. data.column-name.apply(func) : apply a transformation function to a certain column.
In [17]:
#Example
data = pd.DataFrame(np.random.random((4,5)), 
                    columns=['col_'+str(x) for x in range(5)], 
                   index=['row_'+str(x) for x in range(4)])
data
Out[17]:
col_0 col_1 col_2 col_3 col_4
row_0 0.800205 0.601902 0.311152 0.837320 0.473667
row_1 0.016961 0.857147 0.777783 0.353794 0.792751
row_2 0.740049 0.344949 0.909057 0.786513 0.218409
row_3 0.951858 0.032113 0.102229 0.816484 0.616248
In [18]:
#drop col_0 column
data.drop(columns='col_0')
Out[18]:
col_1 col_2 col_3 col_4
row_0 0.601902 0.311152 0.837320 0.473667
row_1 0.857147 0.777783 0.353794 0.792751
row_2 0.344949 0.909057 0.786513 0.218409
row_3 0.032113 0.102229 0.816484 0.616248
In [21]:
data # why col_0 is still in there? 
Out[21]:
col_0 col_1 col_2 col_3 col_4
row_0 0.800205 0.601902 0.311152 0.837320 0.473667
row_1 0.016961 0.857147 0.777783 0.353794 0.792751
row_2 0.740049 0.344949 0.909057 0.786513 0.218409
row_3 0.951858 0.032113 0.102229 0.816484 0.616248
In [22]:
#before we use dropna method, let's add some nan values to col_1 by using loc method
data.loc['row_0','col_1'] = np.nan
data
Out[22]:
col_0 col_1 col_2 col_3 col_4
row_0 0.800205 NaN 0.311152 0.837320 0.473667
row_1 0.016961 0.857147 0.777783 0.353794 0.792751
row_2 0.740049 0.344949 0.909057 0.786513 0.218409
row_3 0.951858 0.032113 0.102229 0.816484 0.616248
In [23]:
data.dropna(axis='columns', how='any')
Out[23]:
col_0 col_2 col_3 col_4
row_0 0.800205 0.311152 0.837320 0.473667
row_1 0.016961 0.777783 0.353794 0.792751
row_2 0.740049 0.909057 0.786513 0.218409
row_3 0.951858 0.102229 0.816484 0.616248
In [24]:
data #Rule 1: Most operations generate a copy. 
Out[24]:
col_0 col_1 col_2 col_3 col_4
row_0 0.800205 NaN 0.311152 0.837320 0.473667
row_1 0.016961 0.857147 0.777783 0.353794 0.792751
row_2 0.740049 0.344949 0.909057 0.786513 0.218409
row_3 0.951858 0.032113 0.102229 0.816484 0.616248
In [26]:
data.dropna(axis='columns', how='any', inplace=True) #Rule 2: if inplace parameter is present in a function, it allows you to make a inplace change.
In [27]:
#insert a new column
data.insert(value=np.random.random(4), column='col_new',loc=0 )
data
Out[27]:
col_new col_0 col_2 col_3 col_4
row_0 0.322507 0.800205 0.311152 0.837320 0.473667
row_1 0.027035 0.016961 0.777783 0.353794 0.792751
row_2 0.092833 0.740049 0.909057 0.786513 0.218409
row_3 0.288416 0.951858 0.102229 0.816484 0.616248
In [109]:
#rename col_new to col_new2
In [29]:
data.rename(columns={'col_new':'col_new2'},inplace=True)
In [30]:
data
Out[30]:
col_new2 col_0 col_2 col_3 col_4
row_0 0.322507 0.800205 0.311152 0.837320 0.473667
row_1 0.027035 0.016961 0.777783 0.353794 0.792751
row_2 0.092833 0.740049 0.909057 0.786513 0.218409
row_3 0.288416 0.951858 0.102229 0.816484 0.616248
In [31]:
## Apply a function to each column to do transfromations.
#Tip: you can use a attribute-style to reference a column. E.g., data.A 
#e.g., add [1,2,3,4] to col_new2
data.col_new2.apply(lambda x: x+10)
Out[31]:
row_0    10.322507
row_1    10.027035
row_2    10.092833
row_3    10.288416
Name: col_new2, dtype: float64
In [33]:
# wait!!!! what is that? How come I get a Series object? I was expecting to get a updated dataframe. 
# Remember Rule 1? Every operation generates a copy. 
data.col_new2 = data.col_new2.apply(lambda x: x+10) #rewrite the column col_new2
data
Out[33]:
col_new2 col_0 col_2 col_3 col_4
row_0 10.322507 0.800205 0.311152 0.837320 0.473667
row_1 10.027035 0.016961 0.777783 0.353794 0.792751
row_2 10.092833 0.740049 0.909057 0.786513 0.218409
row_3 10.288416 0.951858 0.102229 0.816484 0.616248
  1. .drop(index=, inplace=False): Drop rows by labels, pass a list of names to the parameter index. If inplace is True, the changes is operated on the original data, otherwise, a modified copy is returned.
  2. .dropna(axis='index', how='any', thresh=None, subset=None, inplace=False) Drop rows by missings. Threshold: the number of non-Nan values. subset: labels along other axis to consider.
  3. .append(other, ignore_index=False): Add another DataFrame to the end of the current DF. if ignore_index=True, the index of the output is reset.
  4. .rename(index={oldname: newname, }, inplace=False) Rename row labels:
  5. .apply(func, axis=1, ) : apply a transformation function to each row. for instance, get the sum of each row.
In [34]:
data = pd.DataFrame(np.random.rand(4,5), 
                    columns=['col_'+str(x) for x in range(5)], 
                   index=['row_'+str(x) for x in range(4)])
data
Out[34]:
col_0 col_1 col_2 col_3 col_4
row_0 0.688586 0.447904 0.676060 0.270816 0.111429
row_1 0.239293 0.055340 0.114697 0.491123 0.868270
row_2 0.434565 0.725536 0.049033 0.386461 0.684407
row_3 0.457990 0.338172 0.031036 0.002844 0.502192
In [68]:
#drop the first row
In [67]:
data.drop(index='row_0')
Out[67]:
col_0 col_1 col_2 col_3 col_4
row_1 0.892339 0.109345 0.146900 0.133874 0.259947
row_2 0.769893 0.700546 0.367378 0.040472 0.939881
row_3 0.239945 0.688736 0.923779 0.880585 0.342324
In [ ]:
data
In [71]:
#add some missing values to the row_2
data.loc['row_2',['col_0','col_1','col_2']]=np.nan
data
Out[71]:
col_0 col_1 col_2 col_3 col_4
row_0 0.318474 0.177934 0.641396 0.718538 0.687692
row_1 0.892339 0.109345 0.146900 0.133874 0.259947
row_2 NaN NaN NaN 0.040472 0.939881
row_3 0.239945 0.688736 0.923779 0.880585 0.342324
In [74]:
#drop the rows in which the the number of Non-nan is less than the threshold
data.dropna(axis=0, thresh=3)
Out[74]:
col_0 col_1 col_2 col_3 col_4
row_0 0.318474 0.177934 0.641396 0.718538 0.687692
row_1 0.892339 0.109345 0.146900 0.133874 0.259947
row_3 0.239945 0.688736 0.923779 0.880585 0.342324
In [75]:
data.rename(index={'row_0':'row_new'})
Out[75]:
col_0 col_1 col_2 col_3 col_4
row_new 0.318474 0.177934 0.641396 0.718538 0.687692
row_1 0.892339 0.109345 0.146900 0.133874 0.259947
row_2 NaN NaN NaN 0.040472 0.939881
row_3 0.239945 0.688736 0.923779 0.880585 0.342324
In [77]:
#try append function
In [78]:
data2 = pd.DataFrame(np.random.rand(4,5), 
                    columns=['col_'+str(x) for x in range(5)], 
                   index=['row_'+str(x) for x in range(4)])
data2
Out[78]:
col_0 col_1 col_2 col_3 col_4
row_0 0.413115 0.251739 0.378637 0.954995 0.986009
row_1 0.200280 0.069230 0.264094 0.756599 0.752411
row_2 0.880196 0.166490 0.081792 0.817598 0.975656
row_3 0.848336 0.406114 0.525199 0.840423 0.331894
In [79]:
data.append(data2, ignore_index=False)
Out[79]:
col_0 col_1 col_2 col_3 col_4
row_0 0.318474 0.177934 0.641396 0.718538 0.687692
row_1 0.892339 0.109345 0.146900 0.133874 0.259947
row_2 NaN NaN NaN 0.040472 0.939881
row_3 0.239945 0.688736 0.923779 0.880585 0.342324
row_0 0.413115 0.251739 0.378637 0.954995 0.986009
row_1 0.200280 0.069230 0.264094 0.756599 0.752411
row_2 0.880196 0.166490 0.081792 0.817598 0.975656
row_3 0.848336 0.406114 0.525199 0.840423 0.331894
In [80]:
data.append(data2, ignore_index=True)
Out[80]:
col_0 col_1 col_2 col_3 col_4
0 0.318474 0.177934 0.641396 0.718538 0.687692
1 0.892339 0.109345 0.146900 0.133874 0.259947
2 NaN NaN NaN 0.040472 0.939881
3 0.239945 0.688736 0.923779 0.880585 0.342324
4 0.413115 0.251739 0.378637 0.954995 0.986009
5 0.200280 0.069230 0.264094 0.756599 0.752411
6 0.880196 0.166490 0.081792 0.817598 0.975656
7 0.848336 0.406114 0.525199 0.840423 0.331894
In [82]:
#apply a transformation to each values of each row
data.apply(func=lambda x: x+[0, 1, 2, 3, 4], axis=1)
Out[82]:
col_0 col_1 col_2 col_3 col_4
row_0 0.318474 1.177934 2.641396 3.718538 4.687692
row_1 0.892339 1.109345 2.146900 3.133874 4.259947
row_2 NaN NaN NaN 3.040472 4.939881
row_3 0.239945 1.688736 2.923779 3.880585 4.342324
  1. .applymap(func): apply a function to each element of the dataframe
In [83]:
#get the length of each element
data = pd.DataFrame(np.random.randint(0,100, (4,3)), 
                    columns=['col_'+str(x) for x in range(3)], 
                   index=['row_'+str(x) for x in range(4)])
data
Out[83]:
col_0 col_1 col_2
row_0 91 22 34
row_1 53 39 49
row_2 35 24 89
row_3 9 7 84
In [85]:
data.applymap(lambda x: x*2)
Out[85]:
col_0 col_1 col_2
row_0 182 44 68
row_1 106 78 98
row_2 70 48 178
row_3 18 14 168

Missing value

  1. .isull(): element-wise operation. Detect if a value is np.nan. Tip: np.nan is a data type notation to representing nullnuess in Numpy.
  2. .notnull(): element-wise operation. The opposite of isnull()
  3. .fillna(value=None, method=None, axis=None,inplace=False) Fill Nan values using specified value(s) or method

Count how many missings are present in each column

In [35]:
data = pd.DataFrame(np.random.randint(0,100, (4,3)), 
                    columns= ['col_'+str(x) for x in range(3)],
                   index = ['row_'+str(x) for x in range(4)])
data.iloc[[2,3],[0,1]] = np.nan
data
Out[35]:
col_0 col_1 col_2
row_0 84.0 58.0 85
row_1 29.0 50.0 78
row_2 NaN NaN 56
row_3 NaN NaN 6
In [131]:
data.isnull()
Out[131]:
col_0 col_1 col_2
row_0 False False False
row_1 False False False
row_2 True True False
row_3 True True False
In [135]:
data.notnull()
Out[135]:
col_0 col_1 col_2
row_0 True True True
row_1 True True True
row_2 False False True
row_3 False False True
In [136]:
#Count how many missings are present in each column
data.isnull().sum()
Out[136]:
col_0    2
col_1    2
col_2    0
dtype: int64
In [137]:
#Count the number of non-missings in each column
data.notnull().sum()
Out[137]:
col_0    2
col_1    2
col_2    4
dtype: int64

fill missings with 0s

In [115]:
data = pd.DataFrame(np.random.randint(0,100, (4,3)), 
                    columns=['col_'+str(x) for x in range(3)], 
                   index=['row_'+str(x) for x in range(4)])
#add some missings
data.iloc[[0,1],[0,1]] = np.nan
data
Out[115]:
col_0 col_1 col_2
row_0 NaN NaN 72
row_1 NaN NaN 78
row_2 85.0 56.0 96
row_3 0.0 35.0 56
In [116]:
data.fillna(value=0)
Out[116]:
col_0 col_1 col_2
row_0 0.0 0.0 72
row_1 0.0 0.0 78
row_2 85.0 56.0 96
row_3 0.0 35.0 56

fill missings with column means

In [118]:
data.fillna(value=data.mean(), axis=0)
Out[118]:
col_0 col_1 col_2
row_0 42.5 45.5 72
row_1 42.5 45.5 78
row_2 85.0 56.0 96
row_3 0.0 35.0 56

Fill missing with backward-fill method

In [121]:
data.fillna(axis=0, method='backfill')
Out[121]:
col_0 col_1 col_2
row_0 85.0 56.0 72
row_1 85.0 56.0 78
row_2 85.0 56.0 96
row_3 0.0 35.0 56

Merging/Joining Data

  1. .merge(right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, ): merge two dataframe objects on the specified column/index.
  2. .join(other, on=None,lsuffix='', rsuffix=''): Similar to the merge function, with the exception that join method only uses the index of other dataframe objects.

  3. Merge Vs Join:

    1. Merge gives you more customization power than join. You could tell this from the number of parameters. Merge method has more parameters than join function.
    2. Join only uses the index of the other dataframe objects, however merge can use either index or a column.
    3. The biggest advantage of using join is that you could join as many dataframes as possible, only when you want to align data by index.
In [36]:
data1 = pd.DataFrame(np.random.randint(0,10, (4,3)),
                     columns= ['col_'+str(x) for x in range(3)],
                     index=['row_'+str(x) for x in range(4)])
data2 = pd.DataFrame(np.random.randint(0,10, (4,3)),
                     columns= ['col_'+str(x) for x in range(3)],
                     index=['row_'+str(x) for x in range(4)])
In [42]:
data1
Out[42]:
col_0 col_1 col_2
row_0 4 5 6
row_1 7 1 1
row_2 6 0 5
row_3 6 7 1
In [43]:
data2
Out[43]:
col_0 col_1 col_2
row_0 2 0 4
row_1 4 3 8
row_2 4 8 3
row_3 9 8 4

"Join" two dataframes using index

In [154]:
data1.join(data2, rsuffix='_r')
Out[154]:
col_0 col_1 col_2 col_0_r col_1_r col_2_r
row_0 6 0 3 1 7 9
row_1 7 1 6 9 2 1
row_2 6 6 1 2 8 3
row_3 6 4 3 8 5 7

"Merge" two dataframes using index

In [39]:
#merge two dfs on the index, same as the join 
data1.merge(data2, left_index=True, right_index=True)
Out[39]:
col_0_x col_1_x col_2_x col_0_y col_1_y col_2_y
row_0 4 5 6 2 0 4
row_1 7 1 1 4 3 8
row_2 6 0 5 4 8 3
row_3 6 7 1 9 8 4

Merge two dataframes on a common column

In [41]:
#merge two dfs on a specified common column,
data1.merge(data2, on='col_2')
Out[41]:
col_0_x col_1_x col_2 col_0_y col_1_y

Iteration of DataFrame

  1. Iterate over rows:
    1. iterrows(): return a pair(index, Series)
    2. itertuples(): return a named tuple object: Pandas(Index=0, col_1=2, ...) (Recommended as this method preserves the data types of values)
  2. Iterate over columns:
    1. iteritems(): return a pair (column name, Series) paris
In [159]:
data = pd.DataFrame(np.random.randint(0,10, (4,3)),
                     columns= ['col_'+str(x) for x in range(3)],
                     index=['row_'+str(x) for x in range(4)])
In [160]:
data
Out[160]:
col_0 col_1 col_2
row_0 0 5 1
row_1 5 5 4
row_2 4 4 8
row_3 3 9 9

Iteration over rows

In [162]:
##Example: iterate over rows
for row in data.itertuples():
    print(row.Index, row.col_2)
row_0 1
row_1 4
row_2 8
row_3 9
In [165]:
#Example iterate over rows using iterrows
for i, row in data.iterrows():
    print(row.col_2)
1
4
8
9

Iteration over columns

In [167]:
#example iterate over columns
for col in data.iteritems():
    print(col) #col is a series object
('col_0', row_0    0
row_1    5
row_2    4
row_3    3
Name: col_0, dtype: int32)
('col_1', row_0    5
row_1    5
row_2    4
row_3    9
Name: col_1, dtype: int32)
('col_2', row_0    1
row_1    4
row_2    8
row_3    9
Name: col_2, dtype: int32)

GroupBy Operations

.groupby(by=None). The parameter by is used to dtermine how to make groups. The canonical example is by = a column name.
A groupby operation is a combination of splitting the object into groups, applying a function to each group, and combining the results.

  1. The splitting step breaks up the dataframe into groups depending on the values of the specified key.
  2. The applying step does some computing operations (e.g., transformation, aggregation(sum, std),) within each group.
  3. The combining step merges the results of all the groups into an output dataframe. groupby demo
In [58]:
#Example
data = pd.DataFrame({'name':['p1','p2','p3','p4','p5'],
    'state':['TX',"RI","TX","CA","MA"],
                     'income(K)':np.random.randint(20,70,5), 
                    'height': [4, 5, 6.2, 5.2, 5.1]})
data
Out[58]:
name state income(K) height
0 p1 TX 64 4.0
1 p2 RI 31 5.0
2 p3 TX 42 6.2
3 p4 CA 56 5.2
4 p5 MA 30 5.1
In [60]:
#get the mean income of each state
data.groupby('state').mean() # Question: where is name? mean operation is not compatible with a string column. 
Out[60]:
income(K) height
state
CA 56 5.2
MA 30 5.1
RI 31 5.0
TX 53 5.1
In [61]:
data.groupby('state').sum()
Out[61]:
income(K) height
state
CA 56 5.2
MA 30 5.1
RI 31 5.0
TX 106 10.2
In [62]:
#what if we want to get the max of the column income and min of column height
In [66]:
data.groupby('state').aggregate({'income(K)':max, 'height':min })
Out[66]:
income(K) height
state
CA 56 5.2
MA 30 5.1
RI 31 5.0
TX 64 4.0

Visualization

DataFrame.plot(x=None, y=None, kind='line'). The 3 most important parameters:

  1. x : label or position, default None
  2. y : label, position or list of label, positions, default None
  3. kind : str

    1. ‘line’ : line plot (default)
    2. ‘bar’ : vertical bar plot
    3. ‘barh’ : horizontal bar plot
    4. ‘hist’ : histogram
    5. ‘box’ : boxplot
    6. ‘kde’ : Kernel Density Estimation plot
    7. ‘density’ : same as ‘kde’
    8. ‘area’ : area plot
    9. ‘pie’ : pie plot
    10. ‘scatter’ : scatter plot
    11. ‘hexbin’ : hexbin plot

The plotting power of Pandas is limited compared to other Python visualization packages, e.g., Matplotlib, seaborn. Oftentimes, we only use pandas.plot function to explore data.
further reading: Overview of Python Visualization Tools

In [176]:
data = pd.DataFrame({'count':np.random.randint(0,1000,500)})
data.dtypes
Out[176]:
count    int32
dtype: object
In [185]:
data.plot(y='count', kind='hist')
Out[185]:
<matplotlib.axes._subplots.AxesSubplot at 0x1c08e9b3080>
In [184]:
data.plot(y='count',kind='box')
Out[184]:
<matplotlib.axes._subplots.AxesSubplot at 0x1c08e75e208>

Gotchas

Always uses loc() and iloc() to do indexing/slicing to avoid confusion

Situation 1: integers as row labels

In [68]:
#example 1: A Series with integers as index labels. 
data = pd.Series(np.random.random(5),index=[2,3,4,8,6])
data
Out[68]:
2    0.015593
3    0.296934
4    0.501529
8    0.766180
6    0.608410
dtype: float64
In [ ]:
 
In [73]:
#What will happen? we are selecting the first item or the item with a label=0? 
data[0]
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-73-a9429a10ca83> in <module>
      1 #What will happen? we are selecting the first item or the item with a label=0?
----> 2 data[0]

C:\Anaconda3\lib\site-packages\pandas\core\series.py in __getitem__(self, key)
   1069         key = com.apply_if_callable(key, self)
   1070         try:
-> 1071             result = self.index.get_value(self, key)
   1072 
   1073             if not is_scalar(result):

C:\Anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_value(self, series, key)
   4728         k = self._convert_scalar_indexer(k, kind="getitem")
   4729         try:
-> 4730             return self._engine.get_value(s, k, tz=getattr(series.dtype, "tz", None))
   4731         except KeyError as e1:
   4732             if len(self) > 0 and (self.holds_integer() or self.is_boolean()):

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_value()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_value()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item()

KeyError: 0
In [72]:
data.loc[2]
Out[72]:
0.01559272108993981
In [71]:
data.iloc[0]
Out[71]:
0.01559272108993981

Situation 2: Dataframe

In [77]:
#example 2: dataframe
data = pd.DataFrame(np.random.random((3,4)),columns=['a','b','c','d'], index=['a','b','c'])
data
Out[77]:
a b c d
a 0.134431 0.621955 0.359193 0.875364
b 0.965817 0.081602 0.933112 0.511970
c 0.224720 0.501750 0.449910 0.559874
In [78]:
data['a'] #returns a column
Out[78]:
a    0.134431
b    0.965817
c    0.224720
Name: a, dtype: float64
In [81]:
data[0]# returns what?? the first column or the first row? 
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
C:\Anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
   2896             try:
-> 2897                 return self._engine.get_loc(key)
   2898             except KeyError:

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 0

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
<ipython-input-81-4428dd38234c> in <module>
----> 1 data[0]# returns what?? the first column or the first row?

C:\Anaconda3\lib\site-packages\pandas\core\frame.py in __getitem__(self, key)
   2993             if self.columns.nlevels > 1:
   2994                 return self._getitem_multilevel(key)
-> 2995             indexer = self.columns.get_loc(key)
   2996             if is_integer(indexer):
   2997                 indexer = [indexer]

C:\Anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
   2897                 return self._engine.get_loc(key)
   2898             except KeyError:
-> 2899                 return self._engine.get_loc(self._maybe_cast_indexer(key))
   2900         indexer = self.get_indexer([key], method=method, tolerance=tolerance)
   2901         if indexer.ndim > 1 or indexer.size > 1:

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 0
In [82]:
data[0:1]#returns what? the first column or the first row? 
Out[82]:
a b c d
a 0.134431 0.621955 0.359193 0.875364

Label-based slicing is inclusive

The end bound of the slice in .loc() is included, which is different from the default slicing behavior of Python.

In [206]:
data = pd.Series(np.random.randint(0,100,10),index=range(0,10))
data
Out[206]:
0    59
1    99
2    12
3    41
4    21
5    21
6    92
7    64
8    31
9    97
dtype: int32
In [212]:
#use iloc 
data.iloc[:2]
Out[212]:
0    59
1    99
dtype: int32
In [213]:
data.loc[:2] # the end point of this slice is included.
Out[213]:
0    59
1    99
2    12
dtype: int32

Copy Vs. View

  1. All operations return a copy.
  2. If inplace=True is set in the caller function, the operation modify the data in place; only few operations support this setting.
  3. iloc[], .loc[] always modify data in place.
  4. Never do chained indexing when modifying data. The modifying is not guaranteed to work.
In [107]:
#example of chained indexing
data = pd.DataFrame({'name':['p1','p2','p3','p4','p5'],
    'state':['TX',"RI","TX","CA","MA"],
                     'income(K)':np.random.randint(20,70,5), 
                    'height': [4, 5, 6.2, 5.2, 5.1]})
In [108]:
data
Out[108]:
name state income(K) height
0 p1 TX 44 4.0
1 p2 RI 46 5.0
2 p3 TX 21 6.2
3 p4 CA 49 5.2
4 p5 MA 62 5.1
In [109]:
#we tried to modify the cell[0,0] to 'Mike'.
data.loc[0].loc['name'] = 'mike'
C:\Anaconda3\lib\site-packages\ipykernel_launcher.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
In [110]:
data
Out[110]:
name state income(K) height
0 p1 TX 44 4.0
1 p2 RI 46 5.0
2 p3 TX 21 6.2
3 p4 CA 49 5.2
4 p5 MA 62 5.1

The reason is the first part of the chained indexing returns a copy, and the second part is modifying the returned copy (an intermediate variable) rather than the original data.

Comprehensive Project

Link to the project

Comprehensive Projection Solution

link to the project solution

In [ ]: