Pandas can handle virtually any data file format. Below is a table containing supported data formats and their reader & writer functions.
It has more than 50 parameters!? Here are some important parameters you might need to pay attention to.
import pandas as pd
import numpy as np
data = pd.read_csv(filepath_or_buffer='data/titanic.csv')#change the file path accordingly or use the URL address
data.head()
The first thing to do after your data is loaded into Notebook, is to get to know the basic information (e.g., the dimension, size, shape, data types of all the columns) of your data and what your data looks like roughly. All the basic information can be retrieved from DataFrame object's attributes.
Hint: What are attributes of an object? Attributes are the info/properties pre-calculated when you instantiate an object.
#Example: Here I am using some Numpy functions to manually create a 2-d array and then transform it to a DataFrame instance.
#You could skip over this step and simply use the Titanic data.
data = pd.DataFrame(np.random.randint(0,10, (4,3)),
columns= ['col1', 'col2','col3'],
index=['row1','row2','row3','row4'])
data.shape
data.ndim
data.dtypes
data.index
data.columns
#Example
data = pd.DataFrame(np.random.randint(0,10, (4,3)),
columns= ['col1', 'col2','col3'],
index=['row1','row2','row3','row4'])
data.head(2)
data.tail(1)
data.describe()
data.mean()
data.sum()
data.max()
data.idxmax()
Pandas has two indexing systems. One is the position-based indexing and the other is the label-based indexing. While dealing with Series or DataFrame objects, you can use either positions (like what you do with Python on indexing a list) or label for selecting, slicing and modifying values.
A boolean array is an array filled with True or False values. It is primarily used as a mask indicating whether the corresponding entries is selected or not.
#Example
data = pd.DataFrame(np.random.random((4,5)))
data
mask_row = [True, False, False, False]
data.iloc[mask_row,:] # data.loc[mask_row,:] also works
mask_column = [True, True, False,False, False]
data.iloc[:, mask_column] # data.loc[:, mask_column] also works
data.loc[:, mask_column]
##A step further: We want to select rows whose first column>0.4
data.loc[data.iloc[:,0]>0.4]
#decompose the compound statement above.
# The same effect can be accompolished by the following code
mask = data.iloc[:,0]>0.4
data.loc[mask]
Before you start to analyze your data, you might realize that you have to spend a large amount of time cleaning up data. The possible cleansing operations can be:
1. dropping irrelevant columns/rows,
2. adding new columns/rows,
3. fixing data missings, dropping columns/rows in which too many missing entries are present
4. transforming certain columns E.g., add 100 to column x.
#Example
data = pd.DataFrame(np.random.random((4,5)),
columns=['col_'+str(x) for x in range(5)],
index=['row_'+str(x) for x in range(4)])
data
#drop col_0 column
data.drop(columns='col_0')
data # why col_0 is still in there?
#before we use dropna method, let's add some nan values to col_1 by using loc method
data.loc['row_0','col_1'] = np.nan
data
data.dropna(axis='columns', how='any')
data #Rule 1: Most operations generate a copy.
data.dropna(axis='columns', how='any', inplace=True) #Rule 2: if inplace parameter is present in a function, it allows you to make a inplace change.
#insert a new column
data.insert(value=np.random.random(4), column='col_new',loc=0 )
data
#rename col_new to col_new2
data.rename(columns={'col_new':'col_new2'},inplace=True)
data
## Apply a function to each column to do transfromations.
#Tip: you can use a attribute-style to reference a column. E.g., data.A
#e.g., add [1,2,3,4] to col_new2
data.col_new2.apply(lambda x: x+10)
# wait!!!! what is that? How come I get a Series object? I was expecting to get a updated dataframe.
# Remember Rule 1? Every operation generates a copy.
data.col_new2 = data.col_new2.apply(lambda x: x+10) #rewrite the column col_new2
data
data = pd.DataFrame(np.random.rand(4,5),
columns=['col_'+str(x) for x in range(5)],
index=['row_'+str(x) for x in range(4)])
data
#drop the first row
data.drop(index='row_0')
data
#add some missing values to the row_2
data.loc['row_2',['col_0','col_1','col_2']]=np.nan
data
#drop the rows in which the the number of Non-nan is less than the threshold
data.dropna(axis=0, thresh=3)
data.rename(index={'row_0':'row_new'})
#try append function
data2 = pd.DataFrame(np.random.rand(4,5),
columns=['col_'+str(x) for x in range(5)],
index=['row_'+str(x) for x in range(4)])
data2
data.append(data2, ignore_index=False)
data.append(data2, ignore_index=True)
#apply a transformation to each values of each row
data.apply(func=lambda x: x+[0, 1, 2, 3, 4], axis=1)
#get the length of each element
data = pd.DataFrame(np.random.randint(0,100, (4,3)),
columns=['col_'+str(x) for x in range(3)],
index=['row_'+str(x) for x in range(4)])
data
data.applymap(lambda x: x*2)
data = pd.DataFrame(np.random.randint(0,100, (4,3)),
columns= ['col_'+str(x) for x in range(3)],
index = ['row_'+str(x) for x in range(4)])
data.iloc[[2,3],[0,1]] = np.nan
data
data.isnull()
data.notnull()
#Count how many missings are present in each column
data.isnull().sum()
#Count the number of non-missings in each column
data.notnull().sum()
data = pd.DataFrame(np.random.randint(0,100, (4,3)),
columns=['col_'+str(x) for x in range(3)],
index=['row_'+str(x) for x in range(4)])
#add some missings
data.iloc[[0,1],[0,1]] = np.nan
data
data.fillna(value=0)
data.fillna(value=data.mean(), axis=0)
data.fillna(axis=0, method='backfill')
.join(other, on=None,lsuffix='', rsuffix=''): Similar to the merge function, with the exception that join method only uses the index of other dataframe objects.
Merge Vs Join:
data1 = pd.DataFrame(np.random.randint(0,10, (4,3)),
columns= ['col_'+str(x) for x in range(3)],
index=['row_'+str(x) for x in range(4)])
data2 = pd.DataFrame(np.random.randint(0,10, (4,3)),
columns= ['col_'+str(x) for x in range(3)],
index=['row_'+str(x) for x in range(4)])
data1
data2
data1.join(data2, rsuffix='_r')
#merge two dfs on the index, same as the join
data1.merge(data2, left_index=True, right_index=True)
#merge two dfs on a specified common column,
data1.merge(data2, on='col_2')
data = pd.DataFrame(np.random.randint(0,10, (4,3)),
columns= ['col_'+str(x) for x in range(3)],
index=['row_'+str(x) for x in range(4)])
data
##Example: iterate over rows
for row in data.itertuples():
print(row.Index, row.col_2)
#Example iterate over rows using iterrows
for i, row in data.iterrows():
print(row.col_2)
#example iterate over columns
for col in data.iteritems():
print(col) #col is a series object
.groupby(by=None). The parameter by is used to dtermine how to make groups. The canonical example is by = a column name.
A groupby operation is a combination of splitting the object into groups, applying a function to each group, and combining the results.
#Example
data = pd.DataFrame({'name':['p1','p2','p3','p4','p5'],
'state':['TX',"RI","TX","CA","MA"],
'income(K)':np.random.randint(20,70,5),
'height': [4, 5, 6.2, 5.2, 5.1]})
data
#get the mean income of each state
data.groupby('state').mean() # Question: where is name? mean operation is not compatible with a string column.
data.groupby('state').sum()
#what if we want to get the max of the column income and min of column height
data.groupby('state').aggregate({'income(K)':max, 'height':min })
DataFrame.plot(x=None, y=None, kind='line'). The 3 most important parameters:
kind : str
The plotting power of Pandas is limited compared to other Python visualization packages, e.g., Matplotlib, seaborn. Oftentimes, we only use pandas.plot function to explore data.
further reading: Overview of Python Visualization Tools
data = pd.DataFrame({'count':np.random.randint(0,1000,500)})
data.dtypes
data.plot(y='count', kind='hist')
data.plot(y='count',kind='box')
#example 1: A Series with integers as index labels.
data = pd.Series(np.random.random(5),index=[2,3,4,8,6])
data
#What will happen? we are selecting the first item or the item with a label=0?
data[0]
data.loc[2]
data.iloc[0]
#example 2: dataframe
data = pd.DataFrame(np.random.random((3,4)),columns=['a','b','c','d'], index=['a','b','c'])
data
data['a'] #returns a column
data[0]# returns what?? the first column or the first row?
data[0:1]#returns what? the first column or the first row?
The end bound of the slice in .loc() is included, which is different from the default slicing behavior of Python.
data = pd.Series(np.random.randint(0,100,10),index=range(0,10))
data
#use iloc
data.iloc[:2]
data.loc[:2] # the end point of this slice is included.
#example of chained indexing
data = pd.DataFrame({'name':['p1','p2','p3','p4','p5'],
'state':['TX',"RI","TX","CA","MA"],
'income(K)':np.random.randint(20,70,5),
'height': [4, 5, 6.2, 5.2, 5.1]})
data
#we tried to modify the cell[0,0] to 'Mike'.
data.loc[0].loc['name'] = 'mike'
data
The reason is the first part of the chained indexing returns a copy, and the second part is modifying the returned copy (an intermediate variable) rather than the original data.