First things first, Pandas is a package, which is the most important tool for data manipulation and analysis (Python Developers Survey 2020 Results)
The name of Pandas is derived from 'Panel Data'.
Glossary
panel data: In statistics and econometrics, panel data or longitudinal data, are multi-dimensional data involving measurements over time. 1
In terms of release time, Numpy preceded Pandas. Numpy is an open-source Python library used for scientific computing and provides high-performance operations on arrays and matrices (ndarray object). ndarrays are stored and processed more efficiently than Python's default list objects through vectorized operations.
Pandas library is built on Numpy Package, as a higher-level wrapper for the convenience of data science users. Instead Pandas abstracts away those mathematical terms (e.g., vectors, matrices) and turns them into some familiar concepts such as columns, spreadsheet. This way everybody is HAPPY!
a_list = [2,3,4,'t']
#now we want to apply x+10 to all elements of this list
result = [x+10 for x in a_list]
result
A TypeError was triggered. During each iteration, Python interpretor checks whether the intended operation is workable with the element. Say you have a list of 1 million elements, Python has to check the compatibility issue for 1 million times, which is a huge waste of time and computing resources. Numpy came up with a new data type to this problem that is it requires every single element in the list has to be of the same data type. This saves tons of computing time and resources. In addition, it opens the door for vectorized operations which allows more data to be processed at once.
Numpy is only concerned with high-performance numeric computation (e.g., Matrix transpose, matrix multiplication, and so on), however, pandas is designed more for data scientists who care about data manipulation, missing data, queries, splitting and so on. For instance, some of you might be familiar with SQL statements which can be used in Pandas directly.
Also, as mentioned earlier, Pandas acts as a higher-level wrapper of Numpy cores, which intends to make the learning a lot easier.
import pandas as pd
import numpy as np #Some of the demos below will use functions from numpy.
pd.__version__
Pandas has two main data structures.
Pandas can handle virtually any data file format. Below is a table containing supported data formats and their reader & writer functions.
It has more than 50 parameters!? Here are some important parameters you need to pay attention to.
Read a web csv file via a URL address https://raw.githubusercontent.com/s4-web/pythontraining/main/data.csv
import pandas as pd
data = pd.read_csv("https://raw.githubusercontent.com/s4-web/pythontraining/main/data.csv")
data.head()
The first thing to do after your data is loaded into Notebook, is to get to know the basic information (e.g., the dimension, size, shape, data types of all the columns) of your data and what your data looks like roughly. All the basic information can be retrieved from DataFrame object's attributes.
Hint: What are attributes of an object? Attributes are the info/properties pre-calculated when you instantiate an object.
data.shape # 4 rows * 4 columns
data.ndim # the dimenion of your data. If you are working with HDF5 file, you might have more than 2 dimensions.
data.dtypes
#Object is a data type defined by Pandas.
data.index
data.columns
data.head(2)
data.tail(1)
data.describe()
data.mean()
data.sum(numeric_only=True)
data.max()
data.idxmax()
Pandas uses two indexing systems at the same time. One is the position-based indexing (starting from 0)and the other is the label-based indexing. When dealing with Series or DataFrame objects, you can use either positions (like what you do with Python on indexing a list) or labels for selecting, slicing and modifying values.
df = pd.read_csv("https://raw.githubusercontent.com/s4-web/pythontraining/main/data.csv")
df.index = df.ID
df
A. passing in a single value
df.iloc[0] # returns the row-0 on the position indexing.
df.loc['U001'] # return the row-U001 on the label indexing
df.iloc[0, 0] #refers to the cell at position (0,0)
df.loc['U001', 'Name']
B. passing in a list of values
df.iloc[[0, 3]] # we want to select the row-0 and row-3 using the position indexing system
df.loc[['U001', 'U004']]
df.loc[['U001', 'U004'], ["Name", "Grade"]]
C. using a slice
df.iloc[:, 0] # return the first column,
df.iloc[0, :] #return the first row
A boolean array is an array filled with True or False values. It is primarily used as a mask indicating whether the corresponding entries is selected or not.
df
x = df.Score > 85
x
df.loc[x]
df.loc[df.Score >85]
df.loc[(df.Score > 80) & (df.Name == 'Kevin')]
Syntax: In pandas, when you combine multiple conditions, you need to use &
(bitwise and),|
(bitwise or) or ~
(bitwise not).
Question: Why not use and
to connect two criteria?
Before you start to analyze your data, you might realize that you have to spend a large amount of time cleaning up data. The possible cleansing operations can be:
1. dropping irrelevant columns/rows,
2. adding new columns/rows,
3. fixing data missings, dropping columns/rows in which too many missing entries are present
4. transforming certain columns E.g., add 100 to column x.
#Example
df = pd.read_csv("https://raw.githubusercontent.com/s4-web/pythontraining/main/data.csv")
df
#drop the Grade column
df.drop(columns='Grade')
df # why Grade is still in there?
df.drop(columns="Grade", inplace=True) # this time there is no return actually.
df
df = pd.read_csv("https://raw.githubusercontent.com/s4-web/pythontraining/main/data.csv")
#before we use dropna method, let's add some nan values to col_1 by using loc method
df.loc[0, 'Score'] = pd.NA #pd.NA is a data type to represent missing values.
df
df.dropna(axis='columns', how='any')
df #Rule 1: Most operations generate a copy
df.dropna(axis='columns', how='any', inplace=True) #Rule 2: if inplace parameter is present in a function, it allows you to make a inplace change.
df
#insert a new column
df.insert(value=np.random.random(4), column='col_new',loc=0)
df
#rename col_new to col_new2
df.rename(columns={'col_new':'random'},inplace=True)
df
df.sum(axis = 0)
## Apply a function to each column to do transfromations.
#Tip: you can use a attribute-style to reference a column. E.g., data.A
#e.g., add [1,2,3,4] to col_new2
df['random'].apply(lambda x: x+10)
df
# wait!!!! what is that? How come I get a Series object? I was expecting to get a updated dataframe.
# Remember Rule 1? Every operation generates a copy.
df['result']= df['random'].apply(lambda x: x+10) #rewrite the column col_new2
df
help(df.apply)
##waht if I want to update the random column with the calcualted results
df['random'] = df['random'].apply(lambda x: x+10)
df
data = pd.read_csv("https://raw.githubusercontent.com/s4-web/pythontraining/main/data.csv")
data
#drop the first two rows
data.drop(index=[0,1])
data
#add some missing values to the row_2
data.loc[0,'Grade']=pd.NA
data
#drop rows in which the the number of Non-nan is less than the threshold
data.dropna(axis=0, thresh=4) # if a row has less than 4 non-empty values, it gets deleted.
data.rename(index={1:'row_renamed'})
#try append function
df1 = pd.read_csv("https://raw.githubusercontent.com/s4-web/pythontraining/main/data.csv")
df2 = pd.read_csv("https://raw.githubusercontent.com/s4-web/pythontraining/main/data.csv")
df1.append(df2, ignore_index=False) # ignore_index=False tells Pandas to keep the data's original index labels.
df1.append(df2, ignore_index=True)
df1.append(df2, verify_integrity=True) # verify_integrity checks if those data have the duplicates on index. If so, throw out an error
#apply a transformation to each values of each row
df = pd.read_csv("https://raw.githubusercontent.com/s4-web/pythontraining/main/data.csv")
df_num = df.loc[:, ["Score"]].copy()
df_num["Score2"] = [10, 20, 30, 40] # you can also use df_num.insert() method to add a new column
df_num
df_num.apply(lambda x: sum(x), axis=1) # if you feel axis is confusing, always check with help()
#get the length of each element
data = pd.DataFrame(np.random.randint(0,100, (4,3)),
columns=['col_'+str(x) for x in range(3)],
index=['row_'+str(x) for x in range(4)])
data
data.applymap(lambda x: x*2)
data
Use this link to get another dataset which has missing values https://raw.githubusercontent.com/s4-web/pythontraining/main/datanull.csv
data = pd.read_csv("https://raw.githubusercontent.com/s4-web/pythontraining/main/datanull.csv")
data
data.isnull()
data.notnull()
#Count how many missings are present in each column
data.isnull().sum()
#Count the number of missing in each row
data.notnull().sum()
data = pd.read_csv("https://raw.githubusercontent.com/s4-web/pythontraining/main/datanull.csv")
data
data.fillna(value=0)
Question: I want to replace missings in different columns differently. Replace all the missing Grade as 'B'; replace all the missing scores as 50.
data.fillna({"Grade": 'B', "Score": 50})
data
data.fillna(axis=0, method='bfill') # bfill method uses the next valid value to fill the holes
.join(other, on=None,lsuffix='', rsuffix=''): Similar to the merge function, but join method can take as many dataframes as possible. The limit is it only look at the aligned index to do concatenation.
Merge Vs Join Vs. Append:
df1 = pd.read_csv("https://raw.githubusercontent.com/s4-web/pythontraining/main/data.csv")
df2 = pd.read_csv("https://raw.githubusercontent.com/s4-web/pythontraining/main/datanull.csv")
df1
df2
df1.join(df2, lsuffix="_2018",rsuffix="_2019")
#merge two dfs on the index, same as the join
df1.merge(df2, left_index=True, right_index=True, suffixes=('_2018', "_2019"))
#merge two dfs on a specified common column,
df1.merge(df2, left_on="ID", right_on="ID", suffixes=('_2018', "_2019"))
data = pd.read_csv("https://raw.githubusercontent.com/s4-web/pythontraining/main/data.csv")
data
##Example: iterate over rows
for row in data.itertuples():
print(row.Index, row.Score) # Note that Index is capitalcase.
#Example iterate over rows using iterrows
for index, row in data.iterrows():
print(row.Name)
#example iterate over columns
for index, col in data.iteritems():
print(col) #col is a series object
.groupby(by=None). The parameter by is used to dtermine how to make groups.
A groupby operation is a combination of splitting the object into groups, applying a function to each group, and combining the results.
#Example
data = pd.DataFrame({'name':['p1','p2','p3','p4','p5'],
'state':['TX',"RI","TX","CA","MA"],
'income(K)':np.random.randint(20,70,5),
'height': [4, 5, 6.2, 5.2, 5.1]})
data
#get the mean income and mean height of each state
data.groupby('state').mean() # Question: where is name? mean operation is not compatible with a string column.
data.groupby('state').sum()
#what if we want to apply differnt methods to differnt columns
data.groupby('state').aggregate({'income(K)':max, 'height':min })
DataFrame.plot(x=None, y=None, kind='line'). The 3 most important parameters:
kind : str
The plotting power of Pandas is limited compared to other Python visualization packages, e.g., Matplotlib, seaborn. Oftentimes, we only use pandas.plot function to explore data.
further reading: Overview of Python Visualization Tools
data = pd.DataFrame({'count':np.random.randint(0,1000,500)})
data.dtypes
data.plot(y='count', kind='hist')
data.plot(y='count',kind='box')
#example 1: A Series with integers as index labels.
data = pd.Series(np.random.random(5),index=[2,3,4,8,6])
data
#What will happen? we are selecting the first item or the item with a label=0?
data[0]
data.loc[2]
data.iloc[0]
#example 2: dataframe
data = pd.DataFrame(np.random.random((3,4)),columns=['a','b','c','d'], index=['a','b','c'])
data
data['a'] #returns a column
data[0]# returns what?? the first column or the first row?
data.iloc[0]
The end bound of the slice in .loc() is included, which is different from the default slicing behavior of Python.
data = pd.Series(np.random.randint(0,100,10),index=range(0,10))
data
#use iloc
data.iloc[0:2] # select the first two rows
data.loc[0:2] # the end point of this slice is included.
df = pd.read_csv("https://raw.githubusercontent.com/s4-web/pythontraining/main/data.csv")
df
Task: Change Kevin's Score to 100
#Solution1: Find the row which belongs to Kevin and then update its score
df2 = df.copy() # this creates a coopy of df
x = df2[df2.Name == "Kevin"] #Problem happens to this line. This leads to a copy, rather than a view to its orignial df2
x['Score'] = 100
df2
Where is the problem? Okay, the problem happens to the statement x = df2[df2.Name == "Kevin"]
. This line returns a new copy of the selected rows. When you move to the next line where you want to update the score, it turns out you are manipulating the value of a new variable rather than the original data.
df2[df2.Name=='Kevin']['Score'] == 10
is functionally the same as the aforementioned example.
##Solution 2: squeeze all the conditions in a single loc call
df1 = df.copy()
df1.loc[df1.Name == 'Kevin', "Score"] = 100
df1
Rules to follow: Copy Vs. View (Source: Stackflow)