Essential Functions¶

Reading/Writing Data¶

Pandas can handle virtually any data file format. Below is a table containing supported data formats and their reader & writer functions.

data formats supported

Import Parameters¶

read_csv

It has more than 50 parameters!? Here are some important parameters you might need to pay attention to.

filepath_or_buffer: Any valid string path is acceptable. The string could be a URL. for example: Windows: "D:/data/xx.csv" OS: "usr/you/document/xxx.csv"
Tip: for Mac users, if you do not know the path of a file, you could simply drag the target file into a terminal. The prompt tells you the path of that file.
sep: Separator/delimiter to use to parse the file. Defaults to ','. If your data is TSV (tab-separated values) file, pass '\t' to this parameter.
header: the row number to use as column names. Defaults to 'infer'. If your data has no columns present in the file, set this parameter to None and also pass a list of names to the name parameter.
names: used when your data has no columns.
usecols: return a subset of the columns. Instead of reading the whole dataset, this parameter tells pandas to read the columns of your interest. It is particularly useful when you are dealing with a massive dataset.
encoding: defaults to 'utf-8'. check here to get a list of all supported encodings.

Example¶

Hit this link to download a CSV file to your laptop. Use the read_csv() method to load it to the Notebook.
The URL address to the Titanic dataset. https://raw.githubusercontent.com/mwaskom/seaborn-data/master/titanic.csv

import pandas as pd
import numpy as np
data = pd.read_csv(filepath_or_buffer='data/titanic.csv')#change the file path accordingly or use the URL address

data.head()

Exploring Data¶

The first thing to do after your data is loaded into Notebook, is to get to know the basic information (e.g., the dimension, size, shape, data types of all the columns) of your data and what your data looks like roughly. All the basic information can be retrieved from DataFrame object's attributes.

Hint: What are attributes of an object? Attributes are the info/properties pre-calculated when you instantiate an object.

Getting the basic info of the data through attributes¶

shape: return a tuple representing the dimensionality of the Dataframe
ndim: return an integer representing the number of dimensions
dtypes: return a Series including the data type of each column of the DataFrame.
index: Return the index (also known as row labels) of the Dataframe
columns: return a list of column names

#Example: Here I am using some Numpy functions to manually create a 2-d array and then transform it to a DataFrame instance. 
#You could skip over this step and simply use the Titanic data. 
data = pd.DataFrame(np.random.randint(0,10, (4,3)), 
                    columns= ['col1', 'col2','col3'],
                   index=['row1','row2','row3','row4'])

data.shape

(4, 3)

data.ndim

2

data.dtypes

col1    int32
col2    int32
col3    int32
dtype: object

data.index

Index(['row1', 'row2', 'row3', 'row4'], dtype='object')

data.columns

Index(['col1', 'col2', 'col3'], dtype='object')

Viewing the data¶

head(n): return the first n rows. It is a quick solution to get a good sense of data and also to test if your data is loaded as you expect. The parameter n defaults to 5.
tail(n): return the last n rows. The parameter n defaults to 5.
describe(percentile = None, include=None, exclude=None): Generate descriptive statistics of the included columns that summarize the central tendency, dispersion, shape, ignoring NaN values. By default, this summarization only applies to numeric columns.
mean(), sum(),min(), max(), idxmin(), idxmax(): get the mean, sum, minimum, maximum, the index of the minimum value, the index of maximum value of all columns

#Example
data = pd.DataFrame(np.random.randint(0,10, (4,3)), 
                    columns= ['col1', 'col2','col3'],
                   index=['row1','row2','row3','row4'])

data.head(2)

data.tail(1)

data.describe()

data.mean()

col1    4.25
col2    5.75
col3    3.00
dtype: float64

data.sum()

col1    17
col2    23
col3    12
dtype: int64

data.max()

col1    9
col2    9
col3    6
dtype: int32

data.idxmax()

col1    row4
col2    row2
col3    row3
dtype: object

Indexing: Selecting, Slicing and Modifying Data¶

Pandas has two indexing systems. One is the position-based indexing and the other is the label-based indexing. While dealing with Series or DataFrame objects, you can use either positions (like what you do with Python on indexing a list) or label for selecting, slicing and modifying values.

Position-based Indexing System¶

position-based indexing system

Label-based indexing System¶

label-based indexing system

Two methods for indexing¶

.iloc[]: position-based method for indexing. Allowed inputs are:
1. An integer, e.g. 5.
2. A list or array of integers, e.g. [4, 3, 0].
3. A slice object with ints, e.g. 1:7.
4. A boolean array.
5. A callable function with one argument (the calling Series or DataFrame) and that returns valid output for indexing (one of the above). This is useful in method chains, when you don’t have a reference to the calling object, but would like to base your selection on some value.
.loc[]: label-based method for indexing. Allowed inputs are:
1. A single label, e.g. 5 or 'a', (note that 5 is interpreted as a label of the index).
2. A list or array of labels, e.g. ['a', 'b', 'c'].
3. A slice object with labels, e.g. 'a':'f'. Warning: Contrary to usual python slices, For the slice in Pandas, both the start and the stop slice bounds are included.
4. A boolean array of the same length as the axis being sliced, e.g. [True, False, True].
5. A callable function with one argument (the calling Series or DataFrame) and that returns valid output for indexing (one of the above)

What is a boolean array? (Aka Boolean Mask array)¶

A boolean array is an array filled with True or False values. It is primarily used as a mask indicating whether the corresponding entries is selected or not.

#Example
data = pd.DataFrame(np.random.random((4,5)))
data

mask_row = [True, False, False, False]
data.iloc[mask_row,:] # data.loc[mask_row,:] also works

mask_column = [True, True, False,False, False]
data.iloc[:, mask_column] # data.loc[:, mask_column] also works

data.loc[:, mask_column]

##A step further: We want to select rows whose first column>0.4
data.loc[data.iloc[:,0]>0.4]

#decompose the compound statement above. 
# The same effect can be accompolished by the following code
mask = data.iloc[:,0]>0.4
data.loc[mask]

Pre-processing Data¶

Before you start to analyze your data, you might realize that you have to spend a large amount of time cleaning up data. The possible cleansing operations can be:

1. dropping irrelevant columns/rows, 
2. adding new columns/rows, 
3. fixing data missings,  dropping columns/rows in which too many missing entries are present
4. transforming certain columns E.g., add 100 to column x.

.drop(columns=, inplace=False): Drop columns by names, pass a list of names to the parameter columns. If inplace is True, the changes is operated on the original data, otherwise, a copy will be returned.
.dropna(axis='columns', how='any', thresh=None, subset=None, inplace=False) Drop columns by missings. Threshold: the minimum number of non-Nan values. subset: labels along other axis to consider.
.insert(loc, column, value): Add a new column. loc: the insertion index. Must be between 0 and len(columns)
.rename(columns={oldname: newname, }, inplace=False) Rename columns:
data.column-name.apply(func) : apply a transformation function to a certain column.

#Example
data = pd.DataFrame(np.random.random((4,5)), 
                    columns=['col_'+str(x) for x in range(5)], 
                   index=['row_'+str(x) for x in range(4)])
data

#drop col_0 column
data.drop(columns='col_0')

data # why col_0 is still in there?

#before we use dropna method, let's add some nan values to col_1 by using loc method
data.loc['row_0','col_1'] = np.nan
data

data.dropna(axis='columns', how='any')

data #Rule 1: Most operations generate a copy.

data.dropna(axis='columns', how='any', inplace=True) #Rule 2: if inplace parameter is present in a function, it allows you to make a inplace change.

#insert a new column
data.insert(value=np.random.random(4), column='col_new',loc=0 )
data

#rename col_new to col_new2

data.rename(columns={'col_new':'col_new2'},inplace=True)

data

## Apply a function to each column to do transfromations.
#Tip: you can use a attribute-style to reference a column. E.g., data.A 
#e.g., add [1,2,3,4] to col_new2
data.col_new2.apply(lambda x: x+10)

row_0    10.322507
row_1    10.027035
row_2    10.092833
row_3    10.288416
Name: col_new2, dtype: float64

# wait!!!! what is that? How come I get a Series object? I was expecting to get a updated dataframe. 
# Remember Rule 1? Every operation generates a copy. 
data.col_new2 = data.col_new2.apply(lambda x: x+10) #rewrite the column col_new2
data

.drop(index=, inplace=False): Drop rows by labels, pass a list of names to the parameter index. If inplace is True, the changes is operated on the original data, otherwise, a modified copy is returned.
.dropna(axis='index', how='any', thresh=None, subset=None, inplace=False) Drop rows by missings. Threshold: the number of non-Nan values. subset: labels along other axis to consider.
.append(other, ignore_index=False): Add another DataFrame to the end of the current DF. if ignore_index=True, the index of the output is reset.
.rename(index={oldname: newname, }, inplace=False) Rename row labels:
.apply(func, axis=1, ) : apply a transformation function to each row. for instance, get the sum of each row.

data = pd.DataFrame(np.random.rand(4,5), 
                    columns=['col_'+str(x) for x in range(5)], 
                   index=['row_'+str(x) for x in range(4)])
data

#drop the first row

data.drop(index='row_0')

data

#add some missing values to the row_2
data.loc['row_2',['col_0','col_1','col_2']]=np.nan
data

#drop the rows in which the the number of Non-nan is less than the threshold
data.dropna(axis=0, thresh=3)

data.rename(index={'row_0':'row_new'})

#try append function

data2 = pd.DataFrame(np.random.rand(4,5), 
                    columns=['col_'+str(x) for x in range(5)], 
                   index=['row_'+str(x) for x in range(4)])
data2

data.append(data2, ignore_index=False)

data.append(data2, ignore_index=True)

#apply a transformation to each values of each row
data.apply(func=lambda x: x+[0, 1, 2, 3, 4], axis=1)

.applymap(func): apply a function to each element of the dataframe

#get the length of each element
data = pd.DataFrame(np.random.randint(0,100, (4,3)), 
                    columns=['col_'+str(x) for x in range(3)], 
                   index=['row_'+str(x) for x in range(4)])
data

data.applymap(lambda x: x*2)

Missing value¶

.isull(): element-wise operation. Detect if a value is np.nan. Tip: np.nan is a data type notation to representing nullnuess in Numpy.
.notnull(): element-wise operation. The opposite of isnull()
.fillna(value=None, method=None, axis=None,inplace=False) Fill Nan values using specified value(s) or method

Count how many missings are present in each column¶

data = pd.DataFrame(np.random.randint(0,100, (4,3)), 
                    columns= ['col_'+str(x) for x in range(3)],
                   index = ['row_'+str(x) for x in range(4)])
data.iloc[[2,3],[0,1]] = np.nan
data

data.isnull()

data.notnull()

#Count how many missings are present in each column
data.isnull().sum()

col_0    2
col_1    2
col_2    0
dtype: int64

#Count the number of non-missings in each column
data.notnull().sum()

col_0    2
col_1    2
col_2    4
dtype: int64

fill missings with 0s¶

data = pd.DataFrame(np.random.randint(0,100, (4,3)), 
                    columns=['col_'+str(x) for x in range(3)], 
                   index=['row_'+str(x) for x in range(4)])
#add some missings
data.iloc[[0,1],[0,1]] = np.nan
data

data.fillna(value=0)

fill missings with column means¶

data.fillna(value=data.mean(), axis=0)

Fill missing with backward-fill method¶

data.fillna(axis=0, method='backfill')

Merging/Joining Data¶

.merge(right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, ): merge two dataframe objects on the specified column/index.
.join(other, on=None,lsuffix='', rsuffix=''): Similar to the merge function, with the exception that join method only uses the index of other dataframe objects.
Merge Vs Join:
1. Merge gives you more customization power than join. You could tell this from the number of parameters. Merge method has more parameters than join function.
2. Join only uses the index of the other dataframe objects, however merge can use either index or a column.
3. The biggest advantage of using join is that you could join as many dataframes as possible, only when you want to align data by index.

data1 = pd.DataFrame(np.random.randint(0,10, (4,3)),
                     columns= ['col_'+str(x) for x in range(3)],
                     index=['row_'+str(x) for x in range(4)])
data2 = pd.DataFrame(np.random.randint(0,10, (4,3)),
                     columns= ['col_'+str(x) for x in range(3)],
                     index=['row_'+str(x) for x in range(4)])

data1

data2

"Join" two dataframes using index¶

data1.join(data2, rsuffix='_r')

"Merge" two dataframes using index¶

#merge two dfs on the index, same as the join 
data1.merge(data2, left_index=True, right_index=True)

Merge two dataframes on a common column¶

#merge two dfs on a specified common column,
data1.merge(data2, on='col_2')

Iteration of DataFrame¶

Iterate over rows:
1. iterrows(): return a pair(index, Series)
2. itertuples(): return a named tuple object: Pandas(Index=0, col_1=2, ...) (Recommended as this method preserves the data types of values)
Iterate over columns:
1. iteritems(): return a pair (column name, Series) paris

data = pd.DataFrame(np.random.randint(0,10, (4,3)),
                     columns= ['col_'+str(x) for x in range(3)],
                     index=['row_'+str(x) for x in range(4)])

data

Iteration over rows¶

##Example: iterate over rows
for row in data.itertuples():
    print(row.Index, row.col_2)

row_0 1
row_1 4
row_2 8
row_3 9

#Example iterate over rows using iterrows
for i, row in data.iterrows():
    print(row.col_2)

1
4
8
9

Iteration over columns¶

#example iterate over columns
for col in data.iteritems():
    print(col) #col is a series object

('col_0', row_0    0
row_1    5
row_2    4
row_3    3
Name: col_0, dtype: int32)
('col_1', row_0    5
row_1    5
row_2    4
row_3    9
Name: col_1, dtype: int32)
('col_2', row_0    1
row_1    4
row_2    8
row_3    9
Name: col_2, dtype: int32)

GroupBy Operations¶

.groupby(by=None). The parameter by is used to dtermine how to make groups. The canonical example is by = a column name.
A groupby operation is a combination of splitting the object into groups, applying a function to each group, and combining the results.

The splitting step breaks up the dataframe into groups depending on the values of the specified key.
The applying step does some computing operations (e.g., transformation, aggregation(sum, std),) within each group.
The combining step merges the results of all the groups into an output dataframe.

#Example
data = pd.DataFrame({'name':['p1','p2','p3','p4','p5'],
    'state':['TX',"RI","TX","CA","MA"],
                     'income(K)':np.random.randint(20,70,5), 
                    'height': [4, 5, 6.2, 5.2, 5.1]})
data

#get the mean income of each state
data.groupby('state').mean() # Question: where is name? mean operation is not compatible with a string column.

data.groupby('state').sum()

#what if we want to get the max of the column income and min of column height

data.groupby('state').aggregate({'income(K)':max, 'height':min })

Visualization¶

DataFrame.plot(x=None, y=None, kind='line'). The 3 most important parameters:

x : label or position, default None
y : label, position or list of label, positions, default None
kind : str
1. ‘line’ : line plot (default)
2. ‘bar’ : vertical bar plot
3. ‘barh’ : horizontal bar plot
4. ‘hist’ : histogram
5. ‘box’ : boxplot
6. ‘kde’ : Kernel Density Estimation plot
7. ‘density’ : same as ‘kde’
8. ‘area’ : area plot
9. ‘pie’ : pie plot
10. ‘scatter’ : scatter plot
11. ‘hexbin’ : hexbin plot

The plotting power of Pandas is limited compared to other Python visualization packages, e.g., Matplotlib, seaborn. Oftentimes, we only use pandas.plot function to explore data.
further reading: Overview of Python Visualization Tools

data = pd.DataFrame({'count':np.random.randint(0,1000,500)})
data.dtypes

count    int32
dtype: object

data.plot(y='count', kind='hist')

<matplotlib.axes._subplots.AxesSubplot at 0x1c08e9b3080>

data.plot(y='count',kind='box')

<matplotlib.axes._subplots.AxesSubplot at 0x1c08e75e208>

Gotchas¶

Always uses loc() and iloc() to do indexing/slicing to avoid confusion¶

Situation 1: integers as row labels¶

#example 1: A Series with integers as index labels. 
data = pd.Series(np.random.random(5),index=[2,3,4,8,6])
data

2    0.015593
3    0.296934
4    0.501529
8    0.766180
6    0.608410
dtype: float64

#What will happen? we are selecting the first item or the item with a label=0? 
data[0]

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-73-a9429a10ca83> in <module>
      1 #What will happen? we are selecting the first item or the item with a label=0?
----> 2 data[0]

C:\Anaconda3\lib\site-packages\pandas\core\series.py in __getitem__(self, key)
   1069         key = com.apply_if_callable(key, self)
   1070         try:
-> 1071             result = self.index.get_value(self, key)
   1072 
   1073             if not is_scalar(result):

C:\Anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_value(self, series, key)
   4728         k = self._convert_scalar_indexer(k, kind="getitem")
   4729         try:
-> 4730             return self._engine.get_value(s, k, tz=getattr(series.dtype, "tz", None))
   4731         except KeyError as e1:
   4732             if len(self) > 0 and (self.holds_integer() or self.is_boolean()):

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_value()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_value()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item()

KeyError: 0

data.loc[2]

0.01559272108993981

data.iloc[0]

0.01559272108993981

Situation 2: Dataframe¶

#example 2: dataframe
data = pd.DataFrame(np.random.random((3,4)),columns=['a','b','c','d'], index=['a','b','c'])
data

data['a'] #returns a column

a    0.134431
b    0.965817
c    0.224720
Name: a, dtype: float64

data[0]# returns what?? the first column or the first row?

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
C:\Anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
   2896             try:
-> 2897                 return self._engine.get_loc(key)
   2898             except KeyError:

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 0

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
<ipython-input-81-4428dd38234c> in <module>
----> 1 data[0]# returns what?? the first column or the first row?

C:\Anaconda3\lib\site-packages\pandas\core\frame.py in __getitem__(self, key)
   2993             if self.columns.nlevels > 1:
   2994                 return self._getitem_multilevel(key)
-> 2995             indexer = self.columns.get_loc(key)
   2996             if is_integer(indexer):
   2997                 indexer = [indexer]

C:\Anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
   2897                 return self._engine.get_loc(key)
   2898             except KeyError:
-> 2899                 return self._engine.get_loc(self._maybe_cast_indexer(key))
   2900         indexer = self.get_indexer([key], method=method, tolerance=tolerance)
   2901         if indexer.ndim > 1 or indexer.size > 1:

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 0

data[0:1]#returns what? the first column or the first row?

Label-based slicing is inclusive¶

The end bound of the slice in .loc() is included, which is different from the default slicing behavior of Python.

data = pd.Series(np.random.randint(0,100,10),index=range(0,10))
data

0    59
1    99
2    12
3    41
4    21
5    21
6    92
7    64
8    31
9    97
dtype: int32

#use iloc 
data.iloc[:2]

0    59
1    99
dtype: int32

data.loc[:2] # the end point of this slice is included.

0    59
1    99
2    12
dtype: int32

Copy Vs. View¶

All operations return a copy.
If inplace=True is set in the caller function, the operation modify the data in place; only few operations support this setting.
iloc[], .loc[] always modify data in place.
Never do chained indexing when modifying data. The modifying is not guaranteed to work.

#example of chained indexing
data = pd.DataFrame({'name':['p1','p2','p3','p4','p5'],
    'state':['TX',"RI","TX","CA","MA"],
                     'income(K)':np.random.randint(20,70,5), 
                    'height': [4, 5, 6.2, 5.2, 5.1]})

data

#we tried to modify the cell[0,0] to 'Mike'.
data.loc[0].loc['name'] = 'mike'

C:\Anaconda3\lib\site-packages\ipykernel_launcher.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

data

The reason is the first part of the chained indexing returns a copy, and the second part is modifying the returned copy (an intermediate variable) rather than the original data.

Comprehensive Project¶

Link to the project

Comprehensive Projection Solution¶

link to the project solution

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S

	col1	col2	col3
count	4.000000	4.00	4.00000
mean	4.250000	5.75	3.00000
std	4.031129	2.50	2.94392
min	0.000000	3.00	0.00000
25%	1.500000	4.50	0.75000
50%	4.000000	5.50	3.00000
75%	6.750000	6.75	5.25000
max	9.000000	9.00	6.00000

	0	1	2	3	4
0	0.375997	0.835685	0.940172	0.121947	0.906368
1	0.436972	0.303608	0.871348	0.708181	0.617845
2	0.160288	0.534491	0.341066	0.628389	0.236705
3	0.390603	0.284738	0.604126	0.764724	0.681014

	0	1
0	0.375997	0.835685
1	0.436972	0.303608
2	0.160288	0.534491
3	0.390603	0.284738

	0	1
0	0.375997	0.835685
1	0.436972	0.303608
2	0.160288	0.534491
3	0.390603	0.284738

	col_0	col_1	col_2	col_3	col_4
row_0	0.800205	0.601902	0.311152	0.837320	0.473667
row_1	0.016961	0.857147	0.777783	0.353794	0.792751
row_2	0.740049	0.344949	0.909057	0.786513	0.218409
row_3	0.951858	0.032113	0.102229	0.816484	0.616248

	col_new	col_0	col_2	col_3	col_4
row_0	0.322507	0.800205	0.311152	0.837320	0.473667
row_1	0.027035	0.016961	0.777783	0.353794	0.792751
row_2	0.092833	0.740049	0.909057	0.786513	0.218409
row_3	0.288416	0.951858	0.102229	0.816484	0.616248

	col_new2	col_0	col_2	col_3	col_4
row_0	10.322507	0.800205	0.311152	0.837320	0.473667
row_1	10.027035	0.016961	0.777783	0.353794	0.792751
row_2	10.092833	0.740049	0.909057	0.786513	0.218409
row_3	10.288416	0.951858	0.102229	0.816484	0.616248

	col_0	col_1	col_2	col_3	col_4
row_0	0.688586	0.447904	0.676060	0.270816	0.111429
row_1	0.239293	0.055340	0.114697	0.491123	0.868270
row_2	0.434565	0.725536	0.049033	0.386461	0.684407
row_3	0.457990	0.338172	0.031036	0.002844	0.502192

	col_0	col_1	col_2	col_3	col_4
row_1	0.892339	0.109345	0.146900	0.133874	0.259947
row_2	0.769893	0.700546	0.367378	0.040472	0.939881
row_3	0.239945	0.688736	0.923779	0.880585	0.342324

	col_0	col_1	col_2	col_3	col_4
row_0	0.318474	0.177934	0.641396	0.718538	0.687692
row_1	0.892339	0.109345	0.146900	0.133874	0.259947
row_2	NaN	NaN	NaN	0.040472	0.939881
row_3	0.239945	0.688736	0.923779	0.880585	0.342324

	col_0	col_1	col_2	col_3	col_4
row_new	0.318474	0.177934	0.641396	0.718538	0.687692
row_1	0.892339	0.109345	0.146900	0.133874	0.259947
row_2	NaN	NaN	NaN	0.040472	0.939881
row_3	0.239945	0.688736	0.923779	0.880585	0.342324

	col_0	col_1	col_2	col_3	col_4
row_0	0.413115	0.251739	0.378637	0.954995	0.986009
row_1	0.200280	0.069230	0.264094	0.756599	0.752411
row_2	0.880196	0.166490	0.081792	0.817598	0.975656
row_3	0.848336	0.406114	0.525199	0.840423	0.331894

	col_0	col_1	col_2	col_3	col_4
row_0	0.318474	1.177934	2.641396	3.718538	4.687692
row_1	0.892339	1.109345	2.146900	3.133874	4.259947
row_2	NaN	NaN	NaN	3.040472	4.939881
row_3	0.239945	1.688736	2.923779	3.880585	4.342324

	col_0	col_1	col_2
row_0	False	False	False
row_1	False	False	False
row_2	True	True	False
row_3	True	True	False

	a	b	c	d
a	0.134431	0.621955	0.359193	0.875364
b	0.965817	0.081602	0.933112	0.511970
c	0.224720	0.501750	0.449910	0.559874