Pandas

Introduction to Pandas

First things first, Pandas is a package, which is the most important tool for data manipulation and analysis (Python Developers Survey 2018 Results)
Pandas is derived from the phrase'Panel Data'. As you could tell, Pandas was designed to provide operations on panel data.

Glossary
panel data: In statistics and econometrics, panel data or longitudinal data, are multi-dimensional data involving measurements over time. 1

What is numpy and pandas

In terms of initial release time, Numpy preceded Pandas. Numpy is an open-source Python library used for scientific computing and provides high-performance operations on arrays and matrices (ndarray object). ndarrays are stored and processed more efficiently than Python's default list objects through vectorized operations.
Pandas library is built on Numpy Package, as a higher-level wrapper for the convenience of users whose data is not just numbers.

One step further: why numpy is faster?

  1. Requires all the elements in a ndarray object to have the same data type, either integer or floating point numbers, but not a mix of the two. This saves tremendous time inspecting if an operation is workable with specific elements.
In [3]:
a_list = [2,3,4,'t']
#now we want to apply x+10 to all elements of this list
result = [x+10 for x in a_list]
result
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-3-28563919a579> in <module>
      1 a_list = [2,3,4,'t']
      2 #now we want to apply x+10 to all elements of this list
----> 3 result = [x+10 for x in a_list]
      4 result

<ipython-input-3-28563919a579> in <listcomp>(.0)
      1 a_list = [2,3,4,'t']
      2 #now we want to apply x+10 to all elements of this list
----> 3 result = [x+10 for x in a_list]
      4 result

TypeError: can only concatenate str (not "int") to str

As you can tell, a TypeError was triggered. Actually within each iteration, Python needs to check if the specified operation is workable with this element, which is time-confusing. For numpy, it simply puts a restriction, that is, all the elements have to be of the same type!

  1. Once Numpy knows an array's elements are homogeneous in data type, the next step Numpy does is to delegate the array to numpy's optimized C code, which is blazingly fast.

Python packages built on Numpy

Not only is Pandas is built on Numpy, many scientific computing packages are based on Numpy 2.

Learning Pandas

Numpy is only concerned with high-performance numeric computation (e.g., Matrix transpose, matrix multiplication, and so on), however, pandas is designed more for data scientists who care about data manipulation, missing data, queries, splitting and so on.
Also, as we mentioned, Pandas provides a higher-level wrapper of Numpy cores, which intend to make the learning a lot easier.

Import pandas and numpy modules

In [2]:
import pandas as pd
import numpy as np

You might be wondering why we have to import numpy module. the reason is we might happen to use some vectorized functions of Numpy to generate ndarray or do operations. Also, a big reason is while dealing with missing values, the vectorized type for Python's built-in None, is a numpy object, np.nan. (what is np? namespace!)

Pandas's Data Structures

Pandas has two main data structures

  • Series: is a one-dimensional labeled array that is able to hold any data types (integers, strings, floating point numbers). Think of a series as a column in your spreadsheet. Did you ever pay attention to the leftmost side that indicates row numbers?
    Series object has the same design, which in Pandas is termed index. The index can be a list of integers, letters or literally anything.

  • Dataframe: is a two-dimensional data structure, quite similar to a tabular spreadsheet with rows and columns. In addition to the row labels shown on the leftmost that is used to identify rows, a dataframe object also has column labels for the convenience of indexing columns.