First things first, Pandas is a package, which is the most important tool for data manipulation and analysis (Python Developers Survey 2018 Results)
Pandas is derived from the phrase'Panel Data'. As you could tell, Pandas was designed to provide operations on panel data.
Glossary
panel data: In statistics and econometrics, panel data or longitudinal data, are multi-dimensional data involving measurements over time. 1
In terms of initial release time, Numpy preceded Pandas. Numpy is an open-source Python library used for scientific computing and provides high-performance operations on arrays and matrices (ndarray object). ndarrays are stored and processed more efficiently than Python's default list objects through vectorized operations.
Pandas library is built on Numpy Package, as a higher-level wrapper for the convenience of users whose data is not just numbers.
a_list = [2,3,4,'t']
#now we want to apply x+10 to all elements of this list
result = [x+10 for x in a_list]
result
As you can tell, a TypeError was triggered. Actually within each iteration, Python needs to check if the specified operation is workable with this element, which is time-confusing. For numpy, it simply puts a restriction, that is, all the elements have to be of the same type!
Numpy is only concerned with high-performance numeric computation (e.g., Matrix transpose, matrix multiplication, and so on), however, pandas is designed more for data scientists who care about data manipulation, missing data, queries, splitting and so on.
Also, as we mentioned, Pandas provides a higher-level wrapper of Numpy cores, which intend to make the learning a lot easier.
import pandas as pd
import numpy as np
You might be wondering why we have to import numpy module. the reason is we might happen to use some vectorized functions of Numpy to generate ndarray or do operations. Also, a big reason is while dealing with missing values, the vectorized type for Python's built-in None, is a numpy object, np.nan. (what is np? namespace!)
Pandas has two main data structures
Series: is a one-dimensional labeled array that is able to hold any data types (integers, strings, floating point numbers). Think of a series as a column in your spreadsheet. Did you ever pay attention to the leftmost side that indicates row numbers?
Series object has the same design, which in Pandas is termed index. The index can be a list of integers, letters or literally anything.
Dataframe: is a two-dimensional data structure, quite similar to a tabular spreadsheet with rows and columns. In addition to the row labels shown on the leftmost that is used to identify rows, a dataframe object also has column labels for the convenience of indexing columns.