Home » Technical Topics » Data Science

Getting Started with pandas a powerful Python data analysis toolkit

1. Introduction

Pandas

First, Pandas is an open source Python library for data analysis. It contains data manipulation and data structures tools designed to make spreadsheet-like data for loading, manipulating, merging, cleaning, among other functions, fast and easy in Python. It is often used with analytical libraries like scikit-learn, data visualization libraries like matplotlib, and numerical computing tools like NumPy and SciPy.

1.1 Introduction to pandas Data Structures

Pandas has introduced new data types to Python: Series and DataFrame. This two workhorse data structures are not a universal solution for every problem, but they provide a solid basis for most applications. The DataFrame represents your entire spreadsheet or a retangular table of data, whereas the Series is is a single column of the DataFrame.

Series

A Series is a one-dimensional array-like object containing a sequence of values and an associated array of data labels, called its indexIt is similar to the built-in Python list.

Here is an example of an array of data.

1kcVoRDsCzDdh_xIAgBCdDQ

An array of data

This is a string representation of a Series. It shows the index on the left and the values on the right. We have not specifyed an index, so a default one is created. You can get the array respresentation and index object of the Series via its values and index attributes, respectively:

1Y_E0Cw-evx8STySBn9eukw

Array representation and index object

You can create a Series with a label pointing to each data:

10k2QleJaKaAz5ZsXVmTAuQ

Array representation with a label

Additionally, you can use labels in the index when selecting a single value or a set of values:

1Q5yKFEOoxnV1rMnryrbxog

A single value and a set of values

Also, we can use NumPy functions or NumPy-like operations, such as scalar multiplication, filtering with a boolean array, or applying math functions, will preserve the index-value link:

1JYCOdtZ6E6094tSm_SKEJQ

NumPy functions and NumPy-like operations

We can use Series as a specialized dictionary. A dictionary is a structure that maps arbitrary keys to a set of arbitrary values, and a Series is a structure that maps typed keys to a set of typed values.

We can make the Series-as-dictionary analogy even more clear by constructing a Series object directly from a Python dictionary:

Series as dictionary

By default, a Series will be created where the index is drawn from the sorted keys. From here, typical dictionary-style item access can be performed:

1IuvwEos99vEJQmBFxzGSXQ

Unlike a dictionary, though, the Series also supports array-style operations such as slicing:

1Vt_5t6fEv84EVka8VJWq9A

When you are only passing a dict, the index in the resulting Series will have the dict’s keys in sorted order. You can override this by passing the dict keys in the order you want them to appear in the resulting Series:

1DdcNYD3X6mvTRXNegIzguQ

override

Since there is no value for ‘Berlin’, it appears as NaN(Not a Number).

Now you can detect the missing data with isnull and notnull functions.

1ueqwRE2-Wjs6JCqu82ic6Q

isnull and notnull funcitons

Series also has these as instance methods:

1lk5jAwf4jnJTwMr8NCAFhw

For arithmetic operations, the series functions are automatically aligned according to the index name. In addition, both the series object itself and its index have a name attribute:

1qlbAcU6fP3dURIFaBqn33g

DataFrame

First of all, let’s clarify the term DataFrame.

In Pandas it is a two-demonsional size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns). Arithmetic operations align on both row and column labels. Can be tought of as a dict-like container for Series objects. Under the hood, the data is stored as one or more two-dimensional blocks rather than a list, dict, or some other collection of one-dimensional arrays.

data.frame

Data Frames The function data.frame() creates data frames, tightly coupled collections of variables which share many of…

Let’s have a look at this.

Let’s first construct a new Series listing the area of each of the four states discussed before. Now that we have this along with the population Series from above, we can use a dictionary to construct a single two-dimensional object containing this information:

1w6_F_VMW4zF9ZEU1XrW2rw

Now we can access the index labels via the DataFrame index attribute.The DataFrame also has a column attribute, which is an index object that contains the column labels.

1gGB9cValmcG6MYDTVt4Hpg

Index Attribute

Therefore the DataFrame can be thought of as a generalization of a two-dimensional NumPy array, where both the rows and columns have a generalized index for accessing the data.

We can also think of a DataFrame as a specialization of a dictionary. The DataFrame maps a column name to a Series column data, where a dictionary maps a key to a value. Now we get the Series object:

1lhNsSA5_gH7DysIp-LgOOg

Mapping

Note: For a DataFramedata[‘col0’] will return the first columnIn a two-dimensional NumPy arraydata[0] will return the first column.While a DataFrame is physically two-dimensional, you can use it to represent higher dimensional data in a tabular format using hierarchical indexing (also known as multi-indexing) to incorporate multiple index levels within a single index. Hierarchical indexing is more common practices but Pandas provide also Panel and Panel4d objects that handle three-dimensional and four-dimensional data.

We can construct DataFrame Objects in a variety of ways.

  • From a list of dicts
  • From a single Series object
  • From a dictionary of Series objects
  • From a two-dimensional NumPy array
  • From a NumPy structured array

1nI1NB7ZhTXlL0Pqx2RkeZQ

From a list of dicts

1aEeYr-F8yHBxc2_lghpsXg

From a single Series object

1iRtCk6K6qpBNwYOWqJjIeQ

From a dictionary of Series objects

1E2CcssuVpZ82euU9JRPW9g

From a two-dimensional NumPy array

14PcDkyOiwUVKtCTy8IIpRQ

From a NumPy structured array

This was a short intro to Data Analysis with Pandas. For further reading you can grap this book from O’Reilly Python for Data Analysis, 2nd Edition by Wes McKinney. In addition, you should definitely browse through the reference and of course try it out. Also, you can try the code on https://mybinder.org/v2/gh/MehmetGoekce/PandasRepo/master.

Originally posted here