First, Pandas is an open source Python library for data analysis. It contains data manipulation and data structures tools designed to make spreadsheet-like data for loading, manipulating, merging, cleaning, among other functions, fast and easy in Python. It is often used with analytical libraries like scikit-learn, data visualization libraries like matplotlib, and numerical computing tools like NumPy and SciPy.

Pandas has introduced new data types to Python: **Series and DataFrame**. This two workhorse data structures are not a universal solution for every problem, but they provide a solid basis for most applications. The **DataFrame** represents your entire spreadsheet or a retangular table of data, whereas the **Series **is is a single column of the **DataFrame.**

A Series is a **one-dimensional** array-like object containing a sequence of values and an associated array of data labels, called its *index**. *It is similar to the built-in Python* list*.

Here is an example of an array of data.

An array of data

This is a string representation of a **Series. **It shows **the index on the left** and **the values on the right**. We have not specifyed an index, so a default one is created. You can get the array respresentation and index object of the Series via its **values** and **index attributes**, respectively:

Array representation and index object

You can create a Series with a **label** pointing to each data:

Array representation with a label

Additionally, you can use labels in the index when selecting a **single value** or a **set of values**:

A single value and a set of values

Also, we can use NumPy functions or NumPy-like operations, such as scalar multiplication, filtering with a boolean array, or applying math functions, will preserve the index-value link:

NumPy functions and NumPy-like operations

We can use **Series** as a specialized dictionary. **A dictionary** is a structure that maps arbitrary keys to a set of arbitrary values, and a **Series** is a structure that maps typed keys to a set of typed values.

We can make the **Series-as-dictionary **analogy even more clear by constructing a Series object directly from a Python dictionary:

Series as dictionary

By default, a **Series **will be created where the index is drawn from the sorted keys. From here, typical dictionary-style item access can be performed:

Unlike a dictionary, though, the **Series** also supports array-style operations such as slicing:

When you are only passing a dict, the index in the resulting Series will have the dict’s keys in sorted order. You can override this by passing the dict keys in the order you want them to appear in the resulting Series:

override

Since there is no value for ‘Berlin’, it appears as NaN(Not a Number).

Now you can detect the missing data with ** isnull** and

isnull and notnull funcitons

Series also has these as instance methods:

For arithmetic operations, the series functions are automatically aligned according to the index name. In addition, both the series object itself and its index have a name attribute:

First of all, let’s clarify the term **DataFrame**.

In Pandas it is a two-demonsional size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns). Arithmetic operations align on both row and column labels. Can be tought of as a dict-like container for **Series** objects. Under the hood, the data is stored as one or more two-dimensional blocks rather than a list, dict, or some other collection of one-dimensional arrays.

Let’s have a look at this.

Let’s first construct a new **Series **listing the area of each of the four states discussed before. Now that we have this along with the **population Series** from above, we can use a dictionary to construct a single two-dimensional object containing this information:

Now we can access the index labels via the **DataFrame index attribute**.The DataFrame also has a column attribute, which is an index object that contains the column labels.

Index Attribute

Therefore the DataFrame can be thought of as a **generalization of a two-dimensional NumPy array**, where both the rows and columns have a generalized index for accessing the data.

We can also think of a **DataFrame **as a specialization of a **dictionary. **The **DataFrame **maps a **column name **to a **Series column data**, where a **dictionary** maps a **key to a value**. Now we get the **Series** object:

Mapping

**Note:** For a **DataFrame**, **data[‘col0’]** will return the first *column**. *In a two-dimensional **NumPy array**, **data[0]** will return the first *column**.*While a DataFrame is physically **two-dimensional**, you can use it to represent **higher dimensional** data in a tabular format using hierarchical indexing (also known as ** multi-indexing**) to incorporate multiple index

We can construct **DataFrame Objects** in a variety of ways.

- From a list of dicts
- From a single Series object
- From a dictionary of Series objects
- From a two-dimensional NumPy array
- From a NumPy structured array

From a list of dicts

From a single Series object

From a dictionary of Series objects

From a two-dimensional NumPy array

From a NumPy structured array

This was a short intro to **Data Analysis with Pandas. **For further reading you can grap this book from O’Reilly **Python for Data Analysis, 2nd Edition by Wes McKinney**. In addition, you should definitely browse through the reference and of course try it out. Also, you can try the code on https://mybinder.org/v2/gh/MehmetGoekce/PandasRepo/master.

*Originally posted here*

© 2020 Data Science Central ® Powered by

Badges | Report an Issue | Privacy Policy | Terms of Service

**Most Popular Content on DSC**

To not miss this type of content in the future, subscribe to our newsletter.

- Book: Statistics -- New Foundations, Toolbox, and Machine Learning Recipes
- Book: Classification and Regression In a Weekend - With Python
- Book: Applied Stochastic Processes
- Long-range Correlations in Time Series: Modeling, Testing, Case Study
- How to Automatically Determine the Number of Clusters in your Data
- New Machine Learning Cheat Sheet | Old one
- Confidence Intervals Without Pain - With Resampling
- Advanced Machine Learning with Basic Excel
- New Perspectives on Statistical Distributions and Deep Learning
- Fascinating New Results in the Theory of Randomness
- Fast Combinatorial Feature Selection

**Other popular resources**

- Comprehensive Repository of Data Science and ML Resources
- Statistical Concepts Explained in Simple English
- Machine Learning Concepts Explained in One Picture
- 100 Data Science Interview Questions and Answers
- Cheat Sheets | Curated Articles | Search | Jobs | Courses
- Post a Blog | Forum Questions | Books | Salaries | News

**Archives:** 2008-2014 |
2015-2016 |
2017-2019 |
Book 1 |
Book 2 |
More

**Most popular articles**

- Free Book and Resources for DSC Members
- New Perspectives on Statistical Distributions and Deep Learning
- Time series, Growth Modeling and Data Science Wizardy
- Statistical Concepts Explained in Simple English
- Machine Learning Concepts Explained in One Picture
- Comprehensive Repository of Data Science and ML Resources
- Advanced Machine Learning with Basic Excel
- Difference between ML, Data Science, AI, Deep Learning, and Statistics
- Selected Business Analytics, Data Science and ML articles
- How to Automatically Determine the Number of Clusters in your Data
- Fascinating New Results in the Theory of Randomness
- Hire a Data Scientist | Search DSC | Find a Job
- Post a Blog | Forum Questions

## You need to be a member of Data Science Central to add comments!

Join Data Science Central