Python Book Goodies and Apache Arrow - DataScienceCentral.com

In my rundown this week, I cover two distinct topics – a new Python analytics books and the rise of Apache Arrow.

A New Python Data Analytics Book Published

One of the best-known books on data analysis now has a new edition (3rd edition) available as an early open access

The creator of Pandas library, Wes McKinney, released the latest version of his book “Python for data analytics” as open access.

Book: https://bit.ly/3G2RCtq

Chapters

Preface

1 Preliminaries

2 Python Language Basics, IPython, and Jupyter Notebooks

3 Built-in Data Structures, Functions, and Files

4 NumPy Basics: Arrays and Vectorized Computation

5 Getting Started with pandas

6 Data Loading, Storage, and File Formats

7 Data Cleaning and Preparation

8 Data Wrangling: Join, Combine, and Reshape

9 Plotting and Visualization

10 Data Aggregation and Group Operations

11 Time Series

12 Introduction to Modeling Libraries in Python

13 Data Analysis Examples

While looking at Wes Mckinney site, I also saw another interesting project he is now involved in (Apache Arrow) and the second half of this post about the significance of Apache arrow.

A young girl plays the part of a native American Indian girl. Sh

Apache Arrow Targets Columnar Analytics

This needs some explanation to discuss the exact problem being addressed.

Python, R, etc came from scientific computing and statistics and have now become mainstream as tools for data analytics. In doing so, these languages increasingly interact with Big data frameworks like Hadoop and Spark. While Hadoop and Spark offer programming interfaces for Python and R, they provide worse performance relative to the native bindings in big data ecosystems (typically Java or Scala) which run on the JVM.

To access data from a Python user-defined function, the data must be first converted into a format that can be sent to Python and then converted into built-in Python objects(lists, dictionaries, pandas DataFrames),. etc.

Also, Python and R both use Array processing whereas Spark and Hadoop process value at a time. There are other developments such as Dataframes in Spark but still, the native implementations(Java and Scala) provide better performance.

Thus, Apache Arrow(a top-level project in The Apache Software Foundation) is designed for data analytics systems that need to move and process data fast using efficient in-memory columnar analytics and low-overhead data transport.

There are many real-life use cases like Databricks Cloud Fetch connectors that connect business intelligence tools (like Tableau or Power BI) with data stored in the cloud.

Source: references and image source: https://voltrondata.com/news/arrow-columnar-analytics/