Data Cleansing with Apache Spark and Optimus

Outdated, inaccurate, or duplicated data won’t drive optimal data driven solutions. When data is inaccurate, leads are harder to track and nurture, and insights may be flawed. The data on which you base your big data strategy must be accurate, up-to-date, as complete as possible, and should not contain duplicate entries. Clean data results in better decisions.

Cleaning data is the most time-consuming and least enjoyable data science task, but one of the most important ones. No one can start a data science, machine learning or data driven solution without being sure that the data that they’ll be consuming is at its optimal stage. Although several data cleansing solutions exists, none of them can keep up with the emergence of Big Data, or they are really hard to use.

Right now more and more companies are entering (or at least trying to enter) the Big Data and Machine Learning revolution. All of the data driven approaches need to clean, wrangle, normalize and fix the data that will be input to the models they want to create, and with Optimus we are launching an easy to use, easy to deploy to production, and open source framework to clean and analyze data in a parallel fashion using state of the art technologies, that can be used by small, medium, big industries or even startups that want to create data science solutions and don’t have the money to pay lots of data scientists and create their own cluster to clean the data they are going to use.

Optimus is the missing library for cleaning and pre-processing data in a distributed fashion. It uses all the power of to do Apache Sparkso. It implements several handy tools for data wrangling and munging that will make your life much easier. The first obvious advantage over any other public data cleaning library is that it will work on your laptop or your big cluster, and second, it is amazingly easy to install, use and understand.

Requirements:

Apache Spark 2.2.0
Python 3.5

Installation (Windows, Mac & Linux):

In your terminal just type:

pip install optimuspyspark

For a complete documentation on how to use it please visit our GitHub repository:

https://github.com/ironmussa/Optimus

If you want a peak of what can Optimus do for you check out this demo:

https://nbviewer.jupyter.org/github/ironmussa/Optimus/blob/master/e…

Contributors:

Project Manager: Argenis León.
Original developers: Andrea Rosales, Hugo Reyes, Alberto Bonsanto.
Principal developer and maintainer: Favio Vázquez.

Data Cleansing with Apache Spark and Optimus

Requirements:

Installation (Windows, Mac & Linux):

License:

Leave a Reply Cancel reply