Sales Data Analysis using DataIku Studio

Overview

Dataiku Data Science Studio (DSS), a complete data science software platform, is used to explore, prototype, build, and deliver data products. It significantly reduces the time taken by data scientists, data analysts, and data engineers to perform data loading, data cleaning, data preparation, data integration, and data transformation when building powerful predictive applications.

It is easy and more user-friendly to explore the data and perform data cleansing. It supports datasets such as Filesystem, FTP, HTTP, SSH, SFTP, Cloud (S3), PostgreSQL, MySQL, Hadoop (HDFS), Oracle, MS SQL Server, Analytic SQL (Vertica, Greenplum, Redshift, Teradata, and Exadata), and NoSQL (MongoDB, Cassandra, and Elasticsearch).

In this blog, let us discuss about data cleansing, data transformation, and data visualization of sales data of a financial company using Dataiku DSS.

Pre-requisites

Download and install Dataiku DSS Version 4.0.4 on Ubuntu from here

Importing Dataset

To import a dataset into Dataiku DSS, perform the following:

Open Dataiku DSS.
Create a new Project.
Click Add New Dataset and click Add a File to upload a new dataset.
Choose the required Filesystem and click Preview to view the added file.
The dataset looks similar to the one below:

The storage type of the data and meanings of the data will be automatically detected from the content of the columns, where the “meaning” is of rich semantic type. For example, DSS automatically detects the meaning of the column with email IDs and sets the meaning as “E-mail address”.

Data Cleansing

Analyze meaning of each column to explore the data and perform data cleansing.

For example, the E-mail address column has Valid, Invalid, and Empty data as shown in the below diagram:

Apply a filter to remove invalid email IDs.

For example, the Price column has both integer values and comma (,) as shown in the below diagram:

Apply a filter to remove the values with commas as shown in the below diagram:

Data Transformation

Data Preparation Recipes

This recipe has filtering and flagging rows, managing dates, sampling, and geographic processing.

To prepare data, perform the following:

Parse and format date columns.
Calculate difference between account created date and last login date for calculating dormant days.
Convert currency to required currency type.
For example, INR into Dollars.
Filter the unwanted columns by its name.
Concatenate two column values with delimiters as shown in the below diagram:

Calculate GeoPoint by giving latitude and longitude as input as shown in the below diagram:

You can also extract latitude and longitude from the given GeoPoint.

Visual Recipes

Visual recipes are used to create new datasets by transforming existing datasets.

Filter Recipe

This recipe is used to filter invalid rows/cells, filter rows/cells on date range, numerical range, and value, filter rows/cells with formula. It has filtering and flagging rows. The records, not accessed for a long-time period, are filtered out using this recipe as shown in the below diagram:

Split Recipe

This recipe is used to split one dataset rows into several other datasets based on certain rules. The dataset with split and dropped state (“Ireland”) is shown in the below diagram:

Grouping – Aggregating Data Recipe

This recipe allows you to perform aggregations on any dataset and is equivalent to SQL “group by” statement. It offers visual tools to setup aggregations and post filters. The rows, aggregated based on products, calculated count, and distinct count of state and country, are shown in the below diagram:

The rows after applying a filter for state_count not be less than 100 are shown in the below diagram:

Joining Datasets Recipe

To join two datasets, perform the following:

In the “Join” section of the recipe, click “Add input” button to add one join.
Select 2 datasets for joining.
Select Join Type and choose the appropriate join type such as “Inner Join”, “Outer Join” and “Left Join” as shown in the below diagram: