Dataiku Data Science Studio (DSS), a complete data science software platform, is used to explore, prototype, build, and deliver data products. It significantly reduces the time taken by data scientists, data analysts, and data engineers to perform data loading, data cleaning, data preparation, data integration, and data transformation when building powerful predictive applications.
It is easy and more user-friendly to explore the data and perform data cleansing. It supports datasets such as Filesystem, FTP, HTTP, SSH, SFTP, Cloud (S3), PostgreSQL, MySQL, Hadoop (HDFS), Oracle, MS SQL Server, Analytic SQL (Vertica, Greenplum, Redshift, Teradata, and Exadata), and NoSQL (MongoDB, Cassandra, and Elasticsearch).
In this blog, let us discuss about data cleansing, data transformation, and data visualization of sales data of a financial company using Dataiku DSS.
Download and install Dataiku DSS Version 4.0.4 on Ubuntu from here
To import a dataset into Dataiku DSS, perform the following:
- Open Dataiku DSS.
- Create a new Project.
- Click Add New Dataset and click Add a File to upload a new dataset.
- Choose the required Filesystem and click Preview to view the added file.
The dataset looks similar to the one below:
The storage type of the data and meanings of the data will be automatically detected from the content of the columns, where the “meaning” is of rich semantic type. For example, DSS automatically detects the meaning of the column with email IDs and sets the meaning as “E-mail address”.
Analyze meaning of each column to explore the data and perform data cleansing.
For example, the E-mail address column has Valid, Invalid, and Empty data as shown in the below diagram:
Apply a filter to remove invalid email IDs.
For example, the Price column has both integer values and comma (,) as shown in the below diagram:
Apply a filter to remove the values with commas as shown in the below diagram:
Data Preparation Recipes
This recipe has filtering and flagging rows, managing dates, sampling, and geographic processing.
To prepare data, perform the following:
- Parse and format date columns.
- Calculate difference between account created date and last login date for calculating dormant days.
- Convert currency to required currency type.
For example, INR into Dollars.
- Filter the unwanted columns by its name.
- Concatenate two column values with delimiters as shown in the below diagram:
- Calculate GeoPoint by giving latitude and longitude as input as shown in the below diagram:
You can also extract latitude and longitude from the given GeoPoint.
Visual recipes are used to create new datasets by transforming existing datasets.
This recipe is used to filter invalid rows/cells, filter rows/cells on date range, numerical range, and value, filter rows/cells with formula. It has filtering and flagging rows. The records, not accessed for a long-time period, are filtered out using this recipe as shown in the below diagram:
This recipe is used to split one dataset rows into several other datasets based on certain rules. The dataset with split and dropped state (“Ireland”) is shown in the below diagram:
Grouping – Aggregating Data Recipe
This recipe allows you to perform aggregations on any dataset and is equivalent to SQL “group by” statement. It offers visual tools to setup aggregations and post filters. The rows, aggregated based on products, calculated count, and distinct count of state and country, are shown in the below diagram:
The rows after applying a filter for state_count not be less than 100 are shown in the below diagram:
Joining Datasets Recipe
To join two datasets, perform the following:
- In the “Join” section of the recipe, click “Add input” button to add one join.
- Select 2 datasets for joining.
- Select Join Type and choose the appropriate join type such as “Inner Join”, “Outer Join” and “Left Join” as shown in the below diagram:
- Click Conditions to add conditions.
The inner join based on Transaction_ID and Product is shown in the below diagram:
- On successfully completing the join definition, go to “Selected Columns” section of the recipe and select the columns of each dataset needed.
The Original Price and Profit calculated using formulas are shown in the below diagram:
Stacking Datasets Recipe
This recipe merges several datasets into one dataset and is equivalent of a union all SQL statement.
The build datasets can be visualized in the form of charts in Dashboard.
Average of Transaction ID Count by Payment Type
Profit by Country
Product Count by Country
Profit by Country and Product
Average of Profit by Year
Here is the flow created: