Dataiku Data Science Studio (DSS), a complete data science software platform, is used to explore, prototype, build, and deliver data products. It significantly reduces the time taken by data scientists, data analysts, and data engineers to perform data loading, data cleaning, data preparation, data integration, and data transformation when building powerful predictive applications.
It is easy and more user-friendly to explore the data and perform data cleansing. It supports datasets such as Filesystem, FTP, HTTP, SSH, SFTP, Cloud (S3), PostgreSQL, MySQL, Hadoop (HDFS), Oracle, MS SQL Server, Analytic SQL (Vertica, Greenplum, Redshift, Teradata, and Exadata), and NoSQL (MongoDB, Cassandra, and Elasticsearch).
In this blog, let us discuss about data cleansing, data transformation, and data visualization of sales data of a financial company using Dataiku DSS.
Download and install Dataiku DSS Version 4.0.4 on Ubuntu from here
To import a dataset into Dataiku DSS, perform the following:
The storage type of the data and meanings of the data will be automatically detected from the content of the columns, where the “meaning” is of rich semantic type. For example, DSS automatically detects the meaning of the column with email IDs and sets the meaning as “E-mail address”.
Analyze meaning of each column to explore the data and perform data cleansing.
For example, the E-mail address column has Valid, Invalid, and Empty data as shown in the below diagram:
Apply a filter to remove invalid email IDs.
For example, the Price column has both integer values and comma (,) as shown in the below diagram:
Apply a filter to remove the values with commas as shown in the below diagram:
This recipe has filtering and flagging rows, managing dates, sampling, and geographic processing.
To prepare data, perform the following:
You can also extract latitude and longitude from the given GeoPoint.
Visual recipes are used to create new datasets by transforming existing datasets.
This recipe is used to filter invalid rows/cells, filter rows/cells on date range, numerical range, and value, filter rows/cells with formula. It has filtering and flagging rows. The records, not accessed for a long-time period, are filtered out using this recipe as shown in the below diagram:
This recipe is used to split one dataset rows into several other datasets based on certain rules. The dataset with split and dropped state (“Ireland”) is shown in the below diagram:
This recipe allows you to perform aggregations on any dataset and is equivalent to SQL “group by” statement. It offers visual tools to setup aggregations and post filters. The rows, aggregated based on products, calculated count, and distinct count of state and country, are shown in the below diagram:
The rows after applying a filter for state_count not be less than 100 are shown in the below diagram:
To join two datasets, perform the following:
The Original Price and Profit calculated using formulas are shown in the below diagram:
This recipe merges several datasets into one dataset and is equivalent of a union all SQL statement.
The build datasets can be visualized in the form of charts in Dashboard.
Here is the flow created: