Home » Uncategorized

5 Minute Analysis: Underutilized Kaggle Data

In this 5 Minute Analysis we’ll focus on exploring the collection of Kaggle datasets data in real-time, reorganizing it, and filtering the data to find popular datasets with many downloads but very few kernels.

Dataset: Complete Kaggle Datasets Collection

This blog post explores and analyzes the data using PivotBillions, available freely on docker.

Docker Image

Goals

  1. Load the data to Pivot Billions and explore its structure.
  2. Pivot the data to reorganize it by title, description, kernel use, and number of downloads.
  3. Use Pivot Billions’ built-in features to filter the data by kernel use and downloads and find the datasets that don’t have much code development on Kaggle but have a high level of interest.

Steps

Load the Data and View its Structure

  1. Download the dataset from Kaggle.
  2. Unzip your downloaded data.
  3. Access the Pivot Billions URL for your machine.
  4. Click the Plus (+) icon on the top right hand side of the window.
  5. Select Drag & Drop.

lnuB8VP7sBoVN1ZlI7QtoKR56Y47gk-_p_ahlqfmm4Y013vEankuYNf-WH-GgYmdp_Thuh8Zhi8gZznJcimuAGWYi7xsPXUaecHb2EsTOdxTqO8kzopHUrcmKkGf5I7q3fNuWZ8V

  1. Drag your downloaded “kaggle_datasets.csv” file to the Drag & Drop box in Pivot Billions.
  2. Click the dropdown arrow d-T3ncUT8swxTze5bfPtV9kGLx-U0Xxh932BGe9C6eSJxGiyfDadbdvtgAtSct5Ltrs8qHd4QnfEteBTky-adoRJFgal8Wmyuy-gBpW-5efnvOTtCZc9EbbRbjCHuoYqlvf6gGsv to the right of the file in Pivot Billions to view the schema of the data and see a sample.
  3. Then select the left checkbox next to the file and click Preview at the bottom of the screen.

Uk7K8u1bZp3UCSi5_hEsJLpKETwXlUMOxp1J52wLqDyAWrrkaHevPg01-fd67pxdltA15GHB8H_c_FAEVCgr4m7PGWsg245XNGDV7_CoLwZ7lLEp9VmC9nRc1zfzO4B1JGB5KqCp

You can now see the columns and types of the dataset and modify them as you see fit. You can also view or change which column or columns are set as primary keys. When you are done viewing or modifying the data structure to be imported, click Import.

5tBhxBR_ZfO_wcrX4hh-6-To_ijnqwUxJp3rNhzFSfZ-EE5gMvnoqj1k-fdKnNE5iFQDhxj-EhGbDkrwgP_SCUDGJQQOJEGxZBhMOoSFERlyQjh6TGC-_TXMijbbDwhw_BAzhviu

Reorganize the Data to Explore Kernel to Download Discrepancies.

Now that we can view and explore the data, let’s reorganize our data to dive into datasets with many more downloads compared to kernel use.

  1. Click the Pivot icon BQsspWgZNivOf00RpoR3iVKn2msom33p7P0SRrmlR6eRMjyyCpNXMHGk-jUiEQUMf43qwq7G5lNvs-jvV6Q86e5rbxwGh4QQbr_XFslV3SYuSWWKQMoXbLLjcoMEiyPnZ3YCjtxq in the the top right of your data table.
  2. Click the Plus (+) icon under Dimensions and select the “title” column.
  3. Click the Plus (+) icon again and select the “description” column.
  4. Click the Plus (+) icon again and select the “kernels” column.
  5. Click the Plus (+) icon under Values and select the “downloads” column.
  6. Click View to pivot your data.

Pivot Billions now quickly reorganizes your data by dataset title, description, and number of kernels. It also provides counts, sums, and statistics on the downloads of each dataset. You can sort by a column or filter the data. Here we’ll add some filters to restrict the data to just datasets with many downloads but only a few kernels.

  1. In the top-left of the pivot widget, click the Plus (+) button.
  2. Select “kernels” and “Less Than”.
  3. Enter “5” and press enter.
  4. Click the Plus (+) button again.
  5. Select “Sum” and “Greater Than”.
  6. Enter “100” and press enter.

Yu2Fii4Y4EwOm34ZeT4ipide959zAvEDb62-76yMwAqZwWDnOnBl0CZi2nsajou9BHU4uPrsKJPUrEi0s5V5O0niLADABm43UXd4qEVzC916F0RO2cEHZNX76Gi9CoXjHmT9vbk8

You can see the filters immediately applied and the data reduced from 7,666 unique combinations to just the 612 unique combinations matching our filters.

We’ll now interactively view the data.

  1. Click the Switch View Type WF_mhDkFf3RYjCOul6emFRI20Vsp3BNE9d9aPWqmTi_XDWhnlCB6vH2DNUBnYRrTZODOsqsxTjL5qwfMbFLYG4PPefMtorObRv1ipN53Oa1iY7kmWm_jA2vjSfKFXvLeTfu5kItE icon in the top right of the pivot widget and select Pivot View.
  2. Drag the title box to below the drop down selection box.
  3. Drag the description box to below the drop down selection box.
  4. Drag the kernels box to the right of the drop down selection box as shown below.
  5. Click the lower drop down selection box and change “Count” to “Summation”.
  6. Click the leftmost arrow twice so that it is pointing up to sort the data.

o4VTpHuuM0VSY4Pnr9ofRtI1BeE0klD4ilqs_hWqFN5tzzQeUIP4oxg_ZrWzsryrVjjE0jt0Nc-gho8rvWjgaXGHq0Ndc3ZhvBCPLbkyiJjlp9MNrpnYdJAubSX6eSbpuBKTFSw0

We can immediately see a variety of very popular datasets that have been downloaded thousands of times yet have very few or no kernels developed. Many of these are likely underutilized datasets that aren’t easily understood using existing tools and could benefit from additional exploration and analysis incorporating new tools such as PivotBillions.