Subscribe to DSC Newsletter

Celebrate the Big Data Problems – #2

How to identify the no of buckets for a Hive table while executing the HiveQL DDLs ?

The dataottam team has come up with blog sharing initiative called “Celebrate the Big Data Problems”. In this series of blogs we will share our big data problems using CPS (Context, Problem, Solutions) Framework.

Context:

Bucketing is another technique for decomposing data sets into more manageable parts. For example, suppose a table using the date as the top-level partition and the employee_id as the second-level partition leads to too many small partitions. Instead, if we bucket the employee table and use employee_id as the bucketing column, the value of this column will be hashed by a user-defined number into buckets. Records with the same employee_id will always be stored in the same bucket. But the challenges are to identify the no of buckets for certain Hive tables in the big data system. While creating table you can specify like CLUSTERED BY (employee_id) INTO XX BUCKETS; where XX is the number of buckets. Bucketing has several advantages. The number of buckets is fixed so it does not fluctuate with data. If two tables are bucketed by employee_id, Hive can create a logically correct sampling. Bucketing also aids in doing efficient map-side joins etc.

Problem:

How to identify the no of buckets for a Hive table while executing the HiveQL DDLs ?

Solutions:

To identify the buckets we need to do a small exercise as below steps,

  • We need to get the daily / run-wise records from the business, vertical, or domains.
  • Convert into average percentage of increment by taking at-least five days’ data, or a week data.
  • Multiple the incremental percenateg with 1024 to have it in incremental size in megabytes
  • Divide it by 192 or 128 for RCFile and HiveIO respectively
  • Formulae :
    • Incremental Records = Total Records / Incremental Records
    • Incremental Records % of Total = (Incremental Records / Total Records) * 100
    • Incremental Size in MB = Incremental Records % of Total * 1024
    • of Buckets = Incremental size in MB / 192 for RCFile
    • of Buckets = Incremental size in MB / 128 for HiveIO
  • Reason for conversion in to MB is that Hadoop has it’s file storage blocks in to MB for large blocking.
  • Example :
    • If we have initially 100 Records and average increment with 5 Records per run/day wise and we are interested in using RCFile
    • Incremental Records = 100 / 5 = 20

Continue Reading ...

Views: 389

Tags: Analytics, Apache, Big, Data, Hadoop, Hive

Comment

You need to be a member of Data Science Central to add comments!

Join Data Science Central

Videos

  • Add Videos
  • View All

© 2019   Data Science Central ®   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service