Subscribe to DSC Newsletter

How are database joins optimized? How can you do better to handle big data?

Joins are typically Cartesian products and in many database systems, can be very slow. What are the best solutions to optimize joins

  1. From a SQL coding point of view: best coding practices for Joins?
  2. From a database point of view: which database architectures have the best optimizers, and how are these optimizers implemented?

As an illustration, if you join two tables A and B each with one million rows, with a "where" condition narrowing rows down to 10,000 in table A, a "where" condition narrowing rows to 500,000 in table B, then a solution for an efficient join is to

  • Identify these 10,000 rows in table A first
  • Put these 10,000 rows in a hash table,
  • Browsing the 1 million records from table B, with a quick look-up check on the hash table for each of the 500,000 rows in table B satisfying the "where" criterion associated with table B

This is far more efficient than having a full Cartesian product which would involve 1,000,000,000,000 rows. Obviously the final solution (unless an outer join) would consist only of 10,000 rows at most. 

Views: 2968

Reply to This

Replies to This Discussion

Hash join can mark as better than other join methods provided that join happens on primary/index keys having same datatypes. However, the inner and outer join strategy depends on record set in joining tables. As you've mentioned the scenario where the recordset narrowed and optimizer choose between nested/merge join.  

As far as database architecture is concern Teradata optimizer utilizes best effort to perform the join on single node by copying or spooling the child table data on the node where master data is present. Joining at locally gives a best cost for distributed databases. DB2 UDB also provides similar functionality, however sometime it broadcast the join which results more cost than normal.

Big data need to be model in a different way than conventional databases. If there is no way you can't save your boat from drowning then better to jump out and learn to swim.



Manish Bhoge said:

Hash join can mark as better than other join methods provided that join happens on primary/index keys having same datatypes. However, the inner and outer join strategy depends on record set in joining tables. As you've mentioned the scenario where the recordset narrowed and optimizer choose between nested/merge join.  

As far as database architecture is concern Teradata optimizer utilizes best effort to perform the join on single node by copying or spooling the child table data on the node where master data is present. Joining at locally gives a best cost for distributed databases. DB2 UDB also provides similar functionality, however sometime it broadcast the join which results more cost than normal.

As a first step for coding I would recommend using the declarative power of SQL rather than trying to instruct how the query should be performed. The optimizer should have enough information on hand for it to build a query plan, in much the same way that you have described above. Regarding joins it is better to use inner-join over outer-join, and outer-join over full-outer-join.

Any reasonable optimizer will be able to do as you have described in your illustration. The optimizer's are typically implemented as cost-based whereby they will take into account both the anticipated I/O and CPU that are required to perform the task. It is likely that they will review a number of plans and then pick the one with the lowest overall cost.

If your SQL performs poorly then it would be best to first review the plan generated by the optimizer rather than changing your SQL. It is likely that you will find that the optimizer does NOT have enough information on hand.

Are you interested in more detail regarding optimizers?

For question #2, columnar databases tend to join with the best CPU and I/O. Why not try a Cartesian join with one?

http://gigaom.com/cloud/facebook-trapped-in-mysql-fate-worse-than-d...

Not just a Facebook problem

In Stonebraker’s opinion, “old SQL (as he calls it) is good for nothing” and needs to be “sent to the home for retired software.” After all, he explained, SQL was created decades ago before the web, mobile devices and sensors forever changed how and how often databases are accessed.

Reply to Discussion

RSS

Follow Us

Videos

  • Add Videos
  • View All

Resources

© 2017   Data Science Central   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service