There is no doubt that SQL is the most widely-used working language for processing structured data. Not only is the language adopted by all relational database products, but its implementation is the goal of many newly-invented big data platforms. But in many aspects SQL isn’t so convenient to use in handling various computational and query demands. The procedurality issue stated in the last article is just a superficial one. SQL’s problems are rooted in its theory foundation, the relational algebra.
Relational algebra is an algebraic system, which is hard to be strictly defined within the length of this article, so we just give a relatively simple explanation. An algebraic system defines data objects and a set of operational rules to handle those data objects while making sure that the operations are closed and self-consistent. For example, the rational numbers and the basic arithmetic operations based on them form an algebraic system dealing with numerical calculations in everyday life. The closeness refers that the result must be one of the defined data objects, which means, for example, that the result of an arithmetic operation over rational numbers must also be a rational number. The self-consistency refers that the results of these operations must be logical. For instance, we stipulate that 0 can’t be divided by any number, because any definition of the quotient will result in logical errors.
The data objects defined by an algebraic system are different from the data objects defined by an object-oriented programming theory. The former emphasizes operations over data and aims to better describe and implement data operations, while the latter emphasizes the object’s encapsulation ability, inheritance and overload capacity, and is intended for code reuse.
A bad algebraic system seriously affects the degree of convenience and the efficiency of performing computations.
Here are two examples:
1. The present arithmetic operation system uses Arabic numerals, which are convenient in both representing numbers and performing the basic arithmetic operations. Now imagine what will be like if the Arabic numerals are replaced by Roman numerals?
2. Any multiplication of integers can be represented by an addition of integers. With the introduction of multiplication to express the addition of multiple same integers, people invent multiplication tables to make such a kind of addition a simple and fast task.
To make computers perform computations, we need a formal language based on an algebraic system. We code the computing target with a set of recognized syntax and give a computer to execute it. In some sense, the process of solving a problem with a computer is a process of translating the solution into a kind of formal language. With badly-designed algebraic system and formal language, the translation could be more difficult than the solution, like using Roman numerals to perform basic arithmetic operations.
The relational algebra is a kind of algebraic system to deal with batch processing of structured data. The formal language based on it is SQL. Since there are a lot of documents discussing relational algebra and SQL, we won’t dig deep.
What about SQL’s efficiency in processing structured data?
Two issues – the efficiency of describing a computing logic and the efficiency of code execution – are at the heart of our concern. They are easy to understand. The low descriptive efficiency means a high development cost and a difficult coding process. The low execution efficiency means very long time in getting the result, which reduces the practical value. Actually high execution efficiency lies in the efficient description of the computing target. We can’t enhance the hardware performance by improving software performance, but we are able to design an efficient algorithm. The condition is that the formal language should facilitate, rather hobble, the coding of such an algorithm.
SQL is very disappointing in both aspects when handling complex big data computing. Here are two simple examples:
1. Find the number of the longest continuously rising days for a stock.
It’s simple for Java or C++ programmers. They will create a counter whose initial value is 0, sort records by date and traverse them all, add 1 to the counter when the stock rises and reset it to 0 when it falls, and then find the largest number the counter ever has. The logic is natural, but it’s rather difficult to implement it in SQL. Relational algebra inherits the mathematical concept of unordered sets, which means the data sorting can only be performed at the output and the order of traversal can’t be specified, making it difficult to implement the logic in a natural way. Instead, programmers need to generate numbers for the dates, create a group mark, group a rising day with the previous day and put a falling day into another group, and then find the biggest COUNT() value among the groups. The logic is difficult to grasp. That is the issue of the translation of computing logic being more difficult than the solution itself.
2. Find the top 10 among one billion records.
There’s no need to sort the one billion records to find the desired ones. We create a 10-member empty set, and traverse data while keeping the set contain the top 10 records of the traversed records. That way we just need to perform one traversal with a very small memory space occupied and a good computational performance.
Unfortunately as the relational algebra doesn’t define an explicit set data type, SQL can’t express the algorithm. We have to sort all the 1 billion records to get the top 10, with the effort of sorting the rest of the records wasted. It’s resource-consuming to sort the 1 billion records. If the memory can’t accommodate all the records at once, we need to perform multiple data exchanges from the external memory to the internal memory, leading to a serious decline in performance. This is the dilemma that a good algorithm can’t be expressed with the SQL language. The only way to counter the dilemma is to perform database optimization, but for a complex SQL algorithm, the database optimization ability is far from enough (one instance is getting the top 10 from each group).
For SQL, there are a lot of such problems. The language is not as mighty as the legend goes. We won’t list these problems here but will have a deeper look into them later.
As a widely-used query language and with a large number of related software products, SQL is convenient to use when the computational logic is simple and high performance is not a necessity. With data requirements becoming more and more complicated and data amount ever-increasing in modern applications, SQL becomes more a hindrance to data manipulation than an efficient tool. SQL’s greatest problem isn’t in the implementation level, but at its theory foundation. The problem can’t be solved by application optimization. Relational algebra isn’t sophisticated enough for handling the complicated data manipulation scenarios. A revolution is needed at the mathematical theory foundation.
Comment
"Find the top 10 among one billion records."
The statement "We have to sort all the 1 billion records to get the top 10" is not true.
select * from ( select * from MASSIVE_TABLE order by 1 ) where rownum <= 10;
uses a smattering of kilobytes of memory on any Oracle database version from the past 15 years or so.
Find the number of the longest continuously rising days for a stock.
select *
from my_stock
match_recognize (
order by recorded_date
measures last(goes_up.recorded_date) - starting_price.recorded_date
pattern (starting_price goes_up*)
define up as goes_up.price > prev(goes_up.price)
)
Eight lines of SQL comprising
- a definition to what a price increase is ( price is higher than previous price )
- a familiar regex format ( "starting price" followed arbitrary number of price "goes up")
I'm not sure how that isn't quicker and simpler than cranking out dozens of lines of code ?
Hello Author, A very enlightening article, could you help me to find some examples in the computational biology domain? I would really appreciate your help.
Pratim
Thanks for the article. It was really hepful.
Let me begin by stating that I am a recovering database administrator (DBA) :-)
This top n query is a standard problem. I actually co-authored an article about this in Dr. Dobb's.
http://www.drdobbs.com/architecture-and-design/processing-rows-in-b...
SQL is nothing but marketing hype. The original promise of SQL: don't have to worry about how data is stored and don't have to worry about how data is retrieved. What you got was: entire manuals and books on how to do performance tuning. In addition, you ended up having to hire a database administrator who knew the inns/outs of the particular engine and knew the storage and access/patterns.
Just a (very) minor comment, as to you statement that "Find the top 10 among one billion records. ... SQL can’t express the algorithm".
That's not really true, I believe: SQL (the language) allows you to write this as something like:
SELECT ... FROM ... WHERE ... ORDER BY ... DESC LIMIT 10
It's not the SQL language's responsibility to implement this without a total sorting; it's the database engine's responsibility.
And you are right that most (current implementations of most) engines will indeed do an inefficient total sorting here; but luckily not all of them!
Specifically, Hive and SparkSQL will implement this exactly the way you are suggesting: by creating a 10-member empty set, and traversing the data while keeping the set contain the top 10 records.
© 2018 Data Science Central Powered by
Badges | Report an Issue | Privacy Policy | Terms of Service
You need to be a member of Data Science Central to add comments!
Join Data Science Central