Subscribe to DSC Newsletter

I recently had the following discussions with a number of data architects, in different communities, in particular (but not limited to) the TDWI group on LinkedIn. This is a summary of the discussion, featuring differences between data scientists and data architects, and how both can work together.

It shows some of the challenges that still need to be addressed before this new analytics revolution is complete. Following are several questions asked by data architects and database administrators, and my answers. The discussion is about optimizing joins in SQL queries, or just moving away from SQL altogether. Several modern databases now offer many of the features discussed here, including hash table joins and fine-tuning the query optimizer by end users. The discussion illustrates the conflicts between data scientists, data architects, and also business analysts. It also touches on many innovative concepts.

Question: You say that one of the bottlenecks with SQL is users writing queries with (say) three joins, when these queries could be split into two queries each with two joins. Can you elaborate?

Answer: Typically, the way I write SQL code is to embed it into a programming language such as Python, and store all lookup tables that I need as hash tables in memory. So I rarely have to do a join, and when I do, it’s just two tables at most.

In some (rare) cases in which lookup tables were too big to fit in memory, I used sampling methods and worked with subsets and aggregation rules. A typical example is when a field in your data set (web log files) is a user agent (browser, abbreviated as UA). You have more unique UAs than can fit in memory, but as long as you keep the 10 million most popular, and aggregate the 200,000,000 rare UAs into a few million categories (based on UA string), you get good results in most applications.

Being an algorithm expert (not an SQL expert), it takes me a couple minutes to do an efficient four-table join via hash tables in Python (using my own script templates). Most of what I do is advanced analytics, not database aggregation: advanced algorithms, but simple to code in Python, such as hidden decision trees. Anyway, my point here is more about non-expert SQL users such as business analysts: Is it easier or more effective to train them to write better SQL code including sophisticated joins, or to train them to learn Python and blend it with SQL code?

To be more specific, what I have in mind is a system where you have to download the lookup tables not very often (maybe once a week) and access the main (fact) table more frequently. If you must re-upload the lookup tables very frequently, then the Python approach loses its efficiency, and you make your colleagues unhappy because of your frequent downloads that slow down the whole system.

Question: People like you (running Python or Perl scripts to access databases) are a DBA’s worst nightmare. Don’t you think you are a source of problems for DBAs?

Answer: Because I'm much better at Python and Perl than SQL, my Python or Perl code is bug-free, easy-to-read, easy-to-maintain, optimized, robust, and re-usable. If I coded everything in SQL, it would be much less efficient. Most of what I do is algorithms and analytics (machine learning stuff), not querying databases. I only occasionally download lookup tables onto my local machine (saved as hash tables and stored as text files), since most don't change that much from week to week. When I need to update them, I just extract the new rows that have been added since my last update (based on time stamp). And I do some tests before running an intensive SQL script to get an idea of how much time and resources it will consume, and to see whether I can do better. I am an SQL user, just like any statistician or business analyst, not an SQL developer.

But I agree we need to find a balance to minimize data transfers and processes, possibly by having better analytic tools available where the data resides. At a minimum, we need the ability to easily deploy Python code there in non-production tables and systems, and be allocated a decent amount of disk space (maybe 200 GB) and memory (at least several GB).

Question: What are your coding preferences?

Answer: Some people feel more comfortable using a scripting language rather than SQL. SQL can be perceived as less flexible and prone to errors, producing wrong output without anyone noticing due to a bug in the joins.

You can write simple Perl code, which is easy to read and maintain. Perl enables you to focus on the algorithms rather than the code and syntax. Unfortunately, many Perl programmers write obscure code, which creates a bad reputation for Perl (code maintenance and portability). But this does not have to be the case.

You can break down a complex join into several smaller joins using multiple SQL statements and views. You would assume that the DB engine would digest your not-so-efficient SQL code and turn it into something much more efficient. At least you can test this approach and see if it works as fast as one single complex query with many joins. Breaking down multiple joins into several simple statements allows business analysts to write simple SQL code, which is easy for fellow programmers to reuse or maintain.

It would be interesting to see some software that automatically corrects SQL syntax errors (not SQL logical errors). It would save lots of time for many non-expert SQL coders like me, as the same typos that typically occur over and over could be automatically fixed. In the meanwhile, you can use GUIs to produce decent SQL code, using tools provided by most database vendors or open-source, such as Toad for Oracle.

Question: Why do you claim that these built-in SQL optimizers are usually black-box technology for end users? Do you think parameters can’t be fine-tuned by the end user?

Answer: I always like to have a bit of control over what I do, though not necessary a whole lot. For instance, I'm satisfied with the way Perl handles hash tables and memory allocation. I'd rather use the Perl black-box memory allocation/hash table management system than creating it myself from scratch in C, or even worse, write a compiler. I’m just a bit worried with black-box optimization — I've seen the damage created by non-expert users who recklessly used black-box statistical software. I'd feel more comfortable if I had at least a bit of control, even as simple as sending an email to the DBA, having her look at my concern or issue, and having her help improve my queries, maybe fine-tuning the optimizer, if deemed necessary and beneficial for the organization and to other users.

Question: Don’t you think tour approach is 20 years old?

Answer: The results are more important than the approach, as long as the process is reproducible. If I can beat my competitors (or help my clients do so) with whatever tools I use, as one would say "”if it ain't broke, don’t fix it.” Sometimes I use APIs (for example, Google API's), sometimes I use external data collected with a web crawler, sometimes Excel or Cubes are good enough, and sometimes vision combined with analytic acumen and intuition (without using any data) works well. Sometimes I use statistical models, and other times a very modern architecture is needed. Many times, I use a combination of many of these. I have several examples of “light analytics” doing better than sophisticated architectures

Question: Why did you ask whether your data-to-analytic approach makes sense?

Answer: The reason I asked the question is because something has been bothering me, based on not-so-old observations (3-4 years old) in which the practices that I mention are well entrenched in the analytic community (by analytic, I mean machine learning, statistics, and data mining, not ETL). It is also an attempt to see if it’s possible to build a better bridge between two very different communities: data scientists and data architects. Database builders often (but not always) need the data scientist to bring insights and value out of organized data. And the data scientists often (but not always) need the data architect to build great, fast, efficient data processing systems so they can better focus on analytics.

Question: So you are essentially maintaining a cache system with regular, small updates to a local copy of the lookup tables. Two users like you doing the same thing would end up with two different copies after some time. How do you handle that?

Answer: You are correct that two users having two different copies (cache) of lookup tables causes problems. Although in my case, I tend to share my cache with other people, so it's not like five people working on five different versions of the lookup tables. Although I am a senior data scientist, I am also a designer/architect, but not a DB designer/architect, so I tend to have my own local architecture that I share with a team. Sometimes my architecture is stored in a local small DB and occasionally on the production databases, but many times as organized flat files or hash tables stored on local drives, or somewhere in the cloud outside the DB servers, though usually not very far if the data is big. Many times, my important “tables” are summarized extracts — either simple aggregates that are easy to produce with pure SQL, or sophisticated ones such as transaction scores (by client, day, merchant, or more granular) produced by algorithms too complex to be efficiently coded in SQL.

The benefit of my “caching” system is to minimize time-consuming data transfers that penalize everybody in the system. The drawback is that I need to maintain it, and essentially, I am replicating a process already in place in the database system itself.

Finally, for a statistician working on data that is almost correct (not the most recent version of the lookup table, but rather data stored in this “cache” system and updated rather un-frequently), or working on samples, this is not an issue. Usually the collected data is an approximation of a signal we try to capture and measure — it is always messy. The same can be said about predictive models, the ROI extracted from a very decent dataset (my “cache”), the exact original most-recent version of the dataset, or a version where 5 percent of noise is artificially injected into it — it is pretty much the same in most circumstances.

Question: Can you comment on code maintenance and readability?

Answer: Consider the issue of code maintenance when someone writing obscure SQL leaves the company — or worse, when SQL is ported to a different environment (not SQL) — and it’s a nightmare for the new programmers to understand the original code. If easy-to-read SQL (maybe with more statements, fewer elaborate high-dimensional joins) runs just as fast as one complex statement because of the internal user-transparent query optimizer, why not use the easy-to-read code instead? After all, the optimizer is supposed to make both approaches equivalent, right? In other words, if two pieces of code (one short and obscure; one longer and easy to read, maintain, and port) have the same efficiency because they are essentially turned into the same pseudo-code by the optimizer, I would favor the longer version that takes less time to write, debug, maintain, and so on.

There might be a market for a product that turns ugly, obscure, yet efficient code into nice, easy-to-read SQL — an “SQL beautifier.” It would be useful when migrating code to a different platform. Although this already exists to some extent, you can easily visualize any query or sets of queries in all DB systems with diagrams. The SQL beautifier would be in some ways similar to a program that translates Assembler into C++. In short, a reverse compiler or interpreter.

Related articles

Views: 18849


You need to be a member of Data Science Central to add comments!

Join Data Science Central

Comment by Jon Reade on September 18, 2017 at 2:42am

Afraid I'm going to be hyper critical here as I'm passionate about good development practices, and read with horror about implementing" efficient four-table join via hash tables in Python".

I'd argue VERY strongly against this approach, embedding SQL inside Python. It's not only difficult to maintain, but by using hash tables, you're ineffectively re-writing code that achieves what a database platform already does, but much better, as the database platform's binary code is optimised for exactly this sort of operation, whereas Python isn't.

By working in this way, you're also guilty of pulling all of the data from every data set you're using from a server (lots of IO) across a network (lots of bandwidth usage) to do something on a low spec laptop rather than a high spec server (lots of time penalties). Finally, you're grabbing ALL of the data from the server in order to perform the join on a lower spec piece of hardware. Why oh why? A  SQL query might filter it down to 5% of what you're pulling, through intelligently written joins and where clauses. This means less IO, less network usage, a smaller data set on your development machine, and most importantly, minimised impact on your colleagues and employer's servers and network.

By sticking to this approach, you're burning through your employer's money and time and potentially affecting other fee-earning  users of an operational system, because you believe it's more effective to code it in Python (which, by the way I love) than write it in a tool fit for the job (SQL, which I also love and use when appropriate) on a server that's designed for the job. In short, you're spending time, money and finite resources directed to supporting the business because you're choosing to use a hammer, where a scalpel is the most appropriate tool for the job.

In addition, you're making implicit assumptions about access control rights, which you may have but which other users in future may not have, long after you've left. Where are these documented? Not in the code. Although this approach is fine for exploratory data analysis on a dev/test server, on its own, non-operational network, these practices condone exactly those bad development habits that we've been trying to eliminate for years. Whilst some of this is fine for the hacking stuff together quickly approach, this type of practice frequently leaks over into production.

Why not instead educate yourself with some of the simple basics of SQL (a set based language ideal for the job in hand), rather than using a procedural language that's unsuited to this type of work? You're effectively moving your own learning pain as a cost onto your employer because you're insisting on using a tool you're familiar with (Python), rather than choosing the most suitable tool for the job (SQL), which I would argue every data scientist should have a half decent understanding of in order to achieve precisely these goals. I'd certainly not employ someone who claimed to be an operational data scientist if they could not write basic SQL (3-4 way joins, filtering, aggregates).

In summary, it's simply an expected part of the basic toolset, just as much as Python, logistic regression or decision trees.

Comment by Roberto Bruno Martins on February 11, 2014 at 4:55pm

Interesting post there, Vincent. I'd say that maybe the discussion started because the goals on each side are different. You are trying to get as fast as possible to a specific answer to one problem that may be very valuable (but has not been quantified to the other party) with the tools you have at hand, while the other party is trying to keep a shared environment running as efficiently as possible while convincing you that the tools they have are just as good for your particular case, while being much better for everybody.

I can't judge on your code quality, but knowing how a DBMS works (at least RDBMS), I'd say that what you'll get from a good SQL would be very similar to the hash tables you described. Not many people would know how to write that good SQL, but not many people would have efficient PERL or Python libraries or care to use them anyway. In the end, I guess you have to work with whatever gets the job done.

One point in favor of using SQL more is that it is a sort of lingua franca for accessing data. Many tools will generate SQL, so if you could work with the Data Architect/DBA and leverage the database more, these resources you have would be available to more people. On the other hand, it'd take longer to do all this, so we get back to the point: it's about getting to the value as fast as possible. All the rest is about repeatability, moving from discovery to production, and could be dealt with by other means (like, massive paralelism coupled with a roll-out in waves, while developers work at optimizing the logic you created in order for it to scale better). That's part of the beauty of services, encapsulation and cloud: you can afford to choose how you are going to approach a new functionality. Which means that the other party's mindset should not default to trying to keep things fair and balanced for all.

Follow Us


  • Add Videos
  • View All


© 2017   Data Science Central   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service