Subscribe to DSC Newsletter

My nephew's a very impressive young man. Five years ago, he received a PhD in Biochemistry/Molecular Biology from a prestigious university, earning numerous teaching and research awards along the way. He then took a faculty position at a college where, several years later, he decided to become literate in analytic computation in addition to executing his teaching and research responsibilities.

He's become proficient in R, completing the Coursera Data Science Specialization from Johns Hopkins, along with other courses/self-study endeavors including SQL, bioinformatics/Bioconductor, Statistics with R for the Life Sciences, and Advanced Statistics for the Life Sciences. At this point, he's a true data scientist, leveraging his strong science background with newly-developed computation and analytic skills.

As he's progressed, we've had occasional Skype video sessions to discuss aspects of R, RStudio, and SQL. Before one session a while back, he asked if we could focus on outer joins and correlated sub-queries, advanced topics from his then SQL course. Before the call, I decided to access some readily-accessible data in R and assemble a few R/SQL scripts that illustrate those advanced queries in action.

In SQL, an equi or inner join is one that links tables on exact matches of the join attribute(s). Rows from either table that are unmatched in the other are not returned. With outer joins, in contrast, unmatched rows in one or both tables can be returned. A left join, for example, returns matched rows plus unmatched rows from the left table only. I often use left joins in data wrangling.

A correlated sub-query is a query nested inside another query that uses values from the outer to probe the inner. Very powerful, the correlated sub-query operates like computationally expensive nested loops and must subsequently be used prudently.

The scripts to showcase the SQL queries for our call used R and it's database interface, DBI, with the MonetDBLite SQL engine. The R package nycflights13, comprised of five data.frames, provided the data foundation.

I recently upgraded the code to a Jupyter Notebook with Microsoft Open R 3.4.4. In addition, I wrote several functions to ecapsulate the MonetDBLite SQL after pushing the data.frames into MonetDB tables. flightsdb, consisting of 336776 records, is the "fact" table, with dimensions airlinesdb, airportsdb, planesdb, and weatherdb. Conveniently, count queries with flightsdb as a left outer join illustrate well the concepts my nephew wanted to see. It's also easy to probe the hits and misses counts of these joins with correlated sub-queries. The remainder of this notebook details the code and results of the inquiry.

BTW, my nephew got the gist of outer joins and correlated sub-queries immediately. And now, armed with computation and analytic skills, he's looking to switch from academia to commercial data science. Some lucky company will reap the good fortune of hiring him.

Read the entire article here.

Views: 1791


You need to be a member of Data Science Central to add comments!

Join Data Science Central

Comment by AJ Meyers on August 1, 2018 at 6:35pm

Steve - completely understand.  But the way the platform is evolving towards more on analytics (containers of engines on the MPP backbone), keep it in mind when it seems you're spending more time on the tech, and less on the problems.  I'm personally a fan of the open source progress on capabilities, but when it means going back to replicating data without regard of keeping the semantics aligned with the business then I try to look for other options.  this goes for all the other platforms, sometimes "legacy" isn't a bad alternative for the right need.

Comment by steve miller on August 1, 2018 at 1:34pm

AJ --

Thanks much for your note. My quick research suggests a very intriguing analytics environment.

As a data science consultant, though, I don't have much sway over my customers' database decisions; indeed most relational/analytic databases are in place before I engage. Customers look to me to provide guidance on DS tools such as R, SAS, Python, Tableau, Hadoop, Spark, H2O, etc. In many instances, customer environments are built almost exclusively on free open source software. My efforts are thus better focused on proficiency with those tools.

I worked with SAS software for 20 years before switching to first S and then R -- and actually still do a bit of SAS consulting now and then. And I have embedded R in PostgreSQL and also experimented with R/Python in SQL Server, but am generally wary of database lockin. As a consultant with Oracle Corporation many years ago, I well remember the competitions with Teradata -- winning some and losing some. A formidable competitor.


Comment by AJ Meyers on August 1, 2018 at 6:01am

Hi Steve - Great article

You and your Nephew may want to try using the free version of Teradata's platform to do all of the same things you noted above with your Jupyter notebook with alot less additional coding and tools.  the R engine is embedded (as well as Python).  It's scaled back from a performance perspective but has all the functionality of the paid version.  I always believed that spending more time framing the question and understanding the approach and results was more important than the tools used to get it.  but today being a tool jockey is still a necessary evil.  

Comment by Ralph Winters on July 28, 2018 at 5:58am

Thank you for sharing this informative article about your nephew.

Comment by Rogelio Mamani on July 20, 2018 at 6:19am

Excelent article.

© 2018   Data Science Central ®   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service