The Data Science Toolkit - The Future Web Toolkit

There's a lot of confusing jargon and buzzwords in this new field. It helps to know who some of the major players are and what services they offer. This list is a mild introduction and far from exhaustive.

Amazon Web Services: Infrastructure as a service (IaaS). EC2 virtual servers, S3 storage, Mechanical Turk, analytics, and more.

Yandex: Russian competitor for google. Recently launched Cocaine server based on Docker.

Salesforce: Customer Relationship Management (CRM). Acquired Heroku in 2010.

Heroku: Platform as a service (PaaS) for hosting Ruby on Rails, NodeJS, Java, Python Django, and more.

EMC: Enterprise content management (ECM) and big data analytics.

Netflix: An AWS success story. Have implemented several improvements based on Kaggle competitions.

Kaggle: Bounty engineering competitions used by NASA and Wikipedia. Famous for 3M$ Heritage Health Prize.

Zementis: Predictive analytics for big data and real-time scoring.

SAP: Enterprise resource planning (ERP). Enterprise system oriented architecture (SOA).

Crowdflower: Crowdsourcing leaders. Categorization, content generation, image moderation, sentiment analysis, transcription, and more.

Rexer Analytics’s Annual Data Miner Survey: A must-read for newcomers.

For small time data slingers, "the cloud" has a simple interpretation: we can rent AWS EC2 instances by the hour and S3 storage cost pennies a month per gig. No need to buy a new MacBook Pro - whenever a project needs more than your old pc can handle, "move it to the cloud." Vagrant, Docker and Ansible take away a lot of the hurry-up-and-wait IT takes to configure a machine.

"Scalability" is an overloaded term, which usually leads to premature optimization. Don't get distracted by buzzwords. Twitter, Netflix, and Oracle IT managers worry about the problems of data volume, velocity and variety. We're not building Netflix.. but we might work for them one day. The goal is to learn the popular tools that Big Data start-ups use without going into the fine details of robust deployment.

A great talk on optimization and scalability of analytics is David Schachter's
How to Speed up a Python Program 114,000 times

Schachter starts with low hanging fruit, re-writing the analyst's poorly written code, and stops when his optimization gets tied to hardware. He also explains why Hadoop isn't viable for his purposes and makes fun of Twitter engineers for using LAMPP for mobile communication.

I stick to Python as much as possible. I also only want to learn tools that promise longevity. In this battle between old comfort and new features, JavaScript often beats Python. NodeJS MV* web frameworks are increasingly popular, d3 is written in JavaScript, HTML5 apps are replacing desktop programs, and the language is a prime candidate for architects who are trying to "move code to the data." JavaScript for longevity is a safe bet.

My favourite resource for learning JavaScript was the epic series Crockford on JavaScript https://www.youtube.com/watch?v=JxAXlJEmNMg  
Crockford recently published a talk on concurrency called Monads and Gonads, which helps to understand promises https://www.youtube.com/watch?v=b0EF0VTs9Dc

But I'm a Data Scientist, not a web developer!

If you are in favour of specialization, then you already have an excellent career. The aspiring Data Scientist that can build a website and scalable REST services is much more likely to get hired on to a Big Data start-up. Nobody can know it all, but understanding context of the tools makes your abilities fit well in the team.

Back in 2006, Tim Berners-Lee described the Semantic Web as:
..an overlay of scalable vector graphics – everything rippling and folding and looking misty ...integrated across a huge space of data..

Web 3.0 has been elusive so far, but we're almost there. Mobile web apps are becoming increasingly more data intensive, while Data Scientists are learning how to build REST services and MV* websites. Both industries are starting to use the same tools and skill sets are overlapping. The semantic web is waiting on that "huge space" of integrated data, which will emerge as the two industries mature enough to converge as one.

My corollary to Berners-Lee's quote is:

Web 3.0 => The Data Science Toolkit == The Future Web Toolkit

Next month: My Boot Camp Curriculum

Views: 3324

Tags: Tools, Web3.0, cloud


You need to be a member of Data Science Central to add comments!

Join Data Science Central

Comment by Anzar Hasan on March 4, 2014 at 12:49pm

Great information Peter. Keep it up....

© 2021   TechTarget, Inc.   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service