There's a lot of confusing jargon and buzzwords in this new field. It helps to know who some of the major players are and what services they offer. This list is a mild introduction and far from exhaustive.
Amazon Web Services: Infrastructure as a service (IaaS). EC2 virtual servers, S3 storage, Mechanical Turk, analytics, and more.
Yandex: Russian competitor for google. Recently launched Cocaine server based on Docker.
Salesforce: Customer Relationship Management (CRM). Acquired Heroku in 2010.
Heroku: Platform as a service (PaaS) for hosting Ruby on Rails, NodeJS, Java, Python Django, and more.
EMC: Enterprise content management (ECM) and big data analytics.
Netflix: An AWS success story. Have implemented several improvements based on Kaggle competitions.
Kaggle: Bounty engineering competitions used by NASA and Wikipedia. Famous for 3M$ Heritage Health Prize.
Zementis: Predictive analytics for big data and real-time scoring.
SAP: Enterprise resource planning (ERP). Enterprise system oriented architecture (SOA).
Crowdflower: Crowdsourcing leaders. Categorization, content generation, image moderation, sentiment analysis, transcription, and more.
Rexer Analytics’s Annual Data Miner Survey: A must-read for newcomers.
For small time data slingers, "the cloud" has a simple interpretation: we can rent AWS EC2 instances by the hour and S3 storage cost pennies a month per gig. No need to buy a new MacBook Pro - whenever a project needs more than your old pc can handle, "move it to the cloud." Vagrant, Docker and Ansible take away a lot of the hurry-up-and-wait IT takes to configure a machine.
"Scalability" is an overloaded term, which usually leads to premature optimization. Don't get distracted by buzzwords. Twitter, Netflix, and Oracle IT managers worry about the problems of data volume, velocity and variety. We're not building Netflix.. but we might work for them one day. The goal is to learn the popular tools that Big Data start-ups use without going into the fine details of robust deployment.
A great talk on optimization and scalability of analytics is David Schachter's
How to Speed up a Python Program 114,000 times
Schachter starts with low hanging fruit, re-writing the analyst's poorly written code, and stops when his optimization gets tied to hardware. He also explains why Hadoop isn't viable for his purposes and makes fun of Twitter engineers for using LAMPP for mobile communication.
Crockford recently published a talk on concurrency called Monads and Gonads, which helps to understand promises https://www.youtube.com/watch?v=b0EF0VTs9Dc
But I'm a Data Scientist, not a web developer!
If you are in favour of specialization, then you already have an excellent career. The aspiring Data Scientist that can build a website and scalable REST services is much more likely to get hired on to a Big Data start-up. Nobody can know it all, but understanding context of the tools makes your abilities fit well in the team.
Back in 2006, Tim Berners-Lee described the Semantic Web as:
..an overlay of scalable vector graphics – everything rippling and folding and looking misty ...integrated across a huge space of data..
Web 3.0 has been elusive so far, but we're almost there. Mobile web apps are becoming increasingly more data intensive, while Data Scientists are learning how to build REST services and MV* websites. Both industries are starting to use the same tools and skill sets are overlapping. The semantic web is waiting on that "huge space" of integrated data, which will emerge as the two industries mature enough to converge as one.
My corollary to Berners-Lee's quote is:
Web 3.0 => The Data Science Toolkit == The Future Web Toolkit
Next month: My Boot Camp Curriculum