Subscribe to DSC Newsletter

Three myths about data scientists and big data

Here are my reactions to the LinkedIn conversation about the "Statistical data scientists". In short, I believe that modern big data needs new ways of processing data, and that data scientists are polyvalent (versatile) with deep expertise in a few domains, across multiple disciplines, as well as mastering original data science core material as found in my book.

Data scientists who know data engineering

Pretty much any data is special and has its own intricacies, not just click data. I think your disagreement with me stems from the fact that you want to narrow down the role of the data scientist to a traditional statistician, while I want to expand his competencies to include data engineering and business decisioning. Perhaps because of your experience with big companies, you tend to favor division of roles, but in start-ups sometimes you don't have business analysts, data engineers, data architects, statisticians, but just one guy: a data scientist. That has been my experience over the last 5 years, and it is also part of the lean philosophy - lean start-up, agile deployment and so on. You have one leader who manages a project (in my case data scientist, sometimes with a CTO or VP of engineering) and employees all over the world, with a much lower compensation, who execute. In my case, as a data scientist, I've been involved on both sides - analytic scoring deployment (data flows, data plumbing, including real time big data design) as well as back-end analytics (machine learning, hidden decision trees, cross-validation, testing, prototyping etc.)

Sampling versus big data

About sampling, you can incrementally increase your sample till the additional patterns that you discover provide little added value compared to the cost of processing big samples. But most importantly, you need to create a good sample. In my example (click data) you must include a bunch of affiliates, not just the top 3. And if you request advertiser data rather than pure ad network data, a much smaller sample will work. If advertiser data is not available, just create a few dummy (honeypot) advertiser accounts to start collecting more valuable data. And blend this data with your ad network data. Data scientists must be able to identify these issues, find the solution, recommend a strategy and carry on and implement it, including recommending which metrics are going to be captured, and how. It requires data engineering, business expertise, design of experiment and sampling knowledge. In my case, as a data scientist, I've actually created these advertisers accounts myself, tracked this external data independently, and factor in the cost of managing these dummy advertising accounts when sending my bill to the client. These are tasks that a data scientist should be able to perform.

Real data scientists are both IT guys and statisticians, though they might not be familiar with all the flavors of modern logistic regression, nor all the details about the hardware that makes data flows efficiently (but they should know stuff like load balancing and yield optimization, data redundancy, compression algorithms, when it comes to server or data base or file management or web crawling performance).

Being a data scientist in a start-up versus big company

First, the data scientist in these start-ups generally earns well above $100k per year. In my case it was always above $150k, and I was not (by far) the most highly paid data scientist. These data scientists have broad across-disciplines knowledge (computer science, business, maths, operations research, statistics, engineering) as well as deep expertise in a number of topics in a number of disciplines. The knowledge is gained through experience or during college years: in my case, I was exposed to both computer science and statistics, in equal proportions, when I completed my PhD thesis in 1993. Then to engineering and data base engineering when I started to work. My title used to be Chief Scientist. Data scientist did not exist at that time, but my title could have been as well Executive Data Scientist.

Here's my answer to your second question (How do you move up and scale, in such a start-up when you've hired people to do your job?) In my case, I am an inventor, thus seeing all my work outsourced and done by others is a great reward (and I make sure I help well with transition, training, to make it a success). When I prototyped the click scoring engine and it came live, I moved to dashboard design so clients could benefit from a useful dashboard to extract more value out of the scores), and then to integrate my scores in keyword bidding algorithms and create a bidding engine (automated bids) as well as real-time scoring. The scores were useful for scoring new keywords or new affiliates with no historical data. Later on, I moved on to producing scores consistent across clients and over time; I wrote a few patents at that time, sold one. The progression went essentially into three directions: providing more value to the client, outsourcing my work as soon as it made sense, automating. The whole architecture was designed (by the data scientist, me in this case) from the ground up with scalability in mind, with targets such as scoring a trillion clicks per year.

My salary did not go down, but up. Then I moved to another start-up then another one. And then, as many inventors, I ended up creating my own company (lean, scalable, profitable start-up with high margins, no employee on payroll in US except my 12 years old daughter), with job security and compensation currently 3 times above what it was in the earlier start-up environments, and better perks than Google data scientists (work from home, no meetings, great trips paid by company, and tax advantages due the LLC setting, and 50% ownership instead of stock options - much of it financially architected by me, as I also act as the CFO, which is some kind of financial data science role).

At some points in my career, I consulted with big companies - Visa, eBay, WellsFargo, Microsoft. I also learned, created and gained a lot of valuable data science expertise from these work experiences.

On Scalability

Once you've reached 5 billion clicks per day, in terms of scale (in a simulated, stress test), that's more than all the clicks you'll ever be requested to score, what's the point to further scale? Sure you could add impression data, but do you really want to multiply hosting/server costs by 100 to provide little added value? 

And in my current start-up, content / traffic / services are hosted / served by vendors able to handle 10,000 times more than we could ever need. So scalability for us, means increasing revenue (traffic, revenue and profits are currently increasing at 100% per year), better segmenting our audience to offer more customized packages to clients, getting more clients, more self-service clients, getting syndicated content, automating news feeds, more account managers and paying higher fees to vendors as we consume more bandwidth. We do that. But the choice of vendors to begin with, was based on how well they could host us, should we grow by a factor 100. I don't agree that scaling means more servers, not true if these services are outsourced. It can be perceived as "not scaling" because the scaling capability (technical scaling as opposed to business scaling) was built upfront, directly (in my first start-up) or indirectly (through vendors, in my current start-up).

Don't listen to those who seek to divide to conquer

I think many analytic professionals would benefit from getting this cross-discipline training that I am talking about. It is not difficult to earn the essential knowledge of computer science, statistics, engineering, business to become a versatile employee and highly sought by data science start-ups, or succeed independently - much less difficult than many make it to be (hint: start by earning versatile college education, then diversified industry experience). Indeed, that's the purpose of my data science apprenticeship, to help you quickly become an highly valued employee, rather than spending decades discovering everything by yourself (like I did) to eventually become successful, happy, and wealthy practicing data science.

Those who disagree have incentives to put people into buckets, mostly to protect their career, business model or university curriculum. You gain nothing from listening to those who divide to conquer, and everything from listening to those who unify to conquer.

How to become polyvalent

What I found useful during my PhD (this could apply to master program too) is that I immediately started to work for a company on GIS, digital cartography, and water management (predicting extreme floods locally - how much the water could rise, at worse in 100 years, at any (x,y) coordinate on a digital map, modeling how any drop of water falling somewhere runs down, goes underground, eventually reaches low elevation and merges with other water drops on the way down - the digital maps had elevation and land use data available for each pixel; by land use I mean crop, forest, water, rock and so on, as this is important to model how water moves). Very applied and interesting stuff. My first paper (after an article about flood predictions, in a local specialized journal) was in Journal of Number Theory though I never attended classes on number theory. I then started to publish in computational statistics journal, but also in IEEE Pattern Analysis and Machine Intelligence, and Journal of the Royal Statistical Society, series B. I'm currently finishing a book on data science (Wiley, exp. publication March 2014).

The take away from this is that it helps getting polyvalent, if the PhD/Master student can do applied work for a real company, hired and paid as a real employee (partnership between university and private sector), at the beginning of his program. In my case, it was a small R&D company (20 people) so I had the chance to be exposed to many things, not least learning how to write good code used by a team, for real apps (for instance merging hundreds of small images to produce a big map, rotating, filtering images taken by a plane, make sure roads were not broken when moving from one image to another, and putting the whole stuff into some kind of hierarchical database to retrieve and display any portion of the map very fast including adjacent parts, to the end user querying the database). This was in 1989.

But I still do theoretical research, I call it theoretical data science. It has direct practical applications, and research topics are dictated by problems in real data (from data to research) rather than the other way around. We organized a (theoretical) data science competition last year (with $1,500 in awards), and unless Kaggle, we attracted top scientists, many from big companies indeed. The winner, Jean-Francois Puget, is an IBM distinguished engineer expert in mathematical optimization. You can read the details here (see first comment on the landing page, after the main article). More competitions to be announced this year!

Another way to become polyvalent is working (during your leisure time) on stuff that is related to multiple disciplines, such as steganography (the art of hiding messages in images or videos): you can gain knowledge in image analysis, compression techniques, computer science, file management systems, security, writing code, and statistics all at once. And it's an engineering and business topic by itself. Here's a link to get you started.

Related articles

Views: 18026


You need to be a member of Data Science Central to add comments!

Join Data Science Central

Comment by HuaHua Huang-Abt on January 28, 2014 at 10:19am

Thanks for sharing! My only question is: how do you decide what is a clean simple? 

Comment by Bill Schmarzo on January 24, 2014 at 5:52am

Well, keep up the good work, Vincent.  I love your mission, your blog and the material that you share.  And I'm going to have to get your book.  All very outstanding!

Comment by Vincent Granville on January 23, 2014 at 5:46pm

I'm hoping to spread my knowledge (the useful portions) through my apprenticeship and my book, so more people will have a more diversified and relevant background. I believe it can be learned :-)

Comment by Bill Schmarzo on January 23, 2014 at 2:46pm

Wow Vincent, very nice and thorough write up! A key challenge is that folks like you are very rare (unicorns?), so most organizations (especially more traditional companies) are trying to address the data science opportunity as a team that covers the roles that you outlined above.  These organizations already have a stable of talented people, but no one of them has the scope and depth that you possess. For those types of organizations, the team approach seems to be the quickest way to start realizing the data science riches.  IMHO

Follow Us


  • Add Videos
  • View All


© 2017   Data Science Central   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service