Subscribe to Dr. Granville's Weekly Digest

Your Data Science Portfolio: Math Skills Don't Matter

TL;DR: A Data Scientist is a data pipeline plumber. Analytics are icing, not cake.

This article is written specifically for unemployed and underemployed graduates of math intensive subjects like physics and statistics. Others may have more to prove.

After writing my introductory reviews of ETL and visualization, I was going to write something about algorithms and analysis. Then it dawned on me: beyond proving that I'm not completely brain dead, my math skills NEVER helped me get a job. Also, I wrote a quick summary of bread-and-butter algorithms in a previous post.

Don't Drink The Kool-Aid

Predictive analytics is supposed to be the heart of a Data Scientist. It's a lie. Numbers are a figment of our imagination and math is a powerless spectre. It's the code that does the work. Computer Scientists are good at math too, so unless you have domain specific expertise, your so-called 'analytics' contribution to the team are questionable at best. Computer time is money, which means if your MATLAB code and R scripts may need to be re-written in a language that can be optimized to run 10-1000x faster (Python, Julia, Go, FORTRAN, or C).

Diminishing Returns in a Juiceless Lemon

In building the analytics section of your portfolio, the bare minimum is enough. You only need to show basic competency with the most popular models. Placing dead last in a Kaggle competition shows awareness, initiative, and basic competency. Trying out different types of algorithms yields diminishing returns. Spending hours tweaking parameters is strictly hobby time. If you think that you can beat Ivy League researchers at their own game, by all means, set your eyes on the prize. Otherwise, be happy to see your name at the bottom of the Kaggle standings and spend your time learning ETL instead.

Script Kiddies > Actuarial Scientists

The Data Scientist who covets math skills as a jewel puts themselves in a precarious position: as soon as a company has their most important scripts written, the worker becomes redundant. As github's code grows to cover more situations, employment opportunities shrink. Years from now, script aggregating tools will be so sophisticated that actuaries will find themselves competing against kids that scorn certification and degrees. Excel jockeys will be put out to pasture if they are JavaScript illiterate. Don't get caught on the wrong side of the fence when Data Science dies. As Miko Matsumura says, "be the developer, programmer, or entrepreneur."

Context to The Rescue

As I've said before, one of the beautiful things about working with data is that it provides concrete context. An employee that intimately understands the context of the company's data is indispensable. They are able to take one glimpse at a report and say "something's wrong here", cutting through hours/weeks of an analyst's work. Statistical models are built on pyramids of assumptions, and assumptions are famously brittle. As the sun sets on the Data Science marketing hype, the success of your your transition into a new position will depend on how well you understand the intricacies of how your industry's data relates to the real world.

The Map Is Not The Territory; There's No Such Thing As Raw Data.

The "concrete context" is of the domain the data sits in. "Raw data" has the closest connection to real devices taking real measurements, but they aren't really raw - they're numbers, not things. Numbers aren't real. Every time we summarize or aggregate data, abstractions push the context up the pyramid of assumptions, further away from our physical realm.

Asset or Liability?

Those that are unable to compensate for each assumption as it shifts or breaks down quickly become a liability. The annals of Wall Street are full of stories like Merton and Scholes' LTCM. The data they were analyzing was information about information, reports about reports. They didn't realize their models and abstractions pushed them too far out of context to make sound decisions. High on ego, they put total faith in their formula and doubled down on debt when they should have hedged with something more stable. A novice mortgage broker could have seen their insanity.

The problem isn't a lack of mathematical acuity - Merton and Scholes invented the Nobel-prize winning formula their company was exploiting. The problem is the data that traders swim in are tenuously connected to reality. Financial analysts are unable to understand the assumptions that are baked into the many formulas used to build Wall Street's house of cards.

Plumbers, Not Tinkerers

When you strip away all of the hype and jargon that stems from differences in hardware and software, Data Science is fundamentally about one thing: building data pipelines. Most of our intricate problems of analysis have been solved by a myriad of open source software and commodity hardware. Certainly within the 80/20 margins. As niche markets for specialists shrink, a wise tinkerer will round out his skill set and become a pipeline plumber to stay relevant.

Three Pronged Portfolio

To summarize, there are 3 major components of a comprehensive Data Science portfolio. Here is an example that should be received with serious consideration by any Big Data company.

Transform: GitHub scripts for open data curation.Mailing lists 3.0.

Visualize: Examples of the canonical plots from the Visualization Zoo. Matlplotlib is just fine.

Model: Kaggle competition. Bottom of the standings.

Views: 3580

Tags: CV, Portfolio, analytics, hype

Comment

You need to be a member of Data Science Central to add comments!

Join Data Science Central

Comment by Eric A. King on August 11, 2014 at 9:44am

Respectfully Vincent, I don't think we're connecting here -- so I'm going to step out.

And after having advertised for some while through your channels, I'm sorry to see you say that comprehensive public training and mentoring for actionable analytics (particularly with a strategic focus) doesn't exist.  I thought you knew that my company specifically fulfills underserved market.  And if you were referring to academia, I'm not sure that it would be as pragmatic and commercially grounded as a consulting company that is actively engaged and seasoned in the field.

Anyway -- I do appreciate the interaction... but will be moving on.  :^}

Comment by Vincent Granville on August 11, 2014 at 9:08am

Eric, if you can fullfill all the roles - from product guy, COO, CFO, CEO, marketing executive - combining the skills of a PhD scientist, HR person, and an MBA executive with business and product development experience, as an entrepreur, you can do better. At the very least, you can jump-start a company without using VC money. There's a powerful incentive (acquiring these skills) to increase your revenue. A while ago, I was even taking care of sales. Here by product, I mean data science product - in my case what I offer is open patents, source code and state-of-the-art research articles about data science, to help people develop better data science solutions. Revenue for now comes from advertising, though we could sell data as well, or black-box predictions accessible via an API. 

Now as your business grows, you eventually outsource many things to partners, vendors and employees, or you just automate it. I completely outsourced sales, IT and internal analytics (not the data science research part). There's also a lot of automation going on: many of my tweets are automatically generated, and thanks to my magic (secret data science tricks), I have the fastest growing Twitter account in the data science community (and other growth records). Yet, I still remain - like all successful entrepreneurs - an extremely polyvalent professional. I know less than many data scientists (about detailed statistical algorithms), but I have a much broader skill set (including soft skills such as growth hacking). And I can develop disruptive data science - better, more robust, simpler and scalable than what you can find in textbooks (my Wiley book is an exception, of course) because I have a high level vision rather than my head buried in technical minutiae.

And part of my goal, with Data Science Central, is to help many people acquire these skills, via projects offered online or mentoring. No schools are teaching this stuff (as far as I know), but it does not mean that it can't be done.

Comment by Eric A. King on August 11, 2014 at 8:47am

Vincent - I think you're speaking for yourself and not the masses.  As the leader of an advanced analytics training and consulting firm who is constantly engaged with the commercial and public sector markets (and 24 years delivering solutions and mentoring directly in the analytics field), allow me to express our general observations:

  • True scientists certainly have their value -- no argument there. But they typically don't possess (or apply) the soft skills to adequately assess, plan, design, implement, interpret and operationalize at the strategic level.  This arrives at highly accurate models ...that rarely see full implementation and adoption.
  • Entrepreneurs (especially of start-ups) probably shouldn't be heads-down in the tactical minutia of the analytic discipline even if they do have the capability -- not time well spent in their role to grow the practice.  They need to be reducing their roles as the organization grows.
  • Even if a domain expert is a seasoned analyst, s/he should contribute to the problem definition -- then be separated from model development to avoid bias in feature engineering, data preparation, modeling and validation.  They should then return to collaborate on results interpretation and translation for leadership.

In short, the vast majority truly doesn't possess adequate skills or experience to function effectively in all roles required in a successful analytic practice.  Moreover, it should be argued that no one should fulfill all roles, even if they could -- particularly in an enterprise environment.

If 'data scientist' is to survive as a title, it needs to narrow its scope to tactical roles... and only functions that uphold objectivity in the analytic process.  For those who wish to also perform at the strategic level, maintaining 'scientist' in the title is restrictive and a bit misleading.  In this case, there needs to be a more encompassing and generalized title. 

For these reasons, when people tell me they're a (self-proclaimed) 'data scientist', I have to drill much further to determine which subset of the five major functional areas they're truly equipped to support -- (as opposed to what they believe they can do, and what roles should be fulfilled concurrently). 

Meanwhile, I'll keep warning the community to turn on their smoke detectors and don an investigative hat anytime they encounter a 'data scientist'.  From that point, there's a great amount of work to do in order to uncover who they really are and what they can truly do.  Ironically, few who are seeking to hire 'data scientists' truly understand the functional requirements or real shortcomings of the analytic practice to which they're seeking to fulfill.  

It's a mess out there... but that's what puts food on my table (at least, for those who recognize and acknowledge their analytic dysfunction)!

Comment by Vincent Granville on August 11, 2014 at 6:54am

As a data scientist, I am both a leader, user, analyst, domain expert, IT guy - and also a business hacker. I believe all analytic executives in startups (especially founders) wear these 6 hats. How can  a small startup has a data scientist focused on tactical, rather than strategic initiatives?

My role in IT is limited, I essentially outsourced it to vendors and hired an IT guy, but I still have to stay on top of it, and be knowledgeable about industry trends, regarding IT.

Comment by Eric A. King on August 11, 2014 at 5:01am

In time, I believe the 'data scientist' title will dissolve for a few reasons:

  • Like 'big data', the term is too broad and undefined.  It means many different things to different people -- as clearly evidenced in this thread.
  • Too many have loosely donned the title simply because its popular at present, and garners attention on resumes. It has become difficult to quickly discern who really holds what skills and experience -- and which of those is truly important to the role at hand.  As such, the title itself is quickly losing luster and credibility.
  • There are really 5 primary functions that must come together for truly effective data analysis.  Some functions entail multiple people, and in some cases a person can cover multiple functions.  But there is no person I've encountered (and there should not be) who can encompass all 5 functional areas (IT, leadership, user, analyst, domain expert).  At most, I consider a 'data scientist' to cover up to three of those functions... most often the latter three.  

And to me, 'scientist' entails a highly technical and tactical role -- not synonymous with leadership.  And to avoid introducing bias into an otherwise objective model, the analyst and domain exepert should collaborate -- but not be the same person.   Those who argue that a 'data scientist' should entail all aspects of an analytic practice are describing a team, not a person -- and it shouldn't include the 'scientist' aspect, as successful analytics must include a substantial strategic component.  Most who talk about data science overlook or grossly understate the strategic components at their own peril.

So, in my view, if the 'data scientist' title were to survive, I belive it is most fitting as a more seasoned analyst who follows strict tactical analytic process -- and interacts effectively with the other 3 or 4 roles on the overall analytic team.

Comment by Peter Higdon on August 8, 2014 at 7:05am

I'm surprised by Sean's comment. I agree with all of what he wrote. The motivation of much of my writing is against the Hadoop + Mahout + Tableau = "Look, I'm a Data Scientist!"

I should clarify that by "ETL" I actually meant "data curation tools." Data curation gets the least media attention, and that's why I wrote the article in this way. For example I wrote a post each for curation/visualization/analysis, and the view count in the first week of posting each entry was approximately 500/1k/1k.

The target audience for this article was stated as very specific: underemployed graduates from math intensive subjects like statistics and physics (applied math, pure math, etc). My argument is that there is a cohort of people that DO need to downplay their math skills. They exclusively use Windows and MATLAB for data analysis and they might not be aware of how much more work they need to put in to round out their skill set.

I was trying to make the point that context should be the main point of reference for every project and that curation is the most marketable skill for people who already have a strong math education.

I apologize for any miscommunication.

Comment by William J McKibbin on August 7, 2014 at 7:00pm

Yes, the ETL and presentation layers are critical -- but do not downplay the importance of the analytics layer that occurs between these two tasks -- and it is in the analytics layer that I rely most on my mathematics skills -- to do my job, I have to be versed in descriptive and predictive statistics, data-fitting, hypothesis testing, cluster analysis, factor analysis, regression analysis, as well as the language that accompanies these tasks -- I would not want to skip my mathematics preparation as a data scientist -- however, I do acknowledge that the ETL and presentation layers are equally important -- we data scientists have to be able to do all of these tasks in an iterative manner -- thanks for the opportunity to comment.

Comment by Martin Squires on August 7, 2014 at 10:11am

Customer and marketing analysis at its simplest breaks down into 3 stages: pulling your data together, doing the analysis/building the models and communicating insights. The split in terms of analytical time and effort has usually been around 40/40/20 while the balance organisations want is nearer 20/40/40 i.e. they want their best strategic analysts to get access to data as quickly as possible, which in turn explains the amount of money spent on data warehouses.

The current situation where big data and the opening up of unstructured data sources etc has led to new skills being required in the data processing stage is an anomaly. Problems will be solved and tools built to put this new data into the hands of less technical users (yes, there is serious moneyto be made  building those solutions and if that's where people see their carears then limited maths may be ok). The skills gaps organisations will still be struggling to fill are to support stages 2 & 3. We need people who can take complex business issues, apply their analytical skills and create insightful models and strategies. If you want to trade "good enough" maths off anywhere then being great at stage 3, creating and communicating insights is where the blend could be improved. I've never been bothered, beyond table stakes, what software analysts have used and how well they know their code, but  I love interviewing the ones who can tell me why if I hire them they'll help me meet and exceed my companies strategic goals and KPIs

Comment by Sean McClure on August 7, 2014 at 9:25am

This article is so out-of-touch with what Data Science is I struggle to justify the time it will take to comment on it. First off, any Data Scientist worth their salary knows ETL as we know it is virtually dead and only here because of old legacy applications (built by people concerned with "pipelines") and outdated BI (what's BI?...exactly) bloated data warehouses that are good for...building dashboards that nobody uses?  Building software with no connection to the science that underlies WHY algorithms do what they do, and the assumptions they make about the knowledge representation that surfaces, is like building a rocket with no payload.  Engineering without science (the language of which is math) is just a pretty tool that does nothing truly useful or competitive. 

Software alone is nothing but building a tool that allows organizations to do what they have always done more efficiently..and probably with a prettier interface. If companies wanted to do what they have always done they wouldn't hire us. Writing code has absolutely nothing to do with the strategic direction of the company or with brining fresh insight into how an organization can compete analytically. 15 year old children can write an App and sell it for millions. How many 15 year olds are winning Kaggle competitions?  Black boxing machine learning techniques is assuming that everyone's data is going to look the same.  Anyone who has actually worked with real-world data knows this is not even close to being true. EVERY situation requires a deep understanding of the mathematical frameworks that underly the approaches as it is the MATH that uncovers the structure in the data, learns the concepts of the domain, and generalizes out to unseen instances.  It is the math that is making all the assumptions and all the predictions.

Science is here making all the difference because we finally have the volume and variety of data to apply our scientific theories in machine learning and AI to real-world data.  This requires, above all else, a deep understanding of the science and mathematics of how these algorithms works. It requires a deep understanding of the scientific approach to problem solving and vetting out hypotheses. This has nothing to do with coding or pipelining...these are mere vehicles that deliver the goods. Data Science is about math and science. Building an ETL pipeline (do people still use these?) with no science is like a rocket with no payload. What are you going to deliver with your fancy pipeline? Packaged algorithms that nobody understands and therefore have no relevance to the business you're building it for?  Is your client going to compete analytically because you pressed the go button on some vendor's so-called machine learning development kit?

Anyone can build a Hadoop cluster and lay some Mahout on top...then tell the client they are doing Data Science. This is beyond dishonest and taking advantage of the Data Science hype.  You have to do SCIENCE to model out the massive amounts of data underlying the client's business. You have to understand the assumptions being made by any data cleaning and modeling techniques applied. You must understand how to PROPERLY parameterize models to predict incoming data, and adapt to how the markets change.  You need to understand the trade-offs when managing various models with varying prediction accuracies.  This is all spoken in the language of MATH.

The math degree is the #1 degree ranked by salary + work-life balance.  Anyone focused on mere tool-building should stay away from Data Science.  The day will come when real scientists ask you about "your" models...what will you say? "They seem to work well with my pipeline and cluster"???. Good luck with that. 

Follow Us

Videos

  • Add Videos
  • View All

© 2014   Data Science Central

Badges  |  Report an Issue  |  Terms of Service