Capture
Programming and Database skills
The familiarity and ability to use Hadoop, Java, Python, SQL, Hive, and Pig are core essentials. Programming itself and computer science in general is the very starting point of gathering data and understanding how to "get" data and piece it together. Just moving data is it's own specialty reserved for ETL (extract, transformation, loading) specialists. ETL tools may include Informatica, MS SSIS, Teradata bulk loading tools among others. If you can't GET data, you sure can't analyze it. And you sure can't expect somebody else to capture it for you.
Business Domain Expertise and Knowledge
Understanding the business data itself is its own special domain expertise that only comes with working in that data domain. Medical data is different from ecological data which is different from all the varieties of business data. This only comes from studying and asking lots of questions while working in that particular field.
Data Modeling, Warehouse, and Unstructured Data Skills
Knowing the difference between a fact table that is put together well and one that is faulty with semi-structured unconstrained keys makes all the difference in how easily you can trust and massage the data you're trying to capture. Knowing the validity and proper use of each of the dimensions is also key to leveraging any star-schemed data structure. Unstructured data is another story where you may have to figure or organize yourself a staging layer before the data itself is even useful. If you can't get through these things you can't begin making propositional sense of data to analyze.
Analyze
Statistical Tool Skills
Using R, Excel, SAS, or other tools to piece together your propositions and discover potential patterns and correlations through statistics are the heart of working data to discover and apply your creativity. This is where true genius can shine, but use of the tools is the first essential grind of skills required. If you can't use the tools, you can't analyze the data. You could use paper and pencil or even a fancy calculator if you've got the math skills down cold.
Math skills
Understanding correlation, multivariate regression and all aspects of massaging data together to look at it from different angles for use in predictive and prescriptive modeling is the backbone knowledge that's really step one of revealing intelligence. Nothing more to say. If you don't have this, all the data collection and presentation polishing in the world is meaningless.
Present
Visualization Tool Skills
Potential list includes Flare, HighCharts, AmCharts, D3.js, Processing, Google Visualization API, Tableau, Excel, PowerPoint and Raphael.js (?). Most of those I admittedly don't know. Tableau and Excel should provide you with basic enough tools. Heck, if you're good, MS Paint will work just fine.
Storytelling Skills
This is that special set of soft skills that nobody can quite pin down. It's the art and communication holistic human side of the complete data scientist package. This is what makes the difference between a geek scientist and a business savvy Data Scientist of the sexy bent that's valued highly with the according pay and executive respect. When you can come into a meeting and throw up a PowerPoint presentation with an introduction, a proposition, and a revelation in business terms that tells the business what's wrong and what's right and how money is being made and lost, you've earned your income. The trick and value is that elusive almost lost art of storytelling.
Go sit on the porch with Grandpa and get him to tell some stories. Listen to how he sets them up, builds upon them and then delivers the punch lines. You can still learn if you can put your analytic mind aside for awhile. It's the ART of the holistic ART of Data Science. Without it, you might as well just wear a lab-coat. With it you can wear your sunglasses at night.
Other Opinions and Lists
Is the garden-variety spreadsheet jockey a data scientist? Yes, to the extent that they build statistical models and use the tool to find non obvious patterns in structured data, they are engaging in a form of data science. But if this exploration is not their primary job function, they are merely dabbling, not specializing.
Is BI report-building or OLAP cube-development data science? No. Those endeavors, although important, revolve around obvious data patterns — obvious in the sense that an organization has chosen to embed them in repeatable views and access patterns.
Data science is all about asking questions. You engage in it whenever you interactively and iteratively search for deep, hidden patterns.
- Analytical skill-set
- Mathematics / statistics (including experimental design)
- Domain knowledge (i.e. Industry specific processes where analytic are applied)
- Technology / data
- Communication skills (story-telling)
- Curiosity (willingness to challenge the status quo)
- Collaboration
- Commercial acumen/ Strategic
- Customer-centric
- Problem-solving skills
- Proactive
- Diverse Technologies
- Hadoop,
- Java,
- Python,
- C++,
- ECL,
- NoSQL,
- HBase,
- CouchDB
- Mathematics
- Business Skills
- Visualization
- Flare,
- HighCharts,
- AmCharts, D3.js,
- Processing,
- Google Visualization API, and
- Raphael.js
- Innovation
- SQL,
- Statistics,
- Predictive modeling and
- Programming (probably Python)
Further advice of what it takes to be a Data Scientist from practitioners at Netflix, Orbitz and Hortonworks:
- Know the core competencies
- Know a litle more
- Embrace online learning
- Learn to tell a story
- Prepare to be tested (aka “Your pedigree means nothing”).
- Exercise creativity
- Commitment,
- Creativity,
- Business savvy
- Presentation,
- Intuition
- Open-source tools (G)
- Statistics (A)
- Presentation (P)
Michael Driscoll, Secrets of the Successful Data Scientist
Core Curriculum (which tells you a lot of what they think it takes in skills to do this stuff):
CIS 317-DL Database Systems Design & Impl
This course covers the fundamentals of database design and management. Topics include the principles and methodologies of database design, database application development, normalization, referential integrity, security, relational database models, and database languages. Principles are applied by performing written assignments and a project using an SQL database system
CIS 435-0 Data Warehouse & Data Mining
This course provides an introduction to data mining, with a few hours of focus on data warehousing as one of the commonly used data sources for data-mining applications. Students learn data-mining applications, core concepts, and algorithms. Among these algorithms are supervised (Naive Bayes, Decision Tree, and Neural Network) and non supervised (Association Rules, commonly used for market basket analysis, and Clustering) algorithms. Students learn via experimentation; they observe the outcome of applying data mining algorithms to real-life data
PREDICT 401-DL Statistical Analysis
Students learn to apply statistical techniques to the processing and interpretation of data from various industries and disciplines. Topics covered include probability, descriptive statistics, study design and linear regression. Emphasis will be placed on the application of the data across these industries and disciplines and serve as a core thought process through the entire Predictive Analytics curriculum.
PREDICT 410-DL Predictive Modeling I
This course introduces statistical models as they are used in predictive analytics. The course reviews traditional linear and generalized linear models, including multiple regression and logistic regression. It addresses issues of model specification and model selection, as well as best practices in developing models for management. The course also demonstrates the application of multivariate methods in predictive analytics
PREDICT 411-DL Predictive Modeling II
Drawing upon examples from economics and business, this course provides an in-depth review of modeling practice. Special attention is paid to linear predictor and error structure specification for time series models. The course reviews econometric methods, including maximum likelihood estimation, two-stage and three-stage least squares, seemingly unrelated regressions, and simultaneous equation estimation. The course shows how to use autoregressive integrated moving average (ARIMA) models in time series forecasting. The course also demonstrates the application of survival/duration analysis in predictive analytics
LEADERS 481-DL Leadership
The purpose of this course is to identify the fundamental leadership behaviors that enable people to excel in their careers, and to help students apply these behaviors to personal and professional success. The course builds from the basic premise that leadership is learned, and looks at the theory and practice of leadership at the individual and organizational level. The course will explore definitions of leadership, the importance of leadership, leadership styles, the role of vision and integrity, the importance of giving and receiving feedback, how to lead change and solve problems, effective teamwork, and communication strategies
PREDICT 402-DL Analytics and Data Collection
This course will describe the appropriate uses of analytics and its limitations while defining how to approach the various stakeholders within an organization. Included will be a review of the ethical, regulatory, and compliance issues related to a given business problem and/or solution. Time will be spent interpreting performance-based organizational issues while concurrently identifying solutions for these same performance-based organizational issues. In addition, time will be spent identifying best practices to plan for engaging, implementing, and sustaining organizational change.
Happy modeling! :)
You need to be a member of Data Science Central to add comments!
Join Data Science Central