Guest blog by James Kobielus. James is IBM’s Big Data Evangelist. He is an industry veteran who spearheads IBM’s thought leadership activities in big data, data science, enterprise data warehousing, advanced analytics, Hadoop, business intelligence, data management, and next best action technologies. Prior to joining IBM, he was a leading industry analyst, with firms including Forrester Research, Current Analysis, and Burton Group. He has spoken at such leading industry events as IBM Information On Demand, IBM Big Data Integration and governance, Strata, Hadoop Summit, and Forrester Business Process Forum. He has published several business technology books and is a very popular provider of original commentary on blogs, podcasts, bylined business/technology press publications, and many social media.
The experience of being a working data scientist is not necessarily what people think. A profession that some regard as “sexy” is, more often than not, a difficult job involving long hours, tight budgets, limited staff, daunting tasks, shifting requirements, endless meetings, and inflated expectations.
For the working data scientist, pain points may dominate the fabric of their experience. High-performance data scientists are those who automate, accelerate, and streamline the more tedious aspects of their jobs so that they can focus on finding data-driven insights. They will embrace any tool, platform, or approach that can help free up mental bandwidth for tasks that demand their creativity and judgment.
Data scientists do exceptionally complex work. Their productivity depends on having access to tools and practices they can use to streamline and accelerate the details in which they immerse themselves. As discussed in this recent IDG News article, the most fulfilling experiences of high-performance data scientists fall into three broad categories:
- Learning: This is the core value that data scientists deliver: learning what insights the data may reveal and what relevance they may have to the business problem at hand. According to a data scientist who was quoted in the article, “The first step is understanding the area — I’ll spend a lot of time searching the literature, reading, and trying to understand the problem.” It also involves continual reassessment of the available and appropriate data-science computational approaches, algorithms, tools, platforms, and services necessary to tackle these problems effectively within constraints of time, budget, and staff.
- Collaborating: This is the process under which data scientists engage with team members, colleagues, customers, and stakeholders. These activities—such as meetings and emails–often consume a substantial part of the data scientist’s day. It involves everything from identifying a client’s business problem to assessing available data, tracking progress, discussing reports, sharing findings, explaining results, and putting the insights into an actionable business context. Productivity in this respect depends on the data scientists’ consultative skills: their ability to guide stakeholders through the process of identifying how data-driven insights can drive disruptive business outcomes. In the words of another data scientist quoted in the cited article, “A lot of people know they need help with data, but they don’t know what they can do with it. It feels like being a magician, opening their minds to the possibilities. That kind of exploration and geeking out is now my favorite part.”
- Creating: These are the nitty-gritty data science tasks such as discovering and preparing data, building and refining statistical models, visualizing and assessing findings, and developing data-driven applications. Productivity in this respect depends on the data scientist’s ability to leverage high-performance data mining, predictive analytics, machine learning, artificial intelligence, and cognitive solutions to automate these tasks. It also depends on their ability to determine the appropriate analytics technique for addressing classes of business problems, with a clear understanding of both basic and advanced data mining techniques, ranging from regression analysis, cluster analysis, decision trees, neural networks and Bayesian machine learning methods to optimization, simulation and stochastic analysis.
One of the most frustrating experiences for any data scientists is when they have to work with disparate, fragmented tools and platforms in support of various lifecycle tasks, such as source discovery, data preparation, statistical modeling, and visualization. Considering that data science is increasingly a team-oriented discipline, it’s essential that diverse data professionals—including statistical modelers, data engineers, business analytics, visualization designers, app developers, and others—be able to pool their efforts within open, collaborative environments. The collaboration tools that data scientists employ can make a huge difference whether they experience productivity or frustration in their daily tasks. Lack of an integrated development environment has unfortunate consequences for data scientist productivity: results that are unshareable, team collaborations that are are awkward, and cross-project visibility that is limited or non-existent.
To power high-performance data science, an integrated development environment should facilitate the following development tasks:
- Acquire data from diverse data lakes, big data clusters, cloud data services and more
- Discover, acquire, aggregate, curate, prepare, pipeline, model and visualize complex, multistructured data
- Prototype and program data applications for execution in in-memory, streaming and other low-latency runtime environments
- Tap into libraries of algorithms and models for statistical exploration, data mining, predictive analytics, machine learning and natural language processing, among other functions
- Develop, share and reuse data-driven analytic applications as composable microservices for deployment in hybrid cloud environments
- Secure, govern, track, audit and archive data, algorithms, models, metadata and other assets throughout their lifecycles
Now in open beta, IBM Data Science Experience (DSX) delivers all of these capabilities in an open and integrated environment for team data science. DSX provides the following productivity features for next-generation data science:
- An interactive, cloud-based, scalable and secure visual workbench for consolidating open-source tools, languages and libraries and for collaborating within teams to rapidly put high-quality data science applications into production
- Access to open-source tools and libraries—including Spark, R, Python and Scala—as well as solutions from IBM, IBM partners such as RStudio and H20.ai, and others through an extensible architecture
- A unified environment for data scientists and other analytics developers that allows them to connect with one another while accessing project dashboards and learning resources, forking and sharing projects, exchanging development assets (datasets, models, projects, tutorials and Jupyter notebooks) and sharing results, with follow-on releases aiming to include comments, user profiles, data science competitions, Zeppelin notebooks and real-time collaboration
- Built-in connectivity to diverse data sources as well as simplified data ingestion, refinement, curation and analysis capabilities, with follow-on releases aiming to include new features such as data shaping, Spark pipeline deployment, SPSS analytic algorithms, automated modeling and data preparation, model management and deployment, advanced visualizations, text analytics, geospatial analytics and integration with Watson Analytics
For more information on DSX and to participate in the open beta, please visit this page.
Top DSC Resources
- Article: What is Data Science? 24 Fundamental Articles Answering This Question
- Article: Hitchhiker’s Guide to Data Science, Machine Learning, R, Python
- Tutorial: Data Science Cheat Sheet
- Tutorial: How to Become a Data Scientist – On Your Own
- Categories: Data Science – Machine Learning – AI – IoT – Deep Learning
- Tools: Hadoop – DataViZ – Python – R – SQL – Excel
- Techniques: Clustering – Regression – SVM – Neural Nets – Ensembles – Decision Trees
- Links: Cheat Sheets – Books – Events – Webinars – Tutorials – Training – News – Jobs
- Links: Announcements – Salary Surveys – Data Sets – Certification – RSS Feeds – About Us
- Newsletter: Sign-up – Past Editions – Members-Only Section – Content Search – For Bloggers
- DSC on: Ning – Twitter – LinkedIn – Facebook – GooglePlus