A Classic Data Science Project and approach looks like this:
Data Science (DS) and Machine Learning (ML) are the spines of today’s data-driven business decision-making.
From a human viewpoint, ML often consists of multiple phases: from gathering requirements and datasets to deploying a model, and to support human decision-making—we refer to these stages together as DS/ML Lifecycle. There are also various personas in the DS/ML team and these personas must coordinate across the lifecycle: stakeholders set requirements, data scientists define a plan, and data engineers and ML engineers support with data cleaning and model building. Later, stakeholders verify the model, and domain experts use model inferences in decision making, and so on. Throughout the lifecycle, refinements may be performed at various stages, as needed. It is such a complex and time-consuming activity that there are not enough DS/ML professionals to fill the job demands, and as much as 80% of their time is spent on low-level activities such as tweaking data or trying out various algorithmic options and model tuning. These two challenges — the dearth of data scientists, and time-consuming low-level activities — have stimulated AI researchers and system builders to explore an automated solution for DS/ML work: Automated Data Science (AutoML). Several AutoML algorithms and systems have been built to automate the various stages of the DS/ML lifecycle. For example, the ETL (extract/transform/load) task has been applied to the data readiness, pre-processing & cleaning stage, and has attracted research attention. Another heavily investigated stage is feature engineering, for which many new techniques have been developed such as deep feature synthesis, one-button machine, reinforcement learning-based exploration, and historical pattern learning.
However, such work often targets only a single stage of the DS/ML lifecycle. For example, AutoWEKA can automate the model building and training stage by automatically searching for the optimal algorithm and hyperparameter settings, but it offers no support for examining the training data quality, which is a critical step before the training starts. In recent years, a growing number of companies and research organizations have started to invest in driving automation across the full end-to-end AutoML system. For example, Google released its version of AutoML in 2018. Startups like H2O and Data Robot both introduced products. There are also Auto-sklearn and TPOT from the open-source community. Most of these systems aim to support end-to-end DS/ML automation. Dataiku became a leader in Enterprise AI tool which is giving us an all total new view too. There are many other platforms coming up.
Current capabilities are focused on the model building and data analysis stages, while little automation is offered for the human-labor-intensive and time-consuming data preparation or model runtime monitoring stages. Moreover, these works currently lack an analysis from the users’ perspective: Who are the potential users of envisioned full automation functionalities? Are they satisfied with the AutoML’s performance, if they have used it? Can they get what they want and trust the resulting models?
At the end of this article, we will see how Dataiku & other AI solutions or platforms can give us many solutions.
Data Science Team and Data Science Lifecycle:
Data science and machine learning are complex practices that require a team with interdisciplinary background and skills. For example, the team often includes stakeholders who have deep domain knowledge and own the problem; it also must have DS/ML professionals who can actively work with data and write code. Due to the interdisciplinary and complex nature of the DS/ML work, teams need to closely collaborate across different job roles, and the success of such collaboration directly impacts the DS/ML project’s final output model performance.
Data Science Automated:
It often starts with the phases of requirement gathering & problem formulation, followed by data cleaning and engineering, model training and selection, model tuning and ensembles, and finally deployment and monitoring. Automated Data Science (AutoML) is the endeavor of automating each stage of this process separately or jointly. The Data cleaning stage focuses on improving data quality. It involves an array of tasks such as missing value imputation, duplicate removal, noise correction, invalid values, and other data collection errors. AlphaClean and HoloClean provide representative examples of automated data cleaning. Automation can be achieved through approaches like reinforcement learning, trial and error methodology, historical pattern learning, and more recently through knowledge graphs. The Hyperparameter selection stage is used to fine-tune a model or the sequence of steps in a model pipeline. Several automation strategies have been proposed, including grid search, random search, evolutionary algorithms, and finally sequential model-based optimization methods. AutoML has witnessed considerable progress in recent years, in research as well as application in commercial products. Various AutoML research efforts have moved beyond the automation on one specific step. Joint optimization, a type of Bayesian-Optimization-based algorithms, enables AutoML to automate multiple tasks together. For example, AutoWEKA , Auto-sklearn , and TPOT all automate the model selection, hyperparameter optimization, and assembling steps of the data science pipeline. The result coming out of such AutoML system is called a “model pipeline”. A model pipeline is not only about the model algorithm; it emphasizes the various data manipulation actions (e.g., filling in missing value) before the model algorithm is selected, and the multiple model improvement actions (e.g., optimize the best values for model’s hyperparameters) after the model algorithm is selected.
Amongst these advanced AutoML systems, Auto-sklearn and Auto-WEKA are two open-source efforts. Both use the sequential-parameter-optimization algorithm. This optimization approach generates model pipelines by selecting a combination of model algorithms, data pre-processors, and feature transformers. Their system architectures are both based on the same general-purpose-algorithm-configuration framework, SMAC (Sequential Model-based Optimization for General Algorithm Configuration). In applying SMAC, Auto-sklearn and Auto-WEKA translate the model selection problem into a configuration problem, where the selection of the algorithm itself is modeled as a configuration.
Auto-sklearn supports warm-starting the configuration search by trying to generalize configuration settings across data sets based on historic performance information. Leverage historical information to build a recommender system that can navigate the historical information more efficiently. This approach is effective in determining a pipeline but is also limited because it can only select from a pre-defined and limited set of pre-existing pipelines. To enable AutoML to dynamically generate pipelines instead of only selecting pre-existing pipelines were inspired by AlphaGo Zero and its pipeline generation algorithm. modeled as a single-player game. So the pipeline is built iteratively by selecting a set of actions (insertion, deletion, replacement) and a set of pipeline components (e.g., logarithmic transformation of a specific predictor or” feature”). Extend this idea to use a reinforcement learning approach, so that their final pipeline outcome is an ensemble of multiple sub-optimal pipelines, but that final pipeline has a state-of-the-art model performance when compared to other approaches. Model ensembles have become a mainstay in ML with all recent Kaggle competition-winning teams relying on them. Many AutoML systems generate a final output model pipeline as an ensemble of multiple model algorithms instead of a single algorithm. More specifically, the ensemble algorithm includes:
1) Ensemble selection, which is a greedy-search-based algorithm that starts with an empty set of models, incrementally adds a model to the working set and selects that model if such addition results in improving the predictive performance of the ensemble
2) And, genetic programming algorithm, which does not create an ensemble of multiple model algorithms, but it can compose derived model algorithms. An advanced version of the genetic programming algorithm that uses multi-objective genetic programming to evolve a set of accurate and diverse models via introducing bias into the fitness function accordingly.
Data Science Characters, Their Current and Favoured Levels of Automation, and Different Stages of Lifespan:
In this section, let's take a closer look at data science workers’ current level of automation, and what their preferred level of automation would be in the future. The current and preferred levels of automation is also associated with different stages of the DS lifecycle. It is observed that there is a clear gap between the levels of automation in their current DS work practices and preferred automation in the future. Most respondents reported that their current work is at automation L0, which is “No automation, human performs the task”. Some participants reported L1 or even L2 levels of automation (i.e., “human-directed automation” and “system-suggested human-directed” respectively) in their current work practice, and these automation activities happened often in the more technical stages of the DS lifecycle (e.g., data pre-processing, feature engineering, modeling building, and model deployment). These findings echo the existing trend that AutoML system development and algorithm research work focus much more on the technical stages of the lifecycle. However, these degrees of automation are far less than what the respondents desired: participants reported that they prefer at least L1 automation across all stages, with the only exceptions in requirement gathering and model verification where a number of participants still prefer L0. The median across all the stages is L2 – human-guided automation.
In some of the stages, when asked about future preferences, a few respondents indicated that they want full automation (L4) over other automation levels. The Model deployment stage had the highest full automation preference, but it was still not the top choice of people. On average across stages, full automation was only preferred by 14% of respondents. Human-directed automation (L2) was preferred by most respondents (42%), while system-directed automation (L3) was the second preference (22%). This suggests that users of AutoML would always like to be informed and have control of the system to some degree. A full end-to-end automated DS lifecycle was not what people wanted. End-to-end AutoML systems should always have human-in-the-loop. There seems to be a trend in the results of the preferred levels of automation: in general, the desired levels of automation increases along with the lifecycle stages moving from less technical ones (e.g., requirement gathering) into the more technical ones (e.g., model building). L2 (System-suggested human-directed automation), L3 (system-directed automation), and L4 (full automation) are the levels when the human shifts some control and decision power to the system, and the AutoML system starts to take agency. And L2, L3, and L4 together took the majority vote of each stage. In summary, these results suggest that people definitely welcome more automation to help with their DS/ML projects, and there is a huge gap between what they use today and what they want tomorrow. However, people also do not want over-automated systems in human-centered tasks (i.e., requirement gathering, model verification, and decision optimization).
To have an in-depth examination of people’s preferred levels of automation, finer-grained level of automation preferences across different stages, and across different roles. It is worth noting that participants across all roles agreed that requirement gathering should remain a relatively manual process. Data scientists, both expert and citizen scientists, tend to be cautious about automation. Only a few of them expressed interest in fully automating (L4) feature engineering, model building, model verification, and runtime monitoring. For example, in model verification, they prefer system-suggested and human-directed (L3) and system-directed automation (L2) over too little automation (L1/L0) or too much automation (L4).
AI-Ops had a more conservative perspective toward automation than other roles, they only have a majority preference of full automation (L4) in the model deployment stage, some on data acquisition and data pre-processing, but for the rest stages, they would strongly prefer to have human involvement. Above all, there is a clear consensus among different roles that model deployment, feature engineering, and model building are the places where practitioners want higher levels of automation. This suggests an opportunity for researchers and system builders to prioritize automation work on these stages. On the other hand, all roles agree that less automation is desired in requirements gathering and decision optimization stages, this may be due to the fact these stages are currently labor-intensive human efforts, and it is difficult for our participants to even imagine how the automation would look like in these stages in the future.
Data Governance & Why?
Data governance is not a new idea - as long as data has been collected, companies have needed some level of policy and oversight for their management. Yet it largely stayed in the context, as businesses weren’t using data at a scale that required data governance to be top of mind. In the last few years, and certainly in the face of 2020’s tumultuous turn of events, data governance has shot to the forefront of discussions both in the media and in the boardroom as businesses take their first steps towards Enterprise AI. Recent increased government involvement in data privacy (e.g. GDPR and CCPA) has no doubt played a part, as have magnified focuses on AI risks and model maintenance in the face of the rapid development of machine learning. Companies are starting to realize that data governance has never really been established in a way to handle the massive shift toward democratized machine learning required in the age of AI. And that with AI comes new governance requirements. Today, the democratization of data science across the enterprise and tools that put data into the hands of the many and not just the elite few (like data scientists or even analysts) means that companies are using more data in more ways than ever before. And that’s super valuable; in fact, the businesses that have seen the most success in using data to drive the business take this approach.
But it also gifts new challenges - mainly that businesses’ IT organizations are not able to handle the demands of data democratization, which has created a sort of power struggle between the two sides that slows down overall progression to Enterprise AI. A fundamental shift and organizational change into a new type of data governance, one that enables data use while also protecting the business from risk, is the answer to this challenge and the topic of this section.
Most enterprises today identify data governance as a very important part of their data strategy, but often, it’s because poor data governance is risky. And that’s not a bad reason to prioritize it; after all, complying with regulations and avoiding bad actors or security concerns is critical. However, governance programs aren’t just beneficial because they keep the company safe - their effects are much wider:
Governance isn’t about just keeping the company safe; data and AI governance are essential components to bringing the company up to today’s standards, turning data and AI systems into a fundamental organizational asset. As we’ll see in the next section, this includes wider use of data and democratization across the company.
AI Governance and Machine Learning Model Management?
Data governance traditionally includes the policies, roles, standards, and metrics to continuously improve the use of information that ultimately enables a company to achieve its business goals. Data governance ensures the quality and security of an organization’s data by clearly defining who is responsible for what data as well as what actions they can take (using what methods).
With the rise of data science, machine learning, and AI, the opportunities for leveraging the mass amounts of data at the company’s disposal have exploded, and it’s tempting to think that existing data governance strategies are sufficient to sustain this increased activity. Surely, it’s possible to get data to data scientists and analysts as quickly as possible via a data lake, and they can wrangle it to the needs of the business?
But this thinking is flawed; in fact, the need for data governance is greater than ever as organizations worldwide make more decisions with more data. Companies without effective governance and quality controls at the top are effectively kicking the can down the road, so to speak, for the analysts, data scientists, and business users to deal with — repeatedly, and in inconsistent ways. This ultimately leads to a lack of trust at every stage of the data pipeline. If people across an organization do not trust the data, they can’t possibly confidently and accurately make the right decisions
IT organizations historically have addressed and been ultimately responsible for data governance. But as businesses move into the age of data democratization (where stewardship, access, and data ownership become larger questions), those IT teams have often been put in the position incorrectly of also taking responsibility for information governance pieces that should really be owned by business teams. Because the skill sets for each of these governance components are different. Those responsible for data governance will have expertise in data architecture, privacy, integration, and modeling. However, those on the information governance side should be business experts — they know:
In brief, data governance needs to be teamwork between IT and business stakeholders.
Shifting from Traditional Data Governance to Data Science & AI Governance Model:
An old-style data governance program oversees a range of activities, including data security, reference and master data management, data quality, data architecture, and metadata management. Now with the growing adoption of data science, machine learning, and AI, there are new components that should also sit under the data governance. These are namely machine learning model management and Responsible AI governance. Just as the use of data is governed by a data governance program, the development and use of machine learning models in production require clear, unambiguous policies, roles, standards, and metrics. A robust machine learning model management program would aim to answer questions such as:
It’s worth noting that machine learning model management will play an especially important role in AI governance strategies in 2020 and beyond as businesses leverage Enterprise AI to both recovers from and develop systems to better adapt to future economic change.
Responsible AI Governance and Keys to Defining a Successful AI Governance Strategy:
The second new aspect for a modern governance strategy is the oversight and policies around Responsible AI. While it has certainly been at the center of media attention as well as public debate, Responsible AI has also at the same time been somewhat overlooked when it comes to incorporating it concretely as part of governance programs.
Perhaps because data science is referred to as just that — a science — there is a perception among some that AI is intrinsically objective; that is, that its recommendations, forecasts, or any other output of a machine learning model isn’t subject to individuals’ biases. If this were the case, then the question of responsibility would be irrelevant to AI - an algorithm would simply be an indisputable representation of reality.
This misconception is extremely dangerous not only because it is false, but also because it tends to create a false sense of comfort, diluting team and individual responsibility when it comes to AI projects. Governance around Responsible AI should aim to address this misconception, answering questions such as:
Following key Methods & Ethics to follow to define AI Governance Strategy:
A Top-Down and Bottom-Up Plan:
Every AI governance program needs executive sponsorship. Without strong support from leadership, it is unlikely a company will make the right changes (which — full transparency — are often difficult changes) to improve data security, quality, and management. At the same time, individual teams must take collective responsibility for the data they manage and the analysis they produce. There needs to be a culture of continuous improvement and ownership of data issues. This bottom-up approach can only be achieved in tandem with top-down communications and recognition of teams that have made real improvements and can serve as an example to the rest of the organization.
Stability Between Governance and Enablement:
Governance shouldn’t be a blocker to innovation; rather, it should enable and support innovation. That means in many cases, teams need to make distinctions between proofs-of-concept, self-service data initiatives, and industrialized data products, as well as the government, needs surrounding each. Space needs to be given for exploration and experimentation, but teams also need to make a clear decision about when self-service projects or proofs-of-concept should have the funding, testing, and assurance to become an industrialized, operationalized solution
Excellence at its Heart:
In many companies, data products produced by data science and business intelligence teams have not had the same commitment to quality as traditional software development (through movements such as extreme programming and software craftsmanship). In many ways, this arose because five to ten years ago, data science was still a relatively new discipline, and practitioners were mostly working in experimental environments, not pushing to production. So, while data science used to be the wild west, today, its adoption and importance have grown so much that standards of quality applied to software development need to be reapplied. Not only does the quality of the data itself matter now more than ever, but also data products need to have the same high standards of quality — through code review, testing, and continuous integration/continuous development (CI/CD) that traditional software does if the insights are to be trusted and adopted by the business at scale.
As machine learning and deep learning models become more widespread in the decisions made across industries, model management is becoming a key factor in any AI Governance strategy. This is especially true today as the economic climate shifts, causing massive changes in underlying data and models that degrade or drift more quickly. Continuous monitoring, model refreshes, and testing are needed to ensure the performance of models meets the needs of the business. To this end, MLOps is an attempt to take the best of DevOps processes from software development and apply them to data science.
Transparency and Accountable AI:
Even if, per the third component, data scientists write tidy code and adhere to high-quality standards, they are still giving away a certain level of control to complex algorithms. In other words, it’s not just about the quality of data or code, but making sure that models do what they’re intended to do. There is growing scrutiny on decisions made by machine learning models, and rightly so. Models are making decisions that impact many people’s lives every day, so understanding the implications of the decisions they make and making the models explainable is essential (both for the people impacted and the companies producing them). Open source toolkits such as Aequitas3, developed by the University of Chicago, make it simpler for machine learning developers, analysts, and policymakers to understand the types of bias that machine learning models bring.
Data & AI Governance Weaknesses:
Data and AI governance aren’t easy; as mentioned in the introduction, these programs require coordination, discipline, and organizational change, all of which become even more challenging the larger the enterprise. What’s more, their success is a question not just of successful processes, but a transformation of people and technology as well. That is why despite the clear importance and tangible benefit of having an effective AI governance program, there are several pitfalls that organizations can fall into along the way that might hamper efforts:
A governance program without senior sponsorship means policies without “teeth,” so to speak. Data scientists, analysts, and business people will often revert to the status quo if there isn’t top-down castigation when data governance policies aren’t adhered to and recognition for when positive steps are taken to improve data governance.
If there isn’t a culture of ownership and commitment to improving the use and exploitation of data throughout the organization, it is very difficult for a data governance strategy to be effective. As the saying goes, “Culture eats strategy for breakfast.” Part of the answer often comes back to senior sponsorship as well as communication and tooling
A lack of clear and widespread communication around data governance policies, standards, roles, and metrics can lead to a data governance program being ineffective. If employees aren’t aware or educated around what the policies and standards are, then how can they do their best to implement them?
Training and education are hugely important pieces of good data and AI governance. It not only ensures that everyone is aware of policies but also can help explain practically why governance matters. Whether through webinars, e-learning, online documentation, mass emails, or videos, initial and continuing education should be a piece of the puzzle.
A centralized, controlled environment from which all data work happens makes data and AI governance infinitely simpler. Data science, machine learning, and AI platforms can be a basis for this environment, and essential features include at a minimum contextualized documentation, a clear delineation between projects, task organization, change management, rollback, monitoring, and enterprise-level security
Which solutions or platforms are Trending for Enterprise AI or specialist data scientists to make life easier in this area?
Not too hot, not too cold, but just right – these are the platforms that achieve a mix between being loved by techies and non-techies alike. This middle ground offers a strong focus on citizen data science users and heavy integration with programming languages, allowing for flexibility and in-platform collaboration between people who can code, and people who can’t. These platforms making life easy for data scientists in terms of executing a whole lengthy Data science projects steps and doing automation in the works.
Azure Machine Learning:
Microsoft is well known for seamlessly integrating their product offerings with each other, making Azure Machine Learning an attractive option for users who are already working in an existing Azure stack. Azure Machine Learning’s main offering is the ability to build predictive models in-browser using a point-and-click GUI. Though the ability to write code directly in the platform is not available, specialized data scientists will be excited by Microsoft’s Python integration. The Azure ML library for Python allows users to normalize and transform data in Python themselves using familiar syntax, and call Azure Machine Learning models as needed using loops. Not only this, but Azure Machine Learning also integrates with existing Python ML packages (including scikit-learn, TensorFlow and PyTorch). For users familiar with these tools, distributed cloud resources can be used to productions results at scale, just like any other experiment. As of the writing of this article, Azure Machine Learning also offers an SDK for R in a public preview (i.e. non-productionisable) mode, which is expected to improve over time.
H2O Driverless AI:
H2O Driverless AI is the main commercial enterprise offering of the company H2O.ai, offering automated AI with some pretty in-depth algorithms, including advanced features like natural language processing. A strong focus on model interpretability gives users multiple options for visualizing algorithms in charts, decision trees, and flowcharts. H2O.ai is already well-known in the industry for its fully open-source ML platform H2O, which can be accessed as a package through existing languages like Python and R, or in notebook format. H2O Driverless AI and H2O currently exist as fairly separate products, though it is potential for these to be further integrated in the future. Partnerships with multiple cloud infrastructure providers (including AWS, Microsoft, Google Cloud, and Snowflake) make H2o Driverless AI a product to watch in the coming years.
DataRobot offers a tool that is intended to empower business users to build predictive models through a streamlined point-and-click GUI. The tool focuses very heavily on model explainability, by generating flowcharts for data normalization and automated visuals for assessing model outcomes. These out-of-the-box visuals include important exploratory charts like ROC curves, confusion matrices, and feature impact charts. DataRobot’s end-to-end capabilities were significantly bolstered by the company’s acquisition of Paxata (a data preparation platform) in December 2019, which has since been integrated with the DataRobot predictive platform. The company also boasts some big-name partnerships, including Qlik, Tableau, Looker, Snowflake, AWS, and Alteryx. DataRobot does offer Python and R packages, which allow many of the service’s predictive features to be called through code, though the ability to directly write code in the DataRobot platform and collaborate with citizen data scientist users is not currently available (as of the writing of this article). DataRobot’s new MLOps service also provides the ability to deploy independent models written in Python/R (in addition to models developed in DataRobot), as part of a robust operations platform that includes deployment tests, integrated source control, and the ability to track model drift over time.
RapidMiner Studio is a drag & drop GUI-based tool for building predictive analytics solutions, with a free version providing analysis of up to 10,000 rows. In-database querying and processing are available through the GUI, but programmers/analysts also have the option to query in SQL code. The ETL process is handled by Turbo Prep, which offers a point & clicks data preparation (as well as direct export to .qvx, for users who want to import results into Qlik). The cool thing about RapidMiner is the integration with Python & R modules, available as supported extensions in the RapidMiner Marketplace, through which coders & non-coders can both collaborate on the same project. For coders working on a local Python instance, the RapidMiner library in Python also allows for the administration of projects and resources of a RapidMiner instance. For cloud-based scaling of models, RapidMiner also allows containerization using Docker and Kubernetes.
An existing big player in the ETL tool market, Alteryx is used to build data transformation workflows in a GUI, replacing the need to write SQL code. Alteryx has significantly stepped up its game in recent years with its integrated data science offering, allowing users to build predictive models using their drag-and-drop “no-code” approach. The ability to visualize and troubleshoot results at every step of the operation is a huge plus, and users familiar with SQL should transition easily to the logical flowchart style of the ETL, removing the need for complex nested scripts. Alteryx has a fantastic online community with plenty of resources, and direct integration with both Python and R through out-of-the-box tools. The Python tool includes common data science packages such as pandas, scikit-learn, matplotlib, numpy, and others which will be familiar to the Python enthusiasts of this world.
Dataiku is one of the world’s leading AI and machine learning platforms, supporting agility in organizations’ data eorts via collaborative, elastic, and responsible AI, all at enterprise scale. Hundreds of companies use Dataiku to underpin their essential business operations and ensure they stay relevant in a changing world. One quick look at the Dataiku website will make it immediately clear that this is a platform for everyone in the data space. Dataiku offers both a visual UI and a code-based platform for ML model development, along with a host of features that make Dataiku a highly sustainable platform in production. Data scientists will be delighted with not only the Python & R integration, but the flexibility in being able to code either using the embedded code editor, or their favorite IDE like Jupyter notebooks or RStudio. The Dataiku DSS (Data Science Studio) is available as an HTTP REST API, allowing users to manage models, pipelines, and automation externally. Data analysts will be excited by the multitude of plugins available – including PowerBI, Looker, Qlik. qvx export, Dropbox, Excel, Google Sheets, Google Drive, Google Cloud, OneDrive, SharePoint, Confluence, and many more. Automatic feature engineering, generation, and selection, in combination with the visual UI for model development, allows ML to sit firmly within the reach of these citizen data scientists.