Predicting the length of time it will take to get a Machine Learning (ML) project into production can be tricky. If there is an issue, more often than not, it is likely related to a disconnect between engineering and the data science team. Collaboration between data science and engineering is critical for ML projects, but it is often a challenge.
Although data scientists and engineers both work with code and machines - their roles and mindsets are different. Data scientists extract knowledge and insights from data, while software engineers build products and systems. Data scientists can spend considerable time creating and tweaking data models and algorithms to get an ideal result, which makes their work more experimental and iterative than software development. Engineers are responsible for building functionality around the ML data models and getting products into production within a set timeframe.
The model development portion of an ML project is considered the ‘research phase’ and is where many ML projects get stalled due to continual model adjustments. Therefore, it can be extremely beneficial for data scientists to think in engineering terms, which often leads to a faster production cycle.
When it comes to ML project management, one can separate the process into three stages: Proof of Concept (PoC) or the research phase, the Demo phase, and the Engineering phase. In this article, we examine these different phases and how one can handle them to ensure smooth and timely delivery of projects. The resulting protocol can also ensure better estimates of time for production deployment.
Based on several years of experience handling various ML projects as part of a data science team, we have created a number of heuristic rules that one can follow to ensure smooth, predictable and faster time to production.
Almost all ML projects require a PoC phase. PoC ensures a reasonable performing model, apart from ensuring feasibility.
Rule 1: Time bound PoC efforts
Since the PoC is essentially a research effort, it can go on for an undetermined time for two main reasons: 1) data scientists are never done searching for a better model and 2) ML models have a multitude of hyper-parameters to adjust and refine. Therefore, it is essential to set and stick to a pre-determined timeframe to complete the PoC. This reality also drives the need for Rule 2.
Rule 2: Set Expectations of PoC beforehand
Start by clearly defining the output of the PoC either in terms of metrics or a set of feature behaviour. One could argue that by clearly defining Rule 2, Rule 1 is unnecessary. But, Rule 2 will only be operational if the problem can actually be solved. Therefore, Rule 1 ensures the team does not go beyond a certain number of retries before giving up.
So how do you estimate the appropriate amount of time to develop a PoC? This takes experience and can evolve, but as a rule of thumb:
Once the viability of the ML project is ensured, demonstration of the work becomes important. This also sets the path for Minimum Viable Product (MVP).
Rule 3: Demonstrate PoC effort to all the stake-holders
Involving the stakeholders has a number of impacts for MVP:
Though stakeholders differ from project to project, the minimum stakeholders should include:
The quality of demonstration becomes important as this is the project buying phase: the better the demonstration, the higher the chance of the project being approved. Data science is all about creating stories and this is the phase where the stories should speak clearly. These data science stories, combined with business understandable visualizations, are direct indicators of a successful demo. In addition to the model demo, a snapshot of how the PoC would be taken to production should also be presented by the engineering team and other dependents.
The demo phase should not last more than a month and should be time-boxed. Delaying a demo will result in higher chances of the PoC landing in the scrap yard or pre-empted by higher priority projects.
Once the project is in an approved phase, the next step is to take the PoC to production. Taking a PoC into production needs to be handled carefully, since the underlying product sometimes becomes the face of the company.
Rule 4: Set the Requirements Clearly
Setting clear requirements is important as it not only defines the goals for the data science team, but also for all the team/parties on which ML project is dependent on. The following factors should be accounted for:
The requirements should also determine if the final model’s performance does not meet the expectations either due to data unavailability or unforeseen model limitations. In such situations, one can still deploy to a limited set of users to validate the feature, as discussed under Rule 7.
Rule 5: Define Clear Timelines and the Design
Defining timelines for the data science team ensures the project is being tracked and brought to closure within an estimated time. It also sets a product launch time and therefore, timelines should be set carefully, accounting for unknowns. Timelines should also be accordingly defined for dependents to ensure all the parties work in parallel. A regular, agile-type tracking is required to identify blockers early and bring them to closure - before they start to over-power the project.
The allocation of sufficient time for QA and code reviews is often ignored in timelines. Code reviews ensure quality and code coverage, before QA takes over. QA defines the product stability and therefore, should be accounted for appropriately during the planning stage.
Timelines should include integration points clearly. In cases where a dedicated engineering team is available, at least one engineer from the product team should work together with the ML engineer to ensure smooth and faster integration with the system.
Design constitutes an integral part of the system and a well thought through system design ensures future changes, apart from system robustness. Timelines should allocate sufficient time for design which varies depending on whether it is entirely a new feature/product or an add-on feature. Various aspects to consider while designing:
Most of the ML projects take roughly 6-8 months to go to production.
Rule 6: Pre-launch Demo with All Stake Holders
A pre-launch demo is a good way to make sure the final product is consistent with what was agreed upon during start of the project. It also ensures the team accommodates minor changes resulting from final product observations and the resulting discussions. A pre-launch demo is also a counter-check on the business metrics defined earlier. Therefore, the pre-launch demo should be completed nearly a month before the launch.
Rule 7: Phase-wise Product Deployment
Deployment should be completed in phases to ensure user feedback is accounted for incrementally, thereby further ensuring product quality and stability. The specific phase-wise approach will differ depending on the type of ML project, but generally includes:
The final deployment may take 2-6 months, depending upon different deployment phases involved, but plan on a minimum of 2 months.
After considering all the phases and steps, it’s clear that an ML project can take roughly 10-12 months from PoC to production. To ensure the project is delivered within the allotted timeframe, start with clear requirements and well-defined business metrics. Also allow sufficient time for QA and a phased deployment schedule. By following the framework above, the probability of delivering your ML project on time can increase dramatically.