Say you need to implement some machine learning system. Should you purchase a product, re-use open-source code, or develop your own algorithms? The decision does not need to be a binary one. I discuss the pluses and minuses of both options. Combining them offers the best of both worlds. I explain with examples how to do it, and when reinventing makes sense.
Designing Your Own Algorithms
Even if you design your own algorithms, chances are you use a Python platform and leverage the existing libraries. You rarely re-write standard algorithms, unless there is a critical need for performance improvement. So you still face the problem of finding the right library or the right programming environment. Yet, implementing home-made solutions is typically more flexible. I discuss here the pluses and minuses.
Besides flexibility, if you have experienced engineers, the learning curve can be short. You control the code and know exactly what it does. You are not subject to unfixable glitches. Or features that become obsolete and require a new product version or are randomly updated, causing unpredictable issues.
In some cases, reinventing the wheel may take less time than simply finding the right solution, not to mention the time spent in correctly understanding and implementing a third-party solution. In some instances, your core algorithm is your intellectual property and competitive edge: you can not outsource it. If you have a long history of building your own solutions and already have large, consistent home-made libraries, re-using your own code or adding your own new one, may be the faster solution, at least short-term. Though you might on occasions integrate code from third parties. Also, chances are that ad-hoc solutions offer more specialized algorithms, customized to your needs. Still you may use vendor solutions for specific tasks such as dashboards, databases or visualizations.
Doing it your own way may be expensive: you need a team of good, reliable engineers. You also need to do your own maintenance and testing. You don’t benefit from the many fixes in vendor solutions, brought in by the large number of customers using the product. Chances are that your solution has more glitches, compared to that from a vendor. Then there is the legal and compliance stuff, especially if your algorithms impact people’s lives. Vendor solutions usually take care of these issues.
These are vendor products, generally offered for a yearly licensing fee. Though it may also include open-source solutions. Third-party solutions, unless you use the latest version, are thoroughly tested. Data integration and compliance is done transparently (hopefully). Updates and maintenance is not your problem, though it can be a cause of problems. The cost can be lower, and the learning curve may be less steep depending on the platform, especially if you have expert users/developers for the platform in question, in your team.
Finding the right vendor can be a challenge. Some vendors can disappear or discontinue a product. On the plus side, they typically offer good customer support. The cost can be customized to your needs (the size of your data or the amount of resources you use) and seamlessly upgraded over time. Vendors may provide better security against hijacking or data loss, though this is not always the case.
Finally, when working with a vendor, you want to become an expert with the product. You may even find smart uses or tricks that the product developers haven’t thought about. In my case, I’ve found that Excel can do much more than you would think: for instance, compute my own ad-hoc (spline-based) prediction intervals using a simple Excel formula equivalent to
PercentileIf, even though that function does not exist in Excel (unlike
SumIf). In the end, I implemented this algorithm in Excel rather than Python. As a bonus thanks to Excel, my charts are interactive.
Also, you may choose your technology based on what the vendor offers, rather than trying to fit a pre-specified technique into the vendor solution. In short, adapting to the vendor (rather than the other way around) to get the best out of it.
A Nice Compromise
You can decide to keep some stuff in-house, or use multiple vendors, each dealing with a specific part of your systems. Some vendors allow you to add customized code on top on their platforms. This may be the best solution.
For instance, my Python code uses third party graphic libraries. Rather than writing anti-aliasing and image compression algorithms from scratch, I use those implemented in the Python libraries, despite my considerable experience on the subject. It is not perfect: one of the anti-aliasing methods stopped being supported and I need to update my code accordingly. Some MP4 to gif compression works poorly: either I need to explore an alternate library, use a different tool for that (that’s what I did) or only use the compression method where it works best.
Yet the libraries in question provide low-level access to the images and videos, at the pixel level. So technically I could incorporate my own compression method with my own Python code, if I wanted to. But so far, except in a few instances, I used algorithms developed by other people, for these rather mundane tasks.
A counter-example is my connected component detection algorithm, written from scratch. There are Python libraries doing that, and even full code available from other people: see here for a big GitHub repository covering pretty much any algorithm written in Python. I did it myself because I just reused some old piece of code of mine, and it really is customized to nearest neighbor graphs. But I had to make sure it runs just as fast as other algorithms in the literature. For those interested, my algorithm (available here) does not use recursion, contrarily to all the other ones: it is a tutorial to show how recursion gets emulated in programming languages.