Data Tanks for Incremental Training of Machine Learning Models

You are familiar with the term data lake. A data lake is a repository used to store unlimited volume of data. These days, most of the cloud service providers allow us to host scalable data lakes for storing data as it arrives. For using these data lakes, it is not required to structure the data and we can run different types of applications on it. Usually they are applications for big data analytics and machine learning. These applications need entire data to be present in one data lake and when new set of data is added to the lake, we need to repeat the analytics and machine learning training process all over again. This is a time consuming process impacting the delivery schedule of solutions.

A recent advancement in big data analytics and machine learning is incremental analytics and incremental training. Incremental learning extends the knowledge of an existing machine learning model by training it further with new data. With this advancement, we do not have to run the process all over again after adding the new data to the lake. With incremental learning we can augment the results and intelligence already captured by running the application on the newly added data. To maintain an analytics and machine learning workflow, we need not subscribe for a data lake after the first version of the model is generated. Once the initial analysis and training on bulk data is completed, we can do the incremental learning by maintaining the incremental data in “Data Tanks”. Data Tanks facilitates incremental learning. Unlike data lakes, data tanks have limited capacity. This helps in saving subscriptin cost on clouds. Just like a water tank, data gets filled into the data tank and once it is filled to the maximum capacity, the big data analytics / machine learning training gets triggered. These applications does incremental updates on the results/intelligence already captured. Once the incremental learning is completed, the data stored in the tank can be emptied to make space for storing another set of data. In essence, the data tank gets reused and data is not permanently stored.

With the evolution of IOT technology, machine learning models are getting deployed on edge devices. The models deployed on the edge devices see lot of online data and it is a challenge to train the already deployed models making use of the new data. To run full-fledged machine learning training process, the device should have huge volume of storage and memory capacity. This is where Data Tanks become useful. The tanks are deployable on edge devices and the capacity of the tanks can be decided based on the available storage space on the device. Incremental learning consists of a set of techniques used to train models in an incremental fashion. If the models are developed using TensorFlow framework using Keras, we have online and incremental learning with Keras and Creme. Creme is a library specifically tailored to incremental learning. Amazon SageMaker provides built-in algorithms for incremental learning. This feature helps you to exploit the Data Tanks in deploying learning applications on Edge devices with the capability of incremental update of its intelligence.

Feel free to implement Data Tanks for incremental learning and cut-short the time required for re-training the models.

See you next time……

Janardhanan PS
Machine Learning Evangelist