In general, computer scientists treats code and data in two very different ways. Virtual memory was originally developed to run big programs (code) in small memory, while data are entities kept in external storage and must be retrieved into memory before computing. As a result, today’s application developers think by instinct the programming model based on storage and explicit data retrieval. This model, referred to as storage-based computing, plays an important role and has done a great job in transactional applications such as banking and ERP systems, where data integrity is the primary concern and the data size (per transaction) is assumed smaller than the code size.
Since the last decade, the weight of applications has been gradually shifted from ’transactional’ to ’analytic’ ones, and the data size has been increased from a few kilobytes to megabytes/gigabytes/terabytes or even bigger, while the code footprint remains relatively unchanged. The assumption of the data size smaller than the code size becomes no longer valid. With such landscape changed between code and data, storage-based computing imposes serious performance issues as follows.
The worst case is the mixed effect caused by juggling and swapping, leading to a special type of double paging anomaly.
Storage-based computing model has been deeply rooted in developers’ minds for more than forty years even when the landscape is changing gradually. By observing the shift in data size and code footprint from transactional application to big data analytic, we raise the first question for big data computing:
”Instead of moving data to the computing space, is it possible to move programs into the data space and perform computing tasks where data is stored?”
To know more about the details, please find the technical whitepaper here.