The Future of Data is Real-Time - DataScienceCentral.com

Futuristic Sci-Fi glowing HUD clock fading. Abstract time machin

I honestly don’t know how I fed myself, found my way home, hailed a taxi to important meetings or discovered what my friends were up to 15 years ago. Today, we rely on having apps like DoorDash, Waze, Uber or Social Media at our fingertips, and depend on them being accurate and timely – often with less than a minute’s tolerance for any delays.

While these sophisticated companies have figured out how to deliver real-time apps, for you this is bad news if you’re in the business of delivering software experiences to customers or staff. User expectations for data freshness and accuracy are already externally set by their experiences as consumers. If your data architecture uses batch ETL concepts from 15 years ago, your users will feel it and – more alarmingly – you’re at risk of losing them to competitors with a modern data stack that delivers streaming, real-time data. In today’s environment where everyone is a savvy consumer of tech experiences, your user experience is a big part of your brand.

Delayed Data Is Bad Data

Why does data freshness matter? Well, in all but the least dynamic systems, delayed data is incorrect data, the negative impact of which grows worse the longer it is delayed. Consider an Uber driver looking for a customer who doesn’t know they’ve arrived, or that the customer sent an updated pickup location. A navigation app that doesn’t know an accident has been cleared, sending the driver on a long unnecessary detour. An airline that emails the passenger to check-in 20 minutes after the customer already checked-in using an app. A recommendation engine that promotes a product the customer already bought. An online store that shows inventory levels at a branch location, only for the customer to find there’s none having traveled to pick up.

The above are all end user impacts, but let’s not forget that in many cases machines are using data to make decisions, often drawing on machine learning feature stores – decisions that will be incorrect if based on stale data in the model. If decisions are based on signals, and machines are making more decisions, then less accurate signals result in worse outcomes, faster.

Your Horse Won’t Evolve Into A Car

Human nature and risk avoidance will always favor an evolutionary approach. Using familiar and production-proven processes to mimic streaming by running smaller batches more frequently won’t upset the organization’s data-stack apple cart. But the teams supporting these apps know that the tools they’ve been using for 15+ years are buckling under the pressure.

If it takes six minutes to process five minutes worth of data, you’ve surpassed the logical maximum of how quickly you can do things, kind of like the mythical snake eating its tail. But it’s worse than that, because as you approach that theoretical maximum, you run the risk of data loss and corruption. This is compounded by the demands a bursty, chunky batch processing load puts on a system compared with a smooth and steady stream where processing can be more easily amortized over longer periods of time. Batch orchestration dependencies – characterized as a DAG (Directed Acyclic Graph) – also introduce significantly more latency, reducing data freshness. Failures in a DAG can be massively problematic – especially as the number of steps escalates – requiring one or more previous steps to be fully rolled back before reprocessing.

Life is real-time by default. Continuous processing is more natural than discretized chunks of work. The impacts of the batch approach are self-evident: reduced throughput and therefore increased latency driving apps and services to use stale data. Correctness also suffers with records being lost or delivered multiple times. Impact assessment is complex, determining who is impacted by a failure and estimating time to recover. Due to the limitations of batch, no amount of money thrown at the problem can make it go away.

To use a common analogy, the horse can’t go any faster and it’s time to buy that car.

Your Data Stack Is Already Going Real-Time

The signs that the shift to real-time is happening are all around if you look for them; after years of the same databases ruling the roost, a cohort of new specialized real-time analytical databases like Druid, Pinot and Clickhouse are meeting the streaming needs of customers in ways that the incumbent database vendors can’t. By extracting the real-time data these specialist databases enable organizations to prevent the age old problems of internal customers trying to do analysis across operational databases.

The widespread enterprise experimentation with Kafka, and the architectural move to mirror operational data onto a data warehouse show that organizations see the writing on the wall, but without simple stream processing to manipulate data records in flight it’s not yet sufficient or a complete solution to realize their real-time data needs yet. Many are struggling to build a platform out of open source components, emulating the Uber and other examples we talked about earlier. Those experiencing success already realize the upside-down economics of building a custom platform and opt to buy a managed stream processing platform-as-a-service so they can focus on building streaming pipelines. Success in bringing modern, real-time data stacks to data teams means simplifying the process so that things like change data capture, multi-way joins, and change stream processing are simple to implement.

The Future of Your Data Stack is Real-Time and Batch

If there’s one thing we know about IT operations teams, it’s that they don’t enjoy running multiple stacks if they can avoid it. It’s expensive, inefficient, often requires additional people with new skills, and elevates security and technology risk.

The good news is that a properly engineered and orchestrated real-time data stack can serve an organization’s batch *and* real-time needs equally well. Therefore, in the future, there will be just one data stack, and it will be real-time.

Maintaining two full stacks—a batch stack and a real-time stack—is a needless and hefty ops expense if one stack can support both sets of use cases.

Preparing Your Organization for the Real-Time Future

If your organization is already feeling pressure to deliver analytics faster and faster to support real-time use cases, be prepared for that pressure to only increase over time. If you’re not feeling the pressure yet, factors like economic uncertainty and competitive dynamics will bring that pressure to your doorstep soon enough. Start thinking now about how your transition to real-time data stacks can support your data teams with tools designed for the job at hand, while paving the way to a future where real-time stacks take on more and more of the batch workloads you’re running.