Finding insight within one data stream is a challenge. Finding insight from multiple streams can be significantly more so. The simple example? Two different databases created independently of each other that claim to capture the same kind of data. The larger the dataset, the more challenges we face aligning columns, de-duping content, making sure we don’t overwrite newer data with old data, and otherwise cleaning and preparing data for analysis. Ask anyone who has worked trying to align data across databases. It can be a pain. (And while we’re at it, make sure you pay them well. No. Really. They deserve it.)
But what about when the data isn’t all neatly structured into columns and rows? What if you need to have some of your data coming from structured streams but most of the other data you need to tell your story is semi-structured, or even unstructured? How are you going to make sure that the Jane Doe listed in two separate databases as the same person is the same Jane Doe interviewed in an article mentioning your organization? How do you determine whether that mention has any relevance to what you are doing?
Well that’s a whole different level of challenging, one that my friend and colleague Dan Hirpara understands intimately in his work as a Senior Data Architect managing and fusing multiple, massive data streams for use in analysis. Dan recently wrote a blog about data fusion on our website. It’s one of a series we will be publishing from time to time highlight the challenges of implementing enterprise level data driven solutions in different environments. So if you’re interested in learning more, read the full article.