The Role of Data Curation in Big Data

Introduction

Good data management practices are essential for ensuring that research data are of high quality, findable, accessible and have high validity. You can then share data ensuring their sustainability and accessibility in the long-term, for new research and policy or to replicate and validate existing research and policy. It is important that researchers extend these practices to their work with all types of data, be it big (large or complex) data or smaller, more ‘curatable’ datasets.

In this blog, we are going to understand about the data curation. Furthermore, we will be looking into many other advantages which data curation will bring to the big data table.

What is Data Curation?

Curation is the end-to-end process of creating good data through the identification and formation of resources with long-term value. In information technology, it refers mainly to the management of data throughout its lifecycle, from creation and initial storage to the time when it is archived for future research and analysis, or becomes obsolete and is deleted. The goal of data curation in the enterprise is twofold: to ensure compliance and that data can be retrieved for future research or reuse

Why Do You Need Data Curation?

Organizations invest heavily in big data analytics — $44 billion in 2014 alone, according to Gartner; yet, studies show that most organizations use only about 10% of their collected data, data that remains scattered in silos and varied sources across the organization. With data volumes growing exponentially, along with the increasing variety and heterogeneity of data sources, getting the data you need ready for analysis has become a costly and time-consuming process. Multiple data sets from different sources must first be catalogued and connected before they can be used by various analytics tools. Duplicate data and blank fields need to be eliminated, misspellings fixed, columns split or reshaped, and data need to be enriched with data from additional or third party sources to provide more context.

1. Effective Machine Learning

Machine Learning algorithms have made great strides towards understanding the consumer space. AI consisting of “neural networks” collaborate, and can use Deep Learning to recognize patterns. However, Humans need to intervene, at least initially, to direct algorithmic behavior towards effective learning. Curations are about where the humans can actually add their knowledge to what the machine has automated. This results in prepping for intelligent self-service processes, setting up organizations up for insights.

2. Dealing with Data Swamps

A Data Lake strategy allows users to easily access raw data, to consider multiple data attributes at once, and the flexibility to ask ambiguous business driven questions. But Data Lakes can end up Data Swamps where finding business value becomes like a quest to find the Holy Grail. Such Data swamps minus well be a Data graveyard. Well data curation here can save your data lakes from becoming the data yards

3. Ensuring Data Quality

Data Curators clean and undertake actions to ensure the long undertake actions to ensure the long-term preservation and retention of the authoritative nature of digital objects.

Steps in Data Curation

Data curation is the process of turning independently created data sources (structured and semi-structured data) into unified data sets ready for analytics, using domain experts to guide the process. It involves:

1. Identifying

One needs to identify different data sources of interest (whether from inside or outside the enterprise) before they start working on a problem statement. Identification of the dataset is as important a thing as solving a problem. Many people underestimate the value of data identification. But, when one does data identification the right way, one can save on a lot of time wastage which can happen while optimizing the solution of the problem

2. Cleaning

Once you have some data at hand, one needs to clean the data. The incoming data may have a lot of anomalies like spelling errors, missing values, improper entries etc. Most of the data is always dirty and you need to clean it before you can start working with it. Cleaning data is one of the most important tasks under data curation. There is almost 200% value addition once data is in the right format

3. Transforming

Data transformation is the process of converting data or information from one format to another, usually from the format of a source system into the required format of a new destination system. The usual process involves converting documents, but data conversions sometimes involve the conversion of a program from one computer language to another to enable the program to run on a different platform. The usual reason for this data migration is the adoption of a new system that’s totally different from the previous one. Data curation also takes care of the data transformation

The more data you need to curate for analytics and other business purposes, the more costly and complex curation becomes — mostly because humans (domain experts, or data owners) aren’t scalable. As such, most enterprises are “tearing their hair out” as they try to cope with data curation at scale.

Roles of a Data Curator

In practice, data curation is more concerned with maintaining and managing the metadata rather than the database itself and, to that end, a large part of the process of data curation revolves around ingesting metadata such as schema, table and column popularity, usage popularity, top joins/filters/queries. Data curators not only create, manage, and maintain data, but may also determine best practices for working with that data. They often present the data in a visual format such as a chart, dashboard or report.

Data curation starts with the “data set.” These data sets are the atoms of data curation. Determining which of these data sets are the most useful or relevant is the job of the data curator. Being able to present the data in an effective manner is also extremely important. While some rules of thumb and best practices apply, the data curator must make an educated decision about which data assets are appropriate to use.

It’s important to know the context of the data before it can be trusted. Data curation uses such arbiters of modern taste as lists, popularity rankings, annotations, relevance feeds, comments, articles and the upvoting or downvoting of data assets to determine their relevancy.

How to Start with Data Curation?

First, companies can inject additional data assessments into their reviews of data with end users that evaluate how data can be used or redirected. One way this can be done is by making data retention reviews a collaborative process across business functions. The collaboration enables users who ordinarily wouldn’t be exposed to some types of data to evaluate if there are ways that this data can be plugged in and used in their own departmental analytics processes.

Second, IT and the business should articulate rules governing data purges. Presently, there is a fear of discarding any data, no matter how useless.

Third, companies should consider adding a data curator, which is a librarian-like curation function, to their big data and analytics staffs.

Summary

Data sets are reusable components — anyone conducting analysis should share and expect data sets that they create to be re-used. Re-usability is key to self-service at scale. Companies such as GoDaddy and eBay have already embraced this approach to harvesting and distributing data for re-use, allowing any user to become a curator of data knowledge and resulting in higher productivity.

Data curation observes the use of data, focusing on how context, narrative, and meaning can be collected around a reusable data set. It creates trust in data by tracking the social network and social bonds between users of data. By employing lists, popularity rankings, annotations, relevance feeds, comments, articles and the upvoting or downvoting of data assets, curation takes organizations beyond data documentation to creating trust in data across the enterprise.

If you want to read more about data science, read our Data Science Blogs