Subscribe to DSC Newsletter

Are you running from one analysis to another? From one data visualization project, data modeling exercise, dashboard development, data quality analysis to another because there is high demand for your skills?

How big is your personal folder? Is it littered with spreadsheets, BI workbooks, and other scripts that were used for a one time project?

Dark Data is getting Darker

What is dark data?

Dark data is data and content that exists and is stored, but is not leveraged and analyzed for intelligence or used in forward looking decisions. - Isaac

Now many organizations associate dark data with legacy data or data lakes that will serve a future purpose. I call them data landfills that results from the accumulation of database silos and artifacts from one off analytics. For legacy data, I've proposed an agile approach to finding value in dark data.

But data scientist have more tools and a lot more capability today. R scripts, dashboards developed in data visualization tools, software developed for Hadoop clusters, data processing pipelines, etc. Ideally, most of the effort and artifacts are implemented directly to the data warehouses and reference data and become core extensions. But some of this work is one time, single purpose analysis and probably contributing to the organization's dark data unless some action and governance is adopted.

Simple Solutions To Avoid Dark Data


The simple answer to the accumulation of assets tied to analytics is to catalog them. Develop a small database that identifies the analysis performed, its purpose, its owner, and the location of its assets. Develop a tagging taxonomy to make it easier to navigate the catalog. Insure these artifacts are stored in some kind of source control repository like Git and are versioned whenever an analysis is updated. Schedule periodic reviews to identify opportunities to enhance enterprise data assets leveraging these artifacts as prototypes, and archiving others that are no longer valid.

If you have the latest analytics cataloged with reasonable practices around them, you'll help avoid another generation of dark data.


Views: 568

Tags: dark data, data governance, data visualization


You need to be a member of Data Science Central to add comments!

Join Data Science Central

Comment by Richard Ordowich on September 21, 2015 at 7:40am

Perhaps the data in the landfills is just garbage and organizations should begin to practice becoming more "green" about data. Creating a catalog of data that was poorly designed, ill defined and lacks quality is akin to cataloging the contents of a landfill. Legacy data may be just data that has lived beyond its expiry date.

We keep adding to the data landfill in the hope we will discover gold data while perhaps all that will be discovered is methane.

Comment by Isaac Sacolick on September 18, 2015 at 1:54am

Rick - Your best option here is to develop a datamart and some scripts that can load in the data yearly/biyearly. If you're using a reasonable BI tool, then the dashboards/reports you develop can point to this data source and automatically update when there is new data. If you're doing manual work every time there is an update to the data set, or if you're creating a new "file" for each year/biyear of data then this can be automated and also help avoid creating dark data silos.

Comment by Rick Henderson on September 17, 2015 at 8:48am

What about situations where you are analysing for the same insights, but with a data set that changes yearly or biyearly? Like student progress each term. In one case, same students, different term for maybe 4 years, in another, same analysis every term for different students.

Comment by Eric Jensen on September 16, 2015 at 10:23pm

good advice!


  • Add Videos
  • View All

© 2020   Data Science Central ®   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service