A common scenario that data analysts in general encounter is what I like to describe as "data denialism". Often, and especially while consulting, an analyst will find that the data tells a different story than what the customer holds to be true. It is also often the case that, when presenting this finding, the customer will outright deny the evidence, asserting that either the data or the analysis must be wrong. For example, it may be that a retailer focused on the low-end market is getting most of its sales from high-end customers, and such a fact upends months -maybe even years- of marketing planning and strategy. (This may, or may not, be based on one of my previous consulting experiences)
It is of course part of the analyst's job to present and discuss such controversial findings carefully and in a way that they can be understood an accepted, or tell a story that is compelling enough to be believable. Of course, too, some discussion about findings is definitely healthy and desirable. But even if the customer is convinced that the analyst did their job right, there's still the matter of the data itself, for how can the customer be assured that the data is correct? After the myriad transformations, schema modifications, unifications and predictive tasks, how can even the analyst be sure that everything went right?
What the analyst needs to do in this case is to have some form of data lineage system, that is, a way of keeping track of the data's origins and transformations. This can help not only in justifying controversial statements, but also in debugging and regenerating lost information. If the analyst also has a way of representing this visually, or in a simple summary, he can easily convey his confidence in the analysis to the customer.
At the highest level of abstraction, lineage can be represented as a graph, with task executions (instances or runs of a task) and datasets as the nodes, and the edges representing the connections between them. For example, a collection task could generate several datasets, which could then be processed by many different scripts, resulting in a single, denormalized dataset that is used for a regression task, which finally produces predictions for a key variable (e.g. sales). In a data product pipeline, such predictions may then be used as input for a control task or presented to a decision maker through a webapp.
An important advantage of maintaining and presenting the lineage, is that these predictions (or summaries, cluster groups, market baskets or recommendations in general) are naturally understood to be the ultimate consequence of the full pipeline, not of a detached machine learning task. It is a natural way to present every single part of the pipeline as necessary and meaningful, as opposed to giving most importance to the final, more sophisticated task (A common mistake made by analysts trying to sell their job).
The lineage, then, is a useful tool for the analyst to provide transparecy and confidence to their customer, but also serves as a kind of process documentation (extremely useful for expanding existing work) and reproducibility. All of these extremely important concepts in data analysis in particular, and science in general.
Here is a simple lineage example, manually built and visualized using the excellent Neo4J browser UI.
A keen observer will note that a graph like this is not enough to determine exactly which datasets were involved in the creation of a specific output. It may be that a single task execution produces two different datasets, and if there is no explicit declaration of which inputs were involved in the creation of which output, there is no way to unambiguously connect two datasets.
A way to solve the above issue would be to add explicit documentation about inputs and outputs, for example by implementing the Open Provenance Model into the lineage system, which specifies explicit, abstract definitions for interactions between datasets and tasks.
Wait, this suddenly got really complicated, can't we simply use an existing tool?
While there are many tools that can help you achieve a well-maintaned lineage, there is not a clear-cut winner, and the advantages of their adoption depend strongly on your current workflow and use case. For many analysts, a simple hand-drawn graph may be enough, and if the collection methods and tasks are highly manual in nature (for example, if the dataset is delivered by email) it may be the best option available.
For more sophisticated use cases, as is the case of an enterprise, there are many technology-dependent tools, such as Manta for SQL or Cloudera Navigator for Hadoop. Unfortunately, such solutions require that all of the data transformations are done using a specific technology, or else they may not be registered in the lineage. Furthermore, they are often packaged together with Data Quality and Data Governance solutions, which may be beyond the scope of a simple system or workflow. Furthermore, many tools are still in the development or maturing phase, which poses risks for their adoption.
When the use case isn't simple enough to maintain manually, but not big enough or complex enough to be enterprise-level, or if you are using many different technologies that are not supported by a single lineage tool, maintaining a lineage graph can be a pain, especially if your tasks have many outputs a lot of the time.
However, by rethinking your tasks and ensuring they are single-output and atomic (that is, independent of any other task execution or dataset but their inputs), you can go from a very obscure, complex lineage graph, to a clear, simple one like this:
With this graph, because the tasks are single-output, you don't need to make explicit the relation between inputs and outputs, and the lineage of a specific dataset is simply the collection of nodes and edges that can reach it.
Readers with some math experience or training may have noted that the lineage graph in both this and the previous case is a DAG, which endows it with a lot of nice mathematical properties such as topological ordering that make working with it relatively painless. It is also easy to understand the single-output, atomic task as a function, and then the full pipeline can be seen as one or many function compositions, with each lineage an evaluation of these compositions.
The question remains, How do we actually maintain such a graph? For simple use cases, it is enough to build an API or Web service that maintains the graph, adding the necessary nodes and edges. The service could use a graph database as backend, and check that no inconsistencies are inserted (for example, a dataset cannot be output of two different tasks). Then to retrieve the lineage of a dataset one would query the service for such a dataset, and the service could then query the backend. In Datank we are working in such a service, and plan to release it soon as an open-source tool for anyone to use and extend.
In this high-level lineage (also called coarse-grained lineage, as opposed to fine-grained lineage), it is enough to know what each task does and the parameters for each task execution to describe the lineage of a dataset. If the root node of each lineage query is a dataset delivered by the customer, and there is confidence that each task is correctly implemented, data denialism should be harder to justify.
A more complex use case is when we need to know the specific operation that generated each single datum (for example, a graph of low level operations for each single row). Such low level information would be needed for debugging. That is, if we suspect an output datum to be "wrong", we would like to know exactly how it came about it, so that we can say where is the faulty operation. Fine-grained lineage is exactly the storing and management of that low level information.
Unfortunately, tools for this are scarce and most of them, such as Newt and Titian are technology-specific and have not materialized in a commercial or open-source product. We could say that fine-grained lineage tools are still in their infancy.